Human copy and paste
|
Here a human’s manual examination and copy and paste
is required |
Text pattern matching
|
UNIX GREP command / regular expression-matching used to extract
information from web pages
|
HTTP programming
|
Static and dynamic web pages can be retrieved by posting HTTP requests to
the remote web server using socket programming.
|
HTML parsing
|
Semi structured data query languages, such as XQUERY and the HTQ can be
used to parse HTML pages and to retrieve information (wrapper= similar) and
transform page content.
|
DOM parsing
|
Here Document Object Model concept used parse web pages into a DOM tree,
based on which programs can retrieve parts of the pages.
|
Vertical aggregation
|
Its platforms are created by companies with access to large-scale
computing power to target specific verticals. Some companies even run these data
harvesting platforms on the cloud.
|
Semantic annotation recognizing
|
Annotations are stored and managed separately from the web pages, so the
scrapers can retrieve data schema and instructions from that area before
scraping the pages.
|
Computer vision webpage analysis
|
Machine learning and computer vision that attempt to identify and extract
information from web pages by interpreting pages visually as a human being
might.
|