Web Scraping / Harvesting/ Data Extraction

Softwares / Web Scraping / Harvesting/ Data Extraction

It is data scraping used for extracting data from websites/ WWW directly using the Hypertext Transfer Protocol / through a web browser typically into a central local database / spreadsheet, for later retrieval or analysis.

Web Scraping	Implemented Using
Manually	Software User
Automated	A Bot Or Web Crawler
Prevented by	Detecting and disallowing bots from crawling (viewing) their pages
Preventions fails by	DOM parsing, computer vision and natural language processing to simulate human browsing to enable gathering web page content for offline parsing.

S No	Web Scraping Steps	Details
1	Fetching / Downloading Using Web Crawling.	Pages Data
2	Extracting	spreadsheet

Web Scraping / Harvesting/ Data Extraction

Web scraping Techniques

Human copy and paste	Here a human’s manual examination and copy and paste is required
Text pattern matching	UNIX GREP command / regular expression-matching used to extract information from web pages
HTTP programming	Static and dynamic web pages can be retrieved by posting HTTP requests to the remote web server using socket programming.
HTML parsing	Semi structured data query languages, such as XQUERY and the HTQ can be used to parse HTML pages and to retrieve information (wrapper= similar) and transform page content.
DOM parsing	Here Document Object Model concept used parse web pages into a DOM tree, based on which programs can retrieve parts of the pages.
Vertical aggregation	Its platforms are created by companies with access to large-scale computing power to target specific verticals. Some companies even run these data harvesting platforms on the cloud.
Semantic annotation recognizing	Annotations are stored and managed separately from the web pages, so the scrapers can retrieve data schema and instructions from that area before scraping the pages.
Computer vision webpage analysis	Machine learning and computer vision that attempt to identify and extract information from web pages by interpreting pages visually as a human being might.

Home Back