
Web crawlers in Java
Web crawling is the process of traversing a series of interconnected web pages and extracting relevant information from those pages. It does this by isolating and then following links on a page. While there are many precompiled datasets readily available, it may still be necessary to collect data directly off the Internet. Some sources such as news sites are continually being updated and need to be revisited from time to time.
A web crawler is an application that visits various sites and collects information. The web crawling process consists of a series of steps:
- Select a URL to visit
- Fetch the page
- Parse the page
- Extract relevant content
- Extract relevant URLs to visit
This process is repeated for each URL visited.
There are several issues that need to be considered when fetching and parsing a page such as:
- Page importance: We do not want to process irrelevant pages.
- Exclusively HTML: We will not normally follow links to images, for example.
- Spider traps: We want to bypass sites that may result in an infinite number of requests. This can occur with dynamically generated pages where one request leads to another.
- Repetition: It is important to avoid crawling the same page more than once.
- Politeness: Do not make an excessive number of requests to a website. Observe the robot.txt files; they specify which parts of a site should not be crawled.
The process of creating a web crawler can be daunting. For all but the simplest needs, it is recommended that one of several open source web crawlers be used. A partial list follows:
- Nutch: http://nutch.apache.org
- crawler4j: https://github.com/yasserg/crawler4j
- JSpider: http://j-spider.sourceforge.net/
- WebSPHINX: http://www.cs.cmu.edu/~rcm/websphinx/
- Heritrix: https://webarchive.jira.com/wiki/display/Heritrix
We can either create our own web crawler or use an existing crawler and in this chapter we will examine both approaches. For specialized processing, it can be desirable to use a custom crawler. We will demonstrate how to create a simple web crawler in Java to provide more insight into how web crawlers work. This will be followed by a brief discussion of other web crawlers.