Java:Data Science Made Easy
上QQ阅读APP看书,第一时间看更新

Creating your own web crawler

Now that we have a basic understanding of web crawlers, we are ready to create our own. In this simple web crawler, we will keep track of the pages visited using ArrayList instances. In addition, jsoup will be used to parse a web page and we will limit the number of pages we visit. Jsoup (https://jsoup.org/) is an open source HTML parser. This example demonstrates the basic structure of a web crawler and also highlights some of the issues involved in creating a web crawler.

We will use the SimpleWebCrawler class, as declared here:

public class SimpleWebCrawler { 

private String topic;
private String startingURL;
private String urlLimiter;
private final int pageLimit = 20;
private ArrayList<String> visitedList = new ArrayList<>();
private ArrayList<String> pageList = new ArrayList<>();
...
public static void main(String[] args) {
new SimpleWebCrawler();
}

}

The instance variables are detailed here:

In the SimpleWebCrawler constructor, we initialize the instance variables to begin the search from the Wikipedia page for Bishop Rock, an island off the coast of Italy. This was chosen to minimize the number of pages that might be retrieved. As we will see, there are many more Wikipedia pages dealing with Bishop Rock than one might think.

The urlLimiter variable is set to Bishop_Rock, which will restrict the embedded links to follow to just those containing that string. Each page of interest must contain the value stored in the topic variable. The visitPage method performs the actual crawl:

    public SimpleWebCrawler() { 
startingURL = https://en.wikipedia.org/wiki/Bishop_Rock, "
+ "Isles_of_Scilly";
urlLimiter = "Bishop_Rock";
topic = "shipping route";
visitPage(startingURL);
}

In the visitPage method, the pageList ArrayList is checked to see whether the maximum number of accepted pages has been exceeded. If the limit has been exceeded, then the search terminates:

    public void visitPage(String url) { 
if (pageList.size() >= pageLimit) {
return;
}
...
}

If the page has already been visited, then we ignore it. Otherwise, it is added to the visited list:

    if (visitedList.contains(url)) { 
// URL already visited
} else {
visitedList.add(url);
...
}

Jsoup is used to parse the page and return a Document object. There are many different exceptions and problems that can occur such as a malformed URL, retrieval timeouts, or simply bad links. The catch block needs to handle these types of problems. We will provide a more in-depth explanation of jsoup in web scraping in Java:

    try { 
Document doc = Jsoup.connect(url).get();
...
}
} catch (Exception ex) {
// Handle exceptions
}

If the document contains the topic text, then the link is displayed and added to the pageList ArrayList. Each embedded link is obtained, and if the link contains the limiting text, then the visitPage method is called recursively:

    if (doc.text().contains(topic)) { 
out.println((pageList.size() + 1) + ": [" + url + "]");
pageList.add(url);

// Process page links
Elements questions = doc.select("a[href]");
for (Element link : questions) {
if (link.attr("href").contains(urlLimiter)) {
visitPage(link.attr("abs:href"));
}
}
}

This approach only examines links in those pages that contain the topic text. Moving the for loop outside of the if statement will test the links for all pages.

The output follows:

1: [https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly]
2: [https://en.wikipedia.org/wiki/Bishop_Rock_Lighthouse]
3: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&oldid=717634231#Lighthouse]
4: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=717634231]
5: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&oldid=716622943]
6: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716622943]
7: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&oldid=716608512]
8: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716608512]
...
20: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716603919]

In this example, we did not save the results of the crawl in an external source. Normally this is necessary and can be stored in a file or database.