Python Web Scraping Cookbook
上QQ阅读APP看书,第一时间看更新

Querying the DOM with XPath and lxml

XPath is a query language for selecting nodes from an XML document and is a must-learn query language for anyone performing web scraping. XPath offers a number of benefits to its user over other model-based tools:

  • Can easily navigate through the DOM tree
  • More sophisticated and powerful than other selectors like CSS selectors and regular expressions
  • It has a great set (200+) of built-in functions and is extensible with custom functions
  • It is widely supported by parsing libraries and scraping platforms 

XPath contains seven data models (we have seen some of them previously):

  • root node (top level parent node)
  • element nodes (<a>..</a>)
  • attribute nodes (href="example.html")
  • text nodes ("this is a text")
  • comment nodes (<!-- a comment -->)
  • namespace nodes 
  • processing instruction nodes

XPath expressions can return different data types:

  • strings
  • booleans
  • numbers
  • node-sets (probably the most common case)

An (XPath) axis defines a node-set relative to the current node. A total of 13 axes are defined in XPath to enable easy searching for different node parts, from the current context node, or the root node.

lxml is a Python wrapper on top of the libxml2 XML parsing library, which is written in C.  The implementation in C helps make it faster than Beautiful Soup, but also harder to install on some computers. The latest installation instructions are available at: http://lxml.de/installation.html.

lxml supports XPath, which makes it considerably easy to manage complex XML and HTML documents. We will examine several techniques of using lxml and XPath together, and how to use lxml and XPath to navigate the DOM and access data.