Python Web Scraping Cookbook
上QQ阅读APP看书,第一时间看更新

Introduction

The key aspects for effective scraping are understanding how content and data are stored on web servers, identifying the data you want to retrieve, and understanding how the tools support this extraction. In this chapter, we will discuss website structures and the DOM, introduce techniques to parse, and query websites with lxml, XPath, and CSS. We will also look at how to work with websites developed in other languages and different encoding types such as Unicode.

Ultimately, understanding how to find and extract data within an HTML document comes down to understanding the structure of the HTML page, its representation in the DOM, the process of querying the DOM for specific elements, and how to specify which elements you want to retrieve based upon how the data is represented.