上QQ阅读APP看书,第一时间看更新
How it works...
The first step is to download the page. Then, the raw text can be parsed, as in step 3. The resulting page object contains the parsed information.
The html.parser parser is the default one, but for specific operations it can have problems. For example, for big pages it can be slow, or has issue rendering highly dynamic web pages. You can use other parsers, such as, lxml , which is much faster, or html5lib , which will be closer to how a browser operates, including dynamic changes produced by HTML5. They are external modules that will need to be added to the requirements.txt file.
BeautifulSoup allows us to search for HTML elements. It can search for the first one with .find() or return a list with .find_all(). In step 5, it searched for a specific tag <a> that had a particular attribute, name=link. After that, it kept iterating on .next_elements until it finds the next h3 tag, which marks the end of the section.
The text of each element is extracted and finally composed into a single text. Note the or that avoids storing None, returned when an element has no text.
HTML is highly versatile, and can have multiple structures. The case presented in this recipe is typical, but other options on dividing sections can be grouping related sections inside a big <div> tag or other elements, or even raw text. Some experimentation will be required until you find the specific process to extract the juicy bits on a web page. Don't be afraid to try!