![Python Automation Cookbook](https://wfqqreader-1252317822.image.myqcloud.com/cover/411/36699411/b_36699411.jpg)
上QQ阅读APP看书,第一时间看更新
How to do it...
- Import BeautifulSoup and requests:
>>> import requests
>>> from bs4 import BeautifulSoup
- Set up the URL of the page to download and retrieve it:
>>> URL = 'http://www.columbia.edu/~fdc/sample.html'
>>> response = requests.get(URL)
>>> response
<Response [200]>
- Parse the downloaded page:
>>> page = BeautifulSoup(response.text, 'html.parser')
- Obtain the title of the page. See that it is the same as what's displayed in the browser:
>>> page.title
<title>Sample Web Page</title>
>>> page.title.string
'Sample Web Page'
- Find all the h3 elements in the page, to determine the existing sections:
>>> page.find_all('h3')
[<h3><a name="contents">CONTENTS</a></h3>, <h3><a name="basics">1. Creating a Web Page</a></h3>, <h3><a name="syntax">2. HTML Syntax</a></h3>, <h3><a name="chars">3. Special Characters</a></h3>, <h3><a name="convert">4. Converting Plain Text to HTML</a></h3>, <h3><a name="effects">5. Effects</a></h3>, <h3><a name="lists">6. Lists</a></h3>, <h3><a name="links">7. Links</a></h3>, <h3><a name="tables">8. Tables</a></h3>, <h3><a name="install">9. Installing Your Web Page on the Internet</a></h3>, <h3><a name="more">10. Where to go from here</a></h3>]
- Extract the text on the section links. Stop when you reach the next <h3> tag:
>>> link_section = page.find('a', attrs={'name': 'links'})
>>> section = []
>>> for element in link_section.next_elements:
... if element.name == 'h3':
... break
... section.append(element.string or '')
...
>>> result = ''.join(section)
>>> result
'7. Links\n\nLinks can be internal within a Web page (like to\nthe Table of ContentsTable of Contents at the top), or they\ncan be to external web pages or pictures on the same website, or they\ncan be to websites, pages, or pictures anywhere else in the world.\n\n\n\nHere is a link to the Kermit\nProject home pageKermit\nProject home page.\n\n\n\nHere is a link to Section 5Section 5 of this document.\n\n\n\nHere is a link to\nSection 4.0Section 4.0\nof the C-Kermit\nfor Unix Installation InstructionsC-Kermit\nfor Unix Installation Instructions.\n\n\n\nHere is a link to a picture:\nCLICK HERECLICK HERE to see it.\n\n\n'
Notice that there are no HTML tags; it's all raw text.