Python Web Scraping Cookbook
上QQ阅读APP看书,第一时间看更新

How to do it

We will start by converting the planets data into a CSV file.

  1. This will be performed using csv.  The following code writes the planets data to a CSV file (the code is in03/create_csv.py):
import csv
from get_planet_data import get_planet_data

planets = get_planet_data()

with open('../../www/planets.csv', 'w+', newline='') as csvFile:
writer = csv.writer(csvFile)
writer.writerow(['Name', 'Mass', 'Radius', 'Description', 'MoreInfo'])
for planet in planets:
writer.writerow([planet['Name'], planet['Mass'],planet['Radius'], planet['Description'], planet['MoreInfo']])

  1. The output file is put into the www folder of our project.  Examining it we see the following content::
Name,Mass,Radius,Description,MoreInfo
Mercury,0.330,4879,Named Mercurius by the Romans because it appears to move so swiftly.,https://en.wikipedia.org/wiki/Mercury_(planet)
Venus,4.87,12104,Roman name for the goddess of love. This planet was considered to be the brightest and most beautiful planet or star in the heavens. Other civilizations have named it for their god or goddess of love/war.,https://en.wikipedia.org/wiki/Venus
Earth,5.97,12756,"The name Earth comes from the Indo-European base 'er,'which produced the Germanic noun 'ertho,' and ultimately German 'erde,' Dutch 'aarde,' Scandinavian 'jord,' and English 'earth.' Related forms include Greek 'eraze,' meaning 'on the ground,' and Welsh 'erw,' meaning 'a piece of land.'",https://en.wikipedia.org/wiki/Earth
Mars,0.642,6792,"Named by the Romans for their god of war because of its red, bloodlike color. Other civilizations also named this planet from this attribute; for example, the Egyptians named it ""Her Desher,"" meaning ""the red one.""",https://en.wikipedia.org/wiki/Mars
Jupiter,1898,142984,The largest and most massive of the planets was named Zeus by the Greeks and Jupiter by the Romans; he was the most important deity in both pantheons.,https://en.wikipedia.org/wiki/Jupiter
Saturn,568,120536,"Roman name for the Greek Cronos, father of Zeus/Jupiter. Other civilizations have given different names to Saturn, which is the farthest planet from Earth that can be observed by the naked human eye. Most of its satellites were named for Titans who, according to Greek mythology, were brothers and sisters of Saturn.",https://en.wikipedia.org/wiki/Saturn
Uranus,86.8,51118,"Several astronomers, including Flamsteed and Le Monnier, had observed Uranus earlier but had recorded it as a fixed star. Herschel tried unsuccessfully to name his discovery ""Georgian Sidus"" after George III; the planet was named by Johann Bode in 1781 after the ancient Greek deity of the sky Uranus, the father of Kronos (Saturn) and grandfather of Zeus (Jupiter).",https://en.wikipedia.org/wiki/Uranus
Neptune,102,49528,"Neptune was ""predicted"" by John Couch Adams and Urbain Le Verrier who, independently, were able to account for the irregularities in the motion of Uranus by correctly predicting the orbital elements of a trans- Uranian body. Using the predicted parameters of Le Verrier (Adams never published his predictions), Johann Galle observed the planet in 1846. Galle wanted to name the planet for Le Verrier, but that was not acceptable to the international astronomical community. Instead, this planet is named for the Roman god of the sea.",https://en.wikipedia.org/wiki/Neptune
Pluto,0.0146,2370,"Pluto was discovered at Lowell Observatory in Flagstaff, AZ during a systematic search for a trans-Neptune planet predicted by Percival Lowell and William H. Pickering. Named after the Roman god of the underworld who was able to render himself invisible.",https://en.wikipedia.org/wiki/Pluto

We wrote this file into the www directory so that we can download it with our web server.

  1. This data can now be used in applications that support CSV content, such as Excel:
The File Opened in Excel
  1. CSV data can also be read from a web server using the csv library and by first retrieving the content with requests .  The following code is in the 03/read_csv_from_web.py):
import requests
import csv

planets_data = requests.get("http://localhost:8080/planets.csv").text
planets = planets_data.split('\n')
reader = csv.reader(planets, delimiter=',', quotechar='"')
lines = [line for line in reader][:-1]
for line in lines: print(line)

The following is a portion of the output

['Name', 'Mass', 'Radius', 'Description', 'MoreInfo']
['Mercury', '0.330', '4879', 'Named Mercurius by the Romans because it appears to move so swiftly.', 'https://en.wikipedia.org/wiki/Mercury_(planet)']
['Venus', '4.87', '12104', 'Roman name for the goddess of love. This planet was considered to be the brightest and most beautiful planet or star in the heavens. Other civilizations have named it for their god or goddess of love/war.', 'https://en.wikipedia.org/wiki/Venus']
['Earth', '5.97', '12756', "The name Earth comes from the Indo-European base 'er,'which produced the Germanic noun 'ertho,' and ultimately German 'erde,' Dutch 'aarde,' Scandinavian 'jord,' and English 'earth.' Related forms include Greek 'eraze,' meaning 'on the ground,' and Welsh 'erw,' meaning 'a piece of land.'", 'https://en.wikipedia.org/wiki/Earth']

One thing to point our is that the CSV writer left a trailing blank like would add an empty list item if not handled. This was handled by slicing the rows: This following statement returned all lines except the last one: 

lines = [line for line in reader][:-1]
  1. This can also be done quite easily using pandas. The following constructs a DataFrame from the scraped data. The code is in 03/create_df_planets.py:
import pandas as pd
planets_df = pd.read_csv("http://localhost:8080/planets_pandas.csv", index_col='Name')
print(planets_df)

Running this gives the following output:

                                               Description Mass Radius
Name
Mercury Named Mercurius by the Romans because it appea... 0.330 4879
Venus Roman name for the goddess of love. This plane... 4.87 12104
Earth The name Earth comes from the Indo-European ba... 5.97 12756
Mars Named by the Romans for their god of war becau... 0.642 6792
Jupiter The largest and most massive of the planets wa... 1898 142984
Saturn Roman name for the Greek Cronos, father of Zeu... 568 120536
Uranus Several astronomers, including Flamsteed and L... 86.8 51118
Neptune Neptune was "predicted" by John Couch Adams an... 102 49528
Pluto Pluto was discovered at Lowell Observatory in ... 0.0146 2370
  1. And the DataFrame can be saved to a CSV file with a simple call to .to_csv() (code is in 03/save_csv_pandas.py):
import pandas as pd
from get_planet_data import get_planet_data

# construct a data from from the list
planets = get_planet_data()
planets_df = pd.DataFrame(planets).set_index('Name')
planets_df.to_csv("../../www/planets_pandas.csv")
  1. A CSV file can be read in from a URL very easily with pd.read_csv() - no need for other libraries.  You can use the code in03/read_csv_via_pandas.py):
import pandas as pd
planets_df = pd.read_csv("http://localhost:8080/planets_pandas.csv", index_col='Name')
print(planets_df)
  1. Converting data to JSON is also quite easy. Manipulation of JSON with Python can be done with the Python json library.  This library can be used to convert Python objects to and from JSON. The following converts the list of planets into JSON and prints it to the console:prints the planets data as JSON (code in 03/convert_to_json.py):
import json
from get_planet_data import get_planet_data
planets=get_planet_data()
print(json.dumps(planets, indent=4))

Executing this script produces the following output (some of the output is omitted):

[
{
"Name": "Mercury",
"Mass": "0.330",
"Radius": "4879",
"Description": "Named Mercurius by the Romans because it appears to move so swiftly.",
"MoreInfo": "https://en.wikipedia.org/wiki/Mercury_(planet)"
},
{
"Name": "Venus",
"Mass": "4.87",
"Radius": "12104",
"Description": "Roman name for the goddess of love. This planet was considered to be the brightest and most beautiful planet or star in the heavens. Other civilizations have named it for their god or goddess of love/war.",
"MoreInfo": "https://en.wikipedia.org/wiki/Venus"
},
  1. And this can also be used to easily save JSON to a file (03/save_as_json.py):
import json
from get_planet_data import get_planet_data
planets=get_planet_data()
with open('../../www/planets.json', 'w+') as jsonFile:
json.dump(planets, jsonFile, indent=4)
  1. Checking the output using !head -n 13 ../../www/planets.json shows:
[
    {
        "Name": "Mercury",
        "Mass": "0.330",
        "Radius": "4879",
        "Description": "Named Mercurius by the Romans because it appears to move so swiftly.",
        "MoreInfo": "https://en.wikipedia.org/wiki/Mercury_(planet)"
    },
    {
        "Name": "Venus",
        "Mass": "4.87",
        "Radius": "12104",
        "Description": "Roman name for the goddess of love. This planet was considered to be the brightest and most beautiful planet or star in the heavens. Other civilizations have named it for their god or goddess of love/war.",
  1. JSON can be read from a web server with requests and converted to a Python object (03/read_http_json_requests.py):
import requests
import json

planets_request = requests.get("http://localhost:8080/planets.json")
print(json.loads(planets_request.text))
  1. pandas also provides JSON capabilities to save to CSV (03/save_json_pandas.py):
import pandas as pd
from get_planet_data import get_planet_data

planets = get_planet_data()
planets_df = pd.DataFrame(planets).set_index('Name')
planets_df.reset_index().to_json("../../www/planets_pandas.json", orient='records')

Unfortunately, there is not currently a way to pretty-print the JSON that is output from .to_json(). Also note the use of orient='records' and the use of rest_index(). This is necessary for reproducing an identical JSON structure to the JSON written using the JSON library example.

  1. JSON can be read into a DataFrame using .read_json(), as well as from HTTP and files (03/read_json_http_pandas.py):
import pandas as pd
planets_df = pd.read_json("http://localhost:8080/planets_pandas.json").set_index('Name')
print(planets_df)