Practical Data Analysis
上QQ阅读APP看书,第一时间看更新

Data formats

When we are working with data for human consumption the easiest way to store it is through text files. In this section, we will present parsing examples of the most common formats such as CSV, JSON, and XML. These examples will be very helpful in the next chapters.

Tip

The dataset used for these examples is a list of Pokémon characters by National Pokedex number, obtained at the URL http://bulbapedia.bulbagarden.net/.

All the scripts and dataset files can be found in the author's GitHub repository available at the URL https://github.com/hmcuesta/PDA_Book/tree/master/Chapter3/.

CSV

CSV is a very simple and common open format for table, such as data, which can be exported and imported by most of the data analysis tools. CSV is a plain text format this means that the file is a sequence of characters, with no data that has to be interpreted instead, for example, binary numbers.

There are many ways to parse a CSV file from Python, and in a moment we will discuss two of them:

The first eight records of the CSV file (pokemon.csv) look as follows:

 id, typeTwo, name, type
 001, Poison, Bulbasaur, Grass
 002, Poison, Ivysaur, Grass
 003, Poison, Venusaur, Grass
 006, Flying, Charizard, Fire
 012, Flying, Butterfree, Bug
 013, Poison, Weedle, Bug
 014, Poison, Kakuna, Bug
 015, Poison, Beedrill, Bug
. . .
Parsing a CSV file with the csv module

Firstly, we need to import the csv module:

import csv

Then we open the file .csv and with the function csv.reader(f) we parse the file:

with open("pokemon.csv") as f:
    data = csv.reader(f)
    #Now we just iterate over the reader 

    for line in data:
        print(" id: {0} , typeTwo: {1}, name:  {2}, type: {3}"
              .format(line[0],line[1],line[2],line[3]))

Output:
[(1, b' Poison', b' Bulbasaur', b' Grass')
 (2, b' Poison', b' Ivysaur', b' Grass')
 (3, b' Poison', b' Venusaur', b' Grass')
 (6, b' Flying', b' Charizard', b' Fire')
 (12, b' Flying', b' Butterfree', b' Bug')
 . . .]
Parsing a CSV file using NumPy

Perform the following steps for parsing a CSV file:

  1. Firstly, we need to import the numpy library:
    import numpy as np
  2. NumPy provides us with the genfromtxt function, which receives four parameters. First, we need to provide the name of the file pokemon.csv. Then we skip first line as a header (skip_header). Next we need to specify the data type (dtype). Finally, we will define the comma as the delimiter.
    data = np.genfromtxt("pokemon.csv"
                            ,skip_header=1
                            ,dtype=None
                            ,delimiter=',')
  3. Then just print the result.
    print(data)
    
    Output:
    id: id , typeTwo: typeTwo, name: name, type: type
    id: 001 , typeTwo: Poison, name: Bulbasaur, type: Grass
    id: 002 , typeTwo: Poison, name: Ivysaur, type: Grass
    id: 003 , typeTwo: Poison, name: Venusaur, type: Grass
    id: 006 , typeTwo: Flying, name: Charizard, type: Fire
    . . .
    

JSON

JSON is a common format to exchange data. Although it is derived from JavaScript, Python provides us with a library to parse JSON.

Parsing a JSON file using json module

The first three records of the JSON file (pokemon.json) look as follows:

 [
    {
        "id": " 001",
        "typeTwo": " Poison",
        "name": " Bulbasaur",
        "type": " Grass"
    },
    {
        "id": " 002",
        "typeTwo": " Poison",
        "name": " Ivysaur",
        "type": " Grass"
    },
    {
        "id": " 003",
        "typeTwo": " Poison",
        "name": " Venusaur",
        "type": " Grass"
    },
. . .]

Firstly, we need to import the json module and pprint (pretty-print) module.

import json
from pprint import pprint

Then we open the file pokemon.json and with the function json.loads we parse the file.

with open("pokemon.json") as f:
    data = json.loads(f.read())

Finally, just print the result with the function pprint.

pprint(data)

Output:

[{'id': ' 001', 'name': ' Bulbasaur', 'type': ' Grass', 'typeTwo': ' Poison'},
 {'id': ' 002', 'name': ' Ivysaur', 'type': ' Grass', 'typeTwo': ' Poison'},
 {'id': ' 003', 'name': ' Venusaur', 'type': ' Grass', 'typeTwo': ' Poison'},
 {'id': ' 006', 'name': ' Charizard', 'type': ' Fire', 'typeTwo': ' Flying'},
 {'id': ' 012', 'name': ' Butterfree', 'type': ' Bug', 'typeTwo': ' Flying'}, . . . ]

XML

According with to World Wide Web Consortium (W3C) available at http://www.w3.org/XML/

Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML (ISO 8879). Originally designed to meet the challenges of large-scale electronic publishing, XML is also playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere.

The first three records of the XML file (pokemon.xml) look as follows:

<?xml version="1.0" encoding="UTF-8" ?>
<pokemon>
  <row>
    <id> 001</id>
    <typeTwo> Poison</typeTwo>
    <name> Bulbasaur</name>
    <type> Grass</type>
  </row>
  <row>
    <id> 002</id>
    <typeTwo> Poison</typeTwo>
    <name> Ivysaur</name>
    <type> Grass</type>
  </row>
  <row>
    <id> 003</id>
    <typeTwo> Poison</typeTwo>
    <name> Venusaur</name>
    <type> Grass</type>
  </row>
. . .
</pokemon>
Parsing an XML file in Python using xml module

Firstly, we need to import the ElementTree object from xml module.

from xml.etree import ElementTree

Then we open the file "pokemon.xml" and with the function ElementTree.parse we parse the file.

with open("pokemon.xml") as f:
    doc = ElementTree.parse(f)

Finally, just print each 'row' element with the findall function:

 for node in doc.findall('row'):
     print("")
     print("id: {0}".format(node.find('id').text))
     print("typeTwo: {0}".format(node.find('typeTwo').text))
     print("name: {0}".format(node.find('name').text))
     print("type: {0}".format(node.find('type').text))
        
Output:

id: 001
typeTwo: Poison
name: Bulbasaur
type: Grass

id: 002
typeTwo: Poison
name: Ivysaur
type: Grass

id: 003
typeTwo: Poison
name: Venusaur
type: Grass

. . .

YAML

YAML Ain't Markup Language (YAML) is a human-friendly data serialization format. It's not as popular as JSON or XML but it was designed to be easily mapped to data types common to most high-level languages. A Python parser implementation called PyYAML is available in PyPI repository and its implementation is very similar to the JSON module.

The first three records of the YAML file (pokemon.yaml) look as follows:

Pokemon:
 -id : 001
typeTwo : Poison
name : Bulbasaur
type : Grass
 -id : 002
typeTwo : Poison
name : Ivysaur
type : Grass
 -id : 003
typeTwo : Poison
name : Venusaur
type : Grass
. . .