
Data formats
When we are working with data for human consumption the easiest way to store it is through text files. In this section, we will present parsing examples of the most common formats such as CSV, JSON, and XML. These examples will be very helpful in the next chapters.
Tip
The dataset used for these examples is a list of Pokémon characters by National Pokedex number, obtained at the URL http://bulbapedia.bulbagarden.net/.
All the scripts and dataset files can be found in the author's GitHub repository available at the URL https://github.com/hmcuesta/PDA_Book/tree/master/Chapter3/.
CSV
CSV is a very simple and common open format for table, such as data, which can be exported and imported by most of the data analysis tools. CSV is a plain text format this means that the file is a sequence of characters, with no data that has to be interpreted instead, for example, binary numbers.
There are many ways to parse a CSV file from Python, and in a moment we will discuss two of them:
The first eight records of the CSV file (pokemon.csv
) look as follows:
id, typeTwo, name, type 001, Poison, Bulbasaur, Grass 002, Poison, Ivysaur, Grass 003, Poison, Venusaur, Grass 006, Flying, Charizard, Fire 012, Flying, Butterfree, Bug 013, Poison, Weedle, Bug 014, Poison, Kakuna, Bug 015, Poison, Beedrill, Bug . . .
Parsing a CSV file with the csv module
Firstly, we need to import the csv module:
import csv
Then we open the file .csv
and with the function csv.reader(f)
we parse the file:
with open("pokemon.csv") as f: data = csv.reader(f) #Now we just iterate over the reader for line in data: print(" id: {0} , typeTwo: {1}, name: {2}, type: {3}" .format(line[0],line[1],line[2],line[3])) Output: [(1, b' Poison', b' Bulbasaur', b' Grass') (2, b' Poison', b' Ivysaur', b' Grass') (3, b' Poison', b' Venusaur', b' Grass') (6, b' Flying', b' Charizard', b' Fire') (12, b' Flying', b' Butterfree', b' Bug') . . .]
Parsing a CSV file using NumPy
Perform the following steps for parsing a CSV file:
- Firstly, we need to import the
numpy
library:import numpy as np
- NumPy provides us with the
genfromtxt
function, which receives four parameters. First, we need to provide the name of the filepokemon.csv
. Then we skip first line as a header (skip_header
). Next we need to specify the data type (dtype
). Finally, we will define the comma as thedelimiter
.data = np.genfromtxt("pokemon.csv" ,skip_header=1 ,dtype=None ,delimiter=',')
- Then just print the result.
print(data) Output: id: id , typeTwo: typeTwo, name: name, type: type id: 001 , typeTwo: Poison, name: Bulbasaur, type: Grass id: 002 , typeTwo: Poison, name: Ivysaur, type: Grass id: 003 , typeTwo: Poison, name: Venusaur, type: Grass id: 006 , typeTwo: Flying, name: Charizard, type: Fire . . .
JSON
JSON is a common format to exchange data. Although it is derived from JavaScript, Python provides us with a library to parse JSON.
Parsing a JSON file using json module
The first three records of the JSON file (pokemon.json
) look as follows:
[ { "id": " 001", "typeTwo": " Poison", "name": " Bulbasaur", "type": " Grass" }, { "id": " 002", "typeTwo": " Poison", "name": " Ivysaur", "type": " Grass" }, { "id": " 003", "typeTwo": " Poison", "name": " Venusaur", "type": " Grass" }, . . .]
Firstly, we need to import the json module and pprint (pretty-print) module.
import json from pprint import pprint
Then we open the file pokemon.json
and with the function json.loads
we parse the file.
with open("pokemon.json") as f: data = json.loads(f.read())
Finally, just print the result with the function pprint
.
pprint(data) Output: [{'id': ' 001', 'name': ' Bulbasaur', 'type': ' Grass', 'typeTwo': ' Poison'}, {'id': ' 002', 'name': ' Ivysaur', 'type': ' Grass', 'typeTwo': ' Poison'}, {'id': ' 003', 'name': ' Venusaur', 'type': ' Grass', 'typeTwo': ' Poison'}, {'id': ' 006', 'name': ' Charizard', 'type': ' Fire', 'typeTwo': ' Flying'}, {'id': ' 012', 'name': ' Butterfree', 'type': ' Bug', 'typeTwo': ' Flying'}, . . . ]
XML
According with to World Wide Web Consortium (W3C) available at http://www.w3.org/XML/
Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML (ISO 8879). Originally designed to meet the challenges of large-scale electronic publishing, XML is also playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere.
The first three records of the XML file (pokemon.xml
) look as follows:
<?xml version="1.0" encoding="UTF-8" ?> <pokemon> <row> <id> 001</id> <typeTwo> Poison</typeTwo> <name> Bulbasaur</name> <type> Grass</type> </row> <row> <id> 002</id> <typeTwo> Poison</typeTwo> <name> Ivysaur</name> <type> Grass</type> </row> <row> <id> 003</id> <typeTwo> Poison</typeTwo> <name> Venusaur</name> <type> Grass</type> </row> . . . </pokemon>
Parsing an XML file in Python using xml module
Firstly, we need to import the ElementTree
object from xml module.
from xml.etree import ElementTree
Then we open the file "pokemon.xml"
and with the function ElementTree.parse
we parse the file.
with open("pokemon.xml") as f: doc = ElementTree.parse(f)
Finally, just print each 'row'
element with the findall
function:
for node in doc.findall('row'): print("") print("id: {0}".format(node.find('id').text)) print("typeTwo: {0}".format(node.find('typeTwo').text)) print("name: {0}".format(node.find('name').text)) print("type: {0}".format(node.find('type').text)) Output: id: 001 typeTwo: Poison name: Bulbasaur type: Grass id: 002 typeTwo: Poison name: Ivysaur type: Grass id: 003 typeTwo: Poison name: Venusaur type: Grass . . .
YAML
YAML Ain't Markup Language (YAML) is a human-friendly data serialization format. It's not as popular as JSON or XML but it was designed to be easily mapped to data types common to most high-level languages. A Python parser implementation called PyYAML is available in PyPI repository and its implementation is very similar to the JSON module.
The first three records of the YAML file (pokemon.yaml
) look as follows:
Pokemon: -id : 001 typeTwo : Poison name : Bulbasaur type : Grass -id : 002 typeTwo : Poison name : Ivysaur type : Grass -id : 003 typeTwo : Poison name : Venusaur type : Grass . . .