Python:Data Analytics and Visualization
上QQ阅读APP看书,第一时间看更新

The Pandas data structure

Let's first get acquainted with two of Pandas' primary data structures: the Series and the DataFrame. They can handle the majority of use cases in finance, statistic, social science, and many areas of engineering.

Series

A Series is a one-dimensional object similar to an array, list, or column in table. Each item in a Series is assigned to an entry in an index:

>>> s1 = pd.Series(np.random.rand(4),
 index=['a', 'b', 'c', 'd'])
>>> s1
a 0.6122
b 0.98096
c 0.3350
d 0.7221
dtype: float64

By default, if no index is passed, it will be created to have values ranging from 0 to N-1, where N is the length of the Series:

>>> s2 = pd.Series(np.random.rand(4))
>>> s2
0 0.6913
1 0.8487
2 0.8627
3 0.7286
dtype: float64

We can access the value of a Series by using the index:

>>> s1['c']
0.3350
>>>s1['c'] = 3.14
>>> s1['c', 'a', 'b']
c 3.14
a 0.6122
b 0.98096

This accessing method is similar to a Python dictionary. Therefore, Pandas also allows us to initialize a Series object directly from a Python dictionary:

>>> s3 = pd.Series({'001': 'Nam', '002': 'Mary',
 '003': 'Peter'})
>>> s3
001 Nam
002 Mary
003 Peter
dtype: object

Sometimes, we want to filter or rename the index of a Series created from a Python dictionary. At such times, we can pass the selected index list directly to the initial function, similarly to the process in the above example. Only elements that exist in the index list will be in the Series object. Conversely, indexes that are missing in the dictionary are initialized to default NaN values by Pandas:

>>> s4 = pd.Series({'001': 'Nam', '002': 'Mary',
 '003': 'Peter'}, index=[
 '002', '001', '024', '065'])
>>> s4
002 Mary
001 Nam
024 NaN
065 NaN
dtype: object
ect

The library also supports functions that detect missing data:

>>> pd.isnull(s4)
002 False
001 False
024 True
065 True
dtype: bool

Similarly, we can also initialize a Series from a scalar value:

>>> s5 = pd.Series(2.71, index=['x', 'y'])
>>> s5
x 2.71
y 2.71
dtype: float64

A Series object can be initialized with NumPy objects as well, such as ndarray. Moreover, Pandas can automatically align data indexed in different ways in arithmetic operations:

>>> s6 = pd.Series(np.array([2.71, 3.14]), index=['z', 'y'])
>>> s6
z 2.71
y 3.14
dtype: float64
>>> s5 + s6
x NaN
y 5.85
z NaN
dtype: float64

The DataFrame

The DataFrame is a tabular data structure comprising a set of ordered columns and rows. It can be thought of as a group of Series objects that share an index (the column names). There are a number of ways to initialize a DataFrame object. Firstly, let's take a look at the common example of creating DataFrame from a dictionary of lists:

>>> data = {'Year': [2000, 2005, 2010, 2014],
 'Median_Age': [24.2, 26.4, 28.5, 30.3],
 'Density': [244, 256, 268, 279]}
>>> df1 = pd.DataFrame(data)
>>> df1
 Density Median_Age Year
0 244 24.2 2000
1 256 26.4 2005
2 268 28.5 2010
3 279 30.3 2014

By default, the DataFrame constructor will order the column alphabetically. We can edit the default order by passing the column's attribute to the initializing function:

>>> df2 = pd.DataFrame(data, columns=['Year', 'Density', 
 'Median_Age'])
>>> df2
 Year Density Median_Age
0 2000 244 24.2
1 2005 256 26.4
2 2010 268 28.5
3 2014 279 30.3
>>> df2.index
Int64Index([0, 1, 2, 3], dtype='int64')

We can provide the index labels of a DataFrame similar to a Series:

>>> df3 = pd.DataFrame(data, columns=['Year', 'Density', 
 'Median_Age'], index=['a', 'b', 'c', 'd'])
>>> df3.index
Index([u'a', u'b', u'c', u'd'], dtype='object')

We can construct a DataFrame out of nested lists as well:

>>> df4 = pd.DataFrame([
 ['Peter', 16, 'pupil', 'TN', 'M', None],
 ['Mary', 21, 'student', 'SG', 'F', None],
 ['Nam', 22, 'student', 'HN', 'M', None],
 ['Mai', 31, 'nurse', 'SG', 'F', None],
 ['John', 28, 'laywer', 'SG', 'M', None]],
columns=['name', 'age', 'career', 'province', 'sex', 'award'])

Columns can be accessed by column name as a Series can, either by dictionary-like notation or as an attribute, if the column name is a syntactically valid attribute name:

>>> df4.name # or df4['name'] 
0 Peter
1 Mary
2 Nam
3 Mai
4 John
Name: name, dtype: object

To modify or append a new column to the created DataFrame, we specify the column name and the value we want to assign:

>>> df4['award'] = None
>>> df4
 name age career province sex award
0 Peter 16 pupil TN M None
1 Mary 21 student SG F None
2 Nam 22 student HN M None
3 Mai 31 nurse SG F None
4 John 28 lawer SG M None

Using a couple of methods, rows can be retrieved by position or name:

>>> df4.ix[1]
name Mary
age 21
career student
province SG
sex F
award None
Name: 1, dtype: object

A DataFrame object can also be created from different data structures such as a list of dictionaries, a dictionary of Series, or a record array. The method to initialize a DataFrame object is similar to the examples above.

Another common case is to provide a DataFrame with data from a location such as a text file. In this situation, we use the read_csv function that expects the column separator to be a comma, by default. However, we can change that by using the sep parameter:

# person.csv file
name,age,career,province,sex
Peter,16,pupil,TN,M
Mary,21,student,SG,F
Nam,22,student,HN,M
Mai,31,nurse,SG,F
John,28,lawer,SG,M
# loading person.cvs into a DataFrame
>>> df4 = pd.read_csv('person.csv')
>>> df4
 name age career province sex
0 Peter 16 pupil TN M
1 Mary 21 student SG F
2 Nam 22 student HN M
3 Mai 31 nurse SG F
4 John 28 laywer SG M

While reading a data file, we sometimes want to skip a line or an invalid value. As for Pandas 0.16.2, read_csv supports over 50 parameters for controlling the loading process. Some common useful parameters are as follows:

  • sep: This is a delimiter between columns. The default is comma symbol.
  • dtype: This is a data type for data or columns.
  • header: This sets row numbers to use as the column names.
  • skiprows: This skips line numbers to skip at the start of the file.
  • error_bad_lines: This shows invalid lines (too many fields) that will, by default, cause an exception, such that no DataFrame will be returned. If we set the value of this parameter as false, the bad lines will be skipped.

Moreover, Pandas also has support for reading and writing a DataFrame directly from or to a database such as the read_frame or write_frame function within the Pandas module. We will come back to these methods later in this chapter.