Building Machine Learning Systems with Python
上QQ阅读APP看书,第一时间看更新

Reading in the data

We have collected the web statistics for the last month and aggregated them in a file named ch01/data/web_traffic.tsv (.tsv because it contains tab-separated values). They are stored as the number of hits per hour. Each line contains the hour and the number of web hits in that hour. The hours are listed consecutively.

Using SciPy's genfromtxt(), we can easily read in the data using the following code:

>>> data = np.genfromtxt("web_traffic.tsv", delimiter="\t")

We have to specify tabs as the delimiter so that the columns are correctly determined. A quick check shows that we have correctly read in the data:

>>> print(data[:10])
[[ 1.00000000e+00 2.27333105e+03]
[ 2.00000000e+00 1.65725549e+03]
[ 3.00000000e+00 nan]
[ 4.00000000e+00 1.36684644e+03]
[ 5.00000000e+00 1.48923438e+03]
[ 6.00000000e+00 1.33802002e+03]
[ 7.00000000e+00 1.88464734e+03]
[ 8.00000000e+00 2.28475415e+03]
[ 9.00000000e+00 1.33581091e+03]
[ 1.00000000e+01 1.02583240e+03]]
>>> print(data.shape)
(743, 2)

As you can see, we have 743 data points with 2 dimensions.