Python Machine Learning Cookbook（Second Edition）

上QQ阅读APP看书，第一时间看更新

How to do it…

Let's see how to estimate bicycle demand distribution:

We first need to import a couple of new packages, as follows:

import csv
import numpy as np

We are processing a CSV file, so the CSV package for useful in handling these files. Let's import the data into the Python environment:

filename="bike_day.csv"
file_reader = csv.reader(open(filename, 'r'), delimiter=',')
X, y = [], []
for row in file_reader:
    X.append(row[2:13])
    y.append(row[-1])

This piece of code just read all the data from the CSV file. The csv.reader() function returns a reader object, which will iterate over lines in the given CSV file. Each row read from the CSV file is returned as a list of strings. So, two lists are returned: X and y. We have separated the data from the output values and returned them. Now we will extract feature names:

feature_names = np.array(X[0])

The feature names are useful when we display them on a graph. So, we have to remove the first row from X and y because they are feature names:

X=np.array(X[1:]).astype(np.float32)
y=np.array(y[1:]).astype(np.float32)

We have also converted the two lists into two arrays.

Let's shuffle these two arrays to make them independent of the order in which the data is arranged in the file:

from sklearn.utils import shuffle
X, y = shuffle(X, y, random_state=7)

As we did earlier, we need to separate the data into training and testing data. This time, let's use 90% of the data for training and the remaining 10% for testing:

num_training = int(0.9 * len(X))
X_train, y_train = X[:num_training], y[:num_training]
X_test, y_test = X[num_training:], y[num_training:]

Let's go ahead and train the regressor:

from sklearn.ensemble import RandomForestRegressor
rf_regressor = RandomForestRegressor(n_estimators=1000, max_depth=10, min_samples_split=2)
rf_regressor.fit(X_train, y_train)

The RandomForestRegressor() function builds a random forest regressor. Here, n_estimators refers to the number of estimators, which is the number of decision trees that we want to use in our random forest. The max_depth parameter refers to the maximum depth of each tree, and the min_samples_split parameter refers to the number of data samples that are needed to split a node in the tree.

Let's evaluate the performance of the random forest regressor:

y_pred = rf_regressor.predict(X_test)
from sklearn.metrics import mean_squared_error, explained_variance_score
mse = mean_squared_error(y_test, y_pred)
evs = explained_variance_score(y_test, y_pred)
print( "#### Random Forest regressor performance ####")
print("Mean squared error =", round(mse, 2))
print("Explained variance score =", round(evs, 2))

The following results are returned:

#### Random Forest regressor performance ####
Mean squared error = 357864.36
Explained variance score = 0.89

Let's extract the relative importance of the features:

RFFImp= rf_regressor.feature_importances_ 
RFFImp= 100.0 * (RFFImp / max(RFFImp))
index_sorted = np.flipud(np.argsort(RFFImp))
pos = np.arange(index_sorted.shape[0]) + 0.5

To visualize the results, we will plot a bar graph:

import matplotlib.pyplot as plt
plt.figure()
plt.bar(pos, RFFImp[index_sorted], align='center')
plt.xticks(pos, feature_names[index_sorted])
plt.ylabel('Relative Importance')
plt.title("Random Forest regressor")
plt.show()

The following output is plotted:

Looks like the temperature is the most important factor controlling bicycle rentals.