Python Machine Learning Cookbook(Second Edition)
上QQ阅读APP看书,第一时间看更新

How to do it…

Let's see how to estimate bicycle demand distribution:

  1. We first need to import a couple of new packages, as follows:
import csv
import numpy as np
  1. We are processing a CSV file, so the CSV package for useful in handling these files. Let's import the data into the Python environment:
filename="bike_day.csv"
file_reader = csv.reader(open(filename, 'r'), delimiter=',')
X, y = [], []
for row in file_reader:
X.append(row[2:13])
y.append(row[-1])

This piece of code just read all the data from the CSV file. The csv.reader() function returns a reader object, which will iterate over lines in the given CSV file. Each row read from the CSV file is returned as a list of strings. So, two lists are returned: X and y. We have separated the data from the output values and returned them. Now we will extract feature names:

feature_names = np.array(X[0])

The feature names are useful when we display them on a graph. So, we have to remove the first row from X and y because they are feature names:

X=np.array(X[1:]).astype(np.float32)
y=np.array(y[1:]).astype(np.float32)

We have also converted the two lists into two arrays.

  1. Let's shuffle these two arrays to make them independent of the order in which the data is arranged in the file:
from sklearn.utils import shuffle
X, y = shuffle(X, y, random_state=7)
  1. As we did earlier, we need to separate the data into training and testing data. This time, let's use 90% of the data for training and the remaining 10% for testing:
num_training = int(0.9 * len(X))
X_train, y_train = X[:num_training], y[:num_training]
X_test, y_test = X[num_training:], y[num_training:]
  1. Let's go ahead and train the regressor:
from sklearn.ensemble import RandomForestRegressor
rf_regressor = RandomForestRegressor(n_estimators=1000, max_depth=10, min_samples_split=2)
rf_regressor.fit(X_train, y_train)

The RandomForestRegressor() function builds a random forest regressor. Here, n_estimators refers to the number of estimators, which is the number of decision trees that we want to use in our random forest. The max_depth parameter refers to the maximum depth of each tree, and the min_samples_split parameter refers to the number of data samples that are needed to split a node in the tree.

  1. Let's evaluate the performance of the random forest regressor:
y_pred = rf_regressor.predict(X_test)
from sklearn.metrics import mean_squared_error, explained_variance_score
mse = mean_squared_error(y_test, y_pred)
evs = explained_variance_score(y_test, y_pred)
print( "#### Random Forest regressor performance ####")
print("Mean squared error =", round(mse, 2))
print("Explained variance score =", round(evs, 2))

The following results are returned:

#### Random Forest regressor performance ####
Mean squared error = 357864.36
Explained variance score = 0.89
  1. Let's extract the relative importance of the features:
RFFImp= rf_regressor.feature_importances_ 
RFFImp= 100.0 * (RFFImp / max(RFFImp))
index_sorted = np.flipud(np.argsort(RFFImp))
pos = np.arange(index_sorted.shape[0]) + 0.5

To visualize the results, we will plot a bar graph:

import matplotlib.pyplot as plt
plt.figure()
plt.bar(pos, RFFImp[index_sorted], align='center')
plt.xticks(pos, feature_names[index_sorted])
plt.ylabel('Relative Importance')
plt.title("Random Forest regressor")
plt.show()

The following output is plotted:

Looks like the temperature is the most important factor controlling bicycle rentals.