Mastering Machine Learning with scikit-learn(Second Edition)
上QQ阅读APP看书,第一时间看更新

Evaluating the model

We have used a learning algorithm to estimate a model's parameters from training data. How can we assess whether our model is a good representation of the real relationship? Let's assume that you have found another page in your pizza journal. We will use this page's entries as a test set to measure the performance of our model. We have added a fourth column; it contains the prices predicted by our model.

Several measures can be used to assess our model's predictive capability. We will evaluate our pizza price predictor using a measure called R-squared. Also known as the coefficient of determination, R-squared measures how close the data are to a regression line. There are several methods for calculating R-squared. In the case of simple linear regression, R-squared is equal to the square of the Pearson product-moment correlation coefficient (PPMCC), or Pearson's r. Using this method, R-squared must be a positive number between zero and one. This method is intuitive; if R-squared describes the proportion of variance in the response variable that is explained by the model, it cannot be greater than one or less than zero. Other methods, including the method used by scikit-learn, do not calculate R-squared as the square of Pearson's r. Using these methods, R-squared can be negative if the model performs extremely poorly. It is important to note the limitations of performance metrics. R-squared in particular is sensitive to outliers, and can spuriously increase when features are added to the model.

We will follow the method used by scikit-learn to calculate R-squared for our pizza price predictor. First we must measure the total sum of squares. yi is the observed value of the response variable for the ith test instance, and is the mean of the observed values of the response variable.

Next we must find the RSS. Recall that this is also our cost function.

Finally, we can find R-squared using the following:

The R-squared score of 0.662 indicates that a large proportion of the variance in the test instances' prices is explained by the model. Now let's confirm our calculation using scikit-learn. The score method of LinearRegression returns the model's R-squared value, as seen in the following example:

# In[1]: 
import numpy as np
from sklearn.linear_model import LinearRegression

X_train = np.array([6, 8, 10, 14, 18]).reshape(-1, 1)
y_train = [7, 9, 13, 17.5, 18]

X_test = np.array([8, 9, 11, 16, 12]).reshape(-1, 1)
y_test = [11, 8.5, 15, 18, 11]

model = LinearRegression()
model.fit(X_train, y_train)
r_squared = model.score(X_test, y_test)
print(r_squared )

# Out[1]:
0.6620