Linear regression with scikit-learn and higher dimensionality
The scikit-learn library offers the LinearRegression class, which works with n-dimensional spaces. For this purpose, we're going to use the Boston dataset:
from sklearn.datasets import load_boston
boston = load_boston()
print(boston.data.shape)
(506L, 13L)
print(boston.target.shape)
(506L,)
It has 506 samples with 13 input features and one output. In the following graph, there's a collection of the plots of the first 12 features:
There are different scales and outliers (which can be removed using the methods studied in the previous chapters), so it's better to ask the model to normalize the data before processing it (setting the normalize=True parameter). Moreover, for testing purposes, we split the original dataset into training (90%) and test (10%) sets:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, test_size=0.1)
lr = LinearRegression(normalize=True)
lr.fit(X_train, Y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=True)
When the original dataset isn't large enough, splitting it into training and test sets may reduce the number of samples that can be used for fitting the model. As we are assuming that the dataset represents an underlying data generating process, it's absolutely necessary that both the training and test sets obey this rule. Small datasets, can happen to have only a few samples representing a specific area of the data generating process, and it's important to include them in the training set, to avoid a lack of generalization ability.
K-fold cross-validation can help solve this problem with a different strategy. The whole dataset is split into k folds using always k-1 folds for training and the remaining one to validate the model. K iterations will be performed, using always a different validation fold. In the following diagram, there's an example with three folds/iterations:
In this way, the final score can be determined as the average of all values, and all samples are selected for training k-1 times. Whenever the training time is not extremely long, this approach is the best way to assess the performance of a model, in particular the generalization ability, that can be compromised when some samples are not present in the training set.
To check the accuracy of a regression, scikit-learn provides the internal method score(X, y) method, which evaluates the model on test data using the R2 score (see the next section):
print(lr.score(X_test, Y_test))
0.77371996006718879
So, the overall accuracy is about 77%, which is an acceptable result considering the non-linearity of the original dataset, but it can be also influenced by the subdivision made by train_test_split (as in our case). In fact, the test set can contain points that can also be easily predicted also when the overall accuracy is unacceptable. For this reason, it's preferable not to trust this measure immediately and default on a cross-validation (CV) evaluation.
To perform a k-fold cross-validation, we can use the cross_val_score() function, which works with all the classifiers. The scoring parameter is very important because it determines which metric will be adopted for tests. As LinearRegression works with ordinary least squares, we preferred the negative mean squared error, which is a cumulative measure that must be evaluated according to the actual values (it's not relative):
from sklearn.model_selection import cross_val_score
scores = cross_val_score(lr, boston.data, boston.target, cv=7, scoring='neg_mean_squared_error')
array([ -11.32601065, -10.96365388, -32.12770594, -33.62294354,
-10.55957139, -146.42926647, -12.98538412])
print(scores.mean())
-36.859219426420601
print(scores.std())
45.704973900600457
The high standard deviation confirms that this dataset is very sensitive to the split strategy. In some cases, the probability distribution of both training and test sets are rather similar, but in other situations (at least three with seven folds), they are different; hence, the algorithm cannot learn to predict correctly.