Setting hyperparameters in a principled way
In the preceding example, we set the penalty parameter to 0.1. We could just as well have set it to 0.7 or 23.9. Naturally, the results will vary each time. If we pick an overly large value, we get underfitting. In an extreme case, the learning system will just return every coefficient equal to zero. If we pick a value that is too small, we are very close to OLS, which overfits and generalizes poorly (as we saw earlier).
How do we choose a good value? This is a general problem in machine learning: setting parameters for our learning methods. A generic solution is to use cross-validation. We pick a set of possible values, and then use cross-validation to choose which one is best. This performs more computation (five times more if we use five folds), but is always applicable and unbiased.
We must be careful, though. In order to obtain an unbiased estimate of generalization, we must use two levels of cross-validation: the top level is to estimate the generalization power of the system, while the second level is to get good parameters. That is, we split the data in, for example, five folds. We start by holding out the first fold and will learn on the other four. Now, we split these again into five folds in order to choose the parameters. Once we have set our parameters, we test on the first fold. Now we repeat this four other times:
The preceding figure shows how you break up a single training fold into subfolds. We would need to repeat it for all the other folds. In this case, we are looking at five outer folds and five inner folds, but there is no reason to use the same number of outer and inner folds; you can use any number you want as long as you keep the folds separate.
This leads to a lot of computation, but it is necessary in order to do things correctly. The problem is that if you use any datapoint to make any decisions about your model (including which parameters to set), you can no longer use that same datapoint to test the generalization ability of your model. This is a subtle point and it may not be immediately obvious. In fact, it is still the case that many users of machine learning get this wrong and overestimate how well their systems are doing, because they do not perform cross-validation correctly!
Fortunately, scikit-learn makes it very easy to do the right thing; it provides classes named LassoCV, RidgeCV, and ElasticNetCV, all of which encapsulate an inner cross-validation loop to optimize the necessary parameter (hence the letters CV at the end of the class name). The code is almost exactly like the previous one, except that we do not need to specify any value for alpha:
from sklearn.linear_model import ElasticNetCV
met = ElasticNetCV()
kf = KFold(n_splits=5)
p = cross_val_predict(met, data, target, cv=kf)
r2_cv = r2_score(target, p)
print("R2 ElasticNetCV: {:.2}".format(r2_cv))
R2 ElasticNetCV: 0.65
This results in a lot of computation, so, depending on how fast your computer is, you may want to brew some coffee or tea while you are waiting. You can get better performance by taking advantage of multiple processors. This is a built-in feature of scikit-learn, which can be accessed quite trivially by using the n_jobs parameter to the ElasticNetCV constructor. To use four CPUs, make use of the following code:
met = ElasticNetCV(n_jobs=4)
Set the n_jobs parameter to -1 to use all the available CPUs:
met = ElasticNetCV(n_jobs=-1)
You may have wondered why, if ElasticNets has two penalties—the L1 and the L2 penalty—we only need to set a single value for alpha. In fact, the two values are specified by separately specifying the alpha and the l1_ratio variable. Then, α1 and α2 are set as follows (where ρ stands for l1_ratio):
In an intuitive sense, alpha sets the overall amount of regularization while l1_ratio sets the trade-off between the different types of regularization, L1 and L2.
We can request that the ElasticNetCV object tests different values of l1_ratio, as shown in the following code:
l1_ratio=[.01, .05, .25, .5, .75, .95, .99]
met = ElasticNetCV(l1_ratio=l1_ratio, n_jobs=-1)
This set of l1_ratio values is recommended in the documentation. It will test models that are almost like Ridge (when l1_ratio is 0.01 or 0.05) as well as models that are almost like Lasso (when l1_ratio is 0.95 or 0.99). Thus, we explore a full range of different options.
Putting all this together, we can now visualize the prediction versus real fit on this large dataset:
l1_ratio = [.01, .05, .25, .5, .75, .95, .99]
met = ElasticNetCV(l1_ratio=l1_ratio, n_jobs=-1)
pred = cross_val_predict(met, data, target, cv=kf)
fig, ax = plt.subplots()
ax.scatter(pred, y)
ax.plot([pred.min(), pred.max()], [pred.min(), pred.max()])
This results in the following plot:
We can see that the predictions do not match very well on the bottom end of the value range. This is perhaps because there are so many fewer elements on this end of the target range (this affects only a small minority of datapoints).