Building Machine Learning Systems with Python
上QQ阅读APP看书,第一时间看更新

Evaluation – holding out data and cross-validation

The model discussed in the previous section is a simple model; it achieves 96.0 percent accuracy of the whole data. However, this evaluation is almost certainly overly optimistic. We used the data to define what the tree would look like, and then we used the same data to evaluate the model. Of course, the model will perform well on this dataset as it was optimized to perform well on it. The reasoning is circular.

What we really want to do is estimate the ability of the model to generalize to new instances. We should measure its performance in instances that the algorithm has not seen at training. Therefore, we are going to do a more rigorous evaluation and use held-out data. To do this, we are going to break up the data into two groups: in one group, we'll train the model, and in the other, we'll test the one we held out of training. The full code, which is an adaptation of the code presented earlier, is available on the online support repository. Its output is as follows:

Training accuracy was 96.0%.
Testing accuracy was 94.7%.  

The result on the training data (which is a subset of the whole data) is the same as before. However, what is important to note is that the result for the testing data is lower than that of the training error. In this case, the difference is very small, but it can be much larger. When using a complex model, it is possible to get 100 percent accuracy in training and do no better than random guessing on testing! While this may surprise an inexperienced machine learner, it is expected that testing accuracy will be lower than the training accuracy.

To understand why, think about how the decision tree works: it defines a series of thresholds on different features. Sometimes it may be very clear where the threshold should be, but there are areas where even a single datapoint can change the threshold and move it up or down.

The accuracy on the training data, the training accuracy, is almost always an overly optimistic estimate of how well your algorithm is doing. We should always measure and report the testing accuracy, which is the accuracy on a collection of examples that were not used for training.

One possible problem with what we just did is that we only used half the data for training. Perhaps it would have been better to use more training data. On the other hand, if we then leave too little data for testing, the error estimation is performed on a very small number of examples. Ideally, we would like to use all of the data for training and all of the data for testing as well, which is impossible.

We can achieve a good approximation of this impossible ideal via a method called cross-validation. One simple form of cross-validation is leave-one-out cross-validation. We will take an example out of the training data, learn a model without this example, and then test whether the model classifies this example correctly.

This process is then repeated for all the elements in the dataset:

predictions = [] 
for i in range(len(features)): 
    train_features = np.delete(features, i, axis=0) 
    train_labels = np.delete(labels, i, axis=0) 
    tr.fit(train_features, train_labels) 
    predictions.append(tr.predict([features[i]])) 
predictions = np.array(predictions) 

At the end of this loop, we will have tested a series of models on all the examples and will have obtained a final average result. When using cross-validation, there is no circularity problem because each example was tested on a model that was built without taking that datapoint into account. Therefore, the cross-validated estimate is a reliable estimate of how well the models will generalize to new data.

The major problem with leave-one-out cross-validation is that we are now forced to perform more work many times. In fact, you must learn a whole new model for each and every example and this cost will increase as our dataset grows.

We can get most of the benefits of leave-one-out at a fraction of the cost by using k-fold cross-validation, where k stands for a small number. For example, to perform five-fold cross-validation, we break up the data into five groups, so-called five folds.

Then you learn five models. Each time, you will leave one fold out of the training data. The resulting code will be similar to the code given earlier in this section, but we leave 20 percent of the data out instead of just one element. We test each of these models on the left-out fold and average the results:

The preceding diagram illustrates this process for five blocks: the dataset is split into five pieces. For each fold, you hold out one of the blocks for testing and train on the other four. You can use any number of folds you wish. There is a trade-off between computational efficiency (the more folds, the more computation is necessary) and accurate results (the more folds, the closer you are to using the whole of the data for training). Five folds is often a good compromise. This corresponds to training with 80 percent of your data, which should already be close to what you will get from using all the data. If you have little data, you can even consider using 10 or 20 folds. In an extreme case, if you have as many folds as datapoints, you are simply performing leave-one-out cross-validation. On the other hand, if computation time is an issue and you have more data, two or three folds may be the more appropriate choice.

When generating the folds, you need to be careful to keep them balanced. For example, if all of the examples in one fold come from the same class, then the results will not be representative. We will not go into the details of how to do this, because the scikit-learn machine learning package will handle them for you. Here is how to perform five-fold cross-validation with scikit-learn:

from sklearn import model_selection 
predictions = model_selection.cross_val_predict( 
    tr, 
    features, 
    labels, 
    cv=model_selection.LeaveOneOut()) 
print(np.mean(predictions == labels)) 

We have now generated several models instead of just one. So, what final model do we return and use for new data? The simplest solution is to now train a single overall model on all your training data. The cross-validation loop gives you an estimate of how well this model should generalize.

A cross-validation schedule allows you to use all your data to estimate whether your methods are doing well. At the end of the cross-validation loop, you can then use all your data to train a final model.

Although it was not properly recognized when machine learning was starting out as a field, nowadays, it is seen as a very bad sign to even discuss the training accuracy of a classification system. This is because the results can be very misleading and even just presenting them marks you as a newbie in machine learning. We always want to measure and compare either the error on a held-out dataset or the error estimated using a cross-validation scheme.