Overfitting
If underfitting was the consequence of a low capacity and a high bias, overfitting is a phenomenon that a high variance can detect. In general, we can observe a very high training accuracy (even close to the Bayes level), but not a poor validation accuracy. This means that the capacity of the model is high enough or even excessive for the task (the higher the capacity, the higher the probability of large variances), and that the training set isn't a good representation of pdata. To understand the problem, consider the following classification scenarios:
Acceptable fitting (left), overfitted classifier (right)
The left plot has been obtained using logistic regression, while, for the right one, the algorithm is SVM with a sixth-degree polynomial kernel. If we consider the second model, the decision boundaries seem much more precise, with some samples just over them. Considering the shapes of the two subsets, it would be possible to say that a non-linear SVM can better capture the dynamics; however, if we sample another dataset from pdata and the diagonal tail becomes wider, logistic regression continues to classify the points correctly, while the SVM accuracy decreases dramatically. The second model is very likely to be overfitted, and some corrections are necessary. When the validation accuracy is much lower than the training one, a good strategy is to increase the number of training samples to consider the real pdata. In fact, it can happen that a training set is built starting from a hypothetical distribution that doesn't reflect the real one; or the number of samples used for the validation is too high, reducing the amount of information carried by the remaining samples. Cross-validation is a good way to assess the quality of datasets, but it can always happen that we find completely new subsets (for example, generated when the application is deployed in a production environment) that are misclassified, even if they were supposed to belong to pdata. If it's not possible to enlarge the training set, data augmentation could be a valid solution, because it allows creating artificial samples (for images, it's possible to mirror, rotate, or blur them) starting from the information stored in the known ones. Other strategies to prevent overfitting are based on a technique called regularization, which we're going to discuss in the last part of this chapter. For now, we can say that the effect of regularization is similar to a partial linearization, which implies a capacity reduction with a consequent variance decrease.