Which classifier to use
So far, we have looked at two classical classifiers, namely the decision tree and the nearest neighbor classifier. Scikit-learn supports many more, but it does not support everything that has ever been proposed in academic literature. Thus, one may be left wondering: which one should I use? Is it even important to learn about all of them?
In many cases, knowledge of your dataset may help you decide which classifier has a structure that best matches your problem. However, there is a very good study by Manuel Fernández-Delgado and his colleagues titled, Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? This is a very readable, very practically-oriented study, where the authors conclude that there is actually one classifier which is very likely to be the best (or close to the best) for a majority of problems, namely random forests.
What is a random forest? As the name suggests, a forest is a collection of trees. In this case, a collection of decision trees. How do we obtain many trees from a single dataset? If you try to call the methods we used before several times, you will find that you will get the exact same tree every time. The trick is to call the method several times with different random variations of the dataset. In particular, each time, we take a fraction of the dataset and a fraction of the features. Thus, each time, there is a different tree. At classification time, all the trees vote and a final decision is reached. There are many different parameters that determine all the minor details, but only one is relevant, namely the number of trees that you use. In general, the more trees you build the more memory will be required, but your classification accuracy will also increase (up to a plateau of optimal performance). The default in scikit-learn is 10 trees. Unless your dataset is very large such that memory usage become problematic, increasing this value is often advantageous:
from sklearn import ensemble rf = ensemble.RandomForestClassifier(n_estimators=100) predict = model_selection.cross_val_predict(rf, features, target) print("RF accuracy: {:.1%}".format(np.mean(predict == target)))
On this dataset, the result is about 86 percent (it may be slightly different when you run it, as they are random forests).
Another big advantage of random forests is that, since they are based on decision trees, ultimately they only perform binary decisions based on feature thresholds. Thus, they are invariant when features are scaled up or down.