scikit-learn Cookbook(Second Edition)
上QQ阅读APP看书,第一时间看更新

There's more...

Despite our hard work and the elegance of the scikit-learn syntax, the score on the test set at the very end remains suspicious. The reason for this is that the test and train split are not necessarily balanced; the train and test sets do not necessarily have similar proportions of all the classes.

This is easily remedied by using a stratified test-train split:

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

By selecting the target set as the stratified argument, the target classes are balanced. This brings the SVC scores closer together.

svc_scores = cross_val_score(svc_clf, X_train, y_train, cv=4)
print "Average SVC scores: " , svc_scores.mean()
print "Standard Deviation of SVC scores: ", svc_scores.std()
print "Score on Final Test Set:", accuracy_score(y_test, svc_clf.predict(X_test))

Average SVC scores: 0.831547619048
Standard Deviation of SVC scores: 0.0792488953372
Score on Final Test Set: 0.789473684211

Additionally, note that in the preceding example, the cross-validation procedure produces stratified folds by default:

from sklearn.model_selection import cross_val_score
svc_scores = cross_val_score(svc_clf, X_train, y_train, cv = 4)

The preceding code is equivalent to:

from sklearn.model_selection import cross_val_score, StratifiedKFold
skf = StratifiedKFold(n_splits = 4)
svc_scores = cross_val_score(svc_clf, X_train, y_train, cv = skf)