Learning Data Mining with Python
上QQ阅读APP看书,第一时间看更新

scikit-learn estimators

Estimators are scikit-learn's abstraction, allowing for the standardized implementation of a large number of classification algorithms. Estimators are used for classification. Estimators have the following two main functions:

  • fit(): This performs the training of the algorithm and sets internal parameters. It takes two inputs, the training sample dataset and the corresponding classes for those samples.
  • predict(): This predicts the class of the testing samples that is given as input. This function returns an array with the predictions of each input testing sample.

Most scikit-learn estimators use the NumPy arrays or a related format for input and output.

There are a large number of estimators in scikit-learn. These include support vector machines (SVM), random forests, and neural networks. Many of these algorithms will be used in later chapters. In this chapter, we will use a different estimator from scikit-learn: nearest neighbor.

Note

For this chapter, you will need to install a new library called matplotlib. The easiest way to install it is to use pip3, as you did in Chapter 1, Getting Started with Data Mining, to install scikit-learn:

$pip3 install matplotlib

If you have any difficulty installing matplotlib, seek the official installation instructions at http://matplotlib.org/users/installing.html.

Nearest neighbors

Nearest neighbors is perhaps one of the most intuitive algorithms in the set of standard data mining algorithms. To predict the class of a new sample, we look through the training dataset for the samples that are most similar to our new sample. We take the most similar sample and predict the class that the majority of those samples have.

As an example, we wish to predict the class of the triangle, based on which class it is more similar to (represented here by having similar objects closer together). We seek the three nearest neighbors, which are two diamonds and one square. There are more diamonds than circles, and the predicted class for the triangle is, therefore, a diamond:

Nearest neighbors

Nearest neighbors can be used for nearly any dataset-however, it can be very computationally expensive to compute the distance between all pairs of samples. For example if there are 10 samples in the dataset, there are 45 unique distances to compute. However, if there are 1000 samples, there are nearly 500,000! Various methods exist for improving this speed dramatically; some of which are covered in the later chapters of this book.

It can also do poorly in categorical-based datasets, and another algorithm should be used for these instead.

Distance metrics

A key underlying concept in data mining is that of distance. If we have two samples, we need to know how close they are to each other. Further more, we need to answer questions such as are these two samples more similar than the other two? Answering questions like these is important to the outcome of the case.

The most common distance metric that the people are aware of is Euclidean distance, which is the real-world distance. If you were to plot the points on a graph and measure the distance with a straight ruler, the result would be the Euclidean distance. A little more formally, it is the square root of the sum of the squared distances for each feature.

Euclidean distance is intuitive, but provides poor accuracy if some features have larger values than others. It also gives poor results when lots of features have a value of 0, known as a sparse matrix. There are other distance metrics in use; two commonly employed ones are the Manhattan and Cosine distance.

The Manhattan distance is the sum of the absolute differences in each feature (with no use of square distances). Intuitively, it can be thought of as the number of moves a rook piece (or castle) in chess would take to move between the points, if it were limited to moving one square at a time. While the Manhattan distance does suffer if some features have larger values than others, the effect is not as dramatic as in the case of Euclidean.

The Cosine distance is better suited to cases where some features are larger than others and when there are lots of zeros in the dataset. Intuitively, we draw a line from the origin to each of the samples, and measure the angle between those lines. This can be seen in the following diagram:

Distance metrics

In this example, each of the grey circles are in the same distance from the white circle. In (a), the distances are Euclidean, and therefore, similar distances fit around a circle. This distance can be measured using a ruler. In (b), the distances are Manhattan, also called City Block. We compute the distance by moving across rows and columns, similar to how a Rook (Castle) in Chess moves. Finally, in (c), we have the Cosine distance that is measured by computing the angle between the lines drawn from the sample to the vector, and ignore the actual length of the line.

The distance metric chosen can have a large impact on the final performance. For example, if you have many features, the Euclidean distance between random samples approaches the same value. This makes it hard to compare samples as the distances are the same! Manhattan distance can be more stable in some circumstances, but if some features have very large values, this can overrule lots of similarities in other features. Finally, Cosine distance is a good metric for comparing items with a large number of features, but it discards some information about the length of the vector, which is useful in some circumstances.

For this chapter, we will stay with Euclidean distance, using other metrics in later chapters.

Loading the dataset

The dataset we are going to use is called Ionosphere, which is the recording of many high-frequency antennas. The aim of the antennas is to determine whether there is a structure in the ionosphere and a region in the upper atmosphere. Those that have a structure are deemed good, while those that do not are deemed bad. The aim of this application is to build a data mining classifier that can determine whether an image is good or bad.

Loading the dataset

(Image Credit: https://www.flickr.com/photos/geckzilla/16149273389/)

This can be downloaded from the UCL Machine Learning data repository, which contains a large number of datasets for different data mining applications. Go to http://archive.ics.uci.edu/ml/datasets/Ionosphere and click on Data Folder. Download the ionosphere.data and ionosphere.names files to a folder on your computer. For this example, I'll assume that you have put the dataset in a directory called Data in your home folder.

Note

The location of your home folder depends on your operating system. For Windows, it is usually at C:\Documents and Settings\username. For Mac or Linux machines, it is usually at /home/username. You can get your home folder by running this python code:

import os
print(os.path.expanduser("~"))

For each row in the dataset, there are 35 values. The first 34 are measurements taken from the 17 antennas (two values for each antenna). The last is either 'g' or 'b'; that stands for good and bad, respectively.

Start the IPython Notebook server and create a new notebook called Ionosphere Nearest Neighbors for this chapter.

First, we load up the NumPy and csv libraries that we will need for our code:

import numpy as np
import csv

To load the dataset, we first get the filename of the dataset. First, get the folder the dataset is stored in from your data folder:

data_filename = os.path.join(data_folder, "Ionosphere", "ionosphere.data")

We then create the X and y NumPy arrays to store the dataset in. The sizes of these arrays are known from the dataset. Don't worry if you don't know the size of future datasets—we will use other methods to load the dataset in future chapters and you won't need to know this size beforehand:

X = np.zeros((351, 34), dtype='float')
y = np.zeros((351,), dtype='bool')

The dataset is in a Comma-Separated Values (CSV) format, which is a commonly used format for datasets. We are going to use the csv module to load this file. Import it and set up a csv reader object:

with open(data_filename, 'r') as input_file:
    reader = csv.reader(input_file)

Next, we loop over the lines in the file. Each line represents a new set of measurements, which is a sample in this dataset. We use the enumerate function to get the line's index as well, so we can update the appropriate sample in the dataset (X):

    for i, row in enumerate(reader):

We take the first 34 values from this sample, turn each into a float, and save that to our dataset:

  data = [float(datum) for datum in row[:-1]]
  X[i] = data

Finally, we take the last value of the row and set the class. We set it to 1 (or True) if it is a good sample, and 0 if it is not:

  y[i] = row[-1] == 'g'

We now have a dataset of samples and features in X, and the corresponding classes in y, as we did in the classification example in Chapter 1, Getting Started with Data Mining.

Moving towards a standard workflow

Estimators in scikit-learn have two main functions: fit() and predict(). We train the algorithm using the fit method and our training set. We evaluate it using the predict method on our testing set.

First, we need to create these training and testing sets. As before, import and run the train_test_split function:

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=14)

Then, we import the nearest neighbor class and create an instance for it. We leave the parameters as defaults for now, and will choose good parameters later in this chapter. By default, the algorithm will choose the five nearest neighbors to predict the class of a testing sample:

from sklearn.neighbors import KNeighborsClassifier estimator = KNeighborsClassifier()

After creating our estimator, we must then fit it on our training dataset. For the nearest neighbor class, this records our dataset, allowing us to find the nearest neighbor for a new data point, by comparing that point to the training dataset:

estimator.fit(X_train, y_train)

We then train the algorithm with our test set and evaluate with our testing set:

y_predicted = estimator.predict(X_test)
accuracy = np.mean(y_test == y_predicted) * 100
print("The accuracy is {0:.1f}%".format(accuracy))

This scores 86.4 percent accuracy, which is impressive for a default algorithm and just a few lines of code! Most scikit-learn default parameters are chosen explicitly to work well with a range of datasets. However, you should always aim to choose parameters based on knowledge of the application experiment.

Running the algorithm

In our earlier experiments, we set aside a portion of the dataset as a testing set, with the rest being the training set. We train our algorithm on the training set and evaluate how effective it will be based on the testing set. However, what happens if we get lucky and choose an easy testing set? Alternatively, what if it was particularly troublesome? We can discard a good model due to poor results resulting from such an "unlucky" split of our data.

The cross-fold validation framework is a way to address the problem of choosing a testing set and a standard methodology in data mining. The process works by doing a number of experiments with different training and testing splits, but using each sample in a testing set only once. The procedure is as follows:

  1. Split the entire dataset into a number of sections called folds.
  2. For each fold in the dataset, execute the following steps:
    • Set that fold aside as the current testing set
    • Train the algorithm on the remaining folds
    • Evaluate on the current testing set
  3. Report on all the evaluation scores, including the average score.
  4. In this process, each sample is used in the testing set only once. This reduces (but doesn't completely eliminate) the likelihood of choosing lucky testing sets.
    Note

    Throughout this book, the code examples build upon each other within a chapter. Each chapter's code should be entered into the same IPython Notebook, unless otherwise specified.

The scikit-learn library contains a number of cross fold validation methods. A helper function is given that performs the preceding procedure. We can import it now in our IPython Notebook:

from sklearn.cross_validation import cross_val_score
Note

By default, cross_val_score uses a specific methodology called Stratified K Fold to split the dataset into folds. This creates folds that have approximately the same proportion of classes in each fold, again reducing the likelihood of choosing poor folds. This is a great default, so we won't mess with it right now.

Next, we use this function, passing the original (full) dataset and classes:

scores = cross_val_score(estimator, X, y, scoring='accuracy')
average_accuracy = np.mean(scores) * 100
print("The average accuracy is {0:.1f}%".format(average_accuracy))

This gives a slightly more modest result of 82.3 percent, but it is still quite good considering we have not yet tried setting better parameters. In the next section, we will see how we would go about changing the parameters to achieve a better outcome.

Setting parameters

Almost all data mining algorithms have parameters that the user can set. This is often a cause of generalizing an algorithm to allow it to be applicable in a wide variety of circumstances. Setting these parameters can be quite difficult, as choosing good parameter values is often highly reliant on features of the dataset.

The nearest neighbor algorithm has several parameters, but the most important one is that of the number of nearest neighbors to use when predicting the class of an unseen attribution. In scikit-learn, this parameter is called n_neighbors. In the following figure, we show that when this number is too low, a randomly labeled sample can cause an error. In contrast, when it is too high, the actual nearest neighbors have a lower effect on the result:

Setting parameters

In the figure (a), on the left-hand side, we would usually expect the test sample (the triangle) to be classified as a circle. However, if n_neighbors is 1, the single red diamond in this area (likely a noisy sample) causes the sample to be predicted as being a diamond, while it appears to be in a red area. In the figure (b), on the right-hand side, we would usually expect the test sample to be classified as a diamond. However, if n_neighbors is 7, the three nearest neighbors (which are all diamonds) are overridden by the large number of circle samples.

If we want to test a number of values for the n_neighbors parameter, for example, each of the values from 1 to 20, we can rerun the experiment many times by setting n_neighbors and observing the result:

avg_scores = []
all_scores = []
parameter_values = list(range(1, 21))  # Include 20
for n_neighbors in parameter_values:
    estimator = KNeighborsClassifier(n_neighbors=n_neighbors)
    scores = cross_val_score(estimator, X, y, scoring='accuracy')

Compute and store the average in our list of scores. We also store the full set of scores for later analysis:

    avg_scores.append(np.mean(scores))
    all_scores.append(scores)

We can then plot the relationship between the value of n_neighbors and the accuracy. First, we tell the IPython Notebook that we want to show plots inline in the notebook itself:

%matplotlib inline

We then import pyplot from the matplotlib library and plot the parameter values alongside average scores:

from matplotlib import pyplot as plt plt.plot(parameter_values, avg_scores, '-o')

Setting parameters

While there is a lot of variance, the plot shows a decreasing trend as the number of neighbors increases.