Mastering Machine Learning Algorithms
上QQ阅读APP看书,第一时间看更新

Example of Isomap

We can now test the Scikit-Learn Isomap implementation using the Olivetti faces dataset (provided by AT&T Laboratories, Cambridge), which is made up of 400 64 × 64 grayscale portraits belonging to 40 different people. Examples of these images are shown here:

Subset of the Olivetti faces dataset

The original dimensionality is 4096, but we want to visualize the dataset in two dimensions. It's important to understand that using the Euclidean distance for measuring the similarity of images might not the best choice, and it's surprising to see how well the samples are clustered by such a simple algorithm.

The first step is loading the dataset:

from sklearn.datasets import fetch_olivetti_faces

faces = fetch_olivetti_faces()

The faces dictionary contains three main elements:

  • images: Image array with shape 400 × 64 × 64
  • data: Flattened array with shape 400 × 4096
  • target: Array with shape 400 × 1 containing the labels (0, 39)

At this point, we can instantiate the Isomap class provided by Scikit-Learn, setting n_components=2 and n_neighbors=5 (the reader can try different configurations), and then fitting the model:

from sklearn.manifold import Isomap

isomap = Isomap(n_neighbors=5, n_components=2)
X_isomap = isomap.fit_transform(faces['data'])

As the resulting plot with 400 elements is very dense, I preferred to show in the following plot only the first 100 samples:

Isomap applied to 100 samples drawn from the Olivetti faces dataset

As it's possible to see, samples belonging to the same class are grouped in rather dense agglomerates. The classes that seem better separated are 7 and 1. Checking the corresponding faces, for class 7, we get:

Samples belonging to class 7

The set contains portraits of a young woman with a fair complexion, quite different from the majority of other people. Instead, for class 1, we get:

Samples belonging to class 1

In this case, it's a man with big glasses and a particular mouth expression. In the dataset, there are only a few people with glasses, and one of them has a dark beard. We can conclude that Isomap created a low-dimensional representation that is really coherent with the original geodesic distances. In some cases, there's a partial clustering overlap that can be mitigated by increasing the dimensionality or adopting a more complex strategy.