Mastering Machine Learning Algorithms
上QQ阅读APP看书,第一时间看更新

Isomap

Isomap is one of the simplest algorithms, and it's based on the idea of reducing the dimensionality while trying to preserve the geodesic distances measured on the original manifold where the input data lies. The algorithm works in three steps. The first operation is a k-nearest neighbors clustering and the construction of the following graph. The vertices will be the samples, while the edges represent the connections among nearest neighbors, and their weight is proportional to the distance to the corresponding neighbor. 

The second step adopts the Dijkstra algorithm to compute the shortest pairwise distances on the graph of all couples of samples. In the following graph, there's a portion of a graph, where some shortest distances are marked:

 

Example of a graph with marked shortest distances

For example, as x3 is a neighbor of x5 and x7, applying the Dijkstra algorithm, we could get the shortest paths d(x3, x5) = w53 and d(x3, x7) = w73. The computational complexity of this step is about O(n²log n + n²k), which is lower than O(n³) when k << n (a condition normally met); however, for large graphs (with n >> 1), this is often the most expensive part of the whole algorithm.

The third step is called metric multidimensional scaling, which is a technique for finding a low-dimensional representation while trying to preserve the inner product among samples. If we have a P-dimensional dataset X, the algorithm must find a Q-dimensional set Φ with Q < P minimizing the function:

As proven in Semi-Supervised Learning  Chapelle O., Schölkopf B., Zien A., (edited by), The MIT Press, the optimization is achieved by taking the top Q eigenvectors of the Gram matrix Gij = xi · xj (or in matrix form, G=XXT if X ∈ ℜn × M); however, as the Isomap algorithm works with pairwise distances, we need to compute the matrix D of squared distances:

If the X dataset is zero-centered, it's possible to derive a simplified Gram matrix from D, as described by M. A. A. Cox and T. F. Cox:

Isomap computes the top Q eigenvalues λ1, λ2, ..., λQ of GD and the corresponding eigenvectors ν1ν2, ..., νQ and determines the Q-dimensional vectors as:

As we're going to discuss in Chapter 5, EM Algorithm and Applications (and also as pointed out by Saul, Weinberger, Sha, Ham, and Lee in Spectral Methods for Dimensionality Reduction, Saul L. K., Weinberger K. Q., Sha F., Ham J., and Lee D. D.), this kind of projection is also exploited by Principal Component Analysis (PCA), which finds out the direction with the highest variance, corresponding to the top k eigenvectors of the covariance matrix. In fact, when applying the SVD to the dataset X, we get:

The diagonal matrix Λ contains the eigenvalues of both XXT and XTX; therefore, the eigenvalues λGi of G are equal to MλΣi where λΣi are the eigenvalues of the covariance matrix Σ = M-1XTX. Hence, Isomap achieves the dimensionality reduction, trying to preserve the pairwise distances, while projecting the dataset in the subspace determined by a group of eigenvectors, where the maximum explained variance is achieved. In terms of information theory, this condition guarantees the minimum loss with an effective reduction of dimensionality.

Scikit-Learn also implements the Floyd-Warshall algorithm, which is slightly slower. For further information, please refer to Introduction to Algorithms, Cormen T. H., Leiserson C. E., Rivest R. L., The MIT Press.