Agglomerative versus Divisive Clustering
So far, our instances of hierarchical clustering have all been agglomerative – that is, they have been built from the bottom up. While this is typically the most common approach for this type of clustering, it is important to know that it is not the only way a hierarchy can be created. The opposite hierarchical approach, that is, built from the top up, can also be used to create your taxonomy. This approach is called pisive hierarchical clustering and works by having all the data points in your dataset in one massive cluster. Many of the internal mechanics of the pisive approach will prove to be quite similar to the agglomerative approach:
As with most problems in unsupervised learning, deciding on the best approach is often highly dependent on the problem you are faced with solving.
Imagine that you are an entrepreneur who has just bought a new grocery store and needs to stock it with goods. You receive a large shipment of food and drink in a container, but you've lost track of all the shipment information. In order to effectively sell your products, you must group similar products together (your store will be a huge mess if you just put everything on the shelves in a random order). Setting out on this organizational goal, you can take either a bottom-up or top-down approach. On the bottom-up side, you will go through the shipping container and think of everything as disorganized – you will then pick up a random object and find its most similar product. For example, you may pick up apple juice and realize that it makes sense to group it together with orange juice. With the top-down approach, you will view everything as organized in one large group. Then, you will move through your inventory and split the groups based on the largest differences in similarity. For example, if you were organizing a grocery store, you may originally think that apples and apple juice go together, but on second thoughts, they are quite different. Therefore, you will break them into smaller, dissimilar groups.
In general, it helps to think of agglomerative as the bottom-up approach and pisive as the top-down approach – but how do they trade off in terms of performance? This behavior of immediately grabbing the closest thing is known as "greedy learning;" it has the potential to be fooled by local neighbors and not see the larger implications of the clusters it forms at any given time. On the flip side, the pisive approach has the benefit of seeing the entire data distribution as one from the beginning and choosing the best way to break down clusters. This insight into what the entire dataset looks like is helpful for potentially creating more accurate clusters and should not be overlooked. Unfortunately, a top-down approach typically trades off greater accuracy for deeper complexity. In practice, an agglomerative approach works most of the time and should be the preferred starting point when it comes to hierarchical clustering. If, after reviewing the hierarchies, you are unhappy with the results, it may help to take a pisive approach.
Exercise 2.03: Implementing Agglomerative Clustering with scikit-learn
In most business use cases, you will likely find yourself implementing hierarchical clustering with a package that abstracts everything away, such as scikit-learn. Scikit-learn is a free package that is indispensable when it comes to machine learning in Python. It conveniently provides highly optimized forms of the most popular algorithms, such as regression, classification, and clustering. By using an optimized package such as scikit-learn, your work becomes much easier. However, you should only use it when you fully understand how hierarchical clustering works, as we discussed in the previous sections. This exercise will compare two potential routes that you can take when forming clusters – using SciPy and scikit-learn. By completing this exercise, you will learn what the pros and cons are of each, and which suits you best from a user perspective. Follow these steps to complete this exercise:
- Scikit-learn makes implementation as easy as just a few lines of code. First, import the necessary packages and assign the model to the ac variable. Then, create the blob data as shown in the previous exercises:
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
ac = AgglomerativeClustering(n_clusters = 8, \
affinity="euclidean", \
linkage="average")
X, y = make_blobs(n_samples=1000, centers=8, \
n_features=2, random_state=800)
First, we assign the model to the ac variable by passing in parameters that we are familiar with, such as affinity (the distance function) and linkage.
- Then reuse the linkage function and fcluster objects we used in prior exercises:
distances = linkage(X, method="centroid", metric="euclidean")
sklearn_clusters = ac.fit_predict(X)
scipy_clusters = fcluster(distances, 3, criterion="distance")
After instantiating our model into a variable, we can simply fit the dataset to the desired model using .fit_predict() and assign it to an additional variable. This will give us information on the ideal clusters as part of the model fitting process.
- Then, we can compare how each of the approaches work by comparing the final cluster results through plotting. Let's take a look at the clusters from the scikit-learn approach:
plt.figure(figsize=(6,4))
plt.title("Clusters from Sci-Kit Learn Approach")
plt.scatter(X[:, 0], X[:, 1], c = sklearn_clusters ,\
s=50, cmap='tab20b')
plt.show()
Here is the output for the clusters from the scikit-learn approach:
Take a look at the clusters from the SciPy approach:
plt.figure(figsize=(6,4))
plt.title("Clusters from SciPy Approach")
plt.scatter(X[:, 0], X[:, 1], c = scipy_clusters ,\
s=50, cmap='tab20b')
plt.show()
The output is as follows:
As you can see, the two converge to basically the same clusters.
Note
To access the source code for this specific section, please refer to https://packt.live/2DngJuz.
You can also run this example online at https://packt.live/3f5PRgy.
While this is great from a toy problem perspective, in the next activity, you will learn that small changes to the input parameters can lead to wildly different results.
Activity 2.01: Comparing k-means with Hierarchical Clustering
You are managing a store's inventory and receive a large shipment of wine, but the brand labels fell off the bottles in transit. Fortunately, your supplier has provided you with the chemical readings for each bottle, along with their respective serial numbers. Unfortunately, you aren't able to open each bottle of wine and taste test the difference – you must find a way to group the unlabeled bottles back together according to their chemical readings. You know from the order list that you ordered three different types of wine and are given only two wine attributes to group the wine types back together. In this activity, we will be using the wine dataset. This dataset comprises chemical readings from three different types of wine, and as per the source on the UCI Machine Learning Repository, it contains these features:
- Alcohol
- Malic acid
- Ash
- Alkalinity of ash
- Magnesium
- Total phenols
- Flavanoids
- Nonflavanoid phenols
- Proanthocyanins
- Color intensity
- Hue
- OD280/OD315 of diluted wines
- Proline
Note
The wine dataset is sourced from https://archive.ics.uci.edu/ml/machine-learning-databases/wine/.[UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.] It can also be accessed at https://packt.live/3aP8Tpv.
The aim of this activity is to implement k-means and hierarchical clustering on the wine dataset and to determine which of these approaches is more accurate in forming three separate clusters for each wine type. You can try different combinations of scikit-learn implementations and use helper functions in SciPy and NumPy. You can also use the silhouette score to compare the different clustering methods and visualize the clusters on a graph.
After completing this activity, you will see first-hand how two different clustering algorithms perform on the same dataset, allowing easy comparison when it comes to hyperparameter tuning and overall performance evaluation. You will probably notice that one method performs better than the other, depending on how the data is shaped. Another key outcome from this activity is gaining an understanding of how important hyperparameters are in any given use case.
Here are the steps to complete this activity:
- Import the necessary packages from scikit-learn (KMeans, AgglomerativeClustering, and silhouette_score).
- Read the wine dataset into a pandas DataFrame and print a small sample.
- Visualize some features from the dataset by plotting the OD Reading feature against the proline feature.
- Use the sklearn implementation of k-means on the wine dataset, knowing that there are three wine types.
- Use the sklearn implementation of hierarchical clustering on the wine dataset.
- Plot the predicted clusters from k-means.
- Plot the predicted clusters from hierarchical clustering.
- Compare the silhouette score of each clustering method.
Upon completing this activity, you should have plotted the predicted clusters you obtained from k-means as follows:
A similar plot should also be obtained for the cluster that was predicted by hierarchical clustering, as shown here:
Note
The solution to this activity can be found on page 423.