Finding the optimal cluster size with YellowBrick

Atsushi Hara
3 min readNov 3, 2021

What is the optimal cluster size?

When you use the K-Means method or other clustering methods, you may wonder what is the optimal cluster size for this dataset, and it is important to find it. The reason why it is important is that if you set the cluster size too small, you can only see the obvious features. And also if you do the opposite, you can’t understand the common features.

To avoid this, there are few well-known methods to find it and visualize, and thanks for a lot of supports there are so many blogs written about this topic.

But when you reference one of those blogs and use it as a solution, sometimes you face some problems, like the library’s version in the post is very old and it is not useful in your environment or it is not re-useful and etc.

Yellowbrick, the visualization library for machine learning

The Yellowbrick is a visualization tool for machine learning.

They already have a lot of features and examples to deal with many types of problems, of course finding the best cluster size, too.

As I said, they already have an example of visualizing the K-Means, so it would be meaningless, but let me explain why I prefer to use this Yellowbrick with some codes.

How to visualize K-Means with Elbow Method

First, you need to install the library.
The Yellowbrick works with Python so you can install via pip installer.
I prefer to use pipenv or poetry for controlling the library’s version.

$ pip install yellowbrick

After installing, you could follow the example codes.
I add some comments to make it easier to understand.

# Import KElbowVisualizer to visualize K-Means model performance
from yellowbrick.cluster import KElbowVisualizer
# Import KMeans method from sklearn to build the K-Means model
from sklearn.cluster import KMeans
# Import make_blobs method from sklearn to build example dataset
from sklearn.datasets import make_blobs
# Generate synthetic dataset with 8 blobs
X, y = make_blobs(
n_samples=1000,
n_features=12,
centers=8,
shuffle=True,
random_state=200
)
# Good part of Yellowbrick
model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,12))
# Fit and show the performance with dataset
visualizer.fit(X)
visualizer.show()

After running the above code, you would have the below result.

You could find that the optimal cluster size is 8 where the first point touches the low scores in the blue line, which is defined as the centers of the make_blob method.

Now, I’d like to start writing what I like about Yellowbrick.
For the most part, what I like is these two lines of code.

model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,12))

When you pass the K-Means model and the range of the cluster sizes to the KElbowVisualizer and call the fit method, it will do the fit process and store the scores, and visualize it.
If you want to do that, you need to implement it. Of course, it won’t be tons of codes, but it is pretty sure that it won’t be cleaner than this.
And also if the range of the cluster size is not enough and you want to search more, you just need to change widely or bigger value.
Do you want to change the metrics method? It’s very easy. Just pass the metrics name as metric in the KElbowVisualizer like bellow.

visualizer = KElbowVisualizer(model, k=(2,11), metric='distortion')
visualizer = KElbowVisualizer(model, k=(2,11), metric='silhouette')
visualizer = KElbowVisualizer(model, k=(2,11), metric='calinski_harabasz')

The Yellowbrick has already prepared 3 methods, distortion, silhouette, calinski_harabasz. The default method is distortion.

Conclusion

I’ve introduced Yellowbrick, the visualization library for machine learning. And I have demonstrated how to use it and what I like about it.

I hope this article would help somebody and I appreciate it if you give a lot of claps.
Thank you for reading until the end!

--

--