A (Highly) Important Point to Consider Before You Use KMeans Next Time
Estimating the success rate of KMeans.
The most important yet often overlooked step of KMeans is its centroid initialization. Here's something to consider before you use it next time.
KMeans selects the initial centroids randomly. As a result, it fails to converge at times. This requires us to repeat clustering several times with different initialization.
Yet, repeated clustering may not guarantee that you will soon end up with the correct clusters. This is especially true when you have many centroids to begin with.
Instead, KMeans++ takes a smarter approach to initialize centroids.
The first centroid is selected randomly. But the next centroid is chosen based on the distance from the first centroid.
In other words, a point that is away from the first centroid is more likely to be selected as an initial centroid. This way, all the initial centroids are likely to lie in different clusters already, and the algorithm may converge faster and more accurately.
The impact is evident from the bar plots shown below. They depict the frequency of the number of misplaced centroids obtained (analyzed manually) after training 50 different models with KMeans and KMeans++.
On the given dataset, out of the 50 models, KMeans only produced zero misplaced centroids once, which is a success rate of just 2%.
In contrast, KMeans++ never produced any misplaced centroids.
Luckily, if you are using sklearn, you don’t need to worry about the initialization step. This is because sklearn, by default, resorts to the KMeans++ approach.
However, if you have a custom implementation, do give it a thought.
👉 Tell me you liked this post by leaving a heart react 🤍.
👉 If you love reading this newsletter, feel free to share it with friends!
Hey there!
I hope you may have noticed that I have made these daily newsletter issues a bit more detailed. What is your feedback about this change?
Find the code for my tips here: GitHub.
I like to explore, experiment and write about data science concepts and tools. You can read my articles on Medium. Also, you can connect with me on LinkedIn and Twitter.
Very nice short empirical analysis. Questions:
1. Here, I assume you assume you knew the actual number of clusters. Does kmeans++ still succeed to converge most of the time if the number of clusters used is over or under the actual number?
2. In your example, all the clusters are equally distributed in space which makes it perfectly adequate for the method to initialize the centroids as those will also be well distributed in space. What happens when the clusters are not well distributed in spaces, for instance if you have a group of clusters in one side of the space well separated from another group of clusters. My intuition is that even kmeans++ will not able to converge most of the time in that case as well. Have you tried this?
Thanks for your newsletter. I enjoy reading it!
What is the method to choose the second centroid, then the third centroid ?