5 Comments

Very nice short empirical analysis. Questions:

1. Here, I assume you assume you knew the actual number of clusters. Does kmeans++ still succeed to converge most of the time if the number of clusters used is over or under the actual number?

2. In your example, all the clusters are equally distributed in space which makes it perfectly adequate for the method to initialize the centroids as those will also be well distributed in space. What happens when the clusters are not well distributed in spaces, for instance if you have a group of clusters in one side of the space well separated from another group of clusters. My intuition is that even kmeans++ will not able to converge most of the time in that case as well. Have you tried this?

Thanks for your newsletter. I enjoy reading it!

Expand full comment

Those are pretty good questions.

1. The only difference between KMeans and Kmeans++ lies in the initialization step. Thus, any limitations of KMeans will be applicable to KMeans++ as well, one of which, is knowing the number of clusters pre-clustering.

2. I tried this and it does work pretty well as long as you have spherical clusters (which is another assumption of KMeans). I am not able to attach an image here but please find it here: https://imgur.com/a/6n3xqKj.

Lastly, thanks for appreciating the work, glad to know you enjoy reading the newsletter :)

Expand full comment

Oh, thank you. It is like you waited for somebody to ask question #2. Perfect answer! Keep going the good work! :)

Expand full comment

What is the method to choose the second centroid, then the third centroid ?

Expand full comment

The metric used is distance. The first centroid is always chosen randomly. But for the next centroid, the more far it is from the first centroid, the more likely it is to be selected. So the process of selecting the second centroid is still random, yet, it is weighted. More the distance, more the probability of being selected.

Now while selecting the third centroid, you have already selected two centroids. For the third centroid, we again compute the distance of each point from the two centroids and store the minimum of the two. In other words, for every point that you may choose to select as third centroid, you know the distance to its closest centroid. Now, for selecting the third centroid, the process is the same as we followed for the second centroid. The more the distance, the more the probability of being selected.

If you are wondering why we store the minimum of the two distances for each point, here's the reason. Essentially, if you want the third centroid to be well separated from the two centroids, it should be far from BOTH of the already chosen centroids and not just one of them. Thus, by storing the minimum, you can make sure that a point that is more likely to be selected is farthest from ALL centroids. Hope that helps :)

Expand full comment