An Overlooked Source of (Massive) Run-time Optimization in KMeans

Another issue with KMeans and here's how to address it.

Dec 15, 2023

Yesterday, we learned how the mini-batch implementation of the KMeans clustering algorithm works.

More specifically, we discussed how during the centroid reassignment step in traditional KMeans, all data points must be available in memory to average over.

But in large datasets, keeping all data points in memory can be difficult.

MiniBatchKMeans is a mini-batch implementation of KMeans, which addresses this issue by maintaining the following details for every centroid while iterating through all mini-batches:

A “sum-vector” to store the sum of data points vectors assigned to the centroid
A “count” variable to store the number of data points vectors assigned to the centroid.

After iterating through all mini-batches, we perform a scalar division between sum-vector and count to get the average vector:

This average vector will precisely tell us the new location of the centroid as if all data points were present in memory.

This solves the memory bottleneck issues of KMeans.

In fact, a mini-batch approach can also offer a run-time improvement because, while typically, vectorization provides magical run-time improvements when we have a bunch of data, the performance often degrades after a certain point.

Performance benefit with vectorization first increases and then gradually decreases with the dataset size

Thus, by loading fewer training instances at a time into memory, we can get a better training run-time.

For more details, please refer to yesterday’s issue.

Another limitation of KMeans

Nevertheless, there’s another major limitation of both KMeans and MiniBatchKMeans, which we didn’t discuss yesterday.

To recall, KMeans is trained as follows:

Step 1) Initialize centroids
Step 2) Find the nearest centroid for each point
Step 3) Reassign centroids
Step 4) Repeat until convergence

But in this implementation, even “Step 2” has a run-time bottleneck, as this step involves a brute-force and exhaustive search.

In other words, it finds the distance of every data point from every centroid.

As a result, this step isn’t optimized, and it takes plenty of time to train and predict.

This is especially challenging with large datasets.

To speed up KMeans, one of the implementations I usually prefer, especially on large datasets, is Faiss by Facebook AI Research.

To elaborate further, Faiss provides a much faster nearest-neighbor search using approximate nearest-neighbor search algorithms.

It uses “Inverted Index”, which is an optimized data structure to store and index the data point.

This makes performing clustering extremely efficient, especially on large datasets, which is also evident from the image below:

As shown above, on a dataset of 500k data points (1024 dimensions), Faiss is roughly 20x faster than KMeans from Sklearn, which is an insane speedup.

What’s more, Faiss can also run on a GPU, which can further speed up your clustering run-time performance.

👉 Get started with Faiss here: GitHub.

👉 Over to you: What are some other limitations of the KMeans algorithm?

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.