The Limitation of KMeans Which Is Often Overlooked by Many

...and here's how to address it.

Avi Chawla

Jun 10, 2023

The KMeans algorithm has a major limitation.

To recall, KMeans is trained as follows:

Initialize centroids
Find the nearest centroid for each point
Reassign centroids
Repeat until convergence

But the second step:

involves a brute-force and exhaustive search
finds the distance of every point from every centroid

As a result, this implementation:

isn't optimized
takes plenty of time to train and predict

This is especially challenging with large datasets.

To speed up KMeans, use Faiss by Facebook AI Research.

Faiss:

provides a much faster nearest-neighbor search.
finds an approximate best solution
uses "Inverted Index", which is an optimized data structure to store and index the data point

This makes performing clustering extremely efficient.

As shown above, Faiss is roughly 20x as fast as running the ideal KMeans algorithm from Sklearn.

Get started with Faiss: GitHub.

👉 Over to you: What are some other limitations of the KMeans algorithm?

👉 Tell the world what makes this newsletter special for you by leaving a review here :)

Review Daily Dose of Data Science

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.

👉 If you love reading this newsletter, feel free to share it with friends!

Share Daily Dose of Data Science

👉 Sponsor the Daily Dose of Data Science Newsletter. More info here: Sponsorship details.

Find the code for my tips here: GitHub.

I like to explore, experiment and write about data science concepts and tools. You can read my articles on Medium. Also, you can connect with me on LinkedIn and Twitter.

Daily Dose of Data Science

The Limitation of KMeans Which Is Often Overlooked by Many

...and here's how to address it.

Discussion about this post