The KMeans algorithm has a major limitation.
To recall, KMeans is trained as follows:
Initialize centroids
Find the nearest centroid for each point
Reassign centroids
Repeat until convergence
But the second step:
involves a brute-force and exhaustive search
finds the distance of every point from every centroid
As a result, this implementation:
isn't optimized
takes plenty of time to train and predict
This is especially challenging with large datasets.
To speed up KMeans, use Faiss by Facebook AI Research.
Faiss:
provides a much faster nearest-neighbor search.
finds an approximate best solution
uses "Inverted Index", which is an optimized data structure to store and index the data point
This makes performing clustering extremely efficient.
As shown above, Faiss is roughly 20x as fast as running the ideal KMeans algorithm from Sklearn.
Get started with Faiss: GitHub.
π Over to you: What are some other limitations of the KMeans algorithm?
π Tell the world what makes this newsletter special for you by leaving a review here :)
π If you liked this post, donβt forget to leave a like β€οΈ. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.
π If you love reading this newsletter, feel free to share it with friends!
π Sponsor the Daily Dose of Data Science Newsletter. More info here: Sponsorship details.
Find the code for my tips here:Β GitHub.
I like to explore, experiment and write about data science concepts and tools. You can read my articles on Medium. Also, you can connect with me on LinkedIn and Twitter.