ANN-driven KMeans with Faiss

20x speedup over sklearn.

Avi Chawla

Feb 17, 2025

Slow, incomplete, or outdated data can break everything—AI models fail, dashboards mislead, and decisions suffer.

But with the right tools, you can ensure your data is always real-time, accurate, and scalable.

Bright Data gives you:

Real-time data ingestion to keep AI models and analytics up to date.
Automated data validation to catch inconsistencies before they cause issues.
Anomaly detection to prevent bad data from corrupting your insights.
Scalable data pipelines that grow with your needs—without bottlenecks.
Seamless API integration for easy deployment into your existing stack.

If bad data is slowing you down, it’s time to upgrade.

Bright Data helps you move faster, scale smarter, and make decisions with confidence.

Start Collecting Data

Thanks to Bright Data for partnering today.

Speedup KMeans with Faiss

KMeans is trained as follows:

Step 1) Initialize centroids
Step 2) Find the nearest centroid for each point
Step 3) Reassign centroids
Step 4) Repeat until convergence

But in this implementation, “Step 2” has a run-time bottleneck since it finds the distance of every data point from every centroid:

Faiss (by Facebook AI Research) provides a much faster nearest-neighbor search using approximate nearest-neighbor search algorithms.

It uses an “Inverted Index,” which is an optimized data structure to store and index the data point.

We covered indexing techniques in the vector databases article here: A Beginner-friendly and Comprehensive Deep Dive on Vector Databases

This makes performing search extremely efficient, especially on large datasets, which is also evident from the image below:

As shown above, on a dataset of 500k data points (1024 dimensions), Faiss is roughly 20x faster than KMeans from Sklearn, which is an insane speedup.

What’s more, Faiss can also run on a GPU, which can further speed up your clustering run-time performance.

Over to you: What are some other limitations of the KMeans algorithm?

P.S. For those wanting to develop “Industry ML” expertise:

At the end of the day, all businesses care about impact. That’s it!

Can you reduce costs?
Drive revenue?
Can you scale ML models?
Predict trends before they happen?

We have discussed several other topics (with implementations) that align with such topics.

Develop "Industry ML" Skills

Here are some of them:

Learn sophisticated graph architectures and how to train them on graph data: A Crash Course on Graph Neural Networks – Part 1.
So many real-world NLP systems rely on pairwise context scoring. Learn scalable approaches here: Bi-encoders and Cross-encoders for Sentence Pair Similarity Scoring – Part 1.
Learn techniques to run large models on small devices: Quantization: Optimize ML Models to Run Them on Tiny Hardware.
Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust: Conformal Predictions: Build Confidence in Your ML Model’s Predictions.
Learn how to identify causal relationships and answer business questions: A Crash Course on Causality – Part 1
Learn how to scale ML model training: A Practical Guide to Scaling ML Model Training.
Learn techniques to reliably roll out new models in production: 5 Must-Know Ways to Test ML Models in Production (Implementation Included)
Learn how to build privacy-first ML systems: Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.
Learn how to compress ML models and reduce costs: Model Compression: A Critical Step Towards Efficient Machine Learning.

All these resources will help you cultivate key skills that businesses and companies care about the most.

Daily Dose of Data Science

Discussion about this post