DBSCAN++: The Faster and Scalable Alternative to DBSCAN

Addressing major limitations of DBSCAN.

Oct 29, 2024

You may have likely heard about DBSCAN clustering.

A common issue is its run-time, which grows quadratically with the dataset’s size.

DBSCAN++ is a faster and more scalable alternative to it, which we discussed in detail here with implementation: DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering.

Why care?

Non-parametric unsupervised algorithms are widely used in industry to analyze large datasets.

This is because in many practical applications, gathering true labels is pretty difficult or infeasible.

In such cases:

Either data teams annotate the data, which can be practically impossible at times.
Or, they use unsupervised methods to identify patterns.

While KMeans is widely used here due to its simplicity and effectiveness as a clustering algorithm, it has many limitations:

It does not account for cluster covariance.
It can only produce spherical clusters. As shown below, even if the data has non-circular clusters, it still produces round clusters.

Density-based algorithms, like DBSCAN, quite effectively address these limitations.

The core idea behind DBSCAN is to group together data points based on “density”, i.e., points that are close to each other in a high-density region and are separated by lower-density regions.

But unfortunately, with the use of DBSCAN, we again run into an overlooked problem.

One of the things that makes DBSCAN infeasible to use at times is its run-time.

Until recently, it was believed that DBSCAN had a run-time of O(nlogn), but it was proven to be O(n²) in the worst case.

In fact, this can also be verified from the figure below. It depicts the run-time of DBSCAN with the number of samples:

It is clear that DBSCAN has a quadratic relation with the dataset’s size.

Thus, there is an increasing need to establish more efficient versions of DBSCAN.

DBSCAN++ is a major step towards a fast and scalable DBSCAN.

Simply put, DBSCAN++ is based on the observation that we only need to compute the density estimates for a subset “m” of the “n” data points in the given dataset, where “m” can be much smaller than “n” to cluster properly.

The effectiveness of DBSCAN++ is evident from the image below:

On a dataset of 60k data points:

DBSCAN++ is 20x faster than DBSCAN.
DBSCAN++ produces the same clustering scores as DBSCAN.

But how does it work end-to-end?

Why selecting a subset of data points is still effective?
What strategies can we use to select a subset of data points?
What other limitations does DBSCAN++ address?
DBSCAN++ is a new algorithm. So what are the things to keep in mind to use it effectively?

If you are curious, we have gone into the full details here: DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering.

Most real-world datasets are unsupervised and inherently quite large in size:

KMeans is typically not useful because of the strict assumptions stated above.
DBSCAN can work, but it will suffer from run-time.
As a result, awareness about algorithms like DBSCAN++ becomes pretty crucial.

Of course, I know that many of you might not have a clear understanding of DBSCAN itself.

So, I have provided a detailed overview of DBSCAN as well, which serves as a natural buildup for DBSCAN++.

Also, one issue with DBSCAN++ is that it is not integrated into Sklearn.

But don’t worry.

For this article, I wrote and explained a vectorized implementation of DBSCAN++.

Read it here: DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering.

If you want to dive into more advanced clustering algorithms (with from-scratch implementations), here are the resources:

P.S. For those wanting to develop “Industry ML” expertise:

At the end of the day, all businesses care about impact. That’s it!

Can you reduce costs?
Drive revenue?
Can you scale ML models?
Predict trends before they happen?

We have discussed several other topics (with implementations) in the past that align with such topics.

Develop "Industry ML" Skills

Here are some of them:

Learn sophisticated graph architectures and how to train them on graph data: A Crash Course on Graph Neural Networks – Part 1.
So many real-world NLP systems rely on pairwise context scoring. Learn scalable approaches here: Bi-encoders and Cross-encoders for Sentence Pair Similarity Scoring – Part 1.
Learn techniques to run large models on small devices: Quantization: Optimize ML Models to Run Them on Tiny Hardware.
Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust: Conformal Predictions: Build Confidence in Your ML Model’s Predictions.
Learn how to identify causal relationships and answer business questions: A Crash Course on Causality – Part 1
Learn how to scale ML model training: A Practical Guide to Scaling ML Model Training.
Learn techniques to reliably roll out new models in production: 5 Must-Know Ways to Test ML Models in Production (Implementation Included)
Learn how to build privacy-first ML systems: Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.
Learn how to compress ML models and reduce costs: Model Compression: A Critical Step Towards Efficient Machine Learning.

All these resources will help you cultivate key skills that businesses and companies care about the most.

SPONSOR US

Get your product in front of 105,000+ data scientists and machine learning professionals.

Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., who have influence over significant tech decisions and big purchases.

To ensure your product reaches this influential audience, reserve your space here or reply to this email to ensure your product reaches this influential audience

Mahdi

I appreciate you sharing!

Would you kindly provide us advice on how to minimize the noise in the data points that HDBSCAN collects? If we wish to preserve every piece of data, how can we manage it? For instance, clustering a large number of keywords using HDBSCAN, while it has also detected a bunch of keywords as noise.

Expand full comment

Sikiru Olusegun Joshua

Even though I've not make use of DBSCAN before, all because I've been using KMeans lately. With this articles, there's much to what I can learn.

Daily Dose of Data Science

Discussion about this post