AI isn’t magic. It’s math.
Understand the concepts powering technology like ChatGPT in minutes a day with Brilliant.
Thousands of quick, interactive lessons in AI, programming, logic, data science, and more make it easy.
Try it free for 30 days here →
Thanks to Brilliant for partnering with us today!
Clustering Evaluation Without Labels
Evaluating clustering quality is usually difficult since we have no labels. Thus, we must rely on intrinsic measures to determine clustering quality.
Here are three metrics I commonly use:
1) Silhouette coefficient:
Here's the core idea:
If the average distance to all data points in the same cluster is small...
...but that to another cluster is large...
...this indicates that the clusters are well separated and somewhat "reliable."
It is measured as follows:
For every data point:
A
→ average distance to all other points within its cluster.B
→ average distance to all points in the nearest cluster.score
= (B-A)/max(B, A)
Next, compute the average of all scores to get the overall clustering score.
If B is much greater than A, then score=1
and it indicates the clusters are well separated.
Measuring it across a range of centroids (k
) can reveal which clustering results are most promising:
2) Calinski-Harabasz Index
The run-time of Silhouette score grows quadratically with total data points.
Calinski-Harabasz Index handles this, while being similar to Silhouette score.
Here’s how it is measured:
A
→ sum of squared distance between centroids and the dataset's center.B
→ sum of squared distance between all points and their specific centroid.Metric is computed as
A/B
(with an additional scaling factor).
If A is much greater than B, then score>>1
and it indicates the clusters are well separated.
Calinski-Harabasz Index makes the same intuitive sense as the Silhouette Coefficient while being much faster to compute.
3) DBCV
Silhouette score and Calinski-Harabasz index are typically higher for globular (spherical in the case of 3D) clusters.
Thus, using them on density-based clustering can produce misleading results.
DBCV (density-based clustering validation) solves this, and it computes two values:
The density within a cluster.
The density overlap between clusters.
A high density within a cluster and a low density overlap between clusters indicate good clustering results. The effectiveness of DBCV is evident from the image below:
As depicted above:
The clustering output of KMeans is worse, but its Silhouette score is still higher than that of Density-based clustering.
With DBCV, the score for the clustering output of KMeans is worse, and that of density-based clustering is higher.
That said, here, we covered centroid-based and density-based evaluation.
You can read about Distributed-based clustering and its evaluation here: Gaussian Mixture Models (GMMs).
Also, you can read about DBSCAN++ here: DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering.
👉 Over to you: What are some other ways to evaluate clustering performance in such situations?
Final reminder
This is another reminder that lifetime access to Daily Dose of Data Science is available at 30% off.
The offer ends in 1.5 days. Join here: Lifetime membership.
The offer ends in 3 days. Join here: Lifetime membership.
It gives you lifetime access to the no-fluff, industry-relevant, and practical DS and ML resources that help you succeed and stay relevant in these roles:
Our recent 7-part crash course on building RAG systems.
LLM fine-tuning techniques and implementations.
Our crash courses on graph neural networks, PySpark, model interpretability, model calibration, causal inference, and more.
Scaling ML models with implementations.
Building privacy-preserving ML systems.
Mathematical deep dives on core DS topics, clustering, etc.
From-scratch implementations of several core ML algorithms.
Building 100% reproducible ML projects.
50+ more existing industry-relevant topics (usually over 20 mins read covering several details).
Also, all weekly deep dives that we will publish in the future are included.
Join below at 30% off: Lifetime membership.
Our next price drop will not happen any sooner than 8-9 months. If you find value in this work, it is a great time to upgrade to a lifetime experience.
P.S. If you are an existing monthly or yearly member and wish to upgrade to lifetime, please reply to this email.