Identify Fuzzy Duplicates in a Dataset with Million Records
A clever technique to optimize the deduplication algorithm.
Data duplication is a big problem that many organizations face.
This creates problems because:
It wastes storage space.
It can lead to incorrect data analysis.
It can result in reporting errors and more.
As you are reading this, you might be thinking that we can remove duplicates by using common methods like df.drop_duplicates()
.
But what if the data has fuzzy duplicates?
Fuzzy duplicates are those records that are not exact copies of each other, but somehow, they appear to be the same. This is shown below:
The Pandas method will be ineffective because it will only remove exact duplicates.
So what can we do here?
Let’s imagine that your data has over a million records.
How would you identify fuzzy duplicates in this dataset?
One way could be to naively compare every pair of records, as depicted below:
We can formulate a distance metric for each field and generate a similarity score for each pair of records.
But this approach is infeasible at scale.
For instance, on a dataset with just a million records, comparing every pair of records will result in 10^12 comparisons (n^2).
Even if we assume a decent speed of 10,000 comparisons per second, this approach will take ~3 years to complete.
Can we do better?
Of course, we can!
But first, we need to understand a special property of fuzzy duplicates.
If two records are duplicates, they will certainly possess some lexical (or textual) overlap.
For instance, consider the below dataset:
Here, comparing the name “Daniel” to “Philip” or “Shannon” to “Julia” makes no sense. There is literally no lexical overlap.
Thus, they are guaranteed to be distinct records.
This makes intuitive sense as well.
If we are calling two records as “duplicates,” there must be some lexical overlap.
Yet, the naive approach will still waste time in comparing them.
We can utilize this “lexical overlap” property of duplicates to cleverly reduce the total comparisons.
More specifically, we segregate the data into smaller buckets by applying some rules.
For instance, consider the above dataset again. One rule could be to create buckets based on the first three letters of the first name.
Thus, we will only compare two records if they are in the same bucket.
If the first three letters are different, the records will fall into different buckets. Thus, they won’t be compared at all.
After segregating the records, we would have eliminated almost 98-99% of unnecessary comparisons that would have happened otherwise.
The figure “98-99%” comes from my practical experience on solving this problem on a dataset of such massive size.
Finally, we can use our naive comparison algorithm on each individual bucket.
The optimized approach can run in just a few hours instead of taking years.
This way, we can drastically reduce the run time and still achieve great deduplication accuracy.
Isn’t that cool?
Of course, we would have to analyze the data thoroughly to come up with the above data split rules.
But what is a more wise thing to do:
Using the naive approach, which takes three years to run, OR,
Spending some time analyzing the data, devising rules, and running the deduplication approach in a few hours?
👉 Over to you: Can you further optimize this approach?
👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.
The button is located towards the bottom of this email.
Thanks for reading!
Whenever you are ready, here’s one more way I can help you:
Every week, I publish 1-2 in-depth deep dives (typically 20+ mins long). Here are some of the latest ones that you will surely like:
[FREE] A Beginner-friendly and Comprehensive Deep Dive on Vector Databases.
You Are Probably Building Inconsistent Classification Models Without Even Realizing
Why Sklearn’s Logistic Regression Has no Learning Rate Hyperparameter?
PyTorch Models Are Not Deployment-Friendly! Supercharge Them With TorchScript.
DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering.
Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.
You Cannot Build Large Data Projects Until You Learn Data Version Control!
To receive all full articles and support the Daily Dose of Data Science, consider subscribing:
👉 If you love reading this newsletter, feel free to share it with friends!
Thanks Avi. It is very useful for me.
Thanks Avi, very useful.