How Would You Identify Fuzzy Duplicates In A Data With Million Records?
Imagine you have over a million records with fuzzy duplicates. How would you identify potential duplicates?
The naive approach of comparing every pair of records is infeasible in such cases. That's over 10^12 comparisons (n^2). Assuming a speed of 10,000 comparisons per second, it will take roughly 3 years to complete.
The csvdedupe tool (linked in comments) solves this by cleverly reducing the comparisons. For instance, comparing the name “Daniel” to “Philip” or “Shannon” to “Julia” makes no sense. They are guaranteed to be distinct records.
Thus, it groups the data into smaller buckets based on rules. One rule could be to group all records with the same first three letters in the name.
This way, it drastically reduces the number of comparisons with great accuracy.
Read more: csvdedupe.
Share this post on LinkedIn: Post Link.
Find the code for my tips here: GitHub.
I like to explore, experiment and write about data science concepts and tools. You can read my articles on Medium. Also, you can connect with me on LinkedIn.