Identify Fuzzy Duplicates in a Dataset with…

Avi Chawla

Feb 28, 2024

A clever technique to optimize the deduplication algorithm.

Read →

7 Comments

Vincent Rainardi

Feb 29, 2024

Thanks Avi. It is very useful for me.

Expand full comment

Rayudu

Feb 29, 2024

Thanks Avi, very useful.

Expand full comment

Ron

Feb 28, 2024

Great post. But can you please share some code?

Expand full comment

Reply (2)

Avi Chawla

Feb 29, 2024

@Ron, this will help: https://github.com/dedupeio/dedupe. I wrote a PySpark implementation at my earlier job so unfortunately, I don't have that code with me, but the GitHub link I shared is pretty similar.

As far as rules are concerned, I mostly resorted to manual analysis to understand the data patterns, what kind of names appear similar and what patterns they follow. For instance, in my case, I had address information (with fuzzy matches, of course). One rule that worked pretty well in this case was "2 common tokens".

So essentially, if the address has 2 words that are the same, we group them in the same bucket. You can literally come up with any rule of your choice, it's that flexible. With the GitHub library I shared, I think they provide a limited set of rules only, but they must cover a wide variety of fuzzy matching.

Sorry for not being able to help with the code as I literally don't have it with me.

Expand full comment

Reply (1)

Ron

Feb 29, 2024

@Avi

Thank you. You provided a wonderful starting place

Expand full comment

Reply (1)

Avi Chawla

Mar 1, 2024

Wonderful, Ron. Good luck :)