Hi Ron, someone can jump in if I'm missing something, but Avi describes a method that is has a human process. Identifying these lexical duplicates in the example, is checking for a duplicate column of names. To identify these duplicates in Python, pandas method duplicated() can be used to subset the full dataset. in R data.table or maybe dplyr/tidy there is a very similar method. After that, you have to look through the data and find a common rule like "all the same values, but column c has a NaN" which I think would be the rule for Avi's illustrated examples. Hope this helps!
@Ron, this will help: https://github.com/dedupeio/dedupe. I wrote a PySpark implementation at my earlier job so unfortunately, I don't have that code with me, but the GitHub link I shared is pretty similar.
As far as rules are concerned, I mostly resorted to manual analysis to understand the data patterns, what kind of names appear similar and what patterns they follow. For instance, in my case, I had address information (with fuzzy matches, of course). One rule that worked pretty well in this case was "2 common tokens".
So essentially, if the address has 2 words that are the same, we group them in the same bucket. You can literally come up with any rule of your choice, it's that flexible. With the GitHub library I shared, I think they provide a limited set of rules only, but they must cover a wide variety of fuzzy matching.
Sorry for not being able to help with the code as I literally don't have it with me.
Thanks Avi. It is very useful for me.
Thanks Avi, very useful.
Great post. But can you please share some code?
Hi Ron, someone can jump in if I'm missing something, but Avi describes a method that is has a human process. Identifying these lexical duplicates in the example, is checking for a duplicate column of names. To identify these duplicates in Python, pandas method duplicated() can be used to subset the full dataset. in R data.table or maybe dplyr/tidy there is a very similar method. After that, you have to look through the data and find a common rule like "all the same values, but column c has a NaN" which I think would be the rule for Avi's illustrated examples. Hope this helps!
Further info on duplicated():
https://www.geeksforgeeks.org/pandas-dataframe-duplicated/amp/
Thanks Quinn
@Ron, this will help: https://github.com/dedupeio/dedupe. I wrote a PySpark implementation at my earlier job so unfortunately, I don't have that code with me, but the GitHub link I shared is pretty similar.
As far as rules are concerned, I mostly resorted to manual analysis to understand the data patterns, what kind of names appear similar and what patterns they follow. For instance, in my case, I had address information (with fuzzy matches, of course). One rule that worked pretty well in this case was "2 common tokens".
So essentially, if the address has 2 words that are the same, we group them in the same bucket. You can literally come up with any rule of your choice, it's that flexible. With the GitHub library I shared, I think they provide a limited set of rules only, but they must cover a wide variety of fuzzy matching.
Sorry for not being able to help with the code as I literally don't have it with me.
@Avi
Thank you. You provided a wonderful starting place
Wonderful, Ron. Good luck :)