6 Comments
Sep 14, 2023Liked by Avi Chawla

When I encountered fuzzy de-duplication in the past, it was usually sufficient to normalize the data. For phone numbers, you can remove all non-digit characters, and you can normalize addresses using a service like Smarty. Then do exact comparison. However, normalization might not be good enough for all types of data.

Expand full comment

The method you provided is far from perfect. There are significantly better methods available for checking fuzzy matching with a specified number of errors. You can search for NMSLIB. We have developed our own method that surpasses NMSLIB, providing faster results and matching more strings. - Panna Lal Patodia, CEO, Patodia Infotech Private Limited

Expand full comment
author

Of course, Mr. Panna Lal. I never claimed that it is the best method :)

When I used it in my case though, it worked out to be pretty good and we (me and my team) delivered a pretty well performing solution. Also, thanks for introducing me to NMSLIB. Can you provide some more resources for me to explore?

Thanks a lot!

Expand full comment

After bringing down to buckets, is there apossibility to convert them to embedding weights and apply nearest neighbour? If it's too near can we mark as same? Please share your thoughts

Expand full comment

I mean cosign similarity. I did try with chatgpt here is the reply.

Given the type of the data, one common approach to convert such records into embeddings is to use vector representations of text, like TF-IDF or word embeddings (e.g., Word2Vec, FastText). However, these embeddings usually work best for larger chunks of text. Here, we're working with structured data with potentially short string entries.

Let's take a simpler, high-level approach:

1. Preprocess and standardize each field.

2. Convert each field into a binary presence vector.

3. Calculate the cosine similarity between the two resulting vectors.

### Preprocessing Steps:

1. First_Name: Keep as is.

2. Last_Name: Handle NaN as a missing value.

3. Address: Convert 'St.' to 'Street' for consistency. Remove any punctuation and unnecessary whitespace.

4. Phone: Remove any punctuation and ensure it's in a consistent format.

### Data After Preprocessing:

1. Daniel, Lopez, 719 Greene Street East Rhonda, 9371184929

2. Daniel, NaN, 719 Green Street East Rhoda, 9371184929

### Convert to Binary Presence Vectors:

For simplicity, let's say:

1 = present and matches in both records,

0 = absent or different between the two records.

1. First_Name: Daniel = 1 (because it's present in both)

2. Last_Name: Lopez vs NaN = 0 (one is absent)

3. Address: Different = 0 (they are slightly different)

4. Phone: 9371184929 = 1 (they are the same)

So, the vectors for both records are:

1. [1, 0, 0, 1]

2. [1, 0, 0, 1]

### Cosine Similarity:

Using the formula:

\[ \text{cosine similarity}(A, B) = \frac{A \cdot B}{\|A\| \|B\|} \]

Given the vectors are identical, the cosine similarity is 1.

However, this is a simple and naive way to handle such data. Real-world applications might require more intricate embeddings, especially if the goal is to capture semantic similarities.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

for totally distinct values here is the response

Given this new data, we'll still use the previously mentioned method: preprocess and standardize each field, convert each field into a binary presence vector, and then compute the cosine similarity.

### Preprocessing Steps:

1. First_Name: Use as is.

2. Last_Name: Handle NaN as a missing value.

3. Address: Convert 'St.' to 'Street' and 'Str' to 'Street' for consistency. Remove punctuation and extra whitespace.

4. Phone: Remove any punctuation and ensure it's in a consistent format.

### Data After Preprocessing:

1. Daniel, Lopez, 719 Greene Street East Rhonda, 9371184929

2. Srihari, NaN, 666 Red Street West Roda, 9878788928

### Convert to Binary Presence Vectors:

For simplicity, we use:

1 = present and matches in both records,

0 = absent or different between the two records.

1. First_Name: Different names = 0

2. Last_Name: Lopez vs NaN = 0 (one is absent)

3. Address: Different addresses = 0

4. Phone: Different numbers = 0

So, the vectors for both records are:

1. [0, 0, 0, 0]

2. [0, 0, 0, 0]

### Cosine Similarity:

Given that both vectors are zero-vectors (all components are 0), the cosine similarity is undefined because the magnitude of both vectors is 0 (which would result in a division by zero). In practical terms, we can say that the records are completely dissimilar, although the formal calculation is undefined.

Remember, this approach is a very basic one that doesn't capture any nuanced similarities between records. Advanced methods that handle this kind of data often employ more sophisticated embeddings and similarity metrics, especially when working with larger datasets or when the requirement is to capture more subtle similarities.

Expand full comment