In my experience, many unexpected errors in libraries/tools/languages can be attributed to the presence of missing values.
For instance, consider a 1D NumPy array with NaN values.
When we aggregate this NumPy array to calculate, say, its sum, we get the following output:
Strange, right?
Although an output of 6
may have made more sense here, NumPy produces a NaN
value instead.
Anyway, this is not the topic of today’s post, but I hope you get the point.
A similar silent mistake can be found in SQL as well, specifically in the usage of “NOT IN
” clause, which many SQL users are not aware of.
Let’s understand!
Consider we have the following tables (students
and names
), in our database:
The task is to select records from students
table where first_name
is not in the names
table.
One way to do this is by using the NOT IN
clause, as demonstrated below:
This provides the expected results as well.
Everything’s good so far!
Now, say our names
table had a NULL value:
If we run the above query again, we get no records this time:
On a side note, had we used the IN clause to select rows from the students
table where first_name
was in the names
table, we notice that it works as expected:
That’s strange, isn’t it?
Why do we get no records with NOT IN?
The reason we get no records when we use NOT IN
but the IN
clause works as expected has to do with how these two clauses operate internally.
For simplicity, consider we are currently checking the first record (where first_name
is “John”) from the students
table:
The WHERE
clause needs a boolean value to determine whether a record must be filtered or not.
When we use “IN
”, this boolean value is evaluated using the OR
operator as follows:
If any condition is TRUE
, the row gets filtered.
However, when we use “NOT IN
”, the boolean value is evaluated using the AND
operator as follows:
For the above expression to be TRUE
, all individual conditions MUST be TRUE
.
But the (JOHN != None
) condition produces a conflict because, typically, this condition results in an UNKNOWN
value.
Thus, the entire expression evaluates to UNKNOWN
— producing no records.
This happens for every record in the students
table. As a result, the query results in no records:
What to do instead?
There are many ways to avoid this.
As our task is to select records from students
table where first_name
is not in the names
table, we can:
Filter out the NULL values in the sub-query:
Use Anti Joins: Anti join returns only those rows from the left table where no match is found in the right table. This is precisely what we need in our case and it is implemented below:
At times, such mistakes can take some serious time to debug if you are not aware of them beforehand.
If you wish to experiment with what we discussed about the NOT IN
clause today, download this Jupyter Notebook: SQL NOT IN Notebook.
👉 Over to you: Can you propose another solution for the problem above?
Are you overwhelmed with the amount of information in ML/DS?
Every week, I publish no-fluff deep dives on topics that truly matter to your skills for ML/DS roles.
For instance:
Conformal Predictions: Build Confidence in Your ML Model’s Predictions
Quantization: Optimize ML Models to Run Them on Tiny Hardware
5 Must-Know Ways to Test ML Models in Production (Implementation Included)
8 Fatal (Yet Non-obvious) Pitfalls and Cautionary Measures in Data Science
Implementing Parallelized CUDA Programs From Scratch Using CUDA Programming
You Are Probably Building Inconsistent Classification Models Without Even Realizing
And many many more.
Join below to unlock all full articles:
SPONSOR US
Get your product in front of 85,000 data scientists and other tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., who have influence over significant tech decisions and big purchases.
To ensure your product reaches this influential audience, reserve your space here or reply to this email to ensure your product reaches this influential audience.
useful tip nicely explained 👍 thanks!
Many Thank!