In my experience, many unexpected errors in libraries/tools/languages can be attributed to the presence of missing values.
For instance, consider a 1D NumPy array with NaN values.
When we aggregate this NumPy array to calculate, say, its sum, we get the following output:
Strange, right?
Although an output of 6
may have made more sense here, NumPy produces a NaN
value instead.
Anyway, this is not the topic of today’s post, but I hope you get the point.
A similar silent mistake can be found in SQL as well, specifically in the usage of “NOT IN
” clause, which many SQL users are not aware of.
Let’s understand!
Consider we have the following tables (students
and names
), in our database:
The task is to select records from students
table where first_name
is not in the names
table.
One way to do this is by using the NOT IN
clause, as demonstrated below:
This provides the expected results as well.
Everything’s good so far!
Now, say our names
table had a NULL value:
If we run the above query again, we get no records this time:
On a side note, had we used the IN clause to select rows from the students
table where first_name
was in the names
table, we notice that it works as expected:
That’s strange, isn’t it?
Why do we get no records with NOT IN?
The reason we get no records when we use NOT IN
but the IN
clause works as expected has to do with how these two clauses operate internally.
For simplicity, consider we are currently checking the first record (where first_name
is “John”) from the students
table:
The WHERE
clause needs a boolean value to determine whether a record must be filtered or not.
When we use “IN
”, this boolean value is evaluated using the OR
operator as follows:
If any condition is TRUE
, the row gets filtered.
However, when we use “NOT IN
”, the boolean value is evaluated using the AND
operator as follows:
For the above expression to be TRUE
, all individual conditions MUST be TRUE
.
But the (JOHN != None
) condition produces a conflict because, typically, this condition results in an UNKNOWN
value.
Thus, the entire expression evaluates to UNKNOWN
— producing no records.
This happens for every record in the students
table. As a result, the query results in no records:
What to do instead?
There are many ways to avoid this.
As our task is to select records from students
table where first_name
is not in the names
table, we can:
Filter out the NULL values in the sub-query:
Use Anti Joins: Anti join returns only those rows from the left table where no match is found in the right table. This is precisely what we need in our case and it is implemented below:
At times, such mistakes can take some serious time to debug if you are not aware of them beforehand.
Hope that helped!
If you wish to experiment with what we discussed about the NOT IN
clause today, download this Jupyter Notebook: NOT IN Clause Notebook.
👉 Over to you: Can you propose another solution for the problem above?
👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.
The button is located towards the bottom of this email.
Thanks for reading!
Latest full articles
If you’re not a full subscriber, here’s what you missed last month:
DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering
Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning
You Cannot Build Large Data Projects Until You Learn Data Version Control!
Sklearn Models are Not Deployment Friendly! Supercharge Them With Tensor Computations.
Deploy, Version Control, and Manage ML Models Right From Your Jupyter Notebook with Modelbit
Gaussian Mixture Models (GMMs): The Flexible Twin of KMeans.
To receive all full articles and support the Daily Dose of Data Science, consider subscribing:
👉 Tell the world what makes this newsletter special for you by leaving a review here :)
👉 If you love reading this newsletter, feel free to share it with friends!
Very weird how the `NOT IN` logic is implemented. I assumed it would've just been the negation (`~`) of the `IN` logic's operator since that would be much more efficient. Also, first time hearing about ANTI JOINs!
Very nice!