Thank you!
A couple of days back, I completed two years of writing this newsletter. Big thanks to everyone who has found value here and supported the work :)
I am immensely grateful to every one of you who make time every single day and look forward to reading this newsletter.
Growing to 100k readers hasn’t been easy, but you have made it a fulfilling endeavor.
Thanks again :)
Let’s get to today’s issue now!
We all have heard of LEFT JOIN, RIGHT JOIN, INNER JOIN, and OUTER JOIN, haven’t we?
This visual from ByteByteGo neatly summarizes them:
Here’s a short description:
Inner Join: Returns only matching rows from both tables.
Left Join: Returns all rows from the left table and matching rows from the right. Non-matching rows from the left table contain null values.
Right Join: Returns all rows from the right tables and matching rows from the left. Non-matching rows from the right table contain null values.
Full Outer Join: Returns all rows when there is a match in either table. Non-matching rows contain null values.
These four are the most prevalent types of SQL joins.
But if you use DuckDB, you’ll find more.
And today, I want to introduce you to three of them which I find pretty handy at times.
These are:
Semi Join
Anti Join
Natural Join
Let’s understand them today!
#1) Semi-Join
Semi-join appears quite similar to a left join, but there are three notable differences:
If the join condition between two rows is
TRUE
, columns from only the left table are returned. Compare this to the left join, which returns columns from both tables.
If a row in the left table has no match, then that row is not returned. In left join, however, all rows from the left table are returned irrespective of whether they have a match or not.
If a row in the left table has multiple matches, only one entry is returned. In left join, however, multiple matches are returned an equivalent number of times.
For instance, consider we have the following two tables:
Executing a semi-join, we get the following results:
As depicted above, unlike left-join:
It only returns columns from the left table.
It only returns the matched rows from the left table.
If a record has multiple matches, like in this case:
…then we notice that semi-join only returns one record from the left table:
I find semi-join to be particularly useful when I only care about the existence of records in another table. Left join returns duplicates which are not of interest at that point.
#2) Anti-Join
The rows discarded by the semi-join from the left table are the results of an anti-join.
So, in a way, we can say that:
Consider the above orders
and users
table again (one in which there were no multiple matches):
Executing an anti-join, we get the following results:
It is clear from the semi-join and anti-join results that:
Of course, the order can be different. When I say “[SEMI JOIN] + [ANTI JOIN] = [LEFT TABLE]
”, I mean the collection of all records.
I find anti-join to be particularly useful when I wish to know which records do not exist in another table.
#3) Natural Join
This one is similar to INNER JOIN, but there’s no need to specify a join condition explicitly.
Instead, it automatically considers a join condition on ALL the matching column names.
Consider the users and orders table yet again:
Here, the User_ID
column is present in both tables.
Executing a natural join, we get the following results:
As depicted above, the results are similar to what we would get with INNER JOIN.
However, we did not have to explicitly specify a JOIN condition, which, admittedly, could be good or bad.
It is good because it helps us write concise queries.
It is bad because we are not explicit about the columns being joined.
These were three more types of SQL Joins, available in DuckDB, which I use at times to write concise and elegant SQL queries.
Hope you learned something new.
The visual below neatly summarizes what we discussed today:
If you wish to experiment with what we discussed today, download this Jupyter Notebook: Semi-Anti-Natural Join Notebook.
👉 Over to you: Have I missed any other type of SQL Join?
For those wanting to develop “Industry ML” expertise:
We have discussed several other topics (with implementations) in the past that align with “industry ML.”
Here are some of them:
Quantization: Optimize ML Models to Run Them on Tiny Hardware
Conformal Predictions: Build Confidence in Your ML Model’s Predictions
5 Must-Know Ways to Test ML Models in Production (Implementation Included)
Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning
Model Compression: A Critical Step Towards Efficient Machine Learning
At the end of the day, all businesses care about impact. That’s it!
Can you reduce costs?
Drive revenue?
Can you scale ML models?
Predict trends before they happen?
All these resources will help you cultivate those key skills.