We all have heard of LEFT JOIN, RIGHT JOIN, INNER JOIN, and OUTER JOIN, haven’t we?
This visual from ByteByteGo neatly summarizes them:
Here’s a short description:
Inner Join: Returns only matching rows from both tables.
Left Join: Returns all rows from the left table and matching rows from the right. Non-matching rows from the left table contain null values.
Right Join: Returns all rows from the right tables and matching rows from the left. Non-matching rows from the right table contain null values.
Full Outer Join: Returns all rows when there is a match in either table. Non-matching rows contain null values.
These four are the most prevalent types of SQL joins.
But there are more.
And today, I want to introduce you to three of them which I find pretty handy at times.
These are:
Semi Join
Anti Join
Natural Join
If you already know them, you can stop reading here.
If not, let’s understand!
#1) Semi-Join
Semi-join appears quite similar to a left join, but there are three notable differences:
If the join condition between two rows is
TRUE
, columns from only the left table are returned. Compare this to the left join, which returns columns from both tables.
If a row in the left table has no match, then that row is not returned. In left join, however, all rows from the left table are returned irrespective of whether they have a match or not.
If a row in the left table has multiple matches, only one entry is returned. In left join, however, multiple matches are returned an equivalent number of times.
For instance, consider we have the following two tables:
Executing a semi-join, we get the following results:
As depicted above, unlike left-join:
It only returns columns from the left table.
It only returns the matched rows from the left table.
If a record has multiple matches, like in this case:
…then we notice that semi-join only returns one record from the left table:
I find semi-join to be particularly useful when I only care about the existence of records in another table. Left join returns duplicates which are not of interest at that point.
#2) Anti-Join
The rows discarded by the semi-join from the left table are the results of an anti-join.
So, in a way, we can say that:
Consider the above orders
and users
table again (one in which there were no multiple matches):
Executing an anti-join, we get the following results:
It is clear from the semi-join and anti-join results that:
Of course, the order can be different. When I say “[SEMI JOIN] + [ANTI JOIN] = [LEFT TABLE]
”, I mean the collection of all records.
I find anti-join to be particularly useful when I wish to know which records do not exist in another table.
#3) Natural Join
This one is similar to INNER JOIN, but there’s no need to specify a join condition explicitly.
Instead, it automatically considers a join condition on ALL the matching column names.
Consider the users and orders table yet again:
Here, the User_ID
column is present in both tables.
Executing a natural join, we get the following results:
As depicted above, the results are similar to what we would get with INNER JOIN.
However, we did not have to explicitly specify a JOIN condition, which, admittedly, could be good or bad.
It is good because it helps us write concise queries.
It is bad because we are not explicit about the columns being joined.
These were three more types of SQL Joins, which I use at times to write concise and elegant SQL queries.
Hope you learned something new.
The visual below neatly summarizes what we discussed today:
If you wish to experiment with what we discussed today, download this Jupyter Notebook: Semi-Anti-Natural Join Notebook.
👉 Over to you: Have I missed any other type of SQL Join?
👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.
The button is located towards the bottom of this email.
Thanks for reading!
Latest full articles
If you’re not a full subscriber, here’s what you missed last month:
You Are Probably Building Inconsistent Classification Models Without Even Realizing
Why Sklearn’s Logistic Regression Has no Learning Rate Hyperparameter?
PyTorch Models Are Not Deployment-Friendly! Supercharge Them With TorchScript.
How To (Immensely) Optimize Your Machine Learning Development and Operations with MLflow.
Don’t Stop at Pandas and Sklearn! Get Started with Spark DataFrames and Big Data ML using PySpark.
DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering.
Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.
You Cannot Build Large Data Projects Until You Learn Data Version Control!
To receive all full articles and support the Daily Dose of Data Science, consider subscribing:
👉 Tell the world what makes this newsletter special for you by leaving a review here :)
👉 If you love reading this newsletter, feel free to share it with friends!
The natural join is actually the default behavior when merging dataframes with pandas.merge
Well explained... The examples made me understand that joins more clearly