It is pretty conventional to split the given data into train, test, and validation sets.
However, there are quite a few misconceptions about how they are meant to be used, especially the validation and test sets.
Today, let’s clear them up and see how to truly use train, validation, and test sets.
Let’s begin!
As we all know, we begin by splitting the data into:
Train
Validation
Test
At this point, just assume that the test data does not even exist. Forget about it instantly.
Begin with the train set. This is your whole world now.
You analyze it
You transform it
You use it to determine features
You fit a model on it
After modeling, you would want to measure the model’s performance on unseen data, wouldn’t you?
Bring in the validation set now.
Based on validation performance, improve the model.
Here’s how you iteratively build your model:
Train using a train set
Evaluate it using the validation set
Improve the model
Evaluate again using the validation set
Improve the model again
and so on.
Until...
You reach a point where you start overfitting the validation set.
This indicates that you have exploited (or polluted) the validation set.
No worries.
Merge it with the train set and generate a new split of train and validation.
Note: Rely on cross-validation if needed, especially when you don’t have much data. You may still use cross-validation if you have enough data. But it can be computationally intensive. Here’s a newsletter issue on cross-validation.
Now, if you are happy with the model’s performance, evaluate it on test data.
✅ What you use a test set for:
Get a final and unbiased review of the model.
❌ What you DON’T use a test set for:
Analysis, decision-making, etc.
If the model is underperforming on the test set, no problem.
Go back to the modeling stage and improve it.
BUT (and here’s what most people do wrong)!
They use the same test set again.
This is not allowed!
Think of it this way.
Your professor taught you in the classroom. All in-class lessons and examples are the train set.
The professor gave you take-home assignments, which acted like validation sets.
You got some wrong and some right.
Based on this, you adjusted your topic fundamentals, i.e., improved the model.
Now, if you keep solving the same take-home assignment repeatedly, you will eventually overfit it, won’t you?
That is why we bring in a new validation set after some iterations.
The final exam day paper is your test set.
If you do well, awesome!
But if you fail, the professor cannot give you the exact exam paper next time, can they? This is because you know what’s inside.
Of course, by evaluating a model on the test set, the model never gets to “know” the precise examples inside that set.
But the issue is that the test set has been exposed now.
Your previous evaluation will inevitably influence any further evaluations on that specific test set.
That is why you must always use a specific test set only ONCE.
Once you do, merge it with the train and validation set and generate an entirely new split.
Repeat.
And that is how you use train, validation, and test sets in machine learning.
Hope that helped!
That said, there’s a situation where random splitting can be fatal. We discussed it here in this newsletter:
Moreover, you can learn about 8 more fatal (and non-obvious) pitfalls here: 8 Fatal (Yet Non-obvious) Pitfalls and Cautionary Measures in Data Science.
👉 Over to you: While this may sound simple, there are quite a few things to care about, like avoiding data leakage. What are some other things that come to your mind?
Thanks for reading!
1 Referral: Unlock 450+ practice questions on NumPy, Pandas, and SQL.
2 Referrals: Get access to advanced Python OOP deep dive.
3 Referrals: Get access to the PySpark deep dive for big-data mastery.
Get your unique referral link:
Are you preparing for ML/DS interviews or want to upskill at your current job?
Every week, I publish in-depth ML deep dives. The topics align with the practical skills that typical ML/DS roles demand.
Join below to unlock all full articles:
Here are some of the top articles:
[FREE] A Beginner-friendly and Comprehensive Deep Dive on Vector Databases.
Understanding LoRA-derived Techniques for Optimal LLM Fine-tuning
8 Fatal (Yet Non-obvious) Pitfalls and Cautionary Measures in Data Science.
5 Must-Know Ways to Test ML Models in Production (Implementation Included).
A Detailed and Beginner-Friendly Introduction to PyTorch Lightning: The Supercharged PyTorch
Don’t Stop at Pandas and Sklearn! Get Started with Spark DataFrames and Big Data ML using PySpark.
Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.
You Cannot Build Large Data Projects Until You Learn Data Version Control!
Sklearn Models are Not Deployment Friendly! Supercharge Them With Tensor Computations.
Join below to unlock all full articles:
👉 If you love reading this newsletter, share it with friends!
👉 Tell the world what makes this newsletter special for you by leaving a review here :)