Are You Sure You Are Using the Train, Validation and Test Set Correctly?
Here's an explanation you always wanted to read along with an intuitive analogy.
It is pretty conventional and well-known to split the given data into train, test, and validation sets.
However, many folks don’t use them the way they are meant to be used, especially the validation and test sets.
Today, let’s clear up some misconceptions and see how to truly use train, validation, and test sets.
Let’s begin!
As we all know, we begin by splitting the data into:
Train
Validation
Test
At this point, just assume that the test data does not even exist. Forget about it instantly.
Begin with the train set. This is your whole world now.
You analyze it
You transform it
You use it to determine features
You fit a model on it
After modeling, you would want to measure the model’s performance on unseen data, wouldn’t you?
Bring in the validation set now.
Based on validation performance, improve the model.
Here’s how you iteratively build your model:
Train using a train set
Evaluate it using the validation set
Improve the model
Evaluate again using the validation set
Improve the model again
and so on.
Until...
You reach a point where you start overfitting the validation set.
This indicates that you have exploited (or polluted) the validation set.
No worries.
Merge it with the train set and generate a new split of train and validation.
Note: Rely on cross-validation if needed, especially when you don’t have much data. You may still use cross-validation if you have enough data. But it can be computationally intensive. Here’s a newsletter issue on cross-validation.
Now, if you are happy with the model’s performance, evaluate it on test data.
✅ What you use a test set for:
Get a final and unbiased review of the model.
❌ What you DON’T use a test set for:
Analysis, decision-making, etc.
If the model is underperforming on the test set, no problem.
Go back to the modeling stage and improve it.
BUT (and here’s what most people do wrong)!
They use the same test set again.
This is not allowed!
Think of it this way.
Your professor taught you in the classroom. All in-class lessons and examples are the train set.
The professor gave you take-home assignments, which acted like validation sets.
You got some wrong and some right.
Based on this, you adjusted your topic fundamentals, i.e., improved the model.
Now, if you keep solving the same take-home assignment repeatedly, you will eventually overfit it, won’t you?
That is why we bring in a new validation set after some iterations.
The final exam day paper is your test set.
If you do well, awesome!
But if you fail, the professor cannot give you the exact exam paper next time, can they? This is because you know what’s inside.
Of course, by evaluating a model on the test set, the model never gets to “know” the precise examples inside that set.
But the issue is that the test set has been exposed now.
Your previous evaluation will inevitably influence any further evaluations on that specific test set.
That is why you must always use a specific test set only ONCE.
Once you do, merge it with the train and validation set and generate an entirely new split.
Repeat.
And that is how you use train, validation, and test sets in machine learning.
Hope that helped!
👉 Over to you: While this may sound simple, there are quite a few things to care about, like avoiding data leakage. What are some other things that come to your mind?
👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.
The button is located towards the bottom of this email.
Thanks for reading!
Latest full articles
If you’re not a full subscriber, here’s what you missed last month:
You Cannot Build Large Data Projects Until You Learn Data Version Control!
Why Bagging is So Ridiculously Effective At Variance Reduction?
Sklearn Models are Not Deployment Friendly! Supercharge Them With Tensor Computations.
Deploy, Version Control, and Manage ML Models Right From Your Jupyter Notebook with Modelbit
Gaussian Mixture Models (GMMs): The Flexible Twin of KMeans.
To receive all full articles and support the Daily Dose of Data Science, consider subscribing:
👉 Tell the world what makes this newsletter special for you by leaving a review here :)
👉 If you love reading this newsletter, feel free to share it with friends!
In a time series context, where we learn from the past to predict the future, we may have to make do with a single test set consisting of the most recent x% of data. I wouldn't want to throw away a chunk of data every time I evaluate a model. But we would want to evaluate on the test set only when absolutely necessary; for example, to validate the final model before deployment.
So when you merge the test set with training and validation sets then create a entirely new split (yielding a new test set) I can then use this test set the same way even though the model was technically exposed to the entire sets (training, Val, test)?