Everyone knows about the train, test, and validation sets. But very few understand how to use them correctly.
Here’s what you should know about splitting data and using it for ML models.
Begin by splitting the data into:
Train
Validation
Test
Now, assume that the test data does not even exist. Forget about it instantly.
Begin with the train set. This is your whole world now.
Analyze it
Transform it
Model it
As you finish modeling, you’d want to measure the model’s performance on unseen data.
Bring in the validation set now.
Based on validation performance, improve the model.
Here’s how you iteratively build your model:
Train using a train set
Evaluate it using the validation set
Improve the model
Evaluate again using the validation set
Improve the model again
and so on.
Until...
You start overfitting the validation set. This indicates that you exploited and polluted the validation set.
No worries.
Merge it with the train set and generate a new split of train and validation.
Note: Rely on cross-validation if needed, especially when you don’t have much data. You may still use cross-validation if you have enough data. But it can be computationally intensive.
Now, if you are happy with the model’s performance, evaluate it on test data.
👉 What you use a test set for:
Get a final and unbiased review of the model.
👉 What you DON’T use a test set for:
Analysis, decision-making, etc.
If the model is underperforming on the test set, no problem. Go back to the modeling stage and improve it.
BUT (and here’s what most people do wrong)!
They use the same test set again.
Not allowed!
Think of it this way.
Your professor taught you in class. All in-class lessons and examples are the train set.
The professor gave you take-home assignments, which act like validation sets.
You get some wrong and some right.
Based on this, you adjust your topic fundamentals, i.e., improve the model.
If you keep solving the same take-home assignment, then you will eventually overfit it, won’t you?
That is why we bring in a new validation set after some iterations.
The final exam day paper is your test set.
If you do well, awesome!
But if you fail, the professor cannot give you the same exam paper next time, can they? You know what’s inside.
That is why we always use a specific test set only ONCE.
Once you do, merge it with the train and validation set and generate a new split.
Repeat.
👉 Over to you: While this may sound simple, there are quite a few things to care about, like avoiding data leakage. What are some other things that come to your mind?
👉 Read what others are saying about this post on LinkedIn.
👉 Tell the world what makes this newsletter special for you by leaving a review here. It would mean the world to me :)
👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.
👉 If you love reading this newsletter, feel free to share it with friends!
Find the code for my tips here: GitHub.
I like to explore, experiment and write about data science concepts and tools. You can read my articles on Medium. Also, you can connect with me on LinkedIn and Twitter.
This is great explanation but not to clear on it.