During model development, many people find themselves in situations where, no matter how much they try, the model performance barely improves:
Feature engineering is giving marginal improvement.
Trying different models does not produce satisfactory results either.
and more…
This is usually (not always) an indicator that the model is data deficient.
In other words, we don’t have enough data to work with.
However, gathering new data can be a time-consuming and tedious process.
So before venturing into that direction, it would be good to get some insights about whether new data will help.
Here’s a trick I have often used to determine this.
Data subsetting and Model building
Let’s say this is your full training and validation set:
Divide the training dataset into “k” equal parts. The validation set remains as is.
“k” does not have to be super large. Any number between 7 to 12 is fine depending on how much data you have. If there’s plenty of data, setting a low value in this range is recommended (you will understand why shortly).
Next, train models cumulatively on the above subsets and measure their performance on the validation set:
Train a model on the first subset and evaluate the validation set.
Train a model on the first two subsets and evaluate the validation set.
Train a model on the first three subsets and evaluate the validation set.
And so on…
Plotting the validation performance of these models (in order of increasing training data) is likely to produce two types of lines:
Line A conveys that adding more data is likely to increase model performance.
Line B conveys that the model performance has already saturated. Adding more data will most likely not result in any considerable gains.
Now, you might also understand why I mentioned this above: “If there’s plenty of data, setting a low value in this range is recommended.”
Because we train multiple models, setting a high value of “k” means more subsets, which in turn means more models.
This way, you can determine whether the model is data deficient and whether gathering data will be helpful or not.
Isn’t that simple and effective?
Other than the above technique, I discussed 11 more high-utility techniques to improve ML models in a recent article here: 11 Powerful Techniques To Supercharge Your ML Models.
👉 Over to you: What would you do if you get Line B? How would you proceed ahead?
Thanks for reading!
Whenever you are ready, here’s one more way I can help you:
Every week, I publish 1-2 in-depth deep dives (typically 20+ mins long). Here are some of the latest ones that you will surely like:
[FREE] A Beginner-friendly and Comprehensive Deep Dive on Vector Databases.
A Detailed and Beginner-Friendly Introduction to PyTorch Lightning: The Supercharged PyTorch
You Are Probably Building Inconsistent Classification Models Without Even Realizing
Why Sklearn’s Logistic Regression Has no Learning Rate Hyperparameter?
PyTorch Models Are Not Deployment-Friendly! Supercharge Them With TorchScript.
Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.
You Cannot Build Large Data Projects Until You Learn Data Version Control!
To receive all full articles and support the Daily Dose of Data Science, consider subscribing:
👉 If you love reading this newsletter, feel free to share it with friends!
👉 Tell the world what makes this newsletter special for you by leaving a review here :)