There are instances when randomly splitting the data into train and validation sets can lead to data leakage.
To understand this, consider an image captioning use case.
Due to the inherent nature of language, every image can have many different captions:
If we randomly split this dataset, the same data point (image) will be distributed in the training and validation sets.
This is data leakage and will lead to high overfitting!
Group shuffle split helps us solve this.
There are two steps:
Group all training instances corresponding to one image (or features that may result in leakage, any other grouping criteria, etc.).
After grouping, the whole group must be randomly sent to either the training or validation sets.
This will prevent data leakage.
One thing to note is that in the above image captioning example, all features in the dataset (image pixels) contributed to the grouping criteria.
But in practice, there could only be a subset of features that must be grouped together for data splitting.
For instance, consider a dataset containing medical imaging data. Each sample consists of multiple images (e.g., different views of the same patient’s body part), and the model is intended to detect the severity of a disease.
In this case, group splitting will be done based on patient ID.
Demo
If you use Sklearn, the GroupShuffleSplit
implements this idea.
Consider we have the following dataset:
x1
andx2
are the features.y
is the target variable.group
denotes the grouping criteria.
First, we import the GroupShuffleSplit
from sklearn and instantiate the object:
The split()
method of this object lets us perform group splitting:
This returns a generator, and we can unpack it to get the following output:
As demonstrated above:
The data points in groups “A” and “C” are together in the training set.
The data points in group “B” are together in the validation/test set.
Other than the above technique, we discussed 11 more high-utility techniques in a recent article here: 11 Powerful Techniques To Supercharge Your ML Models.
👉 Over to you: What are some other ways data leakage may kick in?
P.S. For those wanting to develop “Industry ML” expertise:
At the end of the day, all businesses care about impact. That’s it!
Can you reduce costs?
Drive revenue?
Can you scale ML models?
Predict trends before they happen?
We have discussed several other topics (with implementations) in the past that align with such topics.
Here are some of them:
Learn sophisticated graph architectures and how to train them on graph data: A Crash Course on Graph Neural Networks – Part 1.
So many real-world NLP systems rely on pairwise context scoring. Learn scalable approaches here: Bi-encoders and Cross-encoders for Sentence Pair Similarity Scoring – Part 1.
Learn techniques to run large models on small devices: Quantization: Optimize ML Models to Run Them on Tiny Hardware.
Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust: Conformal Predictions: Build Confidence in Your ML Model’s Predictions.
Learn how to identify causal relationships and answer business questions: A Crash Course on Causality – Part 1
Learn how to scale ML model training: A Practical Guide to Scaling ML Model Training.
Learn techniques to reliably roll out new models in production: 5 Must-Know Ways to Test ML Models in Production (Implementation Included)
Learn how to build privacy-first ML systems: Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.
Learn how to compress ML models and reduce costs: Model Compression: A Critical Step Towards Efficient Machine Learning.
All these resources will help you cultivate key skills that businesses and companies care about the most.
SPONSOR US
Get your product in front of 115,000+ data scientists and machine learning professionals.
Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., who have influence over significant tech decisions and big purchases.
To ensure your product reaches this influential audience, reserve your space here or reply to this email to ensure your product reaches this influential audience.