Deep learning models may fail to converge due to various reasons.
Some causes are obvious and common, and therefore, quickly rectifiable, like too high/low learning rate, no data normalization, no batch normalization, etc.
But the problem arises when the cause isn’t that apparent. Therefore, it may take some serious time to debug if you are unaware of them.
Today, I want to talk about one such data-related mistake, which I once committed during my early days in machine learning. Admittedly, it took me quite some time to figure it out back then because I had no idea about the issue.
Let’s understand!
Experiment
Consider a classification neural network trained using mini-batch gradient descent:
Mini-batch gradient descent: Update network weights using a few data points at a time.
Here, we train two different neural networks:
Version 1: The dataset is ordered by labels.
Version 2: The dataset is properly shuffled by labels.
And, of course, before training, we ensure that both networks had the same initial weights, learning rate, and other settings.
The following image depicts the epoch-by-epoch performance of the two models. On the left, we have the model trained on label-ordered data, and the one on the right was trained on the shuffled dataset.
It is clear that the model receiving a label-ordered dataset miserably fails to converge, but shuffling the data helps.
Why does that happen?
Now, if you think about it for a second, overall, both models received the same data, didn’t they?
Yet, the order in which the data was fed to these models totally determined their performance.
I vividly remember that when I faced this issue, I knew that my data was ordered by labels.
Yet, it never occurred to me that ordering may influence the model performance because the data will always be the same regardless of the ordering.
But later, I realized that this point will only be valid when the model sees the entire data and updates the model weights in one go, i.e., in batch gradient descent, as depicted below:
But in the case of mini-batch gradient descent, the weights are updated after every mini-batch.
Thus, the prediction and weight update on a subsequent mini-batch is influenced by the previous mini-batches.
In the context of label-ordered data, where samples of the same class are grouped together, mini-batch gradient descent will lead the model to learn patterns specific to the class it excessively saw early on in training.
In contrast, randomly ordered data ensures that each mini-batch contains a balanced representation of classes. This allows the model to learn a more comprehensive set of features throughout the training process.
Of course, the idea of shuffling is not valid for time-series datasets as their temporal structure is important.
The good thing is that if you happen to use, say, PyTorch DataLoader, you are safe. This is because it already implements shuffling. But if you have a custom implementation, ensure that you are not making any such error.
Before I end, one thing that you must ALWAYS remember when training neural networks is that these models can proficiently learn entirely non-existing patterns about your dataset. So never give them any chance to do so.
In tomorrow's issue, I will share something peculiar and counterintuitive that I learned about PyTorch DataLoader pretty recently while optimizing the training procedure of one of my ML models.
Stay tuned!
Here’s some further reading in the meantime related to today’s issue:
We covered 8 fatal (yet non-obvious) pitfalls and cautionary measures in data science here.
We discussed 11 uncommon powerful techniques to supercharge your ML models here.
👉 Over to you: What are some other uncommon sources of error in training deep learning models?
Are you overwhelmed with the amount of information in ML/DS?
Every week, I publish no-fluff deep dives on topics that truly matter to your skills for ML/DS roles.
For instance:
Conformal Predictions: Build Confidence in Your ML Model’s Predictions
5 Must-Know Ways to Test ML Models in Production (Implementation Included)
Implementing Parallelized CUDA Programs From Scratch Using CUDA Programming
You Are Probably Building Inconsistent Classification Models Without Even Realizing
And many many more.
Join below to unlock all full articles:
SPONSOR US
Get your product in front of 87,000 data scientists and other tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., who have influence over significant tech decisions and big purchases.
To ensure your product reaches this influential audience, reserve your space here or reply to this email to ensure your product reaches this influential audience.
I am missing something. What non-existent pattern in the data that the model in the example learned?