Label Smoothing: The Overlooked and Lesser-Talked Regularization Technique
Make your model less overconfident.
For every instance in single-label classification datasets, the entire probability mass belongs to a single class, and the rest are zero.
This is depicted below:
The issue is that, at times, such label distributions excessively motivate the model to learn the true class for every sample with pretty high confidence.
This can impact its generalization capabilities.
Label smoothing is a lesser-talked regularisation technique that elegantly addresses this issue.
As depicted above, with label smoothing:
We intentionally reduce the probability mass of the true class slightly.
The reduced probability mass is uniformly distributed to all other classes.
Simply put, this can be thought of as asking the model to be “less overconfident” during training and prediction while still attempting to make accurate predictions.
This makes intuitive sense as well.
The efficacy of this technique is evident from the image below:
In this experiment, I trained two neural networks on the Fashion MNIST dataset with the exact same weight initialization.
One without label smoothing.
Another with label smoothing.
The model with label smoothing resulted in a better test accuracy, i.e., better generalization.
Pretty handy, isn’t it?
When not to use label smoothing?
After using label smoothing for many of my projects, I have also realized that it is not well suited for all use cases.
So it’s important to know when you should not use it.
See, if you only care about getting the final prediction correct and improving generalization, label smoothing will be a pretty handy technique.
However, I wouldn’t recommend utilizing it if you care about:
Getting the prediction correct.
And understanding the model’s confidence in generating a prediction.
This is because as we discussed above, label smoothing guides the model to become “less overconfident” about its predictions.
Thus, we typically notice a drop in the confidence values for every prediction, as depicted below:
On a specific test instance:
The model without label smoothing outputs 99% probability for class 3.
With label smoothing, although the prediction is still correct, the confidence drops to 74%.
This is something to keep in mind when using label smoothing.
Nonetheless, the technique is indeed pretty promising to regularize deep learning models.
You can download the code notebook here: Label Smoothing Notebook.
👉 Over to you: What could be some other things to take care of when using label smoothing?
👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.
The button is located towards the bottom of this email.
Thanks for reading!
Latest full articles
If you’re not a full subscriber, here’s what you missed last month:
You Cannot Build Large Data Projects Until You Learn Data Version Control!
Why Bagging is So Ridiculously Effective At Variance Reduction?
Sklearn Models are Not Deployment Friendly! Supercharge Them With Tensor Computations.
Deploy, Version Control, and Manage ML Models Right From Your Jupyter Notebook with Modelbit
Gaussian Mixture Models (GMMs): The Flexible Twin of KMeans.
To receive all full articles and support the Daily Dose of Data Science, consider subscribing:
👉 Tell the world what makes this newsletter special for you by leaving a review here :)
👉 If you love reading this newsletter, feel free to share it with friends!