Double Descent vs. Bias-Variance Trade-off
A counterintuitive phenomenon while training ML models.
It is well-known that as the number of model parameters increases, we typically overfit the data more and more.
For instance, consider fitting a polynomial regression model trained on this dummy dataset below:
In case you don’t know, this is called a polynomial regression model:
It is expected that as we’ll increase the degree (m
) and train the polynomial regression model:
The training loss will get closer and closer to zero.
The test (or validation) loss will first reduce and then get bigger and bigger.
This is because, with a higher degree, the model will find it easier to contort its regression fit through each training data point, which makes sense.
In fact, this is also evident from the following loss plot:
But notice what happens when we continue to increase the degree (m
):
That’s strange, right?
Why does the test loss increase to a certain point but then decrease?
This was not expected, was it?
Well…what you are seeing is called the “double descent phenomenon,” which is quite commonly observed in many ML models, especially deep learning models.
It shows that, counterintuitively, increasing the model complexity beyond the point of interpolation can improve generalization performance.
In fact, this whole idea is deeply rooted to why LLMs, although massively big (billions or even trillions of parameters), can still generalize pretty well.
And it’s hard to accept it because this phenomenon directly challenges the traditional bias-variance trade-off we learn in any introductory ML class:
Putting it another way, training very large models, even with more parameters than training data points, can still generalize well.
To the best of my knowledge, this is still an open question, and it isn’t entirely clear why neural networks exhibit this behavior.
There are some theories around regularization, however, such as this one:
It could be that the model applies some sort of implicit regularization, with which, it can precisely focus on an apt number of parameters for generalization.
But to be honest, nothing is clear yet.
👉 Over to you: I would love to hear from you today on what you think about this phenomenon and its possible causes.
👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.
The button is located towards the bottom of this email.
Thanks for reading!
Latest full articles
If you’re not a full subscriber, here’s what you missed last month:
A Beginner-friendly and Comprehensive Deep Dive on Vector Databases.
You Are Probably Building Inconsistent Classification Models Without Even Realizing
Why Sklearn’s Logistic Regression Has no Learning Rate Hyperparameter?
PyTorch Models Are Not Deployment-Friendly! Supercharge Them With TorchScript.
How To (Immensely) Optimize Your Machine Learning Development and Operations with MLflow.
DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering.
Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.
You Cannot Build Large Data Projects Until You Learn Data Version Control!
To receive all full articles and support the Daily Dose of Data Science, consider subscribing:
👉 Tell the world what makes this newsletter special for you by leaving a review here :)
👉 If you love reading this newsletter, feel free to share it with friends!
In the polynomial example, perhaps a very high degree polynomial is able to approximate the cubic spline interpolator, which has much lower variance than the Lagrange polynomial (the interpolating polynomial of minimum degree). Cubic splines are provably the lowest variance interpolating piecewise polynomials.
It may depend on how one is choosing a solution in the underdetermined case (degree > data points - 1). Will L2 regularization give something close to the cubic spline?
How would you explain it in the polynomial case?