Double Descent vs. Bias-Variance Trade-off

A counterintuitive phenomenon while training ML models.

Avi Chawla

Feb 19, 2024

It is well-known that as the number of model parameters increases, we typically overfit the data more and more.

For instance, consider fitting a polynomial regression model trained on this dummy dataset below:

In case you don’t know, this is called a polynomial regression model:

It is expected that as we’ll increase the degree (m) and train the polynomial regression model:

The training loss will get closer and closer to zero.
The test (or validation) loss will first reduce and then get bigger and bigger.

This is because, with a higher degree, the model will find it easier to contort its regression fit through each training data point, which makes sense.

In fact, this is also evident from the following loss plot:

But notice what happens when we continue to increase the degree (m):

That’s strange, right?

Why does the test loss increase to a certain point but then decrease?

This was not expected, was it?

Well…what you are seeing is called the “double descent phenomenon,” which is quite commonly observed in many ML models, especially deep learning models.

It shows that, counterintuitively, increasing the model complexity beyond the point of interpolation can improve generalization performance.

In fact, this whole idea is deeply rooted to why LLMs, although massively big (billions or even trillions of parameters), can still generalize pretty well.

And it’s hard to accept it because this phenomenon directly challenges the traditional bias-variance trade-off we learn in any introductory ML class:

Putting it another way, training very large models, even with more parameters than training data points, can still generalize well.

To the best of my knowledge, this is still an open question, and it isn’t entirely clear why neural networks exhibit this behavior.

There are some theories around regularization, however, such as this one:

It could be that the model applies some sort of implicit regularization, with which, it can precisely focus on an apt number of parameters for generalization.

But to be honest, nothing is clear yet.

👉 Over to you: I would love to hear from you today on what you think about this phenomenon and its possible causes.

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.

Thanks so much for appreciating the effort :)

The button is located towards the bottom of this email.

Thanks for reading!

Latest full articles

If you’re not a full subscriber, here’s what you missed last month:

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

I want to read full articles.

👉 Tell the world what makes this newsletter special for you by leaving a review here :)

Review Daily Dose of Data Science

👉 If you love reading this newsletter, feel free to share it with friends!

Share Daily Dose of Data Science

Joe Corliss

In the polynomial example, perhaps a very high degree polynomial is able to approximate the cubic spline interpolator, which has much lower variance than the Lagrange polynomial (the interpolating polynomial of minimum degree). Cubic splines are provably the lowest variance interpolating piecewise polynomials.

It may depend on how one is choosing a solution in the underdetermined case (degree > data points - 1). Will L2 regularization give something close to the cubic spline?

Expand full comment

Kak

How would you explain it in the polynomial case?

Daily Dose of Data Science

Double Descent vs. Bias-Variance Trade-off

A counterintuitive phenomenon while training ML models.

Latest full articles

Discussion about this post