2 Mathematical Proofs of Ordinary Least Squares

The origin of OLS.

Jul 13, 2023

Most machine learning algorithms use gradient descent to learn the optimal parameters.

However, in addition to gradient descent, linear regression can model data using another technique called ordinary least squares (OLS).

Ordinary Least Square (OLS):

It is a deterministic algorithm. If run multiple times, it will always converge to the same weights.
It always finds the optimal solution.
The solution is mathematically framed as follows:
\(\theta = (X^{T}X)^{-1}X^{T}y\)

But where does this solution come from?

Can we derive it?

Of course, we can.

The above visual depicts two mathematical derivations of parameters obtained using OLS.

Proof #1: Using Matrix Manipulations

Step 1: With OLS, the idea is to find the set of parameters (Θ) such that:

\(y = X\theta\)

where,

X: input data with dimensions (n,m).
Θ: parameters with dimensions (m,1).
y: output data with dimensions (n,1).
n: number of samples.
m: number of features.

Step 2: The parameter matrix Θ may be directly determined by multiplying both sides of the equation with the inverse of X, as shown below:

\(\theta = X^{-1}y\)

But that will only work if X is a square matrix (and has a non-zero determinant).

Step 3: To resolve this, first, we multiply with the transpose of X on both sides, as shown below:

\(X^{T}X\theta = X^{T}y\)

This makes the product of X with its transpose a square matrix.

The obtained matrix, being square, can be inverted (provided it is non-singular).

Step 4) Lastly, we take the collective inverse of the product to get the following:

\(\theta = (X^{T}X)^{-1}X^{T}y\)

Proof #2: Using Calculus

Step 1) Essentially, the goal of linear regression is to minimize the squared error:

\(L = ||y - X\theta||^{2}\)

Thus, we can use calculus to find the parameters Θ that minimize the squared error.

Step 2) Simplify the squared norm above:

\(L = (y - X\theta)^T(y - X\theta)\)

Expanding the expression, we get:

\(L = y^{T}y - y^{T}X\theta - (X\theta)^Ty + (X\theta)^TX\theta\)

Step 3) Differentiate the above expression with respect to Θ and simplify:

\(\frac{dL}{d\theta} = -2X^{T}y + 2X^TX\theta\)

Step 4) Set the derivative to zero:

\(-2X^{T}y + 2X^TX\theta = 0\)

After rearranging, we get:

\(X^TX\theta = X^{T}y\)

Step 5) Invert the product of X and its transpose:

\(\theta = (X^{T}X)^{-1}X^{T}y\)

And that’s how we get to the OLS solution.

It is important to remember that while OLS always returns the optimal solution, you trade run-time for finding an optimal solution without hyperparameter tuning.

I would highly recommend reading one of my previous posts about this: Most Sklearn Users Don’t Know This About Its LinearRegression Implementation.

👉 Over to you: Which proof is your favorite :)?

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.

👉 Tell the world what makes this newsletter special for you by leaving a review here :)

Review Daily Dose of Data Science

👉 If you love reading this newsletter, feel free to share it with friends!

Share Daily Dose of Data Science

I like to explore, experiment and write about data science concepts and tools. You can read my articles on Medium. Also, you can connect with me on LinkedIn and Twitter.

Daily Dose of Data Science

Discussion about this post

Ready for more?