After training an ML model on a training set, we always keep a held-out validation/test set for evaluation.
I am sure you already know the purpose, so we won’t discuss that.
But do you know that random forests are an exception to that?
In other words, one can somewhat “evaluate” a random forest using the training set itself.
Let’s understand how.
To recap, a random forest is trained as follows:
First, we create different subsets of data with replacement (this process is called bootstrapping).
Next, we train one decision tree per subset.
Finally, we aggregate all predictions to get the final prediction.
This process is depicted below:
If we look closely above, every subset has some missing data points from the original training set.
We can use these observations to validate the model.
This is also called out-of-bag validation.
Calculating the out-of-bag score for the whole random forest is simple too.
But one thing to remember is that we CAN NOT evaluate individual decision trees on their specific out-of-bag sample and generate some sort of “aggregated score” for the entire random forest model.
This is because a random forest is not about what a decision tree says individually.
Instead, it’s about what all decision trees say collectively.
So here’s how we can generate the out-of-bag score for the random forest model.
For every data point in the training set:
Gather predictions from all decision trees that did not use it as a training data point.
Aggregate predictions to get the final prediction.
For instance, consider a random forest model with 5 decision trees → (P, Q, R, S, T)
.
Say a specific data point X
was used as a training observation in decision trees P
and R
.
So we shall gather the out-of-bag prediction for data point X
from decision trees Q
, S
and T
.
After obtaining out-of-bag predictions for all samples, we score them to get the out-of-bag score.
Done!
See…this technique allowed us to evaluate a random forest model on the training set.
Of course, I don’t want you to blindly adopt out-of-bag validation without understanding some of its advantages and considerations.
I have found out-of-bag validation to be particularly useful in the following situations:
In low-data situations, out-of-bag validation prevents data splitting whilst obtaining a good proxy for model validation.
In large-data situations, traditional cross-validation techniques are computationally expensive. Here, out-of-bag validation provides an efficient alternative. This is because, by its very nature, even cross-validation provides an out-of-fold metric. Out-of-bag validation is also based on a similar principle.
And, of course, an inherent advantage of out-of-bag validation is that it guarantees no data leakage.
Luckily, out-of-bag validation is also neatly tied in sklearn’s random forest implementation.
The most significant consideration about out-of-bag score is to use it with caution for model selection, model improvement, etc.
This is because if we do, we typically tend to overfit the out-of-bag score as the model is essentially being tuned to perform well on the data points that were left out during its training.
And if we consistently improve the model based on the out-of-bag score, we obtain an overly optimistic evaluation of its generalization performance.
If I were to share just one lesson here based on my experience, it would be that if we don’t have a true (and entirely different) held-out set for validation, we will overfit to some extent.
The decisions made may be too specific to the out-of-bag sample and may not generalize well to new data.
That said, in a Random Forest, have you ever wondered:
Why Bagging is so effective.
Why do we sample rows from the training dataset with replacement.
How to mathematically formulate the idea of Bagging and prove variance reduction.
Read this to learn more: Why Bagging is So Ridiculously Effective At Variance Reduction?
The article dives into the entire mathematical foundation of Bagging, which will help you:
Truly understand and appreciate the mathematical beauty of Bagging as an effective variance reduction technique
Why the random forest model is designed the way it is.
Also, here’s an issue where we discussed a technique to condense a random forest into a decision tree model: Condense Random Forest into a Decision Tree.
The benefits?
This technique can:
Decrease the prediction run-time.
Improve interpretability.
Reduce the memory footprint.
Simplify the model.
Preserve the generalization power of the random forest model.
👉 Over to you: What other considerations would you like to add here about out-of-bag validation?
Are you overwhelmed with the amount of information in ML/DS?
Every week, I publish no-fluff deep dives on topics that truly matter to your skills for ML/DS roles.
For instance:
A Beginner-friendly Introduction to Kolmogorov Arnold Networks (KANs).
5 Must-Know Ways to Test ML Models in Production (Implementation Included).
Understanding LoRA-derived Techniques for Optimal LLM Fine-tuning
8 Fatal (Yet Non-obvious) Pitfalls and Cautionary Measures in Data Science
Implementing Parallelized CUDA Programs From Scratch Using CUDA Programming
You Are Probably Building Inconsistent Classification Models Without Even Realizing.
How To (Immensely) Optimize Your Machine Learning Development and Operations with MLflow.
And many many more.
Join below to unlock all full articles:
SPONSOR US
Get your product in front of 78,000 data scientists and other tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., who have influence over significant tech decisions and big purchases.
To ensure your product reaches this influential audience, reserve your space here or reply to this email to ensure your product reaches this influential audience.
It's worth mentioning that with a large number of data points, each tree is trained on 1 - 1/e (~63%) of the training data on average. So each data point will tend to be used in training for 3 out of every 5 trees, and be OOB for 2 in 5 trees. So the OOB score does utilize a large chunk of the overall ensemble.
The proof is left as an exercise for the reader :)
you are good, is there something climate analysis and forecast related?