Why Decision Trees Must Be Thoroughly Inspected After Training
Understanding the structural formulation of decision trees and how it can be problematic at times.
If we were to visualize the decision rules (the conditions evaluated at every node) of ANY decision tree, we would ALWAYS find them to be perpendicular to the feature axes, as depicted below:
In other words, every decision tree progressively segregates feature space based on such perpendicular boundaries to split the data.
Of course, this is not a “problem” per se.
In fact, this perpendicular splitting is what makes it so powerful to perfectly overfit any dataset (read the overfitting experiment section here to learn more).
However, this also brings up a pretty interesting point that is often overlooked when fitting decision trees.
More specifically, what would happen if our dataset had a diagonal decision boundary, as depicted below:
It is easy to guess that in such a case, the decision boundary learned by a decision tree is expected to appear as follows:
In fact, if we plot this decision tree, we notice that it creates so many splits just to fit this easily separable dataset, which a model like logistic regression, support vector machine (SVM), or even a small neural network can easily handle:
It becomes more evident if we zoom into this decision tree and notice how close the thresholds of its split conditions are:
This is a bit concerning because it clearly shows that the decision tree is meticulously trying to mimic a diagonal decision boundary, which hints that it might not be the best model to proceed with.
To double-check this, I often do the following:
Take the training data
(X, y)
;Shape of
X
:(n, m)
.Shape of
y
:(n, 1)
.
Run PCA on
X
to project data into an orthogonal space ofm
dimensions. This will giveX_pca
, whose shape will also be(n, m)
.Fit a decision tree on
X_pca
and visualize it (thankfully, decision trees are always visualizable).If the decision tree depth is significantly smaller in this case, it validates that there is a diagonal separation.
For instance, the PCA projections on the above dataset are shown below:
It is clear that the decision boundary on PCA projections is almost perpendicular to the X2`
feature (the 2nd principal component).
Fitting a decision tree on this X_pca
drastically reduces its depth, as depicted below:
This lets us determine that we might be better off using some other algorithm instead.
Or, we can spend some time engineering better features that the decision tree model can easily work with using its perpendicular data splits.
At this point, if you are thinking, why can’t we use the decision tree trained on X_pca
?
While nothing stops us from doing that, do note that PCA components are not interpretable, and maintaining feature interpretability can be important at times.
Thus, whenever you train your next decision tree model, consider spending some time inspecting what it’s doing.
Before I end…
Through this post, I don’t intend to discourage the use of decision trees. They are the building blocks of some of the most powerful ensemble models we use today.
My point is to bring forward the structural formulation of decision trees and why/when they might not be an ideal algorithm to work with.
That said, do you know that decision tree models can be supercharged with tensor computations for up to 40x faster inference? Check out this article to learn more: Sklearn Models are Not Deployment Friendly! Supercharge Them With Tensor Computations.
Also, if you want to learn how PCA works end-to-end, along with its entire mathematical details, check out this article: Formulating the Principal Component Analysis (PCA) Algorithm From Scratch.
👉 Over to you: What other ways might you use to handle diagonal datasets with decision tree models?
Thanks for reading!
Are you preparing for ML/DS interviews or want to upskill at your current job?
Every week, I publish in-depth ML deep dives. The topics align with the practical skills that typical ML/DS roles demand.
Join below to unlock all full articles:
Here are some of the top articles:
[FREE] A Beginner-friendly and Comprehensive Deep Dive on Vector Databases.
A Detailed and Beginner-Friendly Introduction to PyTorch Lightning: The Supercharged PyTorch
Don't Stop at Pandas and Sklearn! Get Started with Spark DataFrames and Big Data ML using PySpark.
Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.
Sklearn Models are Not Deployment Friendly! Supercharge Them With Tensor Computations.
Deploy, Version Control, and Manage ML Models Right From Your Jupyter Notebook with Modelbit
Join below to unlock all full articles:
👉 If you love reading this newsletter, share it with friends!
👉 Tell the world what makes this newsletter special for you by leaving a review here :)