How to Inspect Decision Trees After Training with PCA
Understanding the structural formulation of decision trees and how it can be problematic at times.
If we were to visualize the decision rules (the conditions evaluated at every node) of ANY decision tree, we would ALWAYS find them to be perpendicular to the feature axes, as depicted below:
In other words, every decision tree progressively segregates feature space based on such perpendicular boundaries to split the data.
Of course, this is not a “problem” per se.
In fact, this perpendicular splitting is what makes it so powerful to perfectly overfit any dataset (read the overfitting experiment section here to learn more).
However, this also brings up a pretty interesting point that is often overlooked when fitting decision trees.
More specifically, what would happen if our dataset had a diagonal decision boundary, as depicted below:
It is easy to guess that in such a case, the decision boundary learned by a decision tree is expected to appear as follows:
In fact, if we plot this decision tree, we notice that it creates so many splits just to fit this easily separable dataset, which a model like logistic regression, support vector machine (SVM), or even a small neural network can easily handle:
It becomes more evident if we zoom into this decision tree and notice how close the thresholds of its split conditions are:
This is a bit concerning because it clearly shows that the decision tree is meticulously trying to mimic a diagonal decision boundary, which hints that it might not be the best model to proceed with.
To double-check this, I often do the following:
Take the training data
(X, y)
;Shape of
X
:(n, m)
.Shape of
y
:(n, 1)
.
Run PCA on
X
to project data into an orthogonal space ofm
dimensions. This will giveX_pca
, whose shape will also be(n, m)
.Fit a decision tree on
X_pca
and visualize it (thankfully, decision trees are always visualizable).If the decision tree depth is significantly smaller in this case, it validates that there is a diagonal separation.
For instance, the PCA projections on the above dataset are shown below:
It is clear that the decision boundary on PCA projections is almost perpendicular to the X2`
feature (the 2nd principal component).
Fitting a decision tree on this X_pca
drastically reduces its depth, as depicted below:
This lets us determine that we might be better off using some other algorithm instead.
Or, we can spend some time engineering better features that the decision tree model can easily work with using its perpendicular data splits.
At this point, if you are thinking, why can’t we use the decision tree trained on X_pca
?
While nothing stops us from doing that, do note that PCA components are not interpretable, and maintaining feature interpretability can be important at times.
Thus, whenever you train your next decision tree model, consider spending some time inspecting what it’s doing.
Before I end…
Through this post, I don’t intend to discourage the use of decision trees. They are the building blocks of some of the most powerful ensemble models we use today.
My point is to bring forward the structural formulation of decision trees and why/when they might not be an ideal algorithm to work with.
That said, do you know that decision tree models can be supercharged with tensor computations for up to 40x faster inference? Check out this article to learn more: Sklearn Models are Not Deployment Friendly! Supercharge Them With Tensor Computations.
Also, in a recent post, we formulated and implemented the entire XGBoost algorithm from scratch. Read here: Formulating and Implementing XGBoost From Scratch.
Finally, if you want to learn how PCA works end-to-end, along with its entire mathematical details, check out this article: Formulating the Principal Component Analysis (PCA) Algorithm From Scratch.
👉 Over to you: What other ways might you use to handle diagonal datasets with decision tree models?
For those who want to build a career in DS/ML on core expertise, not fleeting trends:
Every week, I publish no-fluff deep dives on topics that truly matter to your skills for ML/DS roles.
For instance:
Conformal Predictions: Build Confidence in Your ML Model’s Predictions
Quantization: Optimize ML Models to Run Them on Tiny Hardware
5 Must-Know Ways to Test ML Models in Production (Implementation Included)
Implementing Parallelized CUDA Programs From Scratch Using CUDA Programming
And many many more.
Join below to unlock all full articles:
SPONSOR US
Get your product in front of 88,000 data scientists and other tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., who have influence over significant tech decisions and big purchases.
To ensure your product reaches this influential audience, reserve your space here or reply to this email to ensure your product reaches this influential audience.