Your Random Forest Model is Never the Best…

Avi Chawla

Oct 10, 2023

The coolest trick to improve random forest models.

Read →

9 Comments

Omar AlSuwaidi

Oct 11, 2023

This is actually really neat and clever!

I'm guessing you could even consider a tree's performance criteria based on its weighted train and test scores, not to overfit the training set nor to overtune (overfit) the validation/test set

Expand full comment

Reply (1)

Avi Chawla

Oct 13, 2023

Of course. After publishing this, another idea I had was to measure the performance of the smaller tree on another held-out validation set and see if this translates well on that set too.

I like your idea, too, Omar :)

Expand full comment

Marcell Nagy

Oct 11, 2023

Noice! The only thing that I was thinking is that now it is tuned for the test set, and we don't know how it would perform on another data set. So it is always adviced to make decisions about the model using the validation set, and never about the test set. So the only thing I would change is to reduce the number of trees using the validation set and then in the end check the performance on the test set.

Expand full comment

Reply (1)

Avi Chawla

Oct 13, 2023

Ohh yes, I am sorry, there I meant validation set but mistakenly ended up writing the test set instead. Thanks for sharing this, Marcell :)

Expand full comment

Reply (1)

Marcell Nagy

Oct 16, 2023

By the way, thanks a lot for your work! I really enjoy reading all of the emails. Keep up the good work Avi!

Expand full comment

Francois Ascani

Oct 11, 2023

Great. I love these very intuitive approach. Although I would be careful not to reduce the number of tres so much that it loses what is special about the forest. Maybe I would have choose the max number before the accuracy goes down. Anyway, thanks for this approach.

PS: In your very last bullet point, should it not be “decrease the run time” instead of “increase the runtime”?

Expand full comment

Reply (1)

Avi Chawla

Oct 11, 2023

Of course, Francois. That is why when I thought of this approach, my intention was to keep the best of them and that too until they until the accuracy. I am giving this approach much more thought and thinking of formalising it even more if possible.

And thanks for pointing out the mistake. Corrected it :)

Expand full comment

refael

Dec 18, 2023

i think the complete process (code) should be using --> only train. meaning: instead of "result.append([i, small_model.score(X_test, y_test)])", it needs to be "result.append([i, small_model.score(X_train, y_train)])". that way your code will be robust and not overfitting

Expand full comment

Joe Corliss

Oct 13, 2023

Because there are 2^n ways to select a subset of the original n trees in the random forest, this may lead to overfitting to the test set. You may want to use a validation set to select the top k trees and then evaluate the performance on a holdout test set.

Expand full comment

Daily Dose of Data Science

Your Random Forest Model is Never the Best…