I'm guessing you could even consider a tree's performance criteria based on its weighted train and test scores, not to overfit the training set nor to overtune (overfit) the validation/test set
Of course. After publishing this, another idea I had was to measure the performance of the smaller tree on another held-out validation set and see if this translates well on that set too.
Noice! The only thing that I was thinking is that now it is tuned for the test set, and we don't know how it would perform on another data set. So it is always adviced to make decisions about the model using the validation set, and never about the test set. So the only thing I would change is to reduce the number of trees using the validation set and then in the end check the performance on the test set.
Great. I love these very intuitive approach. Although I would be careful not to reduce the number of tres so much that it loses what is special about the forest. Maybe I would have choose the max number before the accuracy goes down. Anyway, thanks for this approach.
PS: In your very last bullet point, should it not be “decrease the run time” instead of “increase the runtime”?
Of course, Francois. That is why when I thought of this approach, my intention was to keep the best of them and that too until they until the accuracy. I am giving this approach much more thought and thinking of formalising it even more if possible.
And thanks for pointing out the mistake. Corrected it :)
i think the complete process (code) should be using --> only train. meaning: instead of "result.append([i, small_model.score(X_test, y_test)])", it needs to be "result.append([i, small_model.score(X_train, y_train)])". that way your code will be robust and not overfitting
Because there are 2^n ways to select a subset of the original n trees in the random forest, this may lead to overfitting to the test set. You may want to use a validation set to select the top k trees and then evaluate the performance on a holdout test set.
This is actually really neat and clever!
I'm guessing you could even consider a tree's performance criteria based on its weighted train and test scores, not to overfit the training set nor to overtune (overfit) the validation/test set
Of course. After publishing this, another idea I had was to measure the performance of the smaller tree on another held-out validation set and see if this translates well on that set too.
I like your idea, too, Omar :)
Noice! The only thing that I was thinking is that now it is tuned for the test set, and we don't know how it would perform on another data set. So it is always adviced to make decisions about the model using the validation set, and never about the test set. So the only thing I would change is to reduce the number of trees using the validation set and then in the end check the performance on the test set.
Ohh yes, I am sorry, there I meant validation set but mistakenly ended up writing the test set instead. Thanks for sharing this, Marcell :)
By the way, thanks a lot for your work! I really enjoy reading all of the emails. Keep up the good work Avi!
Great. I love these very intuitive approach. Although I would be careful not to reduce the number of tres so much that it loses what is special about the forest. Maybe I would have choose the max number before the accuracy goes down. Anyway, thanks for this approach.
PS: In your very last bullet point, should it not be “decrease the run time” instead of “increase the runtime”?
Of course, Francois. That is why when I thought of this approach, my intention was to keep the best of them and that too until they until the accuracy. I am giving this approach much more thought and thinking of formalising it even more if possible.
And thanks for pointing out the mistake. Corrected it :)
i think the complete process (code) should be using --> only train. meaning: instead of "result.append([i, small_model.score(X_test, y_test)])", it needs to be "result.append([i, small_model.score(X_train, y_train)])". that way your code will be robust and not overfitting
Because there are 2^n ways to select a subset of the original n trees in the random forest, this may lead to overfitting to the test set. You may want to use a validation set to select the top k trees and then evaluate the performance on a holdout test set.