3 Comments

Better still is not to impute anything but rather to leave it up to each model's perspective how to treat missing values. For example "distance functions" can be customised and in some cases asymmetric (e.g. reflecting some aspect of the application domain). Preprocessing data presupposes downstream purposes (that might change over time).

Expand full comment

Outlier-tolerant e.g. median would be better than mean - though in your example that would just place the spike in a "better" place.

Expand full comment

I've always wanted to try something like this. The iterative approach is interesting. I wonder how much the iteration improves on a single step with a model trained on the non-missing values.

Also, if every feature has some missing data, I guess you can't use MissForest without some modification. Maybe there's another algorithm for this case.

Expand full comment