Hi,

I am trying to understand how the rforest command in Stata handles missing values in the independent variable. Here is the description in the help file:

The independent variables may contain missing values. Splits at any node can occur even if some independent variables are missing. If the independent variable is missing from an observation, it will be ignored for estimation but predictions can still be made on the observation. If the dependent variable for the training data contains missing values, the function will exit with an error message. In other words, any missing values in the dependent (response) variable in the training set needs to be imputed or excluded prior to executing the rforest command.
In particular, could you help me understand whether:
  1. It drops all observations with missing independent values for independent variables before running?
OR
  1. When randomly selecting variables at a given node, it drops observations if values are missing for the selected variables.
This is puzzling to me because I have tried to estimate a model for a dataset in which a at least one variable is systematically missing. Something like that:
a | b | c ----------- 1 | NA | 5 2 | NA | 4 NA | 1 | 3 NA | 2 | 6
And I find that my performance is better when I include the whole dataset than when I run it separately for the first two and last two rows. This should not happen if the observations with missing values where systematically drop.

Thanks a lot,
Best,

Martin