Hi,

I have a data set with 10k observations for Y and an endogenous regressor X with many missing observations (90%) I have an instrument Z with no missings. I know that the values for X are missing at random.

I think, the naive approach would by to run
Code:
ivregress 2sls Y (X=Z)
which seems not optimal to me, because it ignores the information on Y and Z in all the observations with missing X... And since X is missing at random, I can assume that the relationship between X and Z is the same among those observations.

Optimally I would run the first stage with the subsample with non-missing Xs and the the second stage on the full sample. Which should give me more power

How is this possible in Stata?
Are there issues I neglect?
Are there papers about this?

PS: I am cross-posting a similar question here: https://stats.stackexchange.com/ques...ous-regressors