Hi,

I am trying to perform a two-step model to account for selection bias. My question relates to how you include the variables in each step. Actually, I am trying to replicate the approach in a paper. In the first step, they include a number of variables, (e.g. gender, age, marital status etc), but in the regression the start with only gender as control and sequentially adding variables to see how family or labour characteristics affect the gender gap.

My question is if I should keep all the variables in the probit part and only sequentially add the new control variables that I want in the regression part. Some of them are common in both steps which is OK. I just want to check if my approach is correct or if I should only include common variables in both steps but have just one extra variable in the probit so that the estimation does not suffer from endogeneity. Below is also my code.

capture program drop qr11a
program define qr11a, rclass
probit prob1 gender Dage Dage1 Dchild childI /*
*/ Dmar DmarI DmarII marI marII marIII ageI ageII earningsD earningsI if ra0300<61
tempname b
mat `b' = e(b)
predict double xb1, xb
g phi=normalden(xb1)
g PHI=normal(xb1)
g lambda=phi/PHI
reg nwealtht gender lambda Dage Dage1 Deduc Deduc1 Dchild nmar nmarI nmarII if partner==0 & ra0300<61
matrix bb = `b', e(b)
scalar b_g=(bb[1,1])
return scalar b_g=bb[1,1]
scalar b_l=(bb[1,2])
return scalar b_l=bb[1,2]
drop phi PHI lambda
end
mi estimate, cmdok vceok: qr11a nwealtht gender lambda Dage Dage1 Deduc Deduc1 Dchild nmar nmarI nmarII if partner==0 & ra0300<61

Thanks,
Ilias