Hello everyone, I hope you all are staying safe out there.

I am trying to find a way to build a two-stage model where the first stage of the model is a binary dependent variable. I have chosen the probit model to estimate this.

My setup:
Xit = Φ( Wit + eit ), where Φ is the Normal cumulative distribution function
Yit = αZit + βXit + uit, where Z may also contain elements of W

I realize this subject has been discussed ad-nauseum in this forum, but it is hard to collect a single recommendation. First, let me link to relevant articles and my question(s) will follow.

1. 2SLS with Binary Endogenous Variable and linear second stage:
https://www.statalist.org/forums/for...enous-variable
  • Recommended solution is to use either 2SLS in both stages (which ignores the fact that X is binary) or ..
  • Use solution from Wooldridge (2002, 2010) which is a 3 step process: probit, then do 2SLS while using predicted values (from the probit model) as an instrument for the first stage
  • I assume either version of 2SLS is appropriate, depending on data type (ivreg for cross-sections or xtivreg for panels)
  • 2SLS is consistent in both cases, though you lose some precision in the first case as it ignores the binary nature of X
2. 2SLS: Binary Second Stage with Binary Endogenous Variable, : https://www.statalist.org/forums/for...ndent-variable
  • Recommended solutions is to use either 2SLS (again) ... this is what Angrist and Pischke recommend in "Mostly Harmless" or...
  • Use biprobit to joint estimate both maximum likelihood models
  • Wooldridge notes in that post: "A method that plugs in fitted values into nonlinear second stages should be assumed inconsistent unless you prove otherwise."
3. Probit 2SLS: https://stats.stackexchange.com/ques...t-squares-2sls
  • We cannot use probit model as it's own first stage because " neither the conditional expectation nor the linear projection operator passes through nonlinear functions" as discussed in Wooldridge (2010, p267).
4. The other question that comes up is whether we can use a control function approach. But, as Wooldridge notes here on page 10 (https://www.nber.org/WNE/Slides7-31-...ntrolfuncs.pdf): "CF approaches are more difficult to apply to nonlinear models, even relatively simple ones. Methods are available when the endogenous explanatory variables are continuous, but few if any results apply to cases with discrete first stages."
  • Despite this fact, we also have access to the etregress command in Stata. This was first mentioned in link in #1 above.
  • Also mentioned here in a Statalist archive: https://www.stata.com/statalist/arch.../msg00339.html
  • etregress gives the option to use MLE, two-step estimation, or a control function approach (as of Stata 14, I think)


Ok, with this information out there (and countless other posts that I read through, here are my questions:
  1. People commonly refer to the procedure in Wooldridge (2010), but I cannot find an explicit page number reference to this procedure in the 2010 version. In Section 9.5.2 on page 268, there is a similar discussion regarding a squared first-stage covariate, but not a binary first-stage covariate... but perhaps this is what everyone is referring to? I have combed over the book and cannot seem to find it in the right place. Can someone provide me the exact reference to this procedure so I can correctly cite?
  2. Similarly, is there a parallel discussion for this procedure for panel data in the book? The context on p268 is cross-sectional.
  3. Given Wooldridge's comments about the difficulties surrounding a CF approach with a non-linear model, how do I trust the outputs of etregress if I select the CF option?
Thank you all for your time... I hope my post is helpful for aggregating some of this information and can be useful going forward.

Best regards,
RJ