Greetings everyone,

I specify two models in my study with two different dependent variables; one of them is a 0-1 dummy (y1), and the other is a count variable (y2). For each model, the independent variable of interest is a count variable (w), which is potentially endogenous. Thus, in my situation, I encounter two cases: 1) a logit model with a count endogenous explanatory variable and 2) a negative binomial model with a count endogenous explanatory variable.

To address this possible endogeneity problem, I am trying to employ the 2SLS approach, in which the count endogenous explanatory variable is replaced with its fitted values estimated from a negative binomial first-stage regression. However, I read in Statalist that simply mimicking the standard 2SLS approach in non-linear models may not be the appropriate way to correct for endogeneity. As a result, I decided to employ the control function approach (which is a two-stage residual inclusion (2SRI) approach) proposed by Terza et al. 2008 (Two-stage residual inclusion estimation: Addressing endogeneity in health econometric modeling), as adjusted by Wooldridge 2014 (Quasi-maximum likelihood estimation and testing for nonlinear models with endogenous explanatory variables).

Specifically, I address the endogeneity problem in my case as follows with Stata commands:

1) In the first stage of 2SRI, a negative binomial regression is used in which the count endogenous variable (w) is regressed on two instruments (z1 and z2) and a set of controls (x1...xn):

nbreg w z1 z2 x1...xn, vce (cluster Firm)

2) Compute the generalized residuals (gr), as suggested by Wooldridge (2014):

predict gr, score

3) In the second stage of 2SRI, the generalized residuals, along with the count endogenous variable, are added to my two outcome models. Recall that y1 is a dummy and y2 is a count:

logit y1 w gr x1...xn, vce (cluster Firm)

nbreg y2 w gr x1...xn, vce (cluster Firm)

According to the above situation that I face in my research, I have two questions:

Q1: Are the procedures and Stata commands described above correct?

Q2: How can I evaluate the relevance and exogenous of my two instruments, z1 and z2? Can I employ the partial Chi-square test for instruments in the first stage to test for relevance? Also, can I employ the standard overidentification test in the non-linear context by regressing the second stage residuals on z1 and z2 and other controls (x1...xn) and multiplying the resulting R2 by 2 (the number of instruments) to get the test statistic?

I apologize for this long post.

Kindly help me answer my two questions. I am looking forward to your helpful insights.