Hello,

I'd like to manually estimate a two-stage least squares regression, first running the first stage then running the second stage with the predicted X. Doing this, the standard errors need to be adjusted to account for predicted X being a simulated regressor.

I did so using code from a Stata.com post and a Kit Baum Statalist post. With clustering, however, the degrees of freedom adjustment isn't quite right and I can't figure out how to do it. I can get "close" but my manual SE's still don't match those from ivregress (although they do when I don't cluster).

Can someone correct my SE adjustment so that the manual two-stage can recover the SE's from ivregress?

Below is code to generate an illustrative dataset, to generate the accurate correction without clustering (the code is very similar to that from the links above but slightly more automated), to get an "almost accurate" correction with clustering, and to do the whole thing with reghdfe in addition to reg (which has to be done slightly differently because of differences in how predict works; I'm including it because I thought it might be useful to others).

Thank you,
Mitch

Code:
set seed 30819

******************
/*    DATA SETUP    */
******************

clear

* Generate an unbalanced panel of 1000 observations
set obs 1000
gen n = _n
gen i = ceil(100*runiform())
sort i n
by i: gen t = _n
*tab i, sort
*tab t

* Create need for i fixed effects
gen temp1 = rnormal()
egen a = mean(temp1), by(i)
drop temp1

* Create need for t fixed effects
gen temp2 = t + rnormal()
egen d = mean(temp2), by(t)
drop temp2

gen u1 = rnormal()
gen u2 = rnormal()*2
gen v = rnormal()
gen z = rnormal()

gen x = a + d + z/3 + u1 + v

gen y = a + d + x/2 + u1 + u2

**********************************
/*    REG VERSION WITH DUMMIES    */
/*        NO CLUSTERING            */
**********************************

* First stage
qui: reg x z i.i i.t
predict xfit, xb

* Second stage
qui: reg y xfit i.i i.t
local st2se = _se[xfit]
local st2rmse = `e(rmse)'
di `e(df_r)'
local dfr = `e(df_r)'
di `dfr'

* Getting "corrected" residuals as true X's and IV-estimated coefficients
replace xfit = x
predict cst2e, resid

* Getting "corrected" sum of squared errors
gen cst2e2 = cst2e^2
qui: sum cst2e2
local csse = `r(sum)'
di `csse'

* Original SE's with no correction for simulated variables
di `st2se'
* Actual IV standard errors
qui: ivregress 2sls y i.i i.t (x = z)
di _se[x]
* Manually calculated/adjusted SE's
di `st2se'*(sqrt(`csse'/`dfr')/`st2rmse')
* Actual IV standard errors with small sample correction
qui: ivregress 2sls y i.i i.t (x = z), small
di _se[x]

drop xfit cst2e cst2e2


**********************************
/*    REG VERSION WITH DUMMIES    */
/*        WITH CLUSTERING            */
**********************************

* Note: Clustering matters
qui: reg x z i.i i.t
di _se[z]
qui: reg x z i.i i.t, cluster(i)
di _se[z]

* First stage
qui: reg x z i.i i.t, cluster(i)
predict xfit, xb

* Second stage
qui: reg y xfit i.i i.t, cluster(i)
local st2se = _se[xfit]
local st2rmse = `e(rmse)'
di `e(df_r)'
local dfr = (`e(N)' - `e(df_m)' - `e(df_r)')
di `dfr'

* Getting "corrected" residuals as true X's and IV-estimated coefficients
replace xfit = x
predict cst2e, resid

* Getting "corrected" sum of squared errors
gen cst2e2 = cst2e^2
qui: sum cst2e2
local csse = `r(sum)'
di `csse'

* Original SE's with no correction for simulated variables
di `st2se'
* Actual IV standard errors
qui: ivregress 2sls y i.i i.t (x = z), cluster(i)
di _se[x]
* Manually calculated/adjusted SE's
di `st2se'*(sqrt(`csse'/`dfr')/`st2rmse')
* Actual IV standard errors with small sample correction
qui: ivregress 2sls y i.i i.t (x = z), cluster(i) small
di _se[x]

drop xfit cst2e cst2e2

**********************
/*    REGHDFE VERSION    */
/*    NO CLUSTERING    */
**********************

* First stage
qui: reghdfe x z, absorb(i t, savefe) resid
predict xfit, xbd

* Second stage
qui: reghdfe y xfit, absorb(i t, savefe) resid
predict st2e, resid
local st2se = _se[xfit]
local st2rmse = `e(rmse)'
local dfr = `e(df_r)'

* Getting "corrected" residuals as true X's and IV-estimated coefficients
* You have to do this in a strange way because predict doesn't work if you change the X values
gen cst2e = st2e + (xfit - x)*_b[xfit]

* Getting "corrected" sum of squared errors
gen cst2e2 = cst2e^2
qui: sum cst2e2
local csse = `r(sum)'

* Original SE's with no correction for simulated variables
di `st2se'
* Actual IV standard errors
qui: ivreghdfe y (x = z), absorb(i t)
di _se[x]
* Manually calculated/adjusted SE's
di `st2se'*(sqrt(`csse'/`dfr')/`st2rmse')

drop xfit st2e cst2e cst2e2

**********************
/*    REGHDFE VERSION    */
/*    WITH CLUSTERING    */
**********************

* First stage
qui: reghdfe x z, absorb(i t, savefe) resid cluster(i)
predict xfit, xbd

* Second stage
qui: reghdfe y xfit, absorb(i t, savefe) resid cluster(i)
predict st2e, resid
local st2se = _se[xfit]
local st2rmse = `e(rmse)'
local dfr = (`e(N)' - `e(df_m)' - `e(df_r)')

* Getting "corrected" residuals as true X's and IV-estimated coefficients
* You have to do this in a strange way because predict doesn't work if you change the X values
gen cst2e = st2e + (xfit - x)*_b[xfit]

* Getting "corrected" sum of squared errors
gen cst2e2 = cst2e^2
qui: sum cst2e2
local csse = `r(sum)'

* Original SE's with no correction for simulated variables
di `st2se'
* Actual IV standard errors
qui: ivreghdfe y (x = z), absorb(i t) cluster(i)
di _se[x]
* Manually calculated/adjusted SE's
di `st2se'*(sqrt(`csse'/`dfr')/`st2rmse')