Post-estimation utilities for -sem- and -gsem- facilitate estimating margins of the OEn w.r.t. any Observable Exogenous (OEx) variable. Searching Statalist for discussions of estimating margins after -sem- or -gsem-, I found a 2014 thread initiated by Jan Hultgren . Jeff Pitblado (StataCorp) ’s responses are particularly helpful. I've also searched for this question in various textbooks about SEM including that by Anders Skrondal and Sophia Rabe-Hesketh entitled Generalized latent variable modeling, 2004. But so far I have not found a discussion of whether to estimate a margin w.r.t. an LEn or, if that were desired, how to best approach the task.
Here is a toy example to motivate the issue and, hopefully, to show why it might be pertinent.
I. Preliminaries
Consider an extremely simple SEM using Stata’s demo data, auto.dta.
Code:
sysuse auto, clear replace price = price/1000 lab var price "Price in $1,000's'" sem (price <- foreign L) (mpg <- foreign L) (L <- displ gear turn trunk )
Code:
sem (price <- foreign L) (mpg <- foreign L) (L <- displ gear turn trunk ) Endogenous variables Observed: price mpg Latent: L Exogenous variables Observed: foreign displacement gear_ratio turn trunk Fitting target model: Iteration 0: log likelihood = -1300.1204 (not concave) <snip> Iteration 13: log likelihood = -1196.7388 Iteration 14: log likelihood = -1196.7388 Structural equation model Number of obs = 74 Estimation method: ml Log likelihood = -1196.7388 ( 1) [price]L = 1 -------------------------------------------------------------------------------- | OIM | Coefficient std. err. z P>|z| [95% conf. interval] ---------------+---------------------------------------------------------------- Structural | price | L | 1 (constrained) foreign | 4.042227 .8126842 4.97 0.000 2.449395 5.635059 _cons | -3.554518 3.569106 -1.00 0.319 -10.54984 3.440801 -------------+---------------------------------------------------------------- mpg | L | -2.0604 .3596857 -5.73 0.000 -2.765371 -1.355429 foreign | -2.739423 1.369886 -2.00 0.046 -5.424351 -.0544954 _cons | 39.66227 7.471028 5.31 0.000 25.01933 54.30522 -------------+---------------------------------------------------------------- L | displacement | .0137306 .0049636 2.77 0.006 .0040021 .0234591 gear_ratio | -.9408139 .7621917 -1.23 0.217 -2.434682 .5530545 turn | .1976456 .0710133 2.78 0.005 .0584622 .336829 trunk | .0588124 .0533685 1.10 0.270 -.0457879 .1634126 ---------------+---------------------------------------------------------------- var(e.price)| 4.576513 .907895 3.102217 6.751453 var(e.mpg)| 10.99188 2.81405 6.655097 18.15472 var(e.L)| .5206339 .4753274 .0869769 3.116456 -------------------------------------------------------------------------------- LR test of model vs. saturated: chi2(3) = 9.17 Prob > chi2 = 0.0271 est store sem_orig
Code:
which regaxis regaxis displ , lticks(atnumlist) maxticks(7) Xeq margins , at(displ = (`atnumlist')) marginsplot, /// title(Predictive margins and 95% CIs from -sem-) /// subtitle(Conventional margins of OEn variables w.r.t. a single OEx variable) /// note(At mean values of other OEx variables, span) /// ytitle(Predicted values of Price and MPG) /// legend(order(3 "Predicted Price" "in $1,000's" 4 "Predicted Miles-" "Per-Gallon")) /// name(OEx_impact_on_OEn, replace)
Array
II. Narrative/Motivation/"Theory"
In the above model, can the latent variable, L, be interpreted as vehicle "quality"? (Identification problems abound. Please suspend your disbelief here.)
If the latent variable L represents something real, but not directly observable, like "Vehicle Quality", then it is of interest to produce a graph like the one above with respect to "Quality", a LEn variable. Such a graph would support a narrative that the Cause part of the MIMIC model estimates a sort of production function for "Quality", while the indicator part of the MIMIC model can be thought of as a hedonic index of quality. Again suspending our disbelief, imagine that -price- reflects willingness-to-pay, the higher the better, while -mpg- reflects the social value of the car, the lower the better.
III. Stata Problem
To my knowledge, neither -sem-'s nor -gsem-'s postestimation commands enables predicting the values of the OEn indicator variables over a range of values of an LEn variable like the L in this model.
As Jeff Pitblado (StataCorp) emphasizes in his Statalist posts on margins with latent variables (ibid), the unobserved LEn are inherently stochastic, being the result of adding a stochastic error term to a function of observed variables, some of which might also be endogenous and therefore stochastic. Thus the problem is to estimate margins w.r.t. an unobserved random variable. It might seem that a plausible approach would be to derive variation in the predicted value of L from variation in the observed values of the OEx on which it depends. However, although such an approach might give the correct margins, it does not allow an unambiguous estimate of the standard errors of the margins.
To see this ambiguity, note that the observed values of the four OEx variables in the demo data are quite different between the Buick Century and the Buick Opel. Here I use my own -mluwild- which can be downloaded from inside Stata.
Code:
net install mluwild, from(http://digital.cgdev.org/doc/stata/MO/Misc) which mluwild *Extract the coefficients of these four OEx variables from the -sem- results est restore sem_orig mluwild e(b)["y1","L:*"] mat def Lcoefs = r(submat) matlist Lcoefs *Use the matrix -Lcoefs- to score the four OEx variables *for the Buick Century and the Buick Opel, creating predicted values of L. *(This -matrix score- command works by matching the column names of the *matrix of coefficients to the names of variables in the data. *It is a is a cool feature of Stata.) matrix score L_hat = Lcoefs if make =="Buick Century" | make=="Buick Opel"
. The two estimates of the predicted -price- are almost the same at $6,366, but the standard error of the prediction is twice as large for the Opel.
. The two estimates of the predicted -mpg- are almost the same at 20.9, but the standard error of the prediction is almost 4 times larger for the Opel.
Code:
. margins , at(displ=196 gear == 2.93 turn==40 trunk==16) at(displ=304 gear == 2.87 turn==34 trunk==10) Predictive margins Number of obs = 74 Model VCE: OIM 1._predict: Linear prediction (Price in $1,000's'), predict(xb(price)) 2._predict: Linear prediction (Mileage (mpg)), predict(xb(mpg)) 1._at: displacement = 196 gear_ratio = 2.93 turn = 40 trunk = 16 2._at: displacement = 304 gear_ratio = 2.87 turn = 34 trunk = 10 ------------------------------------------------------------------------------ | Delta-method | Margin std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- _predict#_at | 1 1 | 6.365587 .2626448 24.24 0.000 5.850813 6.880362 1 2 | 6.366048 .6387133 9.97 0.000 5.114193 7.617903 2 1 | 20.88454 .4258567 49.04 0.000 20.04987 21.7192 2 2 | 20.88359 1.251274 16.69 0.000 18.43113 23.33604 ------------------------------------------------------------------------------
IV. Proposed solution
The work-around I propose is:
- Estimate the predicted value of L on each observation in the sample based only on the causal component of the -sem- model.
- Use the command -predict LfromOEx, xblatent-.
- Assume that LfromOEx is "fixed-in-repeated-samples" as if it were an Observed Exogenous variable
- Fit the OEn indicators to the selected estimate of L using one of Stata's multivariate regression packages such as -reg3- or -sureg-
- . To replicate -sem-'s anchoring of each Len to one of the OEn, constrain the coefficient of L to equal 1.0 in one of the indicator equations.
- Optionally: Replace the e(b) and e(V) matrices in the -reg3- or -sureg- ereturn space with those from the -sem-, which to Stata look like they are from -reg3-, creating a Frankenstein set of estimates
- Apply -margins- and -marginsplot- to the resulting -reg3- or -sureg- estimates to obtain the impact on the OEn (here -price- and -mpg-) of change in the LEn (here "Quality").
Array
And here is the proposed solution using all five steps and thus predicting the margins based on the less precise Frankenstein estimates.
Array
As you see, the two sets of estimates produce identical estimated margins for -price- and -mpg- at each estimated value of L, but very different standard errors and confidence intervals.
V. Questions [Should this be a "poll" :-) ]:
(1) Is the objective of estimating the relationship between the hedonic "quality" of these cars and the indicator variables price and mpg understandable? Suspending our disbelief regarding identification, does this objective make sense?
(2) Is there a way to compute margins with respect to a latent variable other than by estimating/predicting the latent variable and then assuming it is fixed in repeated samples?
(3) Is it meaningful to present margins of the OEn which are conditional on the assumption that L is fixed in repeated samples at the values estimated from the causal part of the model?
(4) After -gsem-, Stata offers only the Empirical Bayes method of estimating the values of the latent variable. Since the Empirical Bayes method uses the OEn variables as well as the OEx variables, the assumption that estimates of the LEn based on the Empirical Bayes method are fixed in repeated samples seems to be a heavier lift. Do folks agree that this distinction argues for using -sem- rather than -gsem-?
(Or to implement this "solution" after -gsem-, one could simply use -matrix score- to score the OEx variables with their -gsem- estimated coefficients. This should produce results with properties similar to those produced by -predict ..., xblatent- after -sem-.)Based on the above and on intuition (and on my experience with large data sets), the estimated coefficients of L from -reg3- without modification are almost the same as the estimated coefficients from the Frankenstein model, but the standard errors are much larger in the Frankenstein model. (Compare graphs named eb_eV_from_reg3 and eb_eV_from_sem.)
(5) Conditional on assuming the estimated values of L are fixed, as I believe we must do to compute the margins of the OEn w.r.t. L, are we justified to make statistical inferences using the much smaller standard errors from -reg3- without modification?
(6) Since replacing the -reg3- results with the e(b) and e(V) values from the -sem- estimates yields larger (and thus more conservative) standard errors, are we on firmer ground to use these Frankenstein results for inference, justifying them as restoring part of the stochastic characteristics inherent in the original -sem- model?
Again, any comments would be appreciated.
0 Response to Structural equation modeling: Whether and how to estimate margins w.r.t. a latent variable
Post a Comment