Structural equation modeling: Whether and how to estimate margins w.r.t. a latent variable

I’d like to get input from the Stata community on the advisability and potential utility of estimating the margins of Observable Endogenous (OEn) variables with respect to a Latent Endogenous variable (Len). The specific questions I am asking are at the bottom of this post. Any comments, positive or negative, will be greatly appreciated.

Post-estimation utilities for -sem- and -gsem- facilitate estimating margins of the OEn w.r.t. any Observable Exogenous (OEx) variable. Searching Statalist for discussions of estimating margins after -sem- or -gsem-, I found a 2014 thread initiated by Jan Hultgren . Jeff Pitblado (StataCorp) ’s responses are particularly helpful. I've also searched for this question in various textbooks about SEM including that by Anders Skrondal and Sophia Rabe-Hesketh entitled Generalized latent variable modeling, 2004. But so far I have not found a discussion of whether to estimate a margin w.r.t. an LEn or, if that were desired, how to best approach the task.

Here is a toy example to motivate the issue and, hopefully, to show why it might be pertinent.

I. Preliminaries
Consider an extremely simple SEM using Stata’s demo data, auto.dta.

Code:

sysuse auto, clear
replace price = price/1000
lab var price "Price in $1,000's'"
sem  (price <- foreign L)  (mpg <- foreign L)  (L <- displ gear turn trunk )

The above model is an instance of a class of SEM models called MIMIC models. See examples 10 and 36g in Stata’s [SEM] PDF documentation. Here are the estimation results:

Code:

sem (price <- foreign L) (mpg <- foreign L) (L <- displ gear turn trunk )
 
Endogenous variables
  Observed: price mpg
  Latent:   L
 
Exogenous variables
  Observed: foreign displacement gear_ratio turn trunk
 
Fitting target model:
Iteration 0:   log likelihood = -1300.1204  (not concave)
<snip>
Iteration 13:  log likelihood = -1196.7388 
Iteration 14:  log likelihood = -1196.7388 
 
Structural equation model                                   Number of obs = 74
Estimation method: ml
 
Log likelihood = -1196.7388
 
 ( 1)  [price]L = 1
--------------------------------------------------------------------------------
               |                 OIM
               | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
---------------+----------------------------------------------------------------
Structural     |
  price        |
             L |          1  (constrained)
       foreign |   4.042227   .8126842     4.97   0.000     2.449395    5.635059
         _cons |  -3.554518   3.569106    -1.00   0.319    -10.54984    3.440801
  -------------+----------------------------------------------------------------
  mpg          |
             L |    -2.0604   .3596857    -5.73   0.000    -2.765371   -1.355429
       foreign |  -2.739423   1.369886    -2.00   0.046    -5.424351   -.0544954
         _cons |   39.66227   7.471028     5.31   0.000     25.01933    54.30522
  -------------+----------------------------------------------------------------
  L            |
  displacement |   .0137306   .0049636     2.77   0.006     .0040021    .0234591
    gear_ratio |  -.9408139   .7621917    -1.23   0.217    -2.434682    .5530545
          turn |   .1976456   .0710133     2.78   0.005     .0584622     .336829
         trunk |   .0588124   .0533685     1.10   0.270    -.0457879    .1634126
---------------+----------------------------------------------------------------
   var(e.price)|   4.576513    .907895                      3.102217    6.751453
     var(e.mpg)|   10.99188    2.81405                      6.655097    18.15472
       var(e.L)|   .5206339   .4753274                      .0869769    3.116456
--------------------------------------------------------------------------------
LR test of model vs. saturated: chi2(3) = 9.17              Prob > chi2 = 0.0271
est store sem_orig

After this estimation, we can estimate and plot the margins of both OEn w.r.t. the OEx, -displacement-. (I use Roger Newson’s utility regaxis available from SSC.)

Code:

which regaxis
regaxis displ , lticks(atnumlist) maxticks(7)
Xeq margins , at(displ = (`atnumlist'))
marginsplot,   ///
  title(Predictive margins and 95% CIs from -sem-)  ///
  subtitle(Conventional margins of OEn variables w.r.t. a single OEx variable)  ///
  note(At mean values of other OEx variables, span)  ///
  ytitle(Predicted values of Price and MPG)  ///
  legend(order(3 "Predicted Price" "in $1,000's" 4 "Predicted Miles-" "Per-Gallon"))  ///
name(OEx_impact_on_OEn, replace)

This code produces the following margin plot.
Array

II. Narrative/Motivation/"Theory"
In the above model, can the latent variable, L, be interpreted as vehicle "quality"? (Identification problems abound. Please suspend your disbelief here.)

If the latent variable L represents something real, but not directly observable, like "Vehicle Quality", then it is of interest to produce a graph like the one above with respect to "Quality", a LEn variable. Such a graph would support a narrative that the Cause part of the MIMIC model estimates a sort of production function for "Quality", while the indicator part of the MIMIC model can be thought of as a hedonic index of quality. Again suspending our disbelief, imagine that -price- reflects willingness-to-pay, the higher the better, while -mpg- reflects the social value of the car, the lower the better.

III. Stata Problem
To my knowledge, neither -sem-'s nor -gsem-'s postestimation commands enables predicting the values of the OEn indicator variables over a range of values of an LEn variable like the L in this model.

As Jeff Pitblado (StataCorp) emphasizes in his Statalist posts on margins with latent variables (ibid), the unobserved LEn are inherently stochastic, being the result of adding a stochastic error term to a function of observed variables, some of which might also be endogenous and therefore stochastic. Thus the problem is to estimate margins w.r.t. an unobserved random variable. It might seem that a plausible approach would be to derive variation in the predicted value of L from variation in the observed values of the OEx on which it depends. However, although such an approach might give the correct margins, it does not allow an unambiguous estimate of the standard errors of the margins.

To see this ambiguity, note that the observed values of the four OEx variables in the demo data are quite different between the Buick Century and the Buick Opel. Here I use my own -mluwild- which can be downloaded from inside Stata.

Code:

net install mluwild, from(http://digital.cgdev.org/doc/stata/MO/Misc)
which mluwild
*Extract the coefficients of these four OEx variables from the -sem- results
est restore sem_orig
mluwild e(b)["y1","L:*"]
mat def Lcoefs = r(submat)
matlist Lcoefs
 
*Use the matrix -Lcoefs- to score the four OEx variables
*for the Buick Century and the Buick Opel, creating predicted values of L.
*(This -matrix score- command works by matching the column names of the 
*matrix of coefficients to the names of variables in the data.
*It is a is a cool feature of Stata.)
matrix score L_hat = Lcoefs if make =="Buick Century" | make=="Buick Opel"

Now note that, when we calculate the margins of the OEn variables -price- and -mpg- for these two different observations:
. The two estimates of the predicted -price- are almost the same at $6,366, but the standard error of the prediction is twice as large for the Opel.
. The two estimates of the predicted -mpg- are almost the same at 20.9, but the standard error of the prediction is almost 4 times larger for the Opel.

Code:

. margins , at(displ=196 gear == 2.93  turn==40  trunk==16) at(displ=304 gear == 2.87  turn==34  trunk==10)
 
Predictive margins                                          Number of obs = 74
Model VCE: OIM
 
1._predict: Linear prediction (Price in $1,000's'), predict(xb(price))
2._predict: Linear prediction (Mileage (mpg)), predict(xb(mpg))
 
1._at: displacement =  196
       gear_ratio   = 2.93
       turn         =   40
       trunk        =   16
2._at: displacement =  304
       gear_ratio   = 2.87
       turn         =   34
       trunk        =   10
 
------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
_predict#_at |
        1 1  |   6.365587   .2626448    24.24   0.000     5.850813    6.880362
        1 2  |   6.366048   .6387133     9.97   0.000     5.114193    7.617903
        2 1  |   20.88454   .4258567    49.04   0.000     20.04987     21.7192
        2 2  |   20.88359   1.251274    16.69   0.000     18.43113    23.33604
------------------------------------------------------------------------------

Since the standard errors of the predictions are necessary for inferences about the statistical significance of the impacts of changes in the latent variable, it seems that in order to proceed we must assume that the latent variable is fixed-in-repeated-samples, not random. Thus our estimates of the margins w.r.t. the latent variable will be conditional on this assumption.

IV. Proposed solution
The work-around I propose is:

Estimate the predicted value of L on each observation in the sample based only on the causal component of the -sem- model.
1. Use the command -predict LfromOEx, xblatent-.
Assume that LfromOEx is "fixed-in-repeated-samples" as if it were an Observed Exogenous variable
Fit the OEn indicators to the selected estimate of L using one of Stata's multivariate regression packages such as -reg3- or -sureg-
1. . To replicate -sem-'s anchoring of each Len to one of the OEn, constrain the coefficient of L to equal 1.0 in one of the indicator equations.
Optionally: Replace the e(b) and e(V) matrices in the -reg3- or -sureg- ereturn space with those from the -sem-, which to Stata look like they are from -reg3-, creating a Frankenstein set of estimates
Apply -margins- and -marginsplot- to the resulting -reg3- or -sureg- estimates to obtain the impact on the OEn (here -price- and -mpg-) of change in the LEn (here "Quality").

The attached DO file implements this proposed solution both with and without step (4). Here is the -marginsplots- result of executing steps (1), (2), (3) and (5), without using the Frankenstein estimates.

Array
And here is the proposed solution using all five steps and thus predicting the margins based on the less precise Frankenstein estimates.
Array
As you see, the two sets of estimates produce identical estimated margins for -price- and -mpg- at each estimated value of L, but very different standard errors and confidence intervals.

V. Questions [Should this be a "poll" :-) ]:
(1) Is the objective of estimating the relationship between the hedonic "quality" of these cars and the indicator variables price and mpg understandable? Suspending our disbelief regarding identification, does this objective make sense?
(2) Is there a way to compute margins with respect to a latent variable other than by estimating/predicting the latent variable and then assuming it is fixed in repeated samples?
(3) Is it meaningful to present margins of the OEn which are conditional on the assumption that L is fixed in repeated samples at the values estimated from the causal part of the model?
(4) After -gsem-, Stata offers only the Empirical Bayes method of estimating the values of the latent variable. Since the Empirical Bayes method uses the OEn variables as well as the OEx variables, the assumption that estimates of the LEn based on the Empirical Bayes method are fixed in repeated samples seems to be a heavier lift. Do folks agree that this distinction argues for using -sem- rather than -gsem-?

(Or to implement this "solution" after -gsem-, one could simply use -matrix score- to score the OEx variables with their -gsem- estimated coefficients. This should produce results with properties similar to those produced by -predict ..., xblatent- after -sem-.)

Based on the above and on intuition (and on my experience with large data sets), the estimated coefficients of L from -reg3- without modification are almost the same as the estimated coefficients from the Frankenstein model, but the standard errors are much larger in the Frankenstein model. (Compare graphs named eb_eV_from_reg3 and eb_eV_from_sem.)

(5) Conditional on assuming the estimated values of L are fixed, as I believe we must do to compute the margins of the OEn w.r.t. L, are we justified to make statistical inferences using the much smaller standard errors from -reg3- without modification?
(6) Since replacing the -reg3- results with the e(b) and e(V) values from the -sem- estimates yields larger (and thus more conservative) standard errors, are we on firmer ground to use these Frankenstein results for inference, justifying them as restoring part of the stochastic characteristics inherent in the original -sem- model?

Again, any comments would be appreciated.

BJ Data Tech Solution

Home / Data Cleaning / Data management / Data Processing / Structural equation modeling: Whether and how to estimate margins w.r.t. a latent variable
Structural equation modeling: Whether and how to estimate margins w.r.t. a latent variable

0 Response to Structural equation modeling: Whether and how to estimate margins w.r.t. a latent variable

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Structural equation modeling: Whether and how to estimate margins w.r.t. a latent variable Structural equation modeling: Whether and how to estimate margins w.r.t. a latent variable

Related Posts with Structural equation modeling: Whether and how to estimate margins w.r.t. a latent variable

0 Response to Structural equation modeling: Whether and how to estimate margins w.r.t. a latent variable

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Structural equation modeling: Whether and how to estimate margins w.r.t. a latent variable
Structural equation modeling: Whether and how to estimate margins w.r.t. a latent variable