I am using a database with information on food preservation methods (such as "frozen", "canned", expressed in tertiles of consumption in grams/day) and their effect on different variables (leukocytes, CRP, ...- continuous variables). I have difficulty selecting what is the appropriate model for this.
1. If dependent variables are kept as continuous variables, should the model be a multiple regression for each food preservation method and dependent variables?
For example:
Code:
regress leukocyte i.cannedtertile
Code:
regress crp i.cannedtertile
Code:
regress crp i.frozentertile
2. Most dependent variables are not normally distributed. For example, the continuous variable "leukocytes" (measured in 10^3 / mm3) does not have a normal distribution, so I have transformed it logarithmically.
Code:
gen logleukocyte = log(leukocyte)
Code:
regress logleukocyte b1.cannedtertil
Code:
------------------------------------------------------------------------------- logleukocyte | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------- cannedtertil | 2 | -.0365994 .0171835 -2.13 0.033 -.0702951 -.0029037 3 | -.0152055 .0171048 -0.89 0.374 -.0487469 .0183359 _cons | 1.784896 .0121318 147.13 0.000 1.761107 1.808686
-Coefficient = -.0365994
-Exponentiate: 0.9641
-Substract 1: -0.0394
-Result = -3,594
So: "compared to the lowest tertile, those in the second canned food consumption tertile have 3,59 10^3/mm3 less leukocytes"
--> Is this correct?
b) However, how would the confidence interval be interpreted?
I have read in this post (https://www.stata.com/stata-news/news34-2/spotlight/) that it is preferable to use log transform and linear regression or Poisson regression followed by the use of the "margins" command, so that the confidence interval is also on the original scale (given that: "It is tempting to simply exponentiate the predictions to convert them back to wages, but the reverse transformation results in a biased prediction (see references Abrevaya [2002]; Cameron and Trivedi [2010]; Duan [1983]; Wooldridge [2010]).")
c) If the above is correct, is it correct ot use it:
Code:
gsem logleukocyte <- b1.cannedtertil ------------------------------------------------------------------------------- | Coef. Std. Err. z P>|z| [95% Conf. Interval] --------------+---------------------------------------------------------------- logleukocyte | cannedtertil | 2 | -.0365994 .0171659 -2.13 0.033 -.070244 -.0029548 3 | -.0152055 .0170873 -0.89 0.374 -.048696 .018285 _cons | 1.784896 .0121194 147.28 0.000 1.761143 1.80865 --------------+---------------------------------------------------------------- var(e.leulog)| .0716775 .0020492 .0677717 .0758085
Code:
margins, expression(exp(predict(eta))*(exp((_b[/var(e.logleukocyte)])/2)))
Code:
margins, expression(exp(predict(eta))*(exp((_b[/var(e.logleukocyte)])/2))) at(cannedtertile=(1(1)3)) ------------------------------------------------------------------------------ | Delta-method | Margin Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _at | 1 | 6.176397 .0751213 82.22 0.000 6.029162 6.323632 2 | 5.954431 .0726437 81.97 0.000 5.812052 6.09681 ------------------------------------------------------------------------------
--> Also, how would the result be interpreted (6.176397 and 5.954431)? (this is not the same as obtained above: 3.59 10 ^ 3 / mm3)
c) If not, would you recommend the use of the Poisson model + margins (second option explained here: https://www.stata.com/stata-news/news34-2/spotlight/)? (I have used it too and similar results appear - coefficients around 5.- and 6.- and I don't know how to interpret them).
3. If I had to use the value of p, would I use the one obtained in the multiple linear regression with the transformed variables?
I would really appreciate your help.
Thank you in advance.
0 Response to Multiple lineal regression / Log transformed variables interpretation
Post a Comment