I am running a simple linear regression model and I found that 3 of my variables (ssr1 ssr2 ssr3) were highly colinear (a correlation above 0.85). When I performed a binary transformation to these variables in order to bypass this, they did stop being multicolinear but whenever I attempt to include all 3 of them in my baseline regression, the coefficients don't make sense (aka: their sign is different from when they're included alone).
corr ssr1 ssr2 ssr3 (obs=4,544) | ssr1 ssr2 ssr3 -------------+--------------------------- ssr1 | 1.0000 ssr2 | 0.8794 1.0000 ssr3 | 0.9855 0.8725 1.0000 corr dssr1 dssr2 dssr3 (obs=4,544) | dssr1 dssr2 dssr3 -------------+--------------------------- dssr1 | 1.0000 dssr2 | 0.2647 1.0000 dssr3 | 0.4837 0.3100 1.0000 reg y abs x1 x2 x3 dssr1 dssr2 dssr3 Source | SS df MS Number of obs = 4,389 -------------+---------------------------------- F(7, 4381) = 3369.98 Model | 10528.9907 7 1504.14153 Prob > F = 0.0000 Residual | 1955.39803 4,381 .446336003 R-squared = 0.8434 -------------+---------------------------------- Adj R-squared = 0.8431 Total | 12484.3887 4,388 2.84512049 Root MSE = .66808 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- abs | -.3938718 .0589501 -6.68 0.000 -.5094438 -.2782998 x1 | -.4061771 .0108896 -37.30 0.000 -.4275263 -.384828 x2 | 1.449583 .0119303 121.50 0.000 1.426194 1.472972 x3 | -.0211118 .0013578 -15.55 0.000 -.0237738 -.0184497 dssr1 | -.2650344 .0292786 -9.05 0.000 -.3224352 -.2076336 dssr2 | -.065791 .0288065 -2.28 0.022 -.1222664 -.0093157 dssr3 | .0849137 .025804 3.29 0.001 .0343249 .1355025 _cons | 1.411668 .0331324 42.61 0.000 1.346712 1.476625 ------------------------------------------------------------------------------ reg y abs x1 x2 x3 dssr1 Source | SS df MS Number of obs = 4,389 -------------+---------------------------------- F(5, 4383) = 4702.47 Model | 10522.8053 5 2104.56107 Prob > F = 0.0000 Residual | 1961.58339 4,383 .447543553 R-squared = 0.8429 -------------+---------------------------------- Adj R-squared = 0.8427 Total | 12484.3887 4,388 2.84512049 Root MSE = .66899 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- abs | -.4119729 .0573253 -7.19 0.000 -.5243595 -.2995863 x1 | -.4062483 .0108303 -37.51 0.000 -.4274811 -.3850155 x2 | 1.451485 .0118196 122.80 0.000 1.428312 1.474657 x3 | -.0208816 .0012887 -16.20 0.000 -.0234082 -.018355 dssr1 | .0174783 .0283905 -9.42 0.000 -.3231379 -.2118186 _cons | 1.412287 .0328553 42.99 0.000 1.347874 1.4767 ------------------------------------------------------------------------------
I was attempting to do some diagnostics by trying to regress abs (the most likely culprit) on the 3 of them, perhaps to check whether the R^2 was too high, if these 3 explain too much.
reg abs dssr1 dssr2 dssr3
Source | SS df MS Number of obs = 4,389
-------------+---------------------------------- F(3, 4304) = 163.34
Model | 793474.797 3 264491.599 Prob > F = 0.0000
Residual | 6969151.29 4,304 1619.2266 R-squared = 0.1022
-------------+---------------------------------- Adj R-squared = 0.1016
Total | 7762626.08 4,307 1802.32786 Root MSE = 40.24
abs | Coef. Std. Err. t P>|t| [95% Conf. Interval]
dssr1 | 24.94797 1.648215 15.14 0.000 21.71662 28.17933
dssr2 | 13.83835 1.673241 8.27 0.000 10.55793 17.11876
dssr3 | 8.870042 1.515782 5.85 0.000 5.898329 11.84176
_cons | 31.97143 .7562375 42.28 0.000 30.48882 33.45405
But I don't know how to interpret the R^2 in this case. If these variables are truly multicolinear, is the R^2 overestimated? I've tried to find some references but was unable to, and I thought maybe you guys could help me.
0 Response to Multicollinearity after variable transformation & impact on R^2
Post a Comment