Dear community,
I'm trying to run a multiple linear regression with stata 16 using cross-sectional data. My response variable (d17aum - 1252 obs) captures the increase in terms of value of export activity in 2009 in comparison to the previous year for small and medium enterprises. Hence, it is expressed in percentage, from 1 to 100. My main explanatory variables are family ownership (continuous expressed in percentage as well - 6554 obs) and external management (categorical taking 1 when CEO is not a member of the family controlling the firm - 6827 obs). However other explanatory variables will be included as control variables. This is how data look like:
input int d17aum float(fam_own2 external_man)
. 82 0
. 100 0
. 100 0
. 100 0
. 71 0
. 50 0
. 100 0
10 50 0
. 92 0
25 77 0
10 27 0
. 100 0
. 100 0
. 100 0
. 95 0
. 100 0
10 51 0
. 100 0
. 100 0
. 100 1
. 52 0
. 100 1
10 52 0
. 90 0
. 99 0
. 62 0
. 60 0
. 52 0
. 77 1
. 67 0
. 53 0
15 100 0
. 100 0
. 27 1
1 52 0
. 52 0
. 99 0
. 100 0
. 47 0
. 100 0
6 100 1
5 100 0
. 100 0
. 100 0
. 78 0
. 100 0
. 52 1
. 51 0
. 100 0
. 100 0
. 100 0
. 63 0
. 50 0
. 92 0
. 80 0
. 100 0
. 100 0
. 26 0
. 100 0
. 100 0
. 27 0
. 27 0
40 77 0
. 100 0
. 77 0
. 52 0
. 77 0
. 37 0
. 100 0
. 100 0
. 27 0
. 100 0
. 100 0
. 100 0
. 77 0
6 52 0
. 100 0
. 100 0
. 100 0
. 62 0
. 76 0
. 100 0
. 77 0
10 100 0
. 60 0
. 52 0
. 77 0
. 100 0
. 100 0
. 100 0
. 100 0
. 100 0
. 35 0
. 62 0
. 62 0
. 52 0
. 53 0
. 69 0
. 76 0
. 100 0
end
[/CODE]
As you can see in boxplot.png and hist.png my DV is characterized by the presence of outliers and right skewed distribution. Hence I used ln(DV) in order to normalize the distribution and gain a bell curve (see log_distribution.png) and a more decent boxplot (see log_boxplot.npg). Data now look that way:
input float(wd17 fam_own2 external_man)
. 82 0
. 100 0
. 100 0
. 100 0
. 71 0
. 50 0
. 100 0
2.3025851 50 0
. 92 0
3.218876 77 0
2.3025851 27 0
. 100 0
. 100 0
. 100 0
. 95 0
. 100 0
2.3025851 51 0
. 100 0
. 100 0
. 100 1
. 52 0
. 100 1
2.3025851 52 0
. 90 0
. 99 0
. 62 0
. 60 0
. 52 0
. 77 1
. 67 0
. 53 0
2.70805 100 0
. 100 0
. 27 1
0 52 0
. 52 0
. 99 0
. 100 0
. 47 0
. 100 0
1.7917595 100 1
1.609438 100 0
. 100 0
. 100 0
. 78 0
. 100 0
. 52 1
. 51 0
. 100 0
. 100 0
. 100 0
. 63 0
. 50 0
. 92 0
. 80 0
. 100 0
. 100 0
. 26 0
. 100 0
. 100 0
. 27 0
. 27 0
3.6888795 77 0
. 100 0
. 77 0
. 52 0
. 77 0
. 37 0
. 100 0
. 100 0
. 27 0
. 100 0
. 100 0
. 100 0
. 77 0
1.7917595 52 0
. 100 0
. 100 0
. 100 0
. 62 0
. 76 0
. 100 0
. 77 0
2.3025851 100 0
. 60 0
. 52 0
. 77 0
. 100 0
. 100 0
. 100 0
. 100 0
. 100 0
. 35 0
. 62 0
. 62 0
. 52 0
. 53 0
. 69 0
. 76 0
. 100 0
end
[/CODE]
However, when checking for normality using swilk I fail to not reject the null hypothesis hence data for DV are still not normally distributed. I've also been told to use winsor to solve the issue, but no matter how I use it [p.(0.1 to 0.5) high only, lowonly or normal] the Shapiro-Wilk test proves always to be not significant. Therefore any regression i try to run (524 obs) making use of others control variables as well, results to be not significant (F > 0.5) with extremely low R-squared. Consequently, linearity and homoskedasticity assumptions do not hold. At this point, I started wondering if linear regression is the right model to use for my data. I've been reading a lot on the forum and on the web as well about analysis with dependent variables as percentages. Since my statistics knowledge is not at his best possible, I got even more confused. Some say I could treat percentage as continuous variable with linear regression being the best model to use, some say I could treat it as a count variable (even though the variable doesn't count anything) since it has a right-skewed distribution and so Poisson would me more suitable, while some others suggest to break my DV in categories according to its percentiles and use a logistic regression, while a beta regression would not be possible since I would have some values equal to 0 and 1. Since the model has to be useful for hypothesis testing, which model might fit my data the best ?
0 Response to Multiple linear regression - Cross-sectional data - Percentage as response variable
Post a Comment