Dear Statalist Members,

I am analyzing a balanced panel of around 2400 firms over 12 years (Stata 13). The output I am able to present here is based on test data, as I am not allowed (or able to) extract the original files. The only difference is the number of firms, which is higher in the original dataset, and that most of my explanatory variables turn out to be significant, unlike in this sample data. F-statistic in the original is F(11,13432) Prob>F 0.0000, R-sq. overall is 0.9639.

My goal is to analyze the effect of investments in computer (investict), product and process innovations on the demand for highskilled workers. Controls include the size of the firm in terms of employees (total), the industry, a dummy for West Germany (west), a dummy for a collective bargaining agreement (collective), the state of the art of production equipment (tech) and if the firm deals with RnD, and some more.

I have used xtserial and xttest3 which have lead me to include clustered robust standard errors. Using xtoverid,made me decide to use fixed effects. -testparm- has made me include year fixed effects. So my regression is now:

  xtreg highskill investict product_inno process_inno total west industry collective exportshare investment turnover rnd t
> ech i.year, fe vce(cluster idnum)
note: west omitted because of collinearity

Fixed-effects (within) regression               Number of obs      =      4344
Group variable: idnum                           Number of groups   =       498

R-sq:  within  = 0.1005                         Obs per group: min =         1
       between = 0.5034                                        avg =       8.7
       overall = 0.4393                                        max =        11

                                                F(21,497)          =      2.60
corr(u_i, Xb)  = 0.3892                         Prob > F           =    0.0001

                                (Std. Err. adjusted for 498 clusters in idnum)
             |               Robust
   highskill |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
   investict |   .7032893   .2711382     2.59   0.010      .170571    1.236008
product_inno |   .2723859   .6988765     0.39   0.697    -1.100731    1.645503
process_inno |  -.3938082   .4501978    -0.87   0.382    -1.278334    .4907173
       total |    .101938   .0245108     4.16   0.000     .0537805    .1500954
        west |          0  (omitted)
    industry |   .1624997   .1911486     0.85   0.396    -.2130592    .5380586
  collective |  -.2838042   .5861356    -0.48   0.628    -1.435413    .8678049
 exportshare |   .8483747   2.351452     0.36   0.718    -3.771638    5.468387
  investment |   1.44e-06   5.98e-07     2.41   0.016     2.68e-07    2.62e-06
    turnover |  -1.99e-07   1.39e-07    -1.43   0.153    -4.73e-07    7.46e-08
         rnd |  -1.103514   .9824249    -1.12   0.262    -3.033732    .8267042
        tech |  -.6756037   .2828397    -2.39   0.017    -1.231313   -.1198947
        year |
       2008  |   .0310991   .3815399     0.08   0.935    -.7185309    .7807291
       2009  |   .4981931   .3197414     1.56   0.120    -.1300184    1.126405
       2010  |   .7890588   .4913133     1.61   0.109    -.1762483    1.754366
       2011  |   1.109093   .5630923     1.97   0.049     .0027585    2.215428
       2012  |   1.189345   .5407669     2.20   0.028      .126874    2.251816
       2013  |   .0965383   .7094676     0.14   0.892    -1.297387    1.490464
       2014  |   .4120097   .6609871     0.62   0.533    -.8866637    1.710683
       2015  |  -.1867301   .7267681    -0.26   0.797    -1.614647    1.241187
       2016  |   .1137137   .5447759     0.21   0.835     -.956634    1.184061
       2017  |  -.4267298   .7349041    -0.58   0.562    -1.870632    1.017172
       _cons |   4.706464   2.350515     2.00   0.046     .0882924    9.324636
     sigma_u |  22.632204
     sigma_e |  7.5596268
         rho |  .89962854   (fraction of variance due to u_i)

I originally intended to use the share of highskilled employees as my dependent variable, but after reading the paper of Kronman (1993) and several posts in this forum concerning the problems with ratios, I have switched to using the absolute number of highskilled employees (highskill) and include the total number of employees as a control. This has increased my R-squared by a lot (it was only 0.016 before).

On the other hand, I tested my model specification using:

 predict fitted, xb
g sq_fitted=fitted^2
xtreg highskill fitted sq_fitted
test sq_fitted
The p-value was 0.8 before when using the share, now it is significant (0.0000) and telling me my model is misspecified. Now my question is, if the test I used to test for misspecification is the right thing to do here and if yes, what else can I do now concerning my specification? Or is a high R-Sq. enough to argue that my model fits?

Also I don't understand why the dummy for west would be omitted, none of the regressors are highly correlated.

I have read many posts in this forum and run several tests that made me end up with this fixed effects regression model, so I am confused about the result of the specification test. I have also tried -areg-, absorb(idnum) vce(cluster idnum), which has slightly different coefficients and a higher R-Sq. (as is normal) than the -xtreg, fe- but it has the same result in the misspecification test.

Testing for normality using
 xtreg highskill investict product_inno process_inno total west industry collective exportshare investment turnover rnd tech, re vce(cluster idnum)
(re because it is not possible with fe) and then -xtsktest- has given me the following:

(running _xtsktest_calculations on estimation sample)

Bootstrap replications (50)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..................................................    50

Tests for skewness and kurtosis                 Number of obs      =      4344
                                                Replications       =        50

                                 (Replications based on 498 clusters in idnum)
             |   Observed   Bootstrap                         Normal-based
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
  Skewness_e |  -1805.438   1230.613    -1.47   0.142    -4217.396    606.5195
  Kurtosis_e |   456552.4   194447.7     2.35   0.019     75441.97    837662.8
  Skewness_u |    12182.3   2960.393     4.12   0.000     6380.038    17984.56
  Kurtosis_u |    1510700   274557.2     5.50   0.000     972577.4     2048822
Joint test for Normality on e:        chi2(2) =   7.67    Prob > chi2 = 0.0217
Joint test for Normality on u:        chi2(2) =  47.21    Prob > chi2 = 0.0000
Could this mean I should transform my data using logs as there are issues with normality? or what are the consequences?

I appreciate any input on my issues, thanks in advance,
