Dear Statalist members,

I am analysing a balanced panel with about 2400 firms for 12 years (Stata 13). My main goal is to analyse the effect of three dummy variables which are proxies for technological innovations (investict, product_inno, process_inno) on either the amount or the share of high skilled employees. My control variables include investment (absolute numbers, many zeros), total number of employees, average wages over all employees, the export share (as share of total sales) a dummy for a collective bargaining agreement (collective), the state of the art of production equipment (tech) and if the firm deals with RnD (dummy), and some more.

Conducting several tests with the help of statalist members has led me to conclude that my data is non-normally distributed (high skilled employees, investment, total employees, export share have many smaller values) and scatter plots show a non-linear relationship between my dependent variable and the independent ones. Due to the high number of meaningful (not censored) zeros in high skilled, investment and export share, I cannot use a log transformation.

After extensive research and the before mentioned help of statalist members (C. Lazarro and J. Wooldridge, thanks a lot again) i have concluded that I am basically left with two options: using the absolute number of high skilled employees and -xtpoisson, fe vce(robust)- or using the share of high skilled people (highskill/total) and a fractional response model like the one used in Papke and Woodlridge (2008).

First Question:

The problem with the poisson model is that (besides insignificance, which might well be possible) a misspecification test of the form

Code:
 xtpoisson highskill investict product_inno process_inno lntotal avwages collective exportshare investment rnd tech i.industry i.year, fe vce(robust)
predict xbhat, xb
g xbhatsq=xbhat^2
g xbhatcu=xbhat^3
xtpoisson highskill investict product_inno process_inno lntotal avwages collective exportshare investment rnd tech xbhatsq xbhatcu i.industry i.year, fe vce(robust)
test xbhatsq xbhatcu
turns out significant 0.0000 implying a model misspecification. Is there any other specification which might work better?


Second Question:
To try the fractional response model I have found the Stata code from Papke on her website and using their -glm- and -xtgee- code to my data turns out the following:

Code:
 glm share_high investict product_inno process_inno total avwages collective exportshare investment rnd
>  tech industry i.year, fa(bin) link(probit) cluster(idnum)
note: share_high has noninteger values

Iteration 0:   log pseudolikelihood = -1672.9517  
Iteration 1:   log pseudolikelihood = -1653.9288  
Iteration 2:   log pseudolikelihood = -1653.8772  
Iteration 3:   log pseudolikelihood = -1653.8772  

Generalized linear models                          No. of obs      =      5582
Optimization     : ML                              Residual df     =      5560
                                                   Scale parameter =         1
Deviance         =  1564.744217                    (1/df) Deviance =  .2814288
Pearson          =  1922.417319                    (1/df) Pearson  =  .3457585

Variance function: V(u) = u*(1-u/1)                [Binomial]
Link function    : g(u) = invnorm(u)               [Probit]

                                                   AIC             =  .6004576
Log pseudolikelihood = -1653.877174                BIC             = -46403.06

                                (Std. Err. adjusted for 623 clusters in idnum)
------------------------------------------------------------------------------
             |               Robust
  share_high |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   investict |  -.0319885   .0510072    -0.63   0.531    -.1319608    .0679837
product_inno |   .0525227   .0703238     0.75   0.455    -.0853094    .1903547
process_inno |  -.0901038   .0505448    -1.78   0.075    -.1891698    .0089622
       total |   .0011193   .0002498     4.48   0.000     .0006296     .001609
     avwages |   .0000165   .0000207     0.80   0.423     -.000024     .000057
  collective |  -.0788953   .0695357    -1.13   0.257    -.2151827    .0573922
 exportshare |  -.2787724   .1815199    -1.54   0.125    -.6345448    .0769999
  investment |  -1.62e-08   1.52e-08    -1.07   0.286    -4.60e-08    1.35e-08
         rnd |   .2635743   .0984232     2.68   0.007     .0706684    .4564803
        tech |   .0203529   .0391601     0.52   0.603    -.0563995    .0971053
    industry |   .0011611   .0112536     0.10   0.918    -.0208956    .0232178
             |
        year |
       2008  |   .0228136   .0298333     0.76   0.444    -.0356587    .0812858
       2009  |   .0332193    .025811     1.29   0.198    -.0173693    .0838079
       2010  |   .0259046   .0285992     0.91   0.365    -.0301488    .0819581
       2011  |    .029456   .0303086     0.97   0.331    -.0299477    .0888598
       2012  |   .0636053   .0303213     2.10   0.036     .0041766    .1230339
       2013  |   .0067182   .0336056     0.20   0.842    -.0591476    .0725841
       2014  |  -.0376976   .0365879    -1.03   0.303    -.1094086    .0340133
       2015  |  -.0118872   .0332579    -0.36   0.721    -.0770715    .0532971
       2016  |  -.0473058    .040181    -1.18   0.239    -.1260592    .0314476
       2017  |  -.0099275    .040348    -0.25   0.806    -.0890082    .0691532
             |
       _cons |  -1.306921   .1410517    -9.27   0.000    -1.583377   -1.030465
------------------------------------------------------------------------------

. mat b = e(b)

. xtgee share_high investict product_inno process_inno total avwages collective exportshare investment r
> nd tech industry i.year, fa(bi) link(probit) corr(exch) robust from(b,skip)

Iteration 1: tolerance = .80279569
Iteration 2: tolerance = .10893995
....

GEE population-averaged model                   Number of obs      =      5582
Group variable:                      idnum      Number of groups   =       623
Link:                               probit      Obs per group: min =         1
Family:                           binomial                     avg =       9.0
Correlation:                  exchangeable                     max =        11
                                                Wald chi2(21)      =     35.94
Scale parameter:                         1      Prob > chi2        =    0.0222

                                  (Std. Err. adjusted for clustering on idnum)
------------------------------------------------------------------------------
             |             Semirobust
  share_high |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   investict |   .0188805    .010901     1.73   0.083    -.0024849     .040246
product_inno |   .0286437   .0184118     1.56   0.120    -.0074427      .06473
process_inno |   .0073737   .0122423     0.60   0.547    -.0166208    .0313682
       total |  -.0007271   .0005181    -1.40   0.160    -.0017425    .0002883
     avwages |   8.26e-06   6.36e-06     1.30   0.194    -4.20e-06    .0000207
  collective |  -.0394644   .0214855    -1.84   0.066    -.0815753    .0026465
 exportshare |  -.0250987   .0709382    -0.35   0.723     -.164135    .1139377
  investment |   7.70e-09   4.54e-09     1.70   0.090    -1.19e-09    1.66e-08
         rnd |  -.0133706   .0272105    -0.49   0.623    -.0667022    .0399611
        tech |  -.0206219   .0099042    -2.08   0.037    -.0400337   -.0012101
    industry |  -.0033315    .009638    -0.35   0.730    -.0222216    .0155586
             |
        year |
       2008  |   .0196429   .0173751     1.13   0.258    -.0144117    .0536975
       2009  |   .0229043   .0164129     1.40   0.163    -.0092645     .055073
       2010  |   .0192822   .0175307     1.10   0.271    -.0150773    .0536418
       2011  |   .0157518   .0203442     0.77   0.439    -.0241221    .0556257
       2012  |   .0207046   .0195638     1.06   0.290    -.0176397    .0590489
       2013  |   -.000057   .0229038    -0.00   0.998    -.0449477    .0448336
       2014  |  -.0184354   .0216876    -0.85   0.395    -.0609424    .0240715
       2015  |  -.0222338   .0219286    -1.01   0.311     -.065213    .0207455
       2016  |  -.0241677   .0242468    -1.00   0.319    -.0716906    .0233551
       2017  |  -.0246466   .0248686    -0.99   0.322    -.0733881     .024095
             |
       _cons |  -1.026956   .0853745   -12.03   0.000    -1.194287   -.8596252
------------------------------------------------------------------------------
I have to admit I have had my difficulties with the Papke and Wooldridge (2008) paper, especially in terms of interpretation. I have read it several times and all threads on statalist refer people to this paper... What is the difference here between the -glm- and -xtgee- specification? What do the coefficients tell me? Do I have to add dummies for my firms as -fe- is not an option here? This would basically be impossible as I have too many observations... I have read to add year averages instead, how is that done and included in which of the two specifications?

I would really appreciate some help in understanding the fractional model.

Literature:
Papke, L. E., and J. M. Wooldridge, “Econometric Methods for Fractional Response Variables with an Application to 401(k) Plan Participation Rates,” Journal of Applied Econometrics 11 (1996), 619–632.