BJ Data Tech Solution

Specialized on Data processing, Data management Implementation plan, Data Collection tools - electronic and paper base, Data cleaning specifications, Data extraction, Data transformation, Data load, Analytical Datasets, and Data analysis. BJ Data Tech Solutions teaches on design and developing Electronic Data Collection Tools using CSPro, and STATA commands for data manipulation. Setting up Data Management systems using modern data technologies such as Relational Databases, C#, PHP and Android.

Sunday, March 31, 2019

Omitted Dummy Variable on -spxtregress-

Hello, I think I have a problem identifying issue after running random-effect panel data regression using spxtregress. I ran spxtregress for Panel Spatial Error Model and not quite sure why my dummy year variable (y2005) was omitted. So, I tried to examine the effect of using simple way of looking at the difference of log GDP per capita between base year (2004) and the rest of the periode (2005-2007) due to occurence of natural disaster.

I ran xtregress for comparison and it did just fine. Would anyone help me identify what was wrong with the process? Thank you!

Additional information: I use STATA 15.0 for Windows

Code:

. spxtregress lgdp_pcp y2005 y2006 y2007 
>         dead_prc 
>         y2005_dead_prc y2006_dead_prc y2007_dead_prc 
>         year, re errorlag(wmat);
  (294 observations)
  (126 observations excluded due to missing values)
  (168 observations used)
  (data contain 21 panels (places) )
  (weighting matrix defines 21 places)
note: lgdp_pcp:y2005 omitted because of collinearity
note: lgdp_pcp:y2005_dead_prc omitted because of collinearity
note: lgdp_pcp:y2007_dead_prc omitted because of collinearity
note: lgdp_pcp:_cons omitted because of collinearity

Fitting starting values:

Iteration 0:   log likelihood =  57.911505  
Iteration 1:   log likelihood =   57.94042  
Iteration 2:   log likelihood =  57.940496  
Iteration 3:   log likelihood =  57.940496  

Optimizing concentrated log likelihood:

initial:       log likelihood =  26.544279
improve:       log likelihood =  26.544279
rescale:       log likelihood =  26.544279
rescale eq:    log likelihood =   37.92344
Iteration 0:   log likelihood =   37.92344  
Iteration 1:   log likelihood =  42.978376  
Iteration 2:   log likelihood =  43.020721  
Iteration 3:   log likelihood =  43.020755  
Iteration 4:   log likelihood =  43.020755  

Optimizing unconcentrated log likelihood:

Iteration 0:   log likelihood =  43.020755  
Iteration 1:   log likelihood =  43.020755  (backed up)

Random-effects spatial regression               Number of obs     =        168
Group variable: _ID                             Number of groups  =         21
                                                Obs per group     =          8

                                                Wald chi2(5)      =     489.17
                                                Prob > chi2       =     0.0000
Log likelihood =    43.0208                     Pseudo R2         =     0.0944

--------------------------------------------------------------------------------
      lgdp_pcp |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
---------------+----------------------------------------------------------------
lgdp_pcp       |
         y2005 |          0  (omitted)
         y2006 |   19.68171    11.2663     1.75   0.081    -2.399835    41.76326
         y2007 |   19.68406   11.28865     1.74   0.081    -2.441293    41.80941
      dead_prc |   .0115508   .0070841     1.63   0.103    -.0023337    .0254353
y2005_dead_prc |          0  (omitted)
y2006_dead_prc |  -.0012556   .0032023    -0.39   0.695     -.007532    .0050208
y2007_dead_prc |          0  (omitted)
          year |  -.0089161   .0056161    -1.59   0.112    -.0199234    .0020913
         _cons |          0  (omitted)
---------------+----------------------------------------------------------------
wmat           |
    e.lgdp_pcp |    2.14346   .1849234    11.59   0.000     1.781016    2.505903
---------------+----------------------------------------------------------------
      /sigma_u |    .377704   .0592981                      .2776618    .5137918
      /sigma_e |   .1512049   .0088053                      .1348952    .1694864
--------------------------------------------------------------------------------
Wald test of spatial terms:          chi2(1) = 134.35     Prob > chi2 = 0.0000

. xtreg lgdp_pcp y2005 y2006 y2007 
>         dead_prc 
>         y2005_dead_prc y2006_dead_prc y2007_dead_prc 
>         year, re;

Random-effects GLS regression                   Number of obs     =        257
Group variable: _ID                             Number of groups  =         21

R-sq:                                           Obs per group:
     within  = 0.2149                                         min =          8
     between = 0.0072                                         avg =       12.2
     overall = 0.0781                                         max =         14

                                                Wald chi2(8)      =      63.04
corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.0000

--------------------------------------------------------------------------------
      lgdp_pcp |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
---------------+----------------------------------------------------------------
         y2005 |   .1517669    .094186     1.61   0.107    -.0328343     .336368
         y2006 |   .1831134   .0954493     1.92   0.055    -.0039638    .3701905
         y2007 |   .0616214   .1022244     0.60   0.547    -.1387349    .2619776
      dead_prc |    .000789   .0099109     0.08   0.937     -.018636    .0202141
y2005_dead_prc |  -.0021991   .0063445    -0.35   0.729    -.0146341    .0102359
y2006_dead_prc |   .0058923   .0063114     0.93   0.351    -.0064777    .0182624
y2007_dead_prc |   .0109464   .0034607     3.16   0.002     .0041636    .0177293
          year |   .0240922    .011762     2.05   0.041     .0010392    .0471453
         _cons |  -46.72111   23.55381    -1.98   0.047    -92.88572   -.5564968
---------------+----------------------------------------------------------------
       sigma_u |  .52632943
       sigma_e |  .31472541
           rho |  .73661597   (fraction of variance due to u_i)
--------------------------------------------------------------------------------

Problem with Multiple Imputation

Hi all, I am (seemingly) having some trouble with my imputation model. I am trying to impute values for missing data for variables related to cancer and demographics such as stage, grade, receptor status, and deprivation.

I am using multiple imputation with chained equations, and am correctly specifying the model for each imputed variable as far as I am aware. All the separate models converge, but when I test the imputed values vs observed values after imputing, I am getting some wild differences between the two. I know there is no test to ascertain whether this is a problem, but the differences between the observed and imputed values are concerning me.

Can anyone point me in a sensible direction so I can go about correcting my imputation model?

Many thanks.

Plot confidence interval for median

I'd appreciate if anyone could help me with this one. I need to plot median and percentile confidence intervals (2.5 and 97.5) instead of mean and CIs (I basically need an alternative of
ciplot)

ciplot continuous_variable, by( time_period)

Here is the dataset.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int id float(time_period continuous_variable)
 1 0 18
 1 1 13
 1 2  5
 2 0 18
 2 1 57
 2 2 43
 3 0  0
 3 1 38
 3 2  0
 4 1 63
 4 2  5
 7 0 26
 7 1 11
 7 2  8
 7 3  6
 8 0 67
 8 1 39
 8 2 65
12 0  6
12 1 22
12 2 22
15 0  5
15 1 15
16 0 21
16 1 98
16 2 56
17 0 15
17 1 12
17 2 38
17 3 10
18 0 41
18 1 42
19 0 14
19 1 13
21 0  3
21 1  9
21 2  9
end
label values time_period a
label def a 0 "first", modify
label def a 1 "second", modify
label def a 2 "third", modify
label def a 3 "fourth", modify

Thank you

Seasonally adjusted GDP

I guess this is a simple question.

How to get the seasonally adjusted GDP continuously compounded growth rate.

Without seasonally adjusted, it's simply

gen rgdp= gdp[_n+4]/gdp[_n]-1

The data is quarterly U.S. GDP.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double gdp float yq
1934.5 -52
1932.3 -51
1930.3 -50
1960.7 -49
1989.5 -48
2021.9 -47
2033.2 -46
2035.3 -45
2007.5 -44
2000.8 -43
2022.8 -42
2004.7 -41
2084.6 -40
2147.6 -39
2230.4 -38
2273.4 -37
2304.5 -36
2344.5 -35
2392.8 -34
2398.1 -33
2423.5 -32
2428.5 -31
2446.1 -30
2526.4 -29
2573.4 -28
2593.5 -27
2578.9 -26
2539.8 -25
  2528 -24
2530.7 -23
2559.4 -22
2609.3 -21
2683.8 -20
2727.5 -19
2764.1 -18
2780.8 -17
  2770 -16
2792.9 -15
2790.6 -14
2836.2 -13
2854.5 -12
2848.2 -11
2875.9 -10
2846.4  -9
2772.7  -8
2790.9  -7
2855.5  -6
2922.3  -5
2976.6  -4
  3049  -3
3043.1  -2
3055.1  -1
3123.2   0
3111.3   1
3119.1   2
3081.3   3
3102.3   4
3159.9   5
3212.6   6
3277.7   7
3336.8   8
3372.7   9
3404.8  10
  3418  11
3456.1  12
3501.1  13
3569.5  14
  3595  15
3672.7  16
3716.4  17
3766.9  18
3780.2  19
3873.5  20
3926.4  21
4006.2  22
4100.6  23
4201.9  24
4219.1  25
4249.2  26
4285.6  27
4324.9  28
4328.7  29
4366.1  30
4401.2  31
4490.6  32
4566.4  33
4599.3  34
4619.8  35
4691.6  36
4706.7  37
4736.1  38
4715.5  39
4707.1  40
4715.4  41
4757.2  42
4708.3  43
4834.3  44
4861.9  45
  4900  46
4914.3  47
end
format %tq yq

Thanks a lot!

How to filter data using keywords

Hi everyone,

I am a novice user of STATA. I have collected execucomp data on excel, where manager data is available according to their titles and company name. Out of all the managers listed, I only want to work with the CEOs, so only the datapoints with the the titles "CEO" or "executive" or "Chief Executive Officer" would suffice for me. Can anyone help me out with the appropriate codes to filter out data with keywords?

Thanks a lot!

Wali

how to surpress the "graphs by group" when graphing histogram with by option

I do not want to display "graphs by group", Here is the code:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double(tenaciousGoalPursuit group)
2.9 2
2.1 2
3.5 2
2.1 1
3.9 2
  4 2
3.4 2
2.4 1
3.3 2
3.2 2
3.3 1
3.2 1
2.2 1
3.7 1
  4 2
2.8 1
2.9 2
2.8 1
3.2 2
3.9 1
2.8 1
2.7 1
3.5 2
2.9 2
3.5 2
2.5 1
2.6 1
3.5 1
3.4 1
2.8 1
3.2 2
3.4 1
3.3 2
2.7 1
2.9 2
3.1 2
3.3 1
2.5 1
4.5 1
3.2 1
4.5 2
3.5 1
  2 1
2.1 1
3.4 2
2.6 2
2.6 2
3.3 2
2.8 1
3.5 2
3.6 1
3.3 2
  3 2
3.5 2
2.5 1
4.1 2
2.8 2
3.1 2
2.5 1
3.1 1
2.9 2
3.1 2
3.4 1
2.6 1
1.7 1
3.8 2
3.3 1
2.3 1
3.5 1
2.3 1
2.6 1
2.2 1
2.7 1
3.7 1
2.3 1
3.2 2
3.6 2
3.8 1
3.8 1
2.7 1
3.2 1
3.5 2
3.3 1
  4 1
2.9 2
3.4 2
2.4 1
2.5 2
2.8 1
  3 1
3.6 2
2.2 1
3.1 1
2.4 1
2.9 2
3.3 1
3.2 2
3.4 1
3.3 1
  3 1
end
hist  tenaciousGoalPursuit ,by(group)  normal

METAN command on log-transformed odds ratios

Dear Statalist Member,

I have several Effect Sizes (ES) with accompanied 95% Confidence Intervals (ES_lci = lower confidence interval of estimate and ES_uci = upper confidence interval of estimate) of different studies and want to get a summary estimate using a random-effect model.

I now used the following commands:

Code:

metan ES ES_lci ES_uci, random

*** Getting a model with log-transformed ES
gen ln_ES = ln(ES)
gen ln_lci = ln(ES_lci)
gen ln_uci = ln(ES_uci)

metan ln_ES ln_lci ln_uci, random eform

First, I was surprised that although the forrest plots are quite similar, the weights and thus the summary estimate is different between the "log-transformed-effect-size" model and the "normal effect-size" model.

Does anyone know what is the correct way to do or if there is an advantage when to use the log-transformed model (at the first glance, the weights seem to be closer to each other, thus studies with a smaller sample size are weighted higher and studies with a large sample size smaller)?

Thank you in advance for your opinion.
Martin

Stata file to learn from

Hello all, I am trying to learn Stata, and I was wandering if you have a Stata file (containing a sample data) you prepared that you are willing to share along with the do file, or if you are aware of such files that can be downloaded for sample data, I would like to learn more about how to do regression, mean, SD, T test etc by seeing already prepared files/projects and try to replicate it on my own data. I think that would make learning much easier.

Thanks A lot for any help or advice
Maen

Instrumental Variables

Hello Experts,
I am employing GMM approach on dynamic panel model. There are three endogenous variables, so how many instruments i need to use and further can i use the lags of the endogenous variables as instruments. Thanks

Moderating effects in VAR possible

Dear all,

I am analyzing panel data covering 52 firms over one year and thinking about employing a VAR-model including exogenous control variables.

However, my conceptual model actually also includes two moderating variables. Now I am wondering whether such a relationship is actually detectable within a VAR model? Is there a special way to enter moderators into the code in Stata or would I recognize such effects somewhere in the output?

Thank you very much and best wishes
Mona

bootstraping clustered standard errors in system GMM help!

Hi guys,
I am running a diff-in-diff fixed effects model with 16 countries between 2002 to 2017 testing whether Spanish labour reform (treatment) had an impact on FDI post 2012 (treatment point)

In my first regression I had to bootstrap my standard errors as follows:

- bootstrap, reps(500) bca seed(1): xtreg netfdiinflow reformspain reform_others tradegdp inflation humancapital xrate investmentratio realgdppercapita corptax c.time_trend#i.country_n i.year, fe

I now want to undertake a system GMM as my refromspain independent variable is endogenous and so i want to take only the 1st and 2nd lag of the dependent variable as iv estimation, this is the regression i undertook:
- xtabond2 netfdiinflow L.netfdiinflow L2.netfdiinflow reformspain reform_others tradegdp inflation humancapital xrate investmentratio realgdppercapita corptax c.time_trend#i.country_n, gmm(L.netfdiinflow L2.netfdiinflow) iv(reformspain reform_others tradegdp inflation humancapital xrate investmentratio realgdppercapita corptax) twostep

My questions are:
1. Is taking the lag of the dependent variable enough given that its my reformspain dummy (the variable of interest) which is endogenous?
2. given i had to bootstrap my first eeuqation as my standard errors were clustered, do I have to bootstrap the system GMM as well or does the GMM deal with clustered standard errors itself? If I do have to bootstrap how would I immplement that into stata in terms of commands?
3. Could someone help me out with translating the system GMM regression I have done into a model as I don't understand the asscociated algebra too well.

Thank you!

Foreach in mata

Dear All,

just a quick question. Is it possible to use foreach within a mata function?

Suppose I have something lke:

Code:

mata:
foreach v of global regressors{
    D = 1 :/ W_`v'
    W_`v'2 =  editmissing(1 :/ (D:^2), 0)
}
end

I gen the following error message:

Code:

. mata:
------------------------------------------------- mata (type end to exit) -----------------------------------
: foreach v of global regressors{
'global' found where almost anything else expected
(3 lines skipped)
-------------------------------------------------------------------------------------------------------------

Is there any mistake in my code or is it not possible to include foreach in mata? If not, is there a way to work this around?

Thanks in advance.

reshape the dataset

Hello

I have the dataset as below. It is just one long row variable. I want to reshape using reshape long bV, i(group) j(year).
bV_1998_g1..... 1998 is year and is until 2010 and g1-g17 is a group. But there is no year and group variable so,this is not set as a time series variable.
How can I reshape this? Thank you!

RESET test for omitted variables

Hello,

I am trying to run diagnostics of my model and I would like to clarify the following uncertainty. I have panel data and I am using fixed effects. Initially, I estimated a model without lagged dependent variable as a regressor but due to suspecting autocorrelation, later I have included it. When I am running Ramsey's RESET test should I run it on the initial or second model? Result for the first one tells about omitted variables, however when the lagged variable is included - there are no omitted variables. How can I interpret this?

Fill missing observations of a variable with zero

Dear Stata Users,
Please, help me to resolve the following issue. Below I have a dataset. I need to feel “short” with “zero” all missing observation (based of “fdate” – year/month), where there is a gap in “fdate”. For example, for firm gvkey = 001004 there in missing fdate = 1989m5. So, I need to create this missing row and fill “short” with “zero”. For gvkey - 002020 there is a missing date of 1994m8, 1994m9. As before, I need two extra rows with all the variables and short==0. So the only restriction is that short should equal to zero for firms with missing observations (based on missing “fdate”) that lie within minimum “fdate” and maximum “fdate” that is present in the data sample.
Please, help me with this issue.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str6 gvkey float(year month fdate short)
"001004" 1989  1 348  11428
"001004" 1989  2 349   9300
"001004" 1989  3 350  48789
"001004" 1989  4 351  15009
"001004" 1989  6 353  46188
"001004" 1989  7 354  56251
"001004" 1989  8 355 121489
"001004" 1989  9 356 136398
"001004" 1989 10 357 108341
"001004" 1989 11 358 119100
"001004" 1989 12 359  81300
"001004" 1990  1 360  35871
"001004" 1990  2 361  21070
"001004" 1990  3 362  71900
"001004" 1990  4 363  81941
"001004" 1990  5 364  68870
"001004" 1990  6 365  95258
"001004" 1990  7 366 100808
"001004" 1990  8 367 147758
"001004" 1990  9 368 124800
"001004" 1990 10 369 103875
"001004" 1990 11 370  86360
"001004" 1990 12 371 160050
"001004" 1991  1 372  82650
"001004" 1991  2 373 329988
"001004" 1991  3 374  87426
"001004" 1991  4 375  60990
"001004" 1991  5 376  27297
"001004" 1991  6 377  20640
"001004" 1991  7 378  17940
"001004" 1991  8 379  35498
"001004" 1991  9 380  76827
"001004" 1991 10 381  96424
"001004" 1991 11 382  44915
"001004" 1991 12 383 108947
"001004" 1992  1 384 106710
"001004" 1992  4 387   3005
"001004" 1992  5 388  35181
"001004" 1992  6 389  40041
"001004" 1992  7 390  11400
"001004" 1992  8 391  15497
"001004" 1992  9 392  13950
"001004" 1992 10 393  27204
"001004" 1992 11 394  19965
"001004" 1992 12 395   9452
"001004" 1993  1 396  41250
"001004" 1993  2 397   4538
"001004" 1993  3 398  58290
"002020" 1993  4 399  39700
"002020" 1993  5 400  60600
"002020" 1993  6 401    875
"002020" 1993  7 402  51410
"002020" 1993  8 403  47500
"002020" 1993  9 404  46900
"002020" 1993 10 405  59200
"002020" 1993 11 406  47800
"002020" 1993 12 407  39100
"002020" 1994  1 408  25309
"002020" 1994  2 409  39342
"002020" 1994  3 410  21600
"002020" 1994  4 411  26100
"002020" 1994  5 412  21900
"002020" 1994  6 413   6100
"002020" 1994  7 414   6600
"002020" 1994 10 417   6150
"002020" 1994 11 418  19450
"002020" 1994 12 419  42550
"002020" 1995  1 420  26722
"002020" 1995  2 421   3418
"002020" 1995  3 422   8177
"002020" 1995  4 423   3942
"002020" 1995  5 424  80142
"002020" 1995  6 425 103277
"002020" 1995  7 426  31650
"002020" 1995  8 427  14950
"002020" 1995  9 428  14050
"002020" 1995 10 429  38450
"002020" 1995 11 430  14270
"002020" 1995 12 431   4700
"002020" 1996  1 432   7300
"002020" 1996  2 433      0
"002020" 1996  3 434  18194
"002020" 1996  4 435  14785
"002020" 1996  5 436  40285
"002020" 1996  6 437  15685
"002020" 1996  7 438  15285
"002020" 1996  8 439  29085
"002020" 1996  9 440  69285
"002020" 1996 10 441 111685
"002020" 1996 11 442 158782
"002020" 1996 12 443 106562
"002020" 1997  1 444 161350
"002020" 1997  2 445  99460
"002020" 1997  3 446 249823
"002020" 1997  4 447 405050
"002020" 1997  5 448 451523
"002020" 1997  6 449 364347
"002020" 1997  7 450 325880
end
format %tm fdate

Why logit regression and marginal effects can't generate the same z-statistics?

Dear Stata experts,

I've run logit regression and its marginal effect at follow. However, they should have the same z-statistics but I couldn't manage to get it, do you possibly know why is it? Can you please fix it if possible? Many thanks to you in advance!

Code:

logit delist inv profit book_lev MTT, cluster(permno)

Iteration 0:   log pseudolikelihood = -18638.798  
Iteration 1:   log pseudolikelihood = -17116.388  
Iteration 2:   log pseudolikelihood = -13925.904  
Iteration 3:   log pseudolikelihood = -13667.722  
Iteration 4:   log pseudolikelihood = -13661.271  


Logistic regression                             Number of obs     =    102,384
                                                Wald chi2(84)     =    4529.30
                                                Prob > chi2       =     0.0000
Log pseudolikelihood =  -13661.16               Pseudo R2         =     0.2671

                                                         (Std. Err. adjusted for 10,947 clusters in permno)
-----------------------------------------------------------------------------------------------------------
                                          |               Robust
                                   delist |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
------------------------------------------+----------------------------------------------------------------
                                     inv |   .5335151   .2469513     2.16   0.031     .0494995    1.017531
                                   profit |  -1.891079   .0930199   -20.33   0.000    -2.073395   -1.708763
                                 book_lev |   1.609226   .1046438    15.38   0.000     1.404128    1.814324
                                      MTT |  -.5021871   .0129645   -38.74   0.000     -.527597   -.4767773



. margins, dydx(*)

Average marginal effects                        Number of obs     =    102,384
Model VCE    : Robust

Expression   : Pr(delist), predict()
dy/dx w.r.t. : inef profit book_lev MTT

-----------------------------------------------------------------------------------------------------------
                                          |            Delta-method
                                          |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
------------------------------------------+----------------------------------------------------------------
                                     inef |    .018774   .0086892     2.16   0.031     .0017435    .0358045
                                   profit |  -.0665456   .0033517   -19.85   0.000    -.0731148   -.0599764
                                 book_lev |   .0566274   .0037265    15.20   0.000     .0493236    .0639312
                                      MTT |  -.0176716   .0004976   -35.51   0.000    -.0186468   -.0166963

How to find best similar variables using overlaid graph or histograms or else?

I have 51 variables containing 6,120 values of each variable. For example, resid1, resid2, ... resid51.
These are predicted values having different residuals after regressions.
What I want to do here is to find best similar residual trend over 50 residuals which will be the best matched with resid1.
If I make only 10 graphs using following code, it is too hard to find one which has the most similar trend with resid1 because they are overlaid too close.

Code:

line resid resid2 resid3 resid4 resid5 resid6 resid7 resid8 resid9 resid10 fips, legend(label(1 "1") label(2 "2") label(3 "3") label(4 "4") label(5 "5") label(6 "6") label(7 "7") label(8 "8") label(9 "9") label(10 "10"))

So if I draw 51 graphs, it will be harder to differentiate with each other to find the most similar one.
Is there other different ways to find this trend?

Thanks.

Interaction between categorical and continous variable - MNL

Dear Stata users,

I would like to generate interactions between categorical and continuous variable in MNL model. This is the code I am using:
xi: clogit choice Mov MovS Rst i.obs_MS#c.day1 i.obs_M#c.day1 i.obs_R#c.day1 , group(lp1)

Unfortunately this does not work.

This is the output I am getting
"interactions not allowed"

Is there a short way to gen such interactions?

SFA Trans log fuctional form

Hello everyone.
I am doing some technical efficiency analysis using cross-sectional data.Am looking for a syntax that can estimate the trans log functional form stochastic frontier analysis. I will appreciate your feedback.

Missing values, but no gaps in the data

Dear All,

I have the following data:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte _ID str52 regionname float(trust trust2)
60 "ABRUZZI - MOLISE - BASILICATA"                        29.31034 29.31034
46 "ALENTEJO"                                             17.77778 17.77778
44 "ALGARVE"                                              26.08696 26.08696
15 "ANDALUCIA"                                            24.47873 24.47873
12 "ARAGON"                                               55.93221 55.93221
62 "ASTURIAS-CANTABRIA"                                   31.86275 31.86275
48 "AZORE ISLANDS"                                        19.54023 19.54023
 4 "BADEN-WUERTTEMBERG"                                   44.13265 44.13265
17 "BALEARES"                                             23.65591 23.65591
 9 "BAYERN"                                               34.59119 34.59119
64 "BREMEN HAMBURG"                                       35.52631 35.52631
42 "CALABRIA"                                              37.3494  37.3494
33 "CAMPANIA"                                             28.01303 28.01303
18 "CANARIAS"                                             23.56688 23.56688
21 "CASTILLA-LA MANCHA"                                   32.59669 32.59669
19 "CASTILLA-LEON"                                        42.19949 42.19949
22 "CATALUNA"                                             34.40135 34.40135
47 "CENTRO (P)"                                           19.70803 19.70803
14 "COMUNIDAD VALENCIANA"                                 24.49941 24.49941
68 "EAST ANGLIA"                                          47.41379 47.41379
27 "EAST FR"                                              19.19192 19.19192
54 "EAST MIDLANDS"                                        33.57664 33.57664
41 "EMILIA-ROMAGNA"                                       30.84112 30.84112
13 "EXTREMADURA"                                          26.38889 26.38889
10 "GALICIA"                                              31.03448 31.03448
 7 "HESSEN"                                               32.77311 32.77311
30 "ILE DE FRANCE"                                        26.21951 26.21951
38 "LAZIO"                                                 27.7027  27.7027
32 "LIGURIA"                                              37.69231 37.69231
45 "LISBOA E VALE DO TEJO"                                22.77228 22.77228
31 "LOMBARDIA"                                             44.3038  44.3038
49 "MADEIRA"                                              23.85321 23.85321
20 "MADRID"                                               41.71779 41.71779
23 "MEDITERREAN FR"                                             22       22
16 "MURCIA"                                               34.16667 34.16667
63 "NAVARRA - RIOJA"                                      41.13475 41.13475
 8 "NIEDERSACHSEN"                                        43.79085 43.79085
36 "NOORD NEDERLAND - GRONINGEN"                          47.29729 47.29729
 5 "NORDRHEIN-WESTFALEN"                                  39.70037 39.70037
43 "NORTE"                                                21.14804 21.14804
26 "NORTH FR"                                             17.04545 17.04545
66 "NORTH UK"                                             26.44628 26.44628
67 "NORTH WEST UK"                                        32.37705 32.37705
51 "NORTHERN IRELAND"                                     43.62416 43.62416
34 "OOST NEDERLAND"                                       64.13793 64.13793
11 "PAIS VASCO"                                           40.33796 40.33796
29 "PARIS BASIN EAST/WEST"                                 14.1791  14.1791
57 "PIEMONTE - VALLLE D'AOSTA"                            37.76224 37.76224
39 "PUGLIA"                                               29.16667 29.16667
 2 "REG.BRUXELLES-CAP./BR"                                26.89805 26.89805
 3 "REGION WALLONNE"                                      28.87218 28.87218
65 "RHEINLAND-PFALZ SAARL"                                42.51208 42.51208
 6 "SCHLESWIG-HOLSTEIN"                                   34.18803 34.18803
50 "SCOTLAND"                                             39.24051 39.24051
61 "SICILIA - SARDEGNA"                                   26.86567 26.86567
25 "SOUTH EAST FR"                                        24.78633 24.78633
69 "SOUTH EAST UK"                                        39.62617 39.62617
24 "SOUTH WEST FR"                                        30.18868 30.18868
53 "SOUTH WEST UK"                                        34.25414 34.25414
40 "TOSCANA"                                              35.52631 35.52631
58 "TRENTINO ALTO ADIGE - VENETO - FRIULI VENEZIA GIULIA" 48.96265 48.96265
59 "UMBRIA - MARCHE"                                       35.9375  35.9375
 1 "VLAAMS GEWEST"                                        37.72414 37.72414
52 "WALES"                                                40.75235 40.75235
28 "WEST FR"                                              26.71756 26.71756
56 "WEST MIDLANDS"                                        41.42259 41.42259
35 "WEST NEDERLAND"                                       53.16456 53.16456
55 "YORKSHIRE AND HUMBERS"                                 34.0708  34.0708
37 "ZUID NEDERLAND"                                             50       50
end

I generate the following function in mata:

Code:

mata:
function SinvD(vi, vj)
{
return (1/sqrt((vi-vj)*(vi-vj)'))
}
end

Then I type:

Code:

spmatrix userdefined W`v'=SinvD(trust trust2)

When running, i got the following error message:

Code:

. spmatrix userdefined Wtrust=SinvD(trust trust2)
user-defined function returned inappropriate value
    When working on row 60 of the matrix, the user-defined function returned missing values.
r(459);

However, I cannot understand where this error comes from. I do not have any missing value in the data. Notice that If I use the following data (their structure is the same as above), it runs smoothly:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte _ID str52 regionname float(control control2)
60 "ABRUZZI - MOLISE - BASILICATA"                         50.2924  50.2924
46 "ALENTEJO"                                             61.23738 61.23738
44 "ALGARVE"                                              63.16425 63.16425
15 "ANDALUCIA"                                            66.64336 66.64336
12 "ARAGON"                                               60.03899 60.03899
62 "ASTURIAS-CANTABRIA"                                   61.26402 61.26402
48 "AZORE ISLANDS"                                        58.78553 58.78553
 4 "BADEN-WUERTTEMBERG"                                   67.37697 67.37697
17 "BALEARES"                                               53.125   53.125
 9 "BAYERN"                                               67.61789 67.61789
64 "BREMEN HAMBURG"                                       70.76023 70.76023
42 "CALABRIA"                                             54.33604 54.33604
33 "CAMPANIA"                                             62.90082 62.90082
18 "CANARIAS"                                               67.212   67.212
21 "CASTILLA-LA MANCHA"                                   63.17967 63.17967
19 "CASTILLA-LEON"                                        60.98516 60.98516
22 "CATALUNA"                                             60.86318 60.86318
47 "CENTRO (P)"                                           63.31699 63.31699
14 "COMUNIDAD VALENCIANA"                                 64.48045 64.48045
68 "EAST ANGLIA"                                          69.62963 69.62963
27 "EAST FR"                                              59.93266 59.93266
54 "EAST MIDLANDS"                                        70.89948 70.89948
41 "EMILIA-ROMAGNA"                                       62.22223 62.22223
13 "EXTREMADURA"                                          65.83125 65.83125
10 "GALICIA"                                              62.91752 62.91752
 7 "HESSEN"                                               65.68987 65.68987
30 "ILE DE FRANCE"                                        58.97898 58.97898
38 "LAZIO"                                                64.07407 64.07407
32 "LIGURIA"                                              64.34853 64.34853
45 "LISBOA E VALE DO TEJO"                                63.20115 63.20115
31 "LOMBARDIA"                                            63.11858 63.11858
49 "MADEIRA"                                              60.85627 60.85627
20 "MADRID"                                                64.7702  64.7702
23 "MEDITERREAN FR"                                       59.91285 59.91285
16 "MURCIA"                                               61.51111 61.51111
63 "NAVARRA - RIOJA"                                      68.47912 68.47912
 8 "NIEDERSACHSEN"                                        67.83459 67.83459
36 "NOORD NEDERLAND - GRONINGEN"                                60       60
 5 "NORDRHEIN-WESTFALEN"                                  67.86022 67.86022
43 "NORTE"                                                59.54501 59.54501
26 "NORTH FR"                                             58.55181 58.55181
66 "NORTH UK"                                             63.74269 63.74269
67 "NORTH WEST UK"                                        65.56403 65.56403
51 "NORTHERN IRELAND"                                     71.61314 71.61314
34 "OOST NEDERLAND"                                       53.69176 53.69176
11 "PAIS VASCO"                                           63.40095 63.40095
29 "PARIS BASIN EAST/WEST"                                57.61317 57.61317
57 "PIEMONTE - VALLLE D'AOSTA"                            60.84291 60.84291
39 "PUGLIA"                                               65.39683 65.39683
 2 "REG.BRUXELLES-CAP./BR"                                64.07563 64.07563
 3 "REGION WALLONNE"                                      63.26399 63.26399
65 "RHEINLAND-PFALZ SAARL"                                65.95441 65.95441
 6 "SCHLESWIG-HOLSTEIN"                                    66.2458  66.2458
50 "SCOTLAND"                                             67.74318 67.74318
61 "SICILIA - SARDEGNA"                                   59.67658 59.67658
25 "SOUTH EAST FR"                                        56.64845 56.64845
69 "SOUTH EAST UK"                                         67.5137  67.5137
24 "SOUTH WEST FR"                                        56.08466 56.08466
53 "SOUTH WEST UK"                                        65.85366 65.85366
40 "TOSCANA"                                              49.69136 49.69136
58 "TRENTINO ALTO ADIGE - VENETO - FRIULI VENEZIA GIULIA"   65.722   65.722
59 "UMBRIA - MARCHE"                                      55.99647 55.99647
 1 "VLAAMS GEWEST"                                        60.17209 60.17209
52 "WALES"                                                66.30566 66.30566
28 "WEST FR"                                              58.45411 58.45411
56 "WEST MIDLANDS"                                        60.98766 60.98766
35 "WEST NEDERLAND"                                       58.64074 58.64074
55 "YORKSHIRE AND HUMBERS"                                68.10857 68.10857
37 "ZUID NEDERLAND"                                       57.18254 57.18254
end

Any advise?

Thanks in advance.

Export browse table to Excel

Hi there,

is there a command that exports the whole data currently in memory to Excel? Practically, I'm looking for something that will create a table which is identical to what I see in the browse window. I know I can just copy it, but I want to do it as part of a program for multiple datasets. Plus, these datasets might change and I want a command to overwrite the Excel file.

Thanks in advance,
Ben

Rolling regression or loops - missing values

Dear All

I have a peculiar issue I wish to share with you, in the hope of receiving some advice.
I have the following panel dataset (unbalanced), with four variables:

Industry-code Year Industry-sale Number of firm in industry
12 2001 34014 5
12 2002 35402 4
12 2003 29473 5
12 2004 . 5
12 2005 29044 7
12 2006 31024 7
12 2007 32209 10
12 2008 33218 9
13 2004 5162 5
13 2005 .
13 2006 5234 6
… … … …

I have to run this regression:

Industry-sale = a + year + error

Specifically, I want this regression to be run for each year, based on data from the previous 5 years. In other words: I want to create a rolling regression for each industry-year, in the following way and under the following conditions:

within each industry, for each year calculate the regression: industry-sale = a + year + error

Note: The calculation must be based on the information from the previous 5 years. For example, for the year 2008, the regression is based on data from years 2003, 2004, 2005, 2006, 2007. The window of 5 years is fixed and does not change, as the focal year of the regression moves ahead (i.e., 2002, 2003, etc.)

condition A: if in any of the previous 5 years, there is a missing value in either the depvar or indepvar, no estimate is given; that is, stata should return just a missing value;

condition B: if in any of the previous 5 years, the number of firms in industry is below 5 (i.e., 4 or less), then no estimate is given; that is, stata should return just a missing value.

Once these regressions are calculated for the whole dataset, I wish the standard error of the coefficient “year” to be stored in the dataset. I am aware of the command rolling. For example, the command:

rolling regress_SE = _se[year], window(5): regress industry-sale year

does this job. Howevr, the problem with this command is that this command does not account for missing values. That is, if there is a missing value, it will calculate the regression over 4 years, whereas I wish stata to not calculate/store this regression estimate (condition A above); or, if the number of firms is below 5, this command will still run the regression, whereas I wish it to not calculate the regression (condition B above).

Can anyone help me?
Thanks

stripplot specifications

Dear colleagues

I have a variable named Total Compensation1 and another called Affiliationindex1 (from which I created a variable named AffliationIndex1q using xtile to generate 4 quartiles) and I want to plot the trend between the two using the command stripplot.

I ran the following code:

Code:

stripplot TotalCompensation1, over(AffliationIndex1q)  cumul cumprob box centre vertical refline  yla( , ang(h)) xla(, noticks) xsc(titlegap(*5))

and generated this:

Array

Although the trend is somewhat clear (if one looks at the means grey, lines) I wonder if any of you have an idea about changing the options of the command stripplotin order to zoom in and clearly show the trend? maybe by controlling the height so the max is 50k. I tried something like this:

Code:

stripplot TotalCompensation1, over(AffliationIndex1q)  cumul cumprob box centre vertical refline  yla(0 10000 20000 30000 40000 50000 , ang(h)) xla(, noticks) xsc(titlegap(*5))

but did not work, I also tried the ceiling and floor options (I guess I did not understand how they work).

Any Idea?

Thanks

graph pie with some of percentages

How to display the fraction when percentage is above 20%, I know how to display all percentages

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte employeeNumFull
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
4
1
1
4
1
1
1
1
4
4
4
1
1
4
1
1
1
1
4
4
4
4
end
graph pie ,over(employeeNumFull) plabel(_all percent) title("employeeNumFull") legend(ring(0) pos(3) col(1)) pie(3,explode)

Oaxaca decomposition for nonlinear regression (nldecompose)

Hello readers,
I am trying to decompose (using Ben Jann’s nldecompose) differences in the probability of having more liberal gender views into differences in characteristics (cohort replacement) and differences in coefficients (intra-cohort change).

The commands and output are as follows:

Code:

. nldecompose, by(Period): logit ideology1 Age [pweight=weight]

                                                   Number of obs (A) =    1872
                                                   Number of obs (B) =    1403

------------------------------------------------------------------------------
      Results |      Coef.  Percentage
--------------+---------------------------------------------------------------
 Omega = 1    |
         Char |  -.0078933   7.860032%
         Coef |  -.0925304   92.13997%
--------------+---------------------------------------------------------------
 Omega = 0    |
         Char |  -.0064641   6.436787%
         Coef |  -.0939597   93.56321%
--------------+---------------------------------------------------------------
          Raw |  -.1004237        100%
------------------------------------------------------------------------------

.

I have read the stata manual but I have a few questions if anyone could kindly clarify:

1.What is the difference or justification for using twofold over threefold? I understand the difference between them but not when to use each one (i.e. in twofold you are saying that if the observed variables have the same effect in each period, then it would explain x% of observed disparity in gender views – in threefold you add disparity in returns of these observed variables)

2. In a number of academic papers I have read which use nldecompose, the authors have decomposed change into:

Differences attributable to observable characteristics (“Char”)
Differences not attributable to observable characteristics (“Coef)
Total difference

I can see the first two parts in my output (char and coef) but how does one get the “total difference”?

For example in:

Arndt, B.J., 2017. Explaining Income-Related Disparities in Pap Smear Utilization: A regression-based decomposition analysis of differences in Pap smear utilization following implementation of the Affordable Care Act (No. 1694-2017-5829). – TABLE A3

Kelly, E., McGuinness, S., O’connell, P.J., Haugh, D. and Pandiella, A.G., 2014. Transitions in and out of unemployment among young people in the Irish recession. Comparative Economic Studies, 56(4), pp.616-634. – TABLE 7

An example of my data:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(Period ideology1) int Age
0 1 43
0 1 48
1 1 34
1 1 24
0 0 27
1 1 24
0 0 32
1 1 36
1 0 38
1 1 38
0 1 71
0 1 39
0 0 60
0 1 25
0 1 56
0 1 51
0 0 34
0 1 19
0 1 70
0 1 62
0 1 36
1 0 45
1 1 27
1 1 16
0 1 37
1 1 38
0 1 30
1 1 52
0 1 38
1 1 17
0 1 58
0 1 42
0 1 16
0 1 41
0 1 47
0 1 41
0 1 16
0 1 40
0 1 33
0 0 51
0 1 44
0 1 53
0 0 36
0 0 58
0 1 65
0 0 52
0 1 17
0 1 20
0 1 44
0 0 38
0 1 47
0 1 65
0 1 41
0 1 41
0 1 19
0 0 56
0 1 58
0 1 50
0 1 65
0 1 51
0 1 32
0 1 18
0 1 38
0 0 51
0 0 56
0 0 77
0 0 31
0 1 40
0 1 49
0 0 53
0 1 40
0 0 63
0 0 43
0 1 51
0 1 52
0 1 49
0 1 60
0 1 36
0 1 35
0 1 76
0 1 36
0 1 46
0 1 69
0 1 55
0 0 35
0 1 25
0 1 46
0 1 60
0 0 38
0 0 50
0 1 16
0 1 62
0 1 18
0 0 35
0 1 31
0 0 73
0 1 68
0 1 50
0 1 60
0 1 33
end
label values Period per
label def per 0 "1999-2004", modify
label def per 1 "2005-2009", modify
label values ideology1 edc
label def edc 0 "Agree", modify
label def edc 1 "Disagree", modify
label values Age X003

Generate

I am using two data set, one contains data for children and another for woman. Now both data sets have key variables for Household id (HH) and person id (ln). Each id is unique, for example if an id has been provided to a female in a household it won't be assigned to any children in the children data set and vice versa. So, now i have merged these two data set, so now in the merged data set i only have women and children. Now the problem i facing is regarding children's primary caretaker variable. Each child is assigned a primary caretaker from the household child is living in. Now that caretaker can be male or female. How can i tell how many children have female caretaker? Is there any command which i can use to do something
bys HH: gen caretaker="female" if primary caretaker takes any of the value from female person id variable. I mean the number in the caretaker id, equals any number in the female person id variable, bys HH.

regression result only except first variable

Hello

When I regress as below, why don't I get the result for the first variable and how can I fix this?
It doesn't say even it is because of omitted variable...

Year fixed effects

Hello everyone,
I am running a panel data regression model in which I observe 4 time periods and 66 observations, using a fixed effects regression and I am currently including the I.year variable to control for year fixed effects. However when I run the regression without i.year, results vary significantly and coefficients that were statistically insignificant now become significant. I ran a testparm i.Year test in order to see if there was need to include the i.Year variable and the results demonstrated that yes. I was just wondering if there was any reasons why I would not include them or if it Is always wise to include i.year?
Many thanks
Pepito

survey data help: dropping observations

Hello,

I am new to stata and would appreciate your help. I have ten crops under var2. I want to drop 8 of the crops and only analyse two of them making sure that I still have other variables like household id, sex corresponding to the two crops of interest. I can't seem to find the right drop command. The crops are listed by name eg. rice, sugar cane etc

Thanks in advance for your help

GMM: Multiple lagged moment conditions

Hello everyone! I am a complete newbie to Stata and am thus really grateful for any help. I am currently trying to implement a highly non-linear model in Stata using GMM. I am struggling to understand how to

1) Incorporate lagged moment conditions
2) Implement this highly non-linear regression.

My moment conditions should be:
E[price_s * error_t] = 0 for s < t, t = 2,3,...,T

The form of the error term is highly non-linear, i.e.
error_t = demand_t - ((1 - {theta} * z_t) ({alpha} * + {beta} * controlvar + {delta} AvPrice_t) + {theta} * z_(t-1) * ({alpha} * + {beta} * controlvar + {delta} AvPrice_t)

My biggest issue right now is the fact that I only have a single moment condition and do not now how to incorporate the lagged condition mentioned above. I have looked at other non-linear GMM specifications, but failed to understand how they obtain multiple moment conditions. I know that my lack of understanding runs deep and would appreciate either some help on my question or someone pointing me to a resource that I can use.

Thank you very much in advance! (and if there is any other information you need, I am sorry for not providing it and will try to do so asap).

Controlling for averages of lags

Dear Stata community,

I'm using time series regression using data of 100 countries over 22 years to estimate the effect of financial openness on economic development. However, I'm running into an endogeneity problem: Openness may lead to higher growth, but countries that expect to grow a lot may tend to open up their economies. To control for this issue I want to control for recent growth trends. Specifically, I want to control for the average growth rate over the past three years. I'm not sure if there is an easy way to add this into the regression or if I have to create new variables. Does anyone have an idea how to do this?

Thank you in advance for your time.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str42 country int year float loggdppc double(ka gdppcg)
"Algeria" 1995 9.137277                 1 1.853423792861392
"Algeria" 1996 9.160067               .75 2.305087244845765
"Algeria" 1997 9.154971                 1 -.508245295794012
"Algeria" 1998 9.189805               .75 3.544786287223985
"Algeria" 1999 9.207232               .75 1.758011109300668
"Algeria" 2000 9.231221              .875 2.427875700928467
"Algeria" 2001 9.247847              .875 1.676475843908534
"Algeria" 2002 9.289751 .9285714285714286 4.279453026560546
"Algeria" 2003 9.346609 .9285714285714286 5.850520833432341
"Algeria" 2004 9.375616 .9285714285714286 2.943232296389624
end

rangestat/runby

Dear All, I wonder if the following question can be solved by (ssc install) rangestat/runby? The data set is:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long stkcd int year double sale
2 2008 40991779215
2 2009 48881013143
2 2010 50713851443
2 2011 71782749801
2 2012 1.03116e+11
2 2013 1.35419e+11
2 2014 1.46388e+11
2 2015 1.95549e+11
2 2016 2.40477e+11
2 2017 2.42897e+11
4 2008 43314824.87
4 2009 60080600.35
4 2010 131331494.7
4 2011 74503718.53
4 2012 97363301.61
4 2013 72784567.16
4 2014    80608820
4 2015 120454422.5
4 2016 287670026.6
4 2017   138605842
5 2008 108720264.8
5 2009 43192476.48
5 2010 137234232.9
5 2011 63534839.01
5 2012 99421227.75
5 2013 52061702.69
5 2014 52776994.29
5 2015 84125970.69
5 2016 481863433.3
5 2017 530922230

For each `stkcd', I'd like to perform a rolling regression using the recent 5-year data (including the current year) by regressing sale on a constant and trend term (1,2,3,4,5). Then, collect five residuals and calculate its standard deviation. Any suggestions? Thanks.

In the loop

Loop doesn't work with $ which is a recall from global?? Thank you.

foreach i of num 1/17 {
global Pi2016_`i'=$Ng`i'/$NgT
}

Saturday, March 30, 2019

Execute R packages within Stata?

Dear All,

Suppose that (for personal preference) I'd like to execute some R commands/packages (say, PanelMatch) from https://github.com/insongkim/PanelMatch within Stata.
Does anyone can provide a sample do file for this purpose. I checked out the threads and found that rcall and rsource are probably the ways to this end, but have no clue how to do that. Thus, your help is needed and highly appreciated.

Covariate measurement error with binary outcomes

While it is possible to get parameter estimates corrected for measurement error using -sem- like:

Code:

use http://www.stata-press.com/data/r15/sem_rel.dta, clear
sem (y <- x1)
sem (x1 <- X)(y <- X), reliability(x1 0.5)

It isn't possible to use the reliability option with the -gsem- command. Additionally, the command -eivreg- is known to incorrectly estimate the standard errors for the parameter(s) of interest (see Lockwood, J. R., McCaffrey, D. F., & Savage, C. (2017). Errors-in-variables: Why Stata's -eivreg- is wrong and what to do instead. Presented at 2017 Society for Research on Effective Education Conference. Retrieved from: https://www.sree.org/conferences/201...pdf&item=slide):

Code:

eivreg y x1, r(x1 0.5)

Additionally, the reliability option generalizes to the multilevel context as long as the model specified uses an identity link and gaussian family:

Code:

g byte school = rbinomial(4, 0.5)
// This model will fail to converge nearly immediately, but at least illustrates that it is possible to deal with covariate measurement error in the context of mixed effects
gsem (x1 <- X)(y <- X RE[school]), reliability(x1 0.65) link(identity) family(gaussian) difficult

So, I am wondering if anyone is aware of any ways to deal with covariate measurement error when the outcome is binary and the model includes random intercepts and a random coefficient? If context helps, we are trying to estimate the probability of students being retained in grade conditional on test scores (which are measured with a non-trivial amount of error), demographics, school-level random intercepts, and prior schooling indicators (random coefficients). I imagine it might be possible to deal with this in a Bayesian framework (as well as being able to specify priors that could keep estimates away from the boundaries), but don't have much experience with Bayesian modeling.

Predicting values which incorporate fixed and random effects in linear mixed model

Hi all,

I am using Stata/SE 14.2 to analyse a randomized controlled trial. Respondents filled the Beck Depression Inventory (bdi) at times 0, 1 and 2. To control for baselines values of the outcome (i.e. control for regression to the mean), I am building the model without the treatment variable and using id: as the second level (following Twisk 2018). The data is in long format

mixed bdi i.time i.intervention#i.time || id:, var
The interaction intervention*time gives me the mean difference in bdi at time 1 and 2. But now I would like to obtain the predicted means of bdi at times 0, 1 and 2 to make a graph for the publication. For this, I am using predict with the option fitted to obtain means that incorporate both fixed and random effects.

predict predri, fitted
mean predri if time==0 & intervention==1
mean predri if time==0 & intervention==0
mean predri if time==1 & intervention==1
mean predri if time==1 & intervention==0
mean predri if time==2 & intervention==1
mean predri if time==2 & intervention==0
However, I obtain exactly the same means as the unadjusted ones using tabstat. In the data, original and predicted values are different, but when I extract the means, these are the same.

Using margins has the same problem (as only fixed effects are incorporated)

margins i.time#i.intervention
I am wondering what is wrong and how I could calculate predictive means that incorporate both fixed and random effects in order to produce a graph.

Thanks in advance for your time

Reference:
Twisk J, Bosman L, Hoekstra T, Rijnhart J, Welten M, Heymans M. Different ways to estimate treatment effects in randomised controlled trials. Contemporary clinical trials communications 2018; 10: 80-5.

Help Generating a Feeling Thermometer

Hi all,

I have some experience with STATA but I am by far not an expert. I have a set of eight variables (from a larger dataset that I did not code) five of which are from one survey question asking people to assess different aspects of the US government and its response to hurricane Maria. The responses to these five variables are separated into five categories from excellent (coded 1) to poor (coded 5). The other three are formatted differently but still asking about the government's response, two have the response options Better, Worse or about the same, the other is dichotomous.

What I was hoping I would be able to figure out on my own was how to create a single feeling thermometer using these responses as one is not included in this dataset. Ideally the thermometer would descend from the top score for all eight variables where an individual had responded positively to all the questions. For the second category it would include the positive responses from the three differently formatted questions, the middle category would include the ambivalent response option only etc. Those that have answered in varied ways so as not to lose them would then be in between their associated categories. I realize this would not be a feeling thermometer as done traditionally but more of like an assigned score that is being curved where someone scoring 1's across the board have the most affinity for the government working its way down.

Now I know I could use egen to create a variable from the rowtotals or its other options but unless I'm mistaken there is no way to create a thermometer from this. I was also unable to find a post about this in the form. To compare as people answered one way across all the questions but that would require me to reduce the variability in the responses from the first five categories which I don't want to do. I am trying to make a scaled set of responses in order to latter use a linear regression model to predict the likelihood of a statehood vote in a future referendum (yes I am aware of the proposed bill from US congress just the other day, times change but submitted abstracts are forever).

So using the dataex command for these variables I get the following output. I am unfortunately unsure where to start as I have realized that this is over my head and outside of what I have been taught. If this is far to difficult too answer please let me know so I can adjust my plans accordingly.

Variables q16a-e were the first five mentioned above q17 is the dichotomous variable and q18 and q19 are the three option questions. The count here is only ten since the varname coding is so long but I can always provide more observations. I apologize in advance if I haven't articulated my question very well.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(q16a q16b q16c q16d q16e q17 q18 q19)
6 6 3 3 4 2 3 1
5 5 5 5 5 1 3 3
4 4 4 4 5 2 2 1
4 5 5 4 5 2 2 1
2 4 2 3 2 1 2 1
4 4 4 3 5 2 2 1
3 3 3 3 5 1 2 1
1 1 1 1 4 2 3 1
5 5 5 5 5 1 2 1
1 4 3 4 4 2 2 1
end
label values q16a q16a
label def q16a 1 "Excellent", modify
label def q16a 2 "Very good", modify
label def q16a 3 "Good", modify
label def q16a 4 "Fair", modify
label def q16a 5 "Poor", modify
label def q16a 6 "Don't know", modify
label values q16b q16b
label def q16b 1 "Excellent", modify
label def q16b 3 "Good", modify
label def q16b 4 "Fair", modify
label def q16b 5 "Poor", modify
label def q16b 6 "Don't know", modify
label values q16c q16c
label def q16c 1 "Excellent", modify
label def q16c 2 "Very good", modify
label def q16c 3 "Good", modify
label def q16c 4 "Fair", modify
label def q16c 5 "Poor", modify
label values q16d q16d
label def q16d 1 "Excellent", modify
label def q16d 3 "Good", modify
label def q16d 4 "Fair", modify
label def q16d 5 "Poor", modify
label values q16e q16e
label def q16e 2 "Very good", modify
label def q16e 4 "Fair", modify
label def q16e 5 "Poor", modify
label values q17 q17
label def q17 1 "Priority", modify
label def q17 2 "Not a priority", modify
label values q18 q18
label def q18 2 "Worse", modify
label def q18 3 "About the same", modify
label values q19 q19
label def q19 1 "Better", modify
label def q19 3 "About the same", modify
label var q16a "16a. How would you rate the job The federal government has done in responding to" 
label var q16b "16b. How would you rate the job President Trump has done in responding to Hurric" 
label var q16c "16c. How would you rate the job The Puerto Rican government has done in respondi" 
label var q16d "16d. How would you rate the job Governor Rossello has done in responding to Hurr" 
label var q16e "16e. How would you rate the job Your municipal government and mayor has done in " 
label var q17 "17. Do you think the rebuilding of Puerto Rico is a priority for the U.S. federa" 
label var q18 "18. In your opinion, was the federal government's response to Hurricane Maria in" 
label var q19 "19. In your opinion, do you think the federal government's response to Hurricane"

Is there any way to absorb time dummies in xtreg, fe?

Hello

I want to do

code : xtreg depvar indepvar i.day, fe vce (r)

with the panel data. (day is a time variable)

But the data is too big, so I can not use the above code

Is there any way to absorb time dummies in - xtreg, fe - like - areg, absorb() - ?

I used reghdfe and it is not available because the coefficient and t-value are too different compared with xtreg.

thank you for reading

Looping using forvalue in winsor2

I am using the following command in Stata 14.2 which has data stacked by industry-year wise. There are 357 industry year groups for the groups 1 to 271 I used the following syntax :

. forval i = 1/271 {
2. capture {
3. winsor cfo_sc if group == `i', gen(work) h(1)
4. replace cfo_sc_w = work if group == `i'
5. drop work
6. }
7. }

This worked fine for winsorizing the data first 271 groups but when I am using the following syntax for winsorizing the remaining groups, I get the winsorised values for group 272 only for other groups I get blank cells. further temporary variable cfo_sc_t created also do not get dropped.:

forval i = 272/357 {
2. capture {
3. winsor2 cfo_sc if group == `i', suffix(_t) cuts(1,99)
4. replace cfo_sc_s = cfo_sc_t if group == `i'
5. drop cfo_sc_t
6. }
7. }

Kindly guide what's wrong in the syntax.

Thank you.

Full calculation of R^2 when using ivreghdfe and reghdfe

In ivreghdfe and reghdfe, although it's fast to just absorb fixed effects, if I add them into absorb(), they are not calculated into R^2. Only when I include fixed effects as i.FE, they are included in R^2. So when I just absorb them, it appears as if R^2 is very small when it is actually large.

Is there any way to put fixed effect into absorb() so that the speed is fast, while at the same time calculate R^2 including those fixed effects?

Help with nonstationarity and Cointegration (Phillips-Perron and Pedroni)

Dear All,

I am looking for some help as I am quite new to this. Any help or advice would be greatly appreciated. I thank you in advance for taking the time to read this post.

I am currently working with a panel set of global bilateral FDI data from UNCTAD. It covers 12 years, which I know is not a long time frame, but it is all that is available through UNCTAD.

I am attempting to run a standard gravity equation, with LogFDI as my dependent variable. I have a number of institutional variables that I am testing the effects of on FDI. It has come to my attention that nonstationarity and cointegration could pose a problem with my estimations so it was suggested that I under go unit root tests in the form of Phillips-Perron and cointegration tests on the nonstationary variables with Pedroni.

This is where the problem begins. All my variables work fine, with the exception of my dependant variable of LogFDI. Whenever I attempt to run the unit root test, I receive about a thousand errors such as, "could not compute test for panel 3, could not compute test for panel 5, could not compute test for panel 7, etc."

I am assuming this has to do with the fact that there are large number of missing values in bilateral FDI data, apparently this is just the nature of this type of data. However, when I go to compare a panel it says it cannot compute and on that it can, I see no difference between panels. Sometimes both panels are missing data and it works for one but not the other.

I am running the command:

xtunitroot fisher ln_FDIInflows_Host, pperron lags(1)

and I get the following results:

Fisher-type unit-root test for ln_FDIInflows_Host
Based on Phillips-Perron tests
-------------------------------------------------
Ho: All panels contain unit roots Number of panels = 5167
Ha: At least one panel is stationary Avg. number of periods = 6.23

AR parameter: Panel-specific Asymptotics: T -> Infinity
Panel means: Included
Time trend: Not included
Newey-West lags: 1 lag
------------------------------------------------------------------------------
Statistic p-value
------------------------------------------------------------------------------
Inverse chi-squared(7082) P 1.99e+04 0.0000
Inverse normal Z -38.8801 0.0000
Inverse logit t(14104) L* -73.1777 0.0000
Modified inv. chi-squared Pm 107.6729 0.0000
------------------------------------------------------------------------------
P statistic requires number of panels to be finite.
Other statistics are suitable for finite or infinite number of panels.
------------------------------------------------------------------------------

Which according to the literature I've been reading, is not really what I should expect. Which also leads me to my next problem of setting up the Pedroni test. The assumption is that FDI should be nonstationary and cointegrated with several of my time-varying variables. However, if I try to ignore the Phillips-Perron test for now and attempt to see the results of a Pedroni Test, when ever I try to use LogFDI I receive an error.

I am running the command:

xtcointtest pedroni ln_FDIInflows_Host variable1 variable2 variable3

I receive the following:

_hqrdp_la(): 3200 conformability error
_hqrdp(): - function returned error
_hqrd(): - function returned error
hqrd(): - function returned error
getresd(): - function returned error
<istmt>: - function returned error
r(3200);

Now it says its a conformability error, but I fail to see how as all my data is formatted exactly the same. I am obviously missing something.

Any help would be greatly appreciated.

Thank you for your time,

Wes Josefsson

Export to single Excel file with multiple sheets (based on variables)

Code:

 +---------------------------------------------------------------------------------------------------------+
  | country   year     NGDPRPC     NGDPDPC   NGSD_N~P   PCPIPCH   rank~RPC   rank~DPC   rank_N~P   rank_P~H |
  |---------------------------------------------------------------------------------------------------------|
  |  Greece   2010   20327.829   26972.873      5.664     4.704          2          2          4          1 |
  +---------------------------------------------------------------------------------------------------------+

In my data set, each "main" variable e.g. NGDPRPC has an associated rank variable. I would like to export the data into single excel file with 4 sheets (there are 4 "major" variables in this example) ; each sheet will contain name, year, and one "major" variable with its rank variable i.e. total four variables. In addition, I would like to name the sheets according to the "major" variables they contain.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str32 country int year double(NGDPRPC NGDPDPC NGSD_NGDP PCPIPCH) float(rank_NGDPRPC rank_NGDPDPC rank_NGSD_NGDP rank_PCPIPCH)
"Greece"  2010 20327.829 26972.873  5.664  4.704 2 2 4 1
"Estonia" 2010  11052.81 14672.328 23.075  2.741 4 4 1 2
"Malta"   2010 15939.801 21150.423 18.954  2.041 3 3 2 4
"Cyprus"  2010 20490.637 31261.236 12.558  2.558 1 1 3 3
"Estonia" 2011 11928.681 17470.842 26.442   5.08 4 4 1 1
"Malta"   2011 16113.851 22933.749 19.254  2.513 3 3 2 4
"Cyprus"  2011  20051.42 32692.698 14.706  3.481 1 1 3 2
"Greece"  2011 18464.601 25896.934  5.099  3.118 2 2 4 3
"Estonia" 2012 12487.099  17431.54 27.143  4.219 4 4 1 1
"Cyprus"  2012 18936.417 29066.268  10.16  3.089 1 1 3 3
"Greece"  2012  17173.71 22171.911  8.972  1.035 2 2 4 4
"Malta"   2012 16450.882  22069.52 19.718  3.227 3 3 2 2
"Estonia" 2013  12774.29 19078.213 27.426  3.247 4 4 1 1
"Greece"  2013  16742.04 21805.257   9.56  -.854 3 3 3 4
"Cyprus"  2013 17732.983  27825.87  8.246   .381 1 1 4 3
"Malta"   2013 17005.342 24021.743 21.781   .979 2 2 2 2
"Cyprus"  2014 17644.755 27267.428  8.022  -.268 2 1 4 3
"Estonia" 2014 13170.071 19969.128 26.937   .476 4 4 1 2
"Greece"  2014 16984.502  21726.89  10.28 -1.394 3 3 3 4
"Malta"   2014 18097.074 26202.597 26.674   .771 1 2 2 1
end

How to use nonlinear least squares regression for a vector function?

I am trying to fit the function:
Revenue_i(time) = A0/(1+exp(-A1*(time-A2))) + B*Costs_i

I have 5,000 observations of Revenue and Costs for 8 time periods (time = 0,1,...7).

A0, A1, A2, and B are constants that I am trying to fit.

I am unsure of how to fit this function in Stata using the nl command.

Does anybody have suggestions?

Indicator for more than one marital change

Dear Stata Users,

I have a panel dataset and I would like to identify marital changes (var marital).
Suppose that 1 represents single, 2 marriage and 4 divorce, I would like to create a unique indicator that
considers that the individual 1002352* had a transition from single to marriage AND, then, to divorce.
So far, I used the lagged operator because I was interested in just one transition and not in the subsequent one.

But I am not so sure on how to add the second change, Do you have any suggestion?

ID t marital
1002352* 1 1
1002352* 2 2
1002352* 3 4
1002352* 5 4
1002352* 6 4
1002352* 7 4

Thank you, Lydia

Any way to recover results after a crash?

Hi,

This might sound silly as I had no idea about the -preserve- command until I started looking for a solution.

So I was bootstrapping for an entire day and left my PC for only an hour to come and find my results had churned out but Stata is not responding. Tried everything I know but now since it's a crash, is there anyway to recover the lost results? I can't stand bootstrapping another entire day. Any help would be very much appreciated.

making loop with the values within the variable

Hello,

I have one of the variables G1 which has values from 1 to 17.
Then I want to make a loop as below,but it didn't work out.
Anyone can help what is the problem? thanks in advance.

foreach i in G1 {
sum sw20h if G1==`i'
global Ng`i'=r(sum)
}

Splitting a data set into four data sets over all observations but keepin all variables for each splitted data set

Hi All,
I am using a data set which contains more than 50 variables and 0.36 observations. I need to split all the observations into four different data sets based on one of the variable named "province". The variable province has one digit for each of the province e.g. 1 is used for Punjab and 2 is used for sindh.
How i can split this data set into four data sets that will be data sets for each of the different provinces. Thanking in anticipation

Generating categorical dummy variable

Hi all,

I need some help with this.

I would like to generate a new binomial variable EDUCATION equal to 1 if HIGH SCHOOL (existing variable) is above the median value and 0 otherwise

Thankful for any help.

Frank

Clustering Standard Errors help

Hi, I want to check whether there is a need to cluster my results. I asked this question before with no response.

I am doing a fixed effects panel regression regressing recycling rates on income, population density and method of recycling over 20 quarters for 310 local authorities in England. I predict observations from the same local authorities are related to each other

Am I correct to use the ANOVA 'loneway' command to check if I should be clustering my standard errors? Where acode is the code for each local authority.

loneway recycling acode

Do I understand correctly that this tells me the % of of the variation in recycling rates comes from between local authorities?
And if this is high there is a high correlation of recycling rates due to local authority specific changes?

Thank You

Problem with opening the dataset

Hi Stata community. I am trying to open a .dta dataset, which I have been working on for the last four months (and quite a few hours ago as well). Now could not open it, and Stata tells me that file is "not Stata format." Meanwhile, I am quite sure that I have not changed the dataset format (I only use Stata for statistical analysis and no other format). Any idea what might have happened to my dataset file and how to open it? Thanks in advance.

Btw, I was trying to attach here the dataset for anyone who might want to check out, and it gave me this message: Invalid file data

Group and label variables (perhaps using -rangejoin-)

Hello,

my following problem seems to me as being pretty complicated. However, I will try to explain it as goood and structured as possible. I hope you can help me.

My dataset consists of the following variables: fund (identifier for each fund) mdate (yearmonth) number (number of stocks the fund holds) stock_id (identifier for each stock) markcap (market capitalization of each fund. Telling me basically how much the stock is worth at the stock market).
So if fund A holds lets say 5 stocks at January2005, there will be 5 observations for fund A in January2005, given that each stock is considered seperately within the stock_id column.

My goal: Each year in June (overall period is January2005 till December2018), use the median of markcap to split the stocks in two groups, named "small" and "big". If markcap > median = "big". If markcap < median = "small".

If markcap > median in June(t), the stock should then be labelled as "big" from July(t) till June(t+1). Then in June (t+1) stocks are divided again in "small"/"big" and then labelled from July(t+1) till June (t+2) and so on.

However, since one stock can be hold by multiple funds at each date, there are duplicates stock_id mdate markcap in my dataset. Therefore, before dividing the stocks into "small" and "big", I do need to exclude these duplicates so that every stock is considered only once per mdate. Therefore, I would either need to mark the duplicates and exclude them when performing the median calculation (but I do not know how to do this), or I copy the dataset, delete all duplicates, do the median calculation and labelling and then merge the datasets back together.

I am sorry that I can not provide one single code so far. But my stata knowledge for now is simply not good enough to know how to solve this problem. I guess for the labelling from July(t) till June(t+1) I could use rangejoin or a code that looks something like this (assuming that I did already define a small_big variable, indicating to which group each stock_id belongs)?

Code:

by stock_id(mdate), sort: replace small_big = small_big [_n-1] if missing(small_big)

And the creation of the small_big variable could perhaps look something like this (assuming that I did already create a markcap decile for each stock_id per year in June). Eventhough I am not sure if using deciles is the smartest way.

Code:

label define small_big 0 "small" 1 "big"
gen byte small_big= 1 if inlist(markcap_decile_june, 6,7,8,9, 10)
replace small_big= 0 if inlist(marpcap_decile_june, 1, 2,3,4,5)
label values small_big small_big

I would reall appreciate any help a lot.
Thank you very much.

Tim Wolf

Appropriateness of Fixed Effects when using a dummy that is only time variant for some panel units.

I am looking to estimate the impact of the presence of a fiscal council on a country's real growth forecasting error. I have run the following fixed effects model:

Code:

xtreg rgfe L.fc L.fri L.rgog L.debt i.pubdate, fe robust

Where rgfe is the real growth forecasting error and L.fc is a lag of a fiscal council dummy variable equal to 1 in the presence of a fiscal council and 0 otherwise. I am aware that time-invariant explanatory variables are removed because of the transformation involved in fixed effects estimation. For some countries, the fiscal council dummy is time-invariant, i.e. it is always equal to either 0 or 1 for all time periods in my sample. For others, the fiscal council varies over time, e.g. so that for the UK, FC=1 in all time periods after 2009 but FC=0 before.

Is it appropriate to use fixed effects estimation for this model if I am primarily concerned with the fiscal council as an explanatory variable?

Sorted bar graphs by year for multiple variables

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str32 country int year float(rank_NGDPRPC rank_NGDPDPC rank_NGSD_NGDP rank_PCPIPCH)
"Greece"  2010 2 2 4 1
"Estonia" 2010 4 4 1 2
"Cyprus"  2010 1 1 3 3
"Malta"   2010 3 3 2 4
"Greece"  2011 2 2 4 3
"Malta"   2011 3 3 2 4
"Estonia" 2011 4 4 1 1
"Cyprus"  2011 1 1 3 2
"Malta"   2012 3 3 2 2
"Greece"  2012 2 2 4 4
"Cyprus"  2012 1 1 3 3
"Estonia" 2012 4 4 1 1
"Estonia" 2013 4 4 1 1
"Greece"  2013 3 3 3 4
"Malta"   2013 2 2 2 2
"Cyprus"  2013 1 1 4 3
"Cyprus"  2014 2 1 4 3
"Estonia" 2014 4 4 1 2
"Malta"   2014 1 2 2 1
"Greece"  2014 3 3 3 4
end

Q1. I am trying to plot rank_* variables of countries by year. For example, the following code does the job (partially) for the variable rank_NGDPRPC .

Code:

graph hbar (asis) rank_NGDPRPC, over(country) by(year)

. This plot, however, is not sorted. How could I attain a plot that always puts country with rank 1 on top then country with rank 2 and so on?

Q2. How do I apply the above requested code to all rank_* variables using loop?

Q3. Instead of showing all the plots in one graph, how do I generate individual plots for each year for all the rank_* variables?

Problem with specifying a range of year or listing years

Hello,

I am quite new to Stata and am trying to analyse some time series data but am having some issues with getting Stata to understand what i am asking

I have a variable - "Year" - which i generated from another variable - "Y" - using command: generate Year = date(Y, "Y"), after(Y)

This gave a list of days from Stata's base year as expected, but i could only format it into looking like's original form (which was a 4 digit year, e.x. 1988) by using command: format Year %tdcY,

which seemed odd but worked.

Now my problem is, when trying to list the observations if i use command: list Year if Year>2000 , it gives me all years from 1966 to 2015, so i guess Stata is still understanding 2000 as the days from January 1st 1960. Is there a way to change this so when i list observations from 2000 it will show them from the year 2000 not the days?

Any help would be much appreciated

Many thanks, Archie

Average and Median values by group

Dear All

Will appreciate if you can help me on the following query.

The following data shows firms (ID), year , reform year (SSRCOMPLETEYR) and the variable NTS (ownership proportion).

I want to create variables, i.e. the mean and median of NTS with respect to the reform year. For instance, for firms which have reform year (SSRCOMPLETEYR) 2005, then I want the mean and median values of NTS for all these firms for three-years prior (2002), two-years prior(2003) and one-year prior (2004) to the reform year (2005). I also want the mean and median values of NTS for all these firms t+1, t+2 and t+3 where t is the reform year (SSRCOMPLETEYR).

* Example generated by -dataex-. To install: ssc install dataex
clear
input long ID float YEAR int SSRCOMPLETEYR float NTS
2 2003 2005 .1583336
2 2004 2005 .14580873
2 2005 2005 .13422763
2 2006 2005 .129922
2 2007 2005 .02426268
2 2008 2005 .026378496
2 2009 2005 .002197306
2 2010 2005 .0017612013
2 2011 2005 .0016758556
2 2012 2005 .001673647
2 2013 2005 .0021632616
2 2014 2005 .0014382304
4 2014 2006 .001205394
6 2003 2006 .37771195
6 2004 2006 .37771195
6 2005 2006 .37771195
6 2006 2006 .2533447
6 2007 2006 .13797276
6 2008 2006 .08924162
6 2009 2006 .02732855
6 2010 2006 .02611745
6 2011 2006 .021394266
6 2012 2006 .01581895
6 2013 2006 .0087063005
6 2014 2006 .008220699

Many thanks

Yahya Ghazali

Friday, March 29, 2019

Maki's Cointegration

I was wondering if anyone has written an ado file for Maki's cointegration:
Maki, Daiki, (2012), Tests for cointegration allowing for an unknown number of breaks, Economic Modelling, 29, issue 5, p. 2011-2015.

How can I select observations containing specific words?

Dear all,

I need to pick persons A & B whose position include exact "Independent" from the list as follow

Directors Position
A Non-executive Independent Director
B Independent Non-executive Director
C Non-executive Director
D Non-executive Chairman of Board
E Non-Independent Non-executive Director

How can I do that? Because if I use the command

Code:

gen InD=regexm(CurrentPosition,"Independent")

Person E is chosen too.

Thank you

Two limit Tobit with robust and clustered standard errors

Hello, would someone have any help on how to run a two limit tobit model with robust and clustered standard errors. I want to cluster on region (I also include region dummies i.region as fixed effects). My data N=~360,000 individuals nested within 310 regions with two years of data, 2000 and 2010 (not the same individuals across years). I have a count dependent variable with a range of values 0 to 30 (the number of days a person has experienced poor health over the past month). Having a lower limit of 0 (comprising about 64% of the observations) and an upper limit of 30 (about 5-6% of obs.). Thank you very much for any help.

Opposite sign for ivprobit marginal effect compared to ivprobit coefficient, ivreg, probit marginal effect

Data and code download link (Just 91KB): https://drive.google.com/drive/folde...2M?usp=sharing

In this data example I came up with, I get following results.

	Command I run	Result
1	reg	positive & significant
2	ivreg	positive & significant
3	probit coefficient	positive & significant
4	probit marginal effect	positive & significant
5	ivprobit coefficient	positive & significant
6	ivprobit marginal effect	negative&significant

This is strange in so many levels. #5 and #6 having different sign is strange. #2 and #6 having different sign is strange.

And if this is possible, in what situations would it occur?

Fixed effects regression drops a variable of interest

Right now I have a standard OLS model with one year of data, and I'm interested in the coefficient for charter schools. My dependent variable is test scores. I currently have Charter schools =1 if charter school and =0 if public school. I control for other variables like demographics, funding, etc. I thought I could make things more interesting with panel data, but I am getting pretty stuck.

First off, I'm focused on my independent variable for Charter schools and since it's time-invariant, it drops from fixed effects. I think this is an issue because it's my most important independent variable. I've tried a few things:

encode SchoolName, gen(SchoolName_2)

xtset SchoolName Year, yearly

Then I tried these different regressions:

xtreg Y_dep X_ind i.Year, re(don't think random effects is appropriate, ran the hausman)

xtreg Y_dep X_ind i.Year, fe

xtreg Y_dep X_ind i.Year, fe vce(cluster SchoolName_2) and xtreg Y_dep X_ind i.Year, fe vce(robust)

reg Y_dep X_ind i.Year i.SchoolName_2, vce(cluster SchoolName_2)

Not sure if running a regular regression and just clustering errors by School ID (SchoolName) would be the best route, or if it even leads to anything that makes sense. Is it possible to run a regression using panel data when my binary independent variable Charter School is my variable of interest? I have data for 4 years, about 1200 schools.

Does it make sense to try and find a coefficient for each individual school? I noticed a lot of independent variables also become insignificant with my panel data, compared to my single year OLS where most variables are significant. I'm aware there's a few limitations with my model, but I feel like there might be a serious issue with it that I'm not grasping.

Thanks in advance for your advice.

Calculating the length

Dear All, I was asked this question. First, the data is

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long code int year double inve3 float(medianinve3 spike)
2 2000   .011212207889467666 .0026942156 1
2 2001   .004652664322401818 .0026942156 0
2 2002  .0063057824872456165 .0026942156 1
2 2003   .003929370970468741 .0026942156 0
2 2004  .0021902371095429227 .0026942156 0
2 2005   .002157089236485528 .0026942156 0
2 2006    .00280404201049823 .0026942156 0
2 2007  .0025765438461223937 .0026942156 0
2 2008  .0018055175281241453 .0026942156 0
2 2009   .005857645680091422 .0026942156 1
2 2010   .001214716774069126 .0026942156 0
2 2011   .000883029842004046 .0026942156 0
2 2012 .00039774813684472204 .0026942156 0
2 2013  .0050904941091668736 .0026942156 0
2 2014  .0036009324302514667 .0026942156 0
2 2015  .0033748017269586757 .0026942156 0
2 2016   .002584389329396051 .0026942156 0
2 2017   .002025560938915098 .0026942156 0
4 2000   .027700292071658875  .022264717 0
4 2001   .035246372446361925  .022264717 0
4 2002   .022780405552736704  .022264717 0
4 2003    .01775822776250611  .022264717 0
4 2006   .009510167325045489  .022264717 0
4 2009   .006243714789496335  .022264717 0
4 2010    .12440163880059046  .022264717 1
4 2011    .03739443633018239  .022264717 0
4 2012    .09511299375377695  .022264717 1
4 2013   .008926191397906697  .022264717 0
4 2014    .03314476897651029  .022264717 0
4 2015   .010371283584058222  .022264717 0
4 2016   .010034764879193006  .022264717 0
4 2017    .02174902999552444  .022264717 0
5 2006  .0006019977669674259  .006625881 0
5 2007  .0008470770404338991  .006625881 0
5 2008   .004311784985972088  .006625881 0
5 2011   .007327270066669626  .006625881 0
5 2012    .01432647805443116  .006625881 1
5 2013   .008980868845271842  .006625881 0
5 2014    .21303528572685007  .006625881 1
5 2016  .0061001127192159945  .006625881 0
5 2017   .006625881166223293  .006625881 0
6 2000   .012802010792224068   .01280201 0
6 2001    .08606658218795088   .01280201 1
6 2002     .0755770433061105   .01280201 1
6 2004     .0030340841144267   .01280201 0
6 2005    .01762134072401256   .01280201 0
6 2006       .25411283184605   .01280201 1
6 2007  .0010882708206108927   .01280201 0
6 2008   .002493299631683462   .01280201 0
6 2009  .0008146183401212614   .01280201 0
end

In particular, codes denote firms, `inve3' is the investment ratio. In addition, `medianinve3' is the median investment and `spike' (investment spike) is defined as

Code:

bys code: egen medianinve3 = median(inve3)
gen spike = (inve3 > 2*medianinve3)

The purpose is to calculate `Spell Lengthit' which is our proxy for investment timing and is defined as `the number of years' since firm i’s last investment spike. Any suggestions? Thanks.

Coefplot - option rename()

Dear statalist members,

in fact, I have got three questions concerning coefplot (Ben Jann, 2014) but the main question deals with rename().
I plot four models with AME for two waves using the following code:

Code:

coefplot (SE1, label(Settler)) (TE1, label(Target earner)) (YL1, label(Young & Learning)) ///
         (FR1, label(Family remigrants)), bylabel(Wave 1) ///
         || (SE2, label(Settler)) (TE2, label(Target earner)) (YL2, label(Young & Learning)) ///
         (FR2, label(Family remigrants)), bylabel(Wave 2) ///
         ||, rename(ident2 = ident1 PPRC2 = PPRC1 LANG2 = LANG1 ACT2 = ACT1) ///
         drop (durStay1 durStay2 age1 age2 *.sex) legend(row(2)) ///
         xline(0) ///
 coeflabels(ident1 = "{bf:Identification with Germany}" PPRC1 = "{bf:Time spent with Germans}" LANG1 = "{bf:Language proficiency}" ///
 , labsize(small)) legend(size(small)rows(2)) xlabel(,labsize(small)) ///
 headings(1.ACT* = `""{bf:Employment status}" "{it:(Ref.: unemployed)}""' ///
          2.motive = `""{bf:Migration motive}" "{it:(Ref.: Family)}""' ///
          1.EG = `""{bf:Ethnic group}" "{it:(Ref.: Poles)}""')

This is my result:

Array

#1:
As ACT* (employment status) is measured in wave1 and wave2 there are different variable names in the wave 1 (ACT1) and wave 2 (ACT2) models.
That's why employment status is the last coefficient in the second subgraph - at first glance.
I tried to fix the problem, also for identification (ident*), time spent with Germans (PPRC*) and language proficiency (LANG*), by

Code:

rename

.
It worked for all coefficients except ACT*. I have no clue why this is the case.

#2:
How can I change the font size of the reference categories (part of the heading which is in italics)?

#3:
How can I change the size of the gaps between the first three coefficients? I know the code is

Code:

gap

, but I don't know where to put it in my code and how to specify the coefficients.

Thank you for any suggestions.

mi estimate not using all observations

Hi all,
I am trying out multiple imputation in STATA and I noticed something wondering if people can give feedback on.
I had N=980, 770 of which had complete data-- for the 210 with missing, I used multiple imputation (tried both mvn and pmm, resulted the same problem) and stata imputed 110 of them.
So, I thought when I run mi estimate, I should see analyses done on close to 880 cases. However, when I run mi estimate I saw number of observations around 800.

So, looks like stata is not using all the imputed data during estimation. I wonder if people know where in the process stata might have dropped cases.

Thanks much in advance!

Lasso for binary outcome

I am using lassopack, do I need to specify anything if my outcome is binary?
Thanks!

Embedded quotes question

Hello,

I'm trying to auto-generate do-files by copy/pasting from datasets created by the do-file this question pertains to.

I want this as a variable value in the dataset in row 8:

Code:

cap mkdir "`projectfolder'/`projectname' QC"

I tried to do this with this line of code:

Code:

replace preface = `"cap mkdir "`projectfolder'/`projectname' QC""'  if _n ==8

However, as you might guess, I get this value:

Code:

cap mkdir "/ QC"

How do I fix this?

Thank you, brilliant Statalist!

v.14.2

Creating a Regression Table from 2SLS Regressions

Hi everyone!

I would like to create two regression tables from nine 2SLS regressions.

I have run my regressions using the command "ivreg." However when I want to extract my results to a table in a doc. file, the command outreg2 gives the second-stage coefficients only.
How can I create a regression table that contains both first-stage and second-stage coefficients?

I thank you!

Guillaume

Why has Stata dropped observations from my fixed effects model?

Hi,

I have a dataset of 318 local authorities and am running a fixed effects regression.
However when I run the regression Stata only performs this for 311 observations.
I ran the command
tab acode if !e(sample)
However it is still not obvious why 7 of these local authorities (acode) have been excluded, is there a way to find out why/are there common reasons why? I have looked at the dataset and it is not missing any observations?
Thank You

-tabout- -show(all)- and -mi- not working

Hello all,

First and foremost, thank you in advance to anybody that is able to help out. I tried searching Statalist and Google for a solution or something that could point me in the right direction but I have had no luck. Now to the issue.

Version information: Stata version 15, current update level is 21 Mar 2019; tabout version 3

Data background: This is a single year survey that has been weighted using the -svy- command. I cannot upload it or portions of it due to contractual and data sensitivity reasons.

Issue: I am attempting to produce survey weighted cross tabulation tables which contains an outcome variable (on top of table) by 6 cross variables (rows). For some of the cross variables, none of the respondents selected a response option, but for the other cross variables they have selected all options so there are no missing data. This may be hard to visualize so I will create a fake data table below to visually show what is occurring. The -tabout- has an option of -show(all)- that should fix this issue, but is not; Stata also returns an error stating to use this option that I'm already using, but it appears to not be processing this option since it's already apart of my code. I have also tried including the -mi- option, but this doesn't work either. I have referenced the manual and I have not had any luck.

My code

Code:

foreach var of varlist been_homeless {
  2.         
.         quietly tabout age_cat female race_eth fpl region any_chronic `var' using `var'.tex, ///
>                 show(all) cells(row) format(1) clab(Row%) ///
>                 svy percent stats(chi2) npos(lab) ptotal(single) replace ///
>                 style(tex) bt font(bold) topstr(18cm) topf(top.tex) botf(bot_Sweave.tex)
  3. }

Note: There actually about 20 variables in this loop; I shortened it to a single variable that this issue occurs with for cleanliness

Error received

Code:

Warning: not all panels have the same number of columns.
Include show(all) in your syntax to view panels.

and now for the fake table to visually show what is occuring.

Table 1. Fake Table (Row %) - These %s are actual %s

	Homeless	in	past	3	years
	Yes	No	Don't know	Refused	Total
Age
19-35	20%	60%	2%	18%	100%
36-49	5%	84%	0%	1%	100%
50-64	1%	95%	2%	1%	100%
Gender
Male	34%	66%	0%	100%
Female	34%	66%	0%	100%

As one can see, the 'Refused' column shows the percentages for the 'Total' column instead of displaying 0s or being blank. When I run the code within Stata not outputting tables using -tabout-, it executes how it should. Any help trying to identify the mechanism that is causing this behavior, or a work around would be greatly appreciated.

I apologize if I did not include something that I should have. Thank you again and I hope you have a great day!