Monday, November 30, 2020

cluster 2 varlist not allowed

Hi, I'm replicating a report, and this is the attached do-file
use wintertime_final, clear

gen snodas_weighted=snodas_max+(.25+.125*west)*(fsnoda s_max-snodas_max)

xi i.wk i.sn

cluster2 resort wk2 _I*, tcluster(d) fcluster(sn)
outreg2 using table2, replace excel dec(3) keep(wk2)
cluster2 resort wk2 _I* if snodas_max~=., tcluster(d) fcluster(sn)
outreg2 using table2, excel dec(3) keep(wk2)

cluster2 resort wk2 snodas_max _I*, tcluster(d) fcluster(sn)
outreg2 using table2, excel dec(3) keep(wk2 snodas_max)
cluster2 resort wk2 snodas_weighted _I*, tcluster(d) fcluster(sn)
outreg2 using table2, excel dec(3) keep(wk2 snodas_weighted)
cluster2 resort wk2 fsnodas_max snodas_max lsnodas_max _I*, tcluster(d) fcluster(sn)
outreg2 using table2, excel dec(3) keep(wk2 fsnodas_max snodas_max lsnodas_max)

But I cannot use cluster 2 (I installed cluster2.ado), each time I run the data, it shows "varlist not allowed",

Can anyone help me in this one?
Thank you!

Specific Values of Variables and creating a new Variable

Hello,

I am trying to takes 3 variables and take a specific value from that variable, and create a new variable with this.I am trying to take LIVING ALONE from each variable and create one variable. But, they each have different values. SZFFHERE = 1 for leaving alone, FFRESIDE = 3 for living alone, and SZFFONLY = 0 for living alone. I have been trying FOR HOURS to combine these three values from variables and make one variable called LIVINGALONE. Any advice please!

Terrible parallelization performance in Mata

I just ran an example from the SJ paper about boottest ("Fast and Wild") and was surprised at its poor performance. I'm fortunate enough to be working in Stata/MP, and I discovered that the more processors I enabled it to use, the slower it got. I'm using a Dell XPS 17 9700, which has a pretty good cooling system. Its CPU is an Intel i7-10875H, which has 8 cores and hyperthreading. I'm running Stata/MP 12-core 16.1. It's got 64GB of RAM and Windows 10 Pro.

Here is the log from a distilled demonstration. It sets the number of cores to 1, 2, ..., 12. On each iteration it calls a program that creates a 2500 x 1 matrix X and then computes X + X :* X 10,000 times. Simplifying that calculation to X + X or X :* X makes the problem go away.

I'm wondering if anyone else with access to Stata/MP gets similar results, or has insights. Possibly it doesn't happen on all computers. I understand that implementing invisible parallelization in a compiler is a tricky business. But Stata/MP doesn't come cheap!

Code:
cap mata mata drop demo()

mata
mata set matastrict on
mata set matalnum off
mata set mataoptimize on

void demo() {
    real matrix X; real scalar i
    X = runiform(2500,1)
    for (i=10000; i; i--)
        (void) X + X :* X
}
end

timer clear
forvalues p=1/12 {
  qui set processors `p'
  set seed 1202938431
  timer on `p'
  mata demo()
  timer off `p'
}
timer list
Output:
Code:
. timer list
   1:      0.14 /        1 =       0.1390
   2:      0.16 /        1 =       0.1640
   3:      1.63 /        1 =       1.6330
   4:      2.02 /        1 =       2.0150
   5:      2.47 /        1 =       2.4680
   6:      2.92 /        1 =       2.9210
   7:      3.38 /        1 =       3.3780
   8:      3.84 /        1 =       3.8370
   9:      4.26 /        1 =       4.2640
  10:      4.70 /        1 =       4.7040
  11:      5.21 /        1 =       5.2100
  12:      5.63 /        1 =       5.6260
That's right: using 1 core takes 0.14 seconds. Using 8 cores takes 3.84 seconds. Using 12 (with hyperthreading) takes 5.63 seconds.

Here's output I get from Stata 15.0--it's actually better! But still bad:
Code:
   1:      0.13 /        1 =       0.1280
   2:      0.14 /        1 =       0.1440
   3:      1.09 /        1 =       1.0900
   4:      1.35 /        1 =       1.3460
   5:      1.55 /        1 =       1.5540
   6:      1.78 /        1 =       1.7810
   7:      2.10 /        1 =       2.1010
   8:      2.35 /        1 =       2.3540
   9:      2.59 /        1 =       2.5920
  10:      2.89 /        1 =       2.8890
  11:      3.16 /        1 =       3.1570
  12:      3.41 /        1 =       3.4120
I monitored CPU usage during these tests and saw no evidence of throttling.

I'm worried that my Mata-based programs are getting seriously slowed down.

If you've got MP and can run this test, I'd be interested in the results.

Replace missing values with those of previous values or those of the partner

Hi Statalist.

I am using annual survey responses on marital status from a panel study (for each partner in a couple - married/cohabitation) for my analysis. Around 0.8% of responses are missing and I want to replace the missing values with lagged values (for the given partner) or those of the other partner if these are also missing - it is not uncommon that only one in the couple answers this question. Sometimes individuals in a couple miss answering this question in a given year(s). Is this approach reasonable, particularly if there are lagged values and future values and these are the same?

I'm sure my draft code does not achieve what I want it to, so help here is appreciated (mrcurr1 - marital status of male partner); mrcurr2 - marital status of female partner)
Code:
bys id (wave): replace mrcurr1 = L.mrcurr1 if missing(mrcurr1)  
bys id (wave): replace mrcurr1 = L.mrcurr2 if missing(mrcurr1) // in case the current wave response is missing for both
bys id (wave): replace mrcurr2 = L.mrcurr2 if missing(mrcurr2)
bys id (wave): replace mrcurr2 = L.mrcurr1 if missing(mrcurr2)
Sample data:
Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input long(id p_id) byte(wave mrcurr1 mrcurr2)
11  61  8 . 2
11  61 12 . 2
11  61 13 . 2
11  61 14 . 2
11  61 15 . 2
11  61 16 . 2
11  61 17 . 2
11  61 18 . 2
13  60 12 . 1
13  60 13 . 1
13  60 14 . 1
13  60 15 . 1
13  60 16 . 1
13  60 17 . 1
13  60 18 . 1
16  64  8 2 .
16  64  9 2 .
16  64 10 2 .
16  64 11 1 .
16  64 12 1 .
16  64 13 1 .
16  64 14 1 .
16  64 15 1 .
16  64 16 1 .
16  64 17 1 .
16  64 18 1 .
23  24  6 . .
23  24  7 . .
23  24  8 . .
23  24  9 . .
23  24 10 . .
23  24 11 . .
23  24 12 . .
end

Lincom for interpreting interactions in logistic regression

Hello,

I am estimating, for example, the following double-difference / difference-in-difference logistic regression where opt_dum, depstat (Mild or None), and treat (HT or Con) are all binary variables.

The output of the model is as follows:
Code:
logit opt_dum i.depstat##i.treat, or 
Iteration 0:   log likelihood = -166.14693  
Iteration 1:   log likelihood = -121.37891  
Iteration 2:   log likelihood = -120.88007  
Iteration 3:   log likelihood = -120.87909  
Iteration 4:   log likelihood = -120.87909  

Logistic regression                             Number of obs     =        240
                                                LR chi2(3)        =      90.54
                                                Prob > chi2       =     0.0000
Log likelihood = -120.87909                     Pseudo R2         =     0.2725

-------------------------------------------------------------------------------
      opt_dum | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
      depstat |
        Mild  |   .3891403   .1851221    -1.98   0.047     .1531694    .9886448
              |
        treat |
          HT  |   19.15126   9.456233     5.98   0.000     7.276179    50.40706
              |
depstat#treat |
     Mild#HT  |   .5459965   .3697249    -0.89   0.372     .1448083    2.058668
              |
        _cons |   .3953488   .1132654    -3.24   0.001     .2254839     .693179
-------------------------------------------------------------------------------
I then use the margins command to obtain the marginal estimates
Code:
. margins i.depstat##i.treat, post coeflegend

Predictive margins                              Number of obs     =        240
Model VCE    : OIM

Expression   : Pr(opt_dum), predict()

-------------------------------------------------------------------------------
              |     Margin  Legend
--------------+----------------------------------------------------------------
      depstat |
         Non  |   .5833333  _b[1bn.depstat]
        Mild  |       .375  _b[2.depstat]
              |
        treat |
         Con  |   .2083333  _b[1bn.treat]
          HT  |        .75  _b[2.treat]
              |
depstat#treat |
     Non#Con  |   .2833333  _b[1bn.depstat#1bn.treat]
      Non#HT  |   .8833333  _b[1bn.depstat#2.treat]
    Mild#Con  |   .1333333  _b[2.depstat#1bn.treat]
     Mild#HT  |   .6166667  _b[2.depstat#2.treat]
-------------------------------------------------------------------------------
My question is whether the following lincom code is appropriate for testing the double difference i.e. to test whether the association between depstat and opt_dum varies by the treat variable
Code:
. lincom ( _b[2.depstat#2.treat]- _b[2.depstat#1bn.treat])-(_b[1bn.depstat#2.treat]-_b[1bn.depstat#1bn.treat])

 ( 1)  1bn.depstat#1bn.treat - 1bn.depstat#2.treat - 2.depstat#1bn.treat + 2.depstat#2.treat = 0

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         (1) |  -.1166667   .1047263    -1.11   0.265    -.3219264    .0885931
------------------------------------------------------------------------------
Thank you,
Caroline

Plot Transition Matrix on a Graph

Hi all,

I would like to plot in a graph the transition matrix from employment to employment (EtoE), unemployment to employment (UtoE), unemployment to unemployment (UtoU), etc. over the period 2010 t0 2014.

I would like to have the year in the x-axis and the transition probability on the y-axis. Below is the code that I wrote:

Code:
set more off
forvalues i=2010/2014{
gen em1`i'=labor==1 if year==`i'
gen em2`i'=labor==2 if year==`i'
gen em3`i'=labor==3 if year==`i'
}
forvalues i=2010/2013{
local j=`i'+1
gen UtoN`i'=(em2`i'==1 & em3`j'==1) if em2`i'==1
gen UtoE`i'=(em2`i'==1 & em1`j'==1) if em2`i'==1
gen UtoU`i'=(em2`i'==1 & em2`j'==1) if em2`i'==1
gen NtoE`i'=(em3`i'==1 & em1`j'==1) if em3`i'==1
gen NtoU`i'=(em3`i'==1 & em2`j'==1) if em3`i'==1
gen NtoN`i'=(em3`i'==1 & em3`j'==1) if em3`i'==1
gen EtoU`i'=(em1`i'==1 & em2`j'==1) if em1`i'==1
gen EtoN`i'=(em1`i'==1 & em3`j'==1) if em1`i'==1
gen EtoE`i'=(em1`i'==1 & em1`j'==1) if em1`i'==1
}

collapse (mean) UtoN2010-EtoE2013, by(gender state)
reshape long UtoN UtoE UtoU NtoE NtoU NtoN EtoU EtoN EtoE, i(gender state) j(year)
graph twoway ( (line UtoE year,lpattern(dash) lcolor(red)) (line UtoU year,lpattern(dash) lcolor(black))(line EtoE year, lcolor(black)), ///
legend(label(1 "U->E") label(2 "U->U") label(3 "E->E")) by(state))
When I run this code, I obtained 0 for all the probability of transition. Can someone tell me where I made mistake?

My data looks like the following:

Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input int ID double(year labor gender state)
11112 2010 3 1 1
11112 2011 1 1 1
11112 2012 3 1 2
11112 2013 2 1 1
11112 2014 1 1 1

11113 2010 2 1 1
11113 2011 2 1 1
11113 2012 1 1 2
11113 2013 2 1 2
11113 2014 1 1 1

11114 2010 1 1 2
11114 2011 3 1 2
11114 2012 3 1 1
11114 2013 2 1 1
11114 2014 1 1 1

11115 2010 2 0 1
11115 2011 2 0 1
11115 2012 1 0 1
11115 2013 3 0 2
11115 2014 1 0 1

11116 2010 1 0 2
11116 2011 1 0 2
11116 2012 3 0 2
11116 2013 2 0 2
11116 2014 3 0 2
end
label values labor Employment
label def Employment 1 "Employed", modify
label def Employment 2 "Unemployed", modify
label def Employment 3 "Inactive", modify
label values gender Female
label def Female 1 "Female", modify
label def Female 0 "Male", modify
label values state Region
label def Region 1 "North", modify
label def Region 2 "South", modify
Thanks in advance for your help.

Twoway graph with two bars in one diagram

Hello everyone,

I am new to Stata and currently trying to create a bar chart that shows GDP data from two different countries over a certain period of time. I figured out how to create a bar chart with only one country but now I would like to add bars that indicate a value for another country in the same graph. What I would like to show in the graph is the comparison of quarterly GDP values of two different countries. Here is my code:

graph twoway (bar obsvalue qdate if country == "DEU", color(gs8)) ///
(bar obsvalue qdate if country == "ITA", color(black)) ///
ytitle("Gross Value Added in Million US$") xtitle("Period") ///
xlabel(232 "Q1/2018" 233 "Q2/2018" 234 "Q3/2018" 235 "Q4/2018" 236 "Q1/2019" ///
237 "Q2/2019" 238 "Q3/2019" 239 "Q4/2019" 240 "Q1/2020" 241 "Q2/2020", angle(45)) ///
legend(order(1 "Germany" 2 "Italy"))

When I execute this code I always get the error message: ) required. I really cannot figure out where in the code the brackets are missing.

I would really appreciate any advise how to fix that problem.


Sum of dummy variable up until the point of observation

Dear STATALIST forum,

I have a question about generating a variable that depicts the sum of a dummy variable up until point x. To help you understand, here is how my data looks like right now:
platform year month genre_id Genre1_dummy Genre2_dummy Genre3_dummy
microsoft 2003 4 1 1 0 0
microsoft 2003 8 3 0 0 1
macOS 2002 1 1 1 0 0
macOS 2002 4 2 0 1 0
macOS 2003 8 1 1 0 0
And here is how I would like to have it:
platform year month genre_id Genre1_dummy Genre2_dummy Genre3_dummy genre_1 genre_2 genre_3
microsoft 2003 4 1 1 0 0 1 0 0
microsoft 2003 8 3 0 0 1 1 0 1
macOS 2002 1 1 1 0 0 1 0 0
macOS 2002 4 2 0 1 0 1 1 0
macOS 2003 8 1 1 0 0 2 1 0
To conclude: I would like to generate a sum of the genre dummy up until the observation was made, but not going beyond platforms.

Thanks for your help. It is very much appreciated.

Best,

Sebastian

Let values continue

Hello, can someone tell me the command how to do what is shown on the screenshot.
You copy the values of another variable and then pull it down for the same mergeid.
Sorry for the bad paint-display, I hope it makes clear, what I want to express with this. :D
Thanks in advance!

Intercepts at knots in mixed

Dear all,

I am currently working on a mixed model with incorporation of knots at specific timepoints to compare the course over time between 5 different groups (cohorts).
Yet, I cannot figure out how I can retrieve the estimates of the intercept at these knots. And how to compare the intercept at these knots between the 5 groups.

Can anyone help me with this?

My mixed model:
mixed DAS28ESR i.cohort##c.years c.Age_incl i.sex ||EACNUMM: c.years, covariance(unstructured) reml

In which DAS28ESR is a measure for disease activity (continuous) and EACNUMM are the individuals.

Thanks in advance!

Kind regards,
Marloes

distribution explanatory variable for different birth cohorts

Hi there,

I'm struggling to understand part of a assignment that I need to do.

I have to examine whether education has an effect on the total number of kids a woman has.
The data is taken from a survey, surveyed between 1985 and 2009. Each woman is observed only once in the data, hence the women have a different age and might be observed in a different year.
The women are divided into 10 10-year birth cohorts (e.g. a woman born in 1915 is in birth cohort 1911-1920). Education is measured as years of schooling.

"I have to check whether the distribution of education looks reasonable for different birth cohorts. I should conclude that birth cohorts 1, 2, 3, and 10 should not be used in the analysis."

Below is the distribution of education in years of schooling for the different birth cohorts. I can't figure out on the basis of what I should remove birth cohorts 1, 2, 3, and 10. Could anyone help me out? Thank you in advance (:


Array

tabstat of multiple dummy variables

Hello,

I have a large dataset of N = 59,000,000.

I have a set of diagnoses dummy variables a00_a09 - u_unclassified, and a set of service code dummy variables, the service code dummy variables begin with ncb*

I need to find how many times a diagnoses code occurs for a particular service code.

I thought something like the below would work, I simply count where there is cost, for each diagnosis code, by service code

foreach var of varlist a00_a09 - u_unclassified {
tabstat total_ss_costs if `var' ==1, s(n) by (ncb*) }

Every record has a cost.

However, I think I need to specify the the ncb* differently.

Also, as the code is written above, Stata would produce a new output table for each of the 'var' - but would not label the output table with the 'var' name - is it possible to amend the code so that this could happen

Can anyone help?

sample data is below:


Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float total_ss_costs byte(r69 r50_r68 r47_r49 r40_r46 ncbps29m_hist ncbps29b_hist ncbps27b_hist)
        0 0 0 0 0 0 0 0
        0 0 1 0 0 0 0 0
        0 0 0 0 0 0 0 1
        0 0 1 0 0 0 0 0
        0 0 0 0 0 0 0 1
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 1 0 0 0 0 1
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 1 0 0 0 0 0
        0 0 1 0 0 0 0 0
        0 0 0 0 0 0 0 0
 7143.572 0 0 0 0 0 0 0
        0 0 1 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
141.99995 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
 6460.776 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 1 0 0 0 0 1
        0 0 1 0 0 0 0 1
        0 0 0 0 0 0 0 0
        0 0 1 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 1
        0 0 0 0 0 0 0 0
 903.4579 0 0 0 0 0 0 0
        0 0 1 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
 4346.248 0 0 0 1 0 0 0
        0 0 0 0 0 0 0 0
        0 0 1 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 1
        0 0 1 0 0 0 0 1
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 1 0 0 0 0 1
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
 426.3519 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 1 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 1 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
13727.894 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
284.62573 0 0 0 0 0 0 0
110.00005 0 1 0 0 0 0 0
        0 0 0 0 0 0 0 0
      529 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
end

Confidence-interval plots / any plot : resize possible?

Dear Statalisters

My question, basically, is whether there is a 'correct' way to resize Stata graphs (including the background region), so that the graph becomes narrower.


This is the basic graph I start with, using the command ciplot:
Code:
ciplot female, by(resgroup_march) scheme(lean1)
[ATTACH=CONFIG]temp_20524_1606749475305_891[/ATTACH]

I'd prefer if this plot were less wide. For example, as shown in the photo below (not my slide):

[ATTACH=CONFIG]temp_20522_1606748419290_207[/ATTACH]


One option is to tinker with the aspect ratio of the overall graph. An aspect ratio of 3, for example, makes the graph narrower - but the background region remains the same.

[ATTACH=CONFIG]temp_20525_1606749624951_754[/ATTACH]

If I were to
Code:
graph combine
the above plot (aspect ratio 3) with another plot (same aspect ratio of 3), this is the result:

[ATTACH=CONFIG]temp_20526_1606750008219_260[/ATTACH]

Although It would certainly do the job, I'd prefer if there was less white space between the two graphs. Basically, I'd prefer to replicate the look of the 5 plots closely next to each other as in my photo.

The 'incorrect' option would be to crop my individual plots in a separate tool (eg. Word) and then paste them next to each other, but I've decided to 'dig deep' into this matter and find out whether there is a 'correct' way to resize this using Stata.



Finally, please note that I've also tried
Code:
statsby mean=r(mean) ub=r(ub) lb=r(lb), by(resgroup_march) clear: ci means female
twoway (rcap ub lb resgroup_march) (scatter mean resgroup_march), ytitle(Share of female respondents) ylabel(0.30(0.10)0.70, gmin) xlabel(0(1)1, valuelabel)
[ATTACH=CONFIG]temp_20527_1606750564923_894[/ATTACH]

This works fine too, but my issue regarding the white space between the two groups remains.
Editing the aspect ratio makes the white space between the groups smaller, but leaves the background region unchanged, similar to the ciplot command.


Any help is much appreciated.

Stata version 16.1

How to perform NLS estimation with a complex function

Dear Stata users,


I need to use NLS to estimate the following equation describing the so-called Bass diffusion model:

yt = m * [ ((1 - exp (-(p+q) * t)) / (1 + q/p * exp (-(p+q) * t))) - ((1 - exp (-(p+q) * (t-1))) / (1 + q/p * exp (-(p+q) * (t-1)))) ] + ut

where m,p, and q are parameters and t indicates time. As initial values for these parameters, I am to use those obtained with OLS applied to:

yt = β0 + β1 x1, t-1 + β2 x2, t-1 + ϵt

where β0 = m * p
β1 = q - p
β2= - q/m

Note that m = 41302528
p = 0.024
q = 0.194


To perform this NLS estimation, I tried to follow the example that was given in the Stata forum (https://www.stata.com/features/overv...ar-regression/) in order to create a program that defines my function.Then, I used the nl command and specified the initial values.

When I run the command, I get the following message:

verify that nlces is a function evaluator program
r(198);

Would you please help me figure out what the problem is? I am a beginner in Stata software.
Thanks in advance.

Endogeneity & dropping of constraints

Hi,

I am doing Durbin–Wu–Hausman test to test my independent variables for endogeniety.

My data is cross sectional and I have a matched pairs design. For the purpose of analysis, I would use either logistic regression or probit or conditional logistic regression.

I am using the following steps to test for endogeneity:

1) regress X1 X2 X3 X4 X5 X6 X7 (X1 to X7 are all independent variables and X1 is being tested for endogeneity)

2) predict X1_res, res

3) regress DV X1 X2 X3 X4 X5 X6 X7 X1_res (DV is the dependent variable)

4) test X1_res

For a lot of the independent variables, STATA is returning the message that the constraint is dropped. For eg: o.CDualy_res = 0
Constraint 1 dropped

Could you please help me understand why are the constraints is getting dropped?

Hausman test/Estimating pmg,mg, dfe


Good afternoon

I am estimating a dynamic model with pmg, mg and dfe. the study has to do with the application of the kuznets curve to CO2, co2 = co2 (primary energy, gdp, gdp ^ 2, trade openess).
I discovered the optimal lags (1,1,0,0,0) in the order of variables described above. however I don't know if I'm using the right commands to do the hausman test and estimate the model.
xtpmg d.lco2pc d.lceppc d.lpibpc d.lpib2pc d.labr, lr( l.lco2pc l.lceppc lpib2pc labr) ec(ECT) replace mg
xtpmg d.lco2pc d.lceppc d.lpibpc d.lpib2pc d.labr, lr( l.lco2pc l.lceppc lpib2pc labr) ec(ECT) replace pmg
hausman mg pmg

best regards

PS: co2=co2
primary energy=ceppc
GDP=pib
GDP^2=pib^2
trade openess=abr
l(var)=ln(var)

Subsample dataset under multiple conditions

Windows 10
Stata 16

Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input long region str6 gender float age
 8 "male"   4
 2 "male"   5
13 "male"   5
10 "male"   3
 4 "male"   5
14 "female" 4
 2 "male"   4
12 "male"   3
13 "male"   4
 1 "male"   5
 8 "male"   1
11 "female" 3
14 "female" 4
 5 "male"   5
 8 "male"   3
 5 "female" 4
 7 "male"   2
 9 "male"   4
11 "male"   5
 8 "female" 5
 4 "female" 4
 2 "male"   4
 6 "male"   5
 9 "male"   3
 2 "male"   4
 2 "male"   2
11 "female" 5
 9 "male"   5
 8 "male"   4
13 "female" 1
 2 "female" 1
 9 "male"   3
 2 "male"   5
 2 "male"   4
 9 "male"   5
 9 "male"   4
 2 "female" 4
 2 "male"   4
 2 "male"   5
 4 "male"   4
 2 "male"   5
11 "female" 5
 2 "female" 5
 5 "female" 4
11 "male"   5
 8 "male"   5
10 "male"   3
11 "female" 1
 2 "female" 5
 2 "male"   4
 4 "male"   3
13 "male"   5
 8 "male"   4
 1 "male"   5
 5 "male"   3
 2 "male"   1
 5 "female" 3
10 "male"   2
14 "male"   4
12 "male"   4
 2 "male"   5
 6 "female" 3
 2 "male"   4
11 "female" 1
11 "male"   5
12 "female" 4
 2 "male"   2
14 "male"   3
 2 "female" 4
 1 "female" 3
12 "female" 4
 9 "female" 4
 2 "female" 3
 8 "male"   3
 2 "female" 4
11 "male"   3
 8 "female" 4
 5 "male"   5
 9 "male"   4
10 "female" 5
13 "male"   4
10 "male"   4
13 "female" 5
 2 "female" 4
13 "male"   3
 5 "male"   1
 4 "male"   4
13 "male"   1
 8 "male"   3
 1 "male"   3
 8 "male"   1
13 "male"   5
 2 "male"   4
 8 "male"   3
13 "female" 4
 8 "male"   3
 1 "female" 3
 4 "female" 1
 7 "female" 1
 2 "male"   5
end
label values region periph_res
label def periph_res 1 "Anatolikis Macedonias kai Thrakis", modify
label def periph_res 2 "Attikis", modify
label def periph_res 4 "Dytikis Elladas", modify
label def periph_res 5 "Dytikis Makedonias", modify
label def periph_res 6 "Ionion Nison", modify
label def periph_res 7 "Ipeirou", modify
label def periph_res 8 "Kentrikis Makedonias", modify
label def periph_res 9 "Kritis", modify
label def periph_res 10 "Notiou Aigaiou", modify
label def periph_res 11 "Peloponnisou", modify
label def periph_res 12 "Stereas Elladas", modify
label def periph_res 13 "Thessalias", modify
label def periph_res 14 "Voreiou Aigaiou", modify
label values age agel
label def agel 1 "18-24", modify
label def agel 2 "25-34", modify
label def agel 3 "35-44", modify
label def agel 4 "45-54", modify
label def agel 5 "55 plus", modify

Hello!

I would like to ask help on how to balance my dataset under several constraints simultaneously.
I have collected a sample of 5946 respondents, with the following demographic characteristics: region, gender and age.
The sample is not representative on a national level, based on the above characteristics. In fact, these are the percentages in the real population (from the census) and in my sample are:
REGION CODE CENSUS (Population) (%) Sample (%)
R1 5.62 5.8
R2 35.40 33.54
R3 6.28 4.44
R4 2.62 2.24
R5 1.92 1.66
R6 3.11 3.21
R7 17.40 19.07
R8 5.76 5.89
R9 2.86 3.55
R10 5.34 4.89
R11 5.06 4.86
R12 6.77 6.63
R13 1.84 4.22

GENDER Males Females
CENSUS (Population) (%) 49.03 50.97
Sample (%) 55.63 44.37

AGE GROUP 18-24 25-34 35-44 45-54 55 over
CENSUS (Population) (%) 9.63 17.32 18.43 16.58 38.03
Sample (%) 9.2 14.3 24 26.86 25.65

I would like to balance my sample, so it matches all the population quotas according to the above tables. In case there is more than one datasets (ie, subsamples) satisfying the above conditions, I would like to keep the largest one. In case there are more than one datasets satisfying the above conditions, and are of equal size, I would like to keep all of them.
I only need the sample to be balanced on a national level, not on a regional level. For instance, I need 49% males in the overall subsample but I don’t necessarily need 49% in each region. That means that regions that have more women can compensate for those who have a deficiency in women etc.
The constraints are not really that strict. For example, I don’t need to get exactly 49.03% males. I could do 49.04%, or any other percentage that would not be significantly different from 49.03%.
The test we use to estimate the significance is a two-sample test of proportions (prtesti in STATA)
Is there any way to do this? I considered gsample, but I am not sure how to move forward with it or if it even relevant to my case.

Thank you for your time in advance

Best,
Eleni









Looping to create a difference score

Hello all,

I am working on a project where I need difference scores. I will have a number of difference scores, so typing the command over-and-over will become tedious. I am trying to develop a loop that will provide me with ONLY the difference scores for my variables of interest. I realize there is probably some process that I am overlooking, so I need some assistance, please. I am working with the following code:

** setting up data **
set obs 10000
set seed 40245
*set trace on
gen x1 = rnormal(25, 10)
gen x2 = rnormal(12, 3)
gen x3 = rnormal(10, 5)
gen x4 = rnormal(15, 8)
gen x5 = rnormal(5, 2)
gen x6 = rnormal(3, 1)

** actual difference scores that I desire **
gen d1 = x2 - x1
gen d2 = x4 - x3
gen d3 = x6 - x5

**loop to replicate the difference scores **

local t1 x1 x3 x5
local t2 x2 x4 x6

local counter = 2
foreach v1 of varlist `t1' {
foreach v2 of varlist `t2' {
local ++counter
gen d`counter' = `v2'-`v1'

}
}

sum

drop *

At the moment, the loop provides me every possible combination between `v1' and `v2'. I only want the three. How do I "tame" the loop? Thank you in advance.

Cross tabulations based on multiple dummy variables

Hello,

I have a large dataset for N= 59,000,000 observations.

I have 154 dummy variables relating to a set of diagnosis codes, and 173 dummy variables relating to service codes ( e.g. chemotherapy, orthopedics etc). The service codes all begin with ncb*. I have one string variable that identifies individuals and one cost variable (float).

I need to produce a cross tab of diagnosis code (column) *service code(row) and count the number individuals in each cell. Some people will have no diagnosis and no service code. I would then like to repeat with a sum of cost rather than count of individuals

In the past I have used collapse and reshape, to get the output I would like, but with columns being dummy variables I am unsure of how to do this.

Ideally I would like to produce a new Stata file as this will be easier to export to excel I think.

Below is a sample of the data but with the id variable not included - because it is a long string. I have included the cost variable. I am assuming I could replace a sum with a count to count the ID rather than sum the cost.

Can anyone advise please?

Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float total_ss_costs byte(r69 r50_r68 r47_r49 r40_r46 ncbps29m_hist ncbps29b_hist ncbps27b_hist)
        0 0 0 0 01 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 1 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 1 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 1 0 0 0 0 0
        0 0 0 0 0 0 0 0
 7143.572 0 1 0 0 1 0 0
        0 0 1 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
141.99995 1 0 0 0 0 0 1
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
 6460.776 0 1 0 0 1 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
 903.4579 0 1 0 0 0 0 0
        0 0 1 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
 4346.248 0 0 0 1 0 0 0
        0 0 0 0 0 0 0 0
        0 0 1 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 1 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 1 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
 426.3519 0 1 0 0 1 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 1 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 1 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
13727.894 0 1 0 0 0 0 1
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0
284.62573 0 0 0 0 0 0 1
110.00005 0 1 0 0 1 0 0
        0 0 0 0 0 0 0 0
      529 0 0 1 0 0 0 1
        0 0 0 0 0 0 0 0
end

Ordered Probit with 2 continuous interaction term

Hello all,
I am doing an ordered probit regression as follows:
Y takes the value from 1 to 10
X is a continuous variable
Z is also a continuous variable
My regression is: oprobit Y X c.X#c.Z
Please tell me how can I interpret the interaction variable c, since both X and Z are continuous so the command margins X, at (Z=1) does not work




Interaction terms on robust standard errors result in no F-stat and p-value

Hi all,
I am currently experiencing some difficulties with my model, and I hope you will be able to help me.

When I'm running my regression I get no F-stat and p-value output. The problem is occurring when I'm including dummy variables (in order to treat missing values, MIM), robust SE (to fulfill MLR. 5, Homoscedasticity), and several interaction terms (included in order to achieve linearity in parameters, RESET test) to my regression. (actually also only when running my regression with interaction terms and the robust SE).

Do you have any apparent explanation for why I do not get any F-stat and p-value output?

Thanks in advance.

Best regards,
Marcus Frellsen.

marginal effects in hurdle model

Dear All

I'm fitting a hurdle model (with the command churdle linear) and use the margin (margins, dydx(CategoricalVariable)). I wonder why the statistical significance is different between the hurdle linear and the margin.

Hurdle linear part

SED_wkday | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
SED_wkday |
speaking_wk | -13.15755 16.96346 -0.78 0.438 -46.40532 20.09022
texting_wk | -22.34044 11.64429 -1.92 0.055 -45.16282 .4819424
social_net_wk | -50.51344 8.730203 -5.79 0.000 -67.62433 -33.40256
browsing_wk | -24.73634 11.57269 -2.14 0.033 -47.41838 -2.054286


Margin

| Delta-method
| dy/dx Std. Err. z P>|z| [95% Conf. Interval]
--------------+----------------------------------------------------------------
speaking_wk | -13.60962 10.8538 -1.25 0.210 -34.88268 7.663439
texting_wk | -9.302818 7.627541 -1.22 0.223 -24.25252 5.646887
social_net_wk | -19.89934 5.714631 -3.48 0.000 -31.09981 -8.698867
browsing_wk | -.7922133 7.934503 -0.10 0.920 -16.34355 14.75913


Please could someone help me? what's going on and why the statistical significance differ?

In addition, how could I fit an interaction in hurdle model?

Thank you
Ellie

Intra-Day Data Differences

Hi,

I am working with the following intra-day data.

Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float id long date double(time price)
1 21823           56100000    102.387
1 21824           33596000    102.288
1 21824           47710000    102.397
1 21825           50009000    102.742
1 21825           60499000 102.702736
1 21826           12119000    102.582
1 21826           32789000    102.565
1 21826           43951000    102.609
1 21826           54412000    102.617
1 21826           54412000  102.55451
1 21829           37797000    102.545
1 21829           45684000    102.522
1 21830           44628000    102.536
1 21831           35067000    102.645
1 21831           37511000    102.542
1 21831           48165000    102.174
1 21831           48165000    102.474
1 21831           64399000   102.4974
1 21831           64399000   102.4974
1 21832           36633000    102.355
1 21832           57553000    102.299
1 21832           57995000 102.338645
1 21833           40237000    102.319
1 21836           57954000 102.080452
1 21837           30188000    102.332
1 21837 32455999.999999996    102.268
1 21837           44701000    102.073
1 21838 32155999.999999996    102.259
1 21838           33861000    102.403
1 21838           33861000    102.403
1 21838           33861000    102.403
1 21839           40823000    102.237
1 21839           40823000    102.237
1 21840           10127000    102.424
1 21840           28777000    102.423
1 21840           42387000    102.453
1 21840           42387000    102.453
1 21843           36383000     102.36
1 21843           42529000    102.327
1 21843           44051000     102.39
1 21843           46214000    102.356
1 21843           46214000    102.206
1 21843           46214000    102.206
1 21843           47160000    102.287
1 21843           47160000    102.137
1 21843           47160000    102.137
1 21843           54017000    102.389
1 21844           42487000    102.378
1 21844           51397000    102.388
1 21844           56163000    102.403
1 21845           31133000    102.336
1 21845           39218000     102.36
1 21845           40921000    102.353
1 21845           57404000    102.357
1 21846           33927000    102.367
1 21846           33927000 102.315821
1 21846           37407000    102.452
1 21846           37795000    102.537
1 21846           40807000    102.437
1 21846           40807000    102.437
1 21846           43021000    102.497
1 21847           21949000    102.467
1 21847           29852000    102.238
1 21850 32089000.000000004    102.146
1 21850           33233000    102.087
1 21851           35148000    102.856
1 21851           35149000    102.189
1 21851           37061000    102.181
1 21851           41673000    102.189
1 21851           63717000    102.268
1 21852 31804000.000000004    102.134
1 21852           31821000    102.095
1 21852           38297000    102.235
1 21852           38310000    102.235
1 21852           40491000    102.155
1 21852           55524000    102.261
1 21853           34598000    102.172
1 21853           34598000    102.272
1 21853           42468000    102.357
1 21854           49875000  102.38718
1 21854           49875000  102.35594
1 21857           35252000    102.308
1 21857           53270000    102.244
1 21858           36819000    102.148
1 21858           40646000    102.207
1 21858           55883000    102.214
1 21858           59147000 102.186285
1 21859           36765000    102.232
1 21859           37575000    102.278
1 21859           39973000    102.165
1 21859           51312000    102.197
1 21860           40842000     102.07
1 21861           54688000   101.9095
1 21861           54688000   102.0095
1 21865           36402000    102.038
1 21865           43839000    101.891
1 21865           43839000    101.954
1 21865           43978000    101.948
1 21865           52278000    102.017
1 21865           58523000    101.968
end
format %td date
format %tcHH:MM:SS time
I want to take the difference between two successive prices (the intra-day interval is not of fixed length) over a window of 10 days, as in the second step I want to sum up the differences over the same interval. The problem I am facing in taking up the differences is that the the window is defined over a number of days whereas the data is intra-day. Would appreciate any help in this regards. Thank You.

Merging data

Dear all,

I want to merge data from two different datasets. The first has only one observation per firm(CUSIP) for a given year, as shown.

Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str12 CUSIP double year
"00036020" 2009
"00036020" 2010
"00036020" 2011
"00036020" 2012
"00036020" 2013
"00036020" 2014
"00036020" 2015
"00036020" 2016
"00036020" 2017
"00036020" 2018
"00036110" 2007
"00036110" 2008
"00036110" 2009
"00036110" 2010
"00036110" 2011
"00036110" 2012
"00036110" 2013
"00036110" 2014
"00036110" 2015
"00036110" 2016
"00036110" 2017
"00036110" 2018
"00088630" 2007
"00088630" 2008
"00088630" 2009
"00095710" 2007
"00095710" 2008
"00095710" 2009
"00095710" 2010
"00095710" 2011
"00095710" 2012
"00095710" 2013
"00095710" 2014
"00095710" 2015
"00095710" 2016
"00095710" 2017
"00095710" 2018
"00101J10" 2013
"00101J10" 2014
"00101J10" 2015
"00105510" 2007
"00105510" 2008
"00105510" 2009
"00105510" 2010
"00105510" 2011
"00105510" 2012
"00105510" 2013
"00105510" 2014
"00105510" 2015
"00105510" 2016
"00105510" 2017
"00105510" 2018
"00108410" 2007
"00108410" 2008
"00108410" 2009
"00108410" 2010
"00108410" 2011
"00108410" 2012
"00108410" 2013
"00108410" 2014
"00108410" 2015
"00108410" 2016
"00108410" 2017
"00108410" 2018
"00130H10" 2007
"00130H10" 2008
"00130H10" 2009
"00130H10" 2010
"00130H10" 2011
"00130H10" 2012
"00130H10" 2013
"00130H10" 2014
"00130H10" 2015
"00130H10" 2016
"00130H10" 2017
"00130H10" 2018
"00130H10" 2019
"00154710" 2008
"00154710" 2009
"00154710" 2010
"00154710" 2011
"00154710" 2012
"00154710" 2013
"00154710" 2014
"00154710" 2015
"00154710" 2016
"00154710" 2017
"00154710" 2018
"00163T10" 2007
"00163T10" 2008
"00163T10" 2009
"00163T10" 2010
"00163U10" 2016
"00163U10" 2017
"00163U10" 2018
"00163U10" 2019
"00164V10" 2012
"00164V10" 2013
"00164V10" 2014
"00164V10" 2015
end
The second data set has multiple observations per firm(CUSIP) per year, as it contains information on the individual directors of the firms in the first dataset.

Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str12 CUSIP double year
"00846U101" 2007
"00846U101" 2007
"00846U101" 2007
"00846U101" 2007
"00846U101" 2007
"00846U101" 2007
"00846U101" 2007
"00846U101" 2007
"00846U101" 2008
"00846U101" 2008
"00846U101" 2008
"00846U101" 2008
"00846U101" 2008
"00846U101" 2008
"00846U101" 2008
"00846U101" 2008
"00846U101" 2008
"00846U101" 2009
"00846U101" 2009
"00846U101" 2009
"00846U101" 2009
"00846U101" 2009
"00846U101" 2009
"00846U101" 2009
"00846U101" 2009
"00846U101" 2009
"00846U101" 2010
"00846U101" 2010
"00846U101" 2010
"00846U101" 2010
"00846U101" 2010
"00846U101" 2010
"00846U101" 2010
"00846U101" 2010
"00846U101" 2011
"00846U101" 2011
"00846U101" 2011
"00846U101" 2011
"00846U101" 2011
"00846U101" 2011
"00846U101" 2011
"00846U101" 2011
"00846U101" 2011
"00846U101" 2012
"00846U101" 2012
"00846U101" 2012
"00846U101" 2012
"00846U101" 2012
"00846U101" 2012
"00846U101" 2012
"00846U101" 2012
"00846U101" 2012
"00846U101" 2013
"00846U101" 2013
"00846U101" 2013
"00846U101" 2013
"00846U101" 2013
"00846U101" 2013
"00846U101" 2013
"00846U101" 2013
"00846U101" 2013
"00846U101" 2014
"00846U101" 2014
"00846U101" 2014
"00846U101" 2014
"00846U101" 2014
"00846U101" 2014
"00846U101" 2014
"00846U101" 2014
"00846U101" 2015
"00846U101" 2015
"00846U101" 2015
"00846U101" 2015
"00846U101" 2015
"00846U101" 2015
"00846U101" 2015
"00846U101" 2015
"00846U101" 2015
"00846U101" 2016
"00846U101" 2016
"00846U101" 2016
"00846U101" 2016
"00846U101" 2016
"00846U101" 2016
"00846U101" 2016
"00846U101" 2016
"00846U101" 2016
"00846U101" 2016
"00846U101" 2017
"00846U101" 2017
"00846U101" 2017
"00846U101" 2017
"00846U101" 2017
"00846U101" 2017
"00846U101" 2017
"00846U101" 2017
"00846U101" 2017
"00846U101" 2017
"00846U101" 2018
"00846U101" 2018
end
I tried merging 1:m, but after running -tabstat it showed the first dataset had increased significantly in observations. How would I go about merging while I keep the observations in the first dataset at its original level, as only need one observation per year per firm for those variables. The basic idea would be that within one dataset I have observations for all different directors, but for some variables only for a company as a whole.

Anyone want to do my homework?

Anyone want to do a statistics assignment for me? I am strugglinggg. It's an elementary course for Stata and I'm suppose to analyze political data and making sense of it. My research question is "are unhappy people more likely to engage in political rally's/marches.". I have consensus data that I'm suppose to manipulate to answer this. Looking at standard deviations, differences of the mean, standard error, p-values etc...

expression too long r(130) while specifying starting values

I am using sfpanel() command and I need to set initial values for the variables before running the regression.
In order to set the initial values, I am using the ‘svfrontier’ argument of sfpanel().

Since I’m including user fixed effects (and there are ~2000 Users in the dataset), I need to specify around 2000 starting values. This becomes a very long expression!

Command: Sfpanel y dummy2 dummy3 i.userid, model(tfe) dist (tnormal) svfrontier( 1 2 3 … 2000 )

The numbers inside the bracket (i.e. 1 2 3…) are the starting values here (say).

I’m getting the following Error:

expression too long
r(130);

Things I've tried: 1.) I have tried this for a smaller dataset (that had 15 users) and this worked perfectly fine but for the complete dataset (~2000 users), we've a longer expression and this becomes a problem. 2.) I tried the delimit command in the do file as well:
#delimit ; sfpanel y dummy2 dummy3 i.userid, model(tfe) dist(tnormal) svfrontier(1 2 3 ... 2000); #delimit cr This didn't work for the complete dataset. (although it did work for the smaller dataset, so shouldn't be a syntax error)
Any help will be greatly appreciated! Thanks, Aishvarya

Sunday, November 29, 2020

Combining multiple variables

Hello!

I just started learning Stata and I am using a consensus data set and trying to combine a bunch of different variables to make one "Happy People" vs "Unhappy People" variable and then see which group is more active in Political Rallys. The variables I want to use are:

p_psych 1 (Self- Described Anxious rated 1 - 5)
p_mental (Mental Health rated 1 - 5)
p_life_ideal (In most ways my life is ideal rated 1 - 5)
p_life_sat (I am satisfied with my life rated 1 - 5)

If there's a way to combine these I would love to hear your thoughts! I have no idea what I'm doing so anything would help!

Thank you!

Combining multiple variables

Hello!

I jsut started learning Stata and I am using a consensus data set and trying to combine a bunch of different variables to make one "Happy People" vs "Unhappy People" variable and then see which group is more active in Political Rallys. The variables I want to use are:

p_psych 1 (Self- Described Anxious rated 1 - 5)
p_mental (Mental Health rated 1 - 5)
p_life_ideal (In most ways my life is ideal rated 1 - 5)
p_life_sat (I am satisfied with my life rated 1 - 5)

If there's a way to combine these I would love to hear your thoughts! I have no idea what I'm doing so anything would help!

Thank you!

Split file by group i.e. group 1 (non clinical) vs group 2 (clinical)

Hi there,

Might someone please advise how I split my Stata dataset by 1 variable "Group" which is has two levels 1 (non clinical) vs group 2 (clinical) as I would like to be able to compare those two groups?

I think I'm overthinking this - many thanks in advance for your time and expertise.

Control variable that varies accross time but not accros countries how to add to dataset and regression

Hello everyone,

For my regression i am using the VIX as a proxy for volatility and the TED spread as a proxy for the global risk free rate, my data consist of around 85 countries over 20 year in a long format. These 2 rates vary across time but not between countries, so they change from year to year over the 20 years but are the same for all countries for each individual year as they proxy for global factors. I am wondering how to include these 2 variables into my dataset, for now i have added them by copying them accross the 20 years so they are exactly the same for each country, i was wondering if this is the correct approach of adding them or if there is some other way to add them into the regression or data.

Kind regards,

Max Verheijen

Quintile Sorting two variables

I have monthly observations from various funds over along amount of time but the time series are unbalanced. For each fund I have an average alpha pro fund and a turnover ratio per year and now I want to divide the turnover ratio into quintiles and then sort the mean alphas into the correct turnover quintiles to see in the end which quintile has the best alphas.(i. e. Do more active fonds perform better?)
In the end I need two panels: Panel A that divides every observation in the quintiles, so that one fund can jump into quintiles and Panel B where I measure a mean Turnover for each funds and a mean Alpha for each fund an divide each fund in one fixed quintile based on their mean Turnover Ratio and Alpha.
My data in dataex:

clear
input double(portfolioid Turnover) float mean_Turnover double AlphaCAPM float mean_AlphaCAPM
4 1.6770000457763672 1.417865 .0011276676681867825 -.0006093712
4 1.6770000457763672 1.417865 -.00066037569874813 -.0006093712
4 1.6770000457763672 1.417865 -.003203570453383219 -.0006093712
4 1.6770000457763672 1.417865 -.005067735517733929 -.0006093712
4 1.6770000457763672 1.417865 -.004276204678426189 -.0006093712
4 1.6770000457763672 1.417865 -.00264340576862393 -.0006093712
4 1.6770000457763672 1.417865 -.004977918697552257 -.0006093712
4 1.6770000457763672 1.417865 -.0013603183772131176 -.0006093712
4 1.6770000457763672 1.417865 -.0020266099299921167 -.0006093712
4 1.6770000457763672 1.417865 -.0011593780606042707 -.0006093712
4 1.6770000457763672 1.417865 -.0017615356820123058 -.0006093712
4 1.6770000457763672 1.417865 -.0024929980839788132 -.0006093712
4 1.6770000457763672 1.417865 .0028820224147934476 -.0006093712
4 1.6770000457763672 1.417865 .004593126737425821 -.0006093712
4 1.6770000457763672 1.417865 .003565814453837919 -.0006093712
4 1.6770000457763672 1.417865 .0015064737708275246 -.0006093712
4 1.6770000457763672 1.417865 -.001450077625743531 -.0006093712
4 1.3890000581741333 1.417865 .0025399518285950835 -.0006093712
4 1.3890000581741333 1.417865 -.00008314149448057293 -.0006093712
4 1.3890000581741333 1.417865 -.00264706689732443 -.0006093712
4 1.3890000581741333 1.417865 -.0046777288729113395 -.0006093712
4 1.3890000581741333 1.417865 -.00914245381680592 -.0006093712
4 1.3890000581741333 1.417865 -.00700097423044709 -.0006093712
4 1.3890000581741333 1.417865 -.008433296568919339 -.0006093712
4 1.3890000581741333 1.417865 -.013204861540199477 -.0006093712
4 1.3890000581741333 1.417865 -.014364183593742185 -.0006093712
4 1.3890000581741333 1.417865 -.015444051513141611 -.0006093712
4 1.3890000581741333 1.417865 -.01279344256693269 -.0006093712
4 1.3890000581741333 1.417865 -.009466272067912462 -.0006093712
4 1.3890000581741333 1.417865 -.007383238950604474 -.0006093712
4 1.3890000581741333 1.417865 -.006018720126566169 -.0006093712
4 1.3890000581741333 1.417865 -.006023297375925912 -.0006093712
4 1.3890000581741333 1.417865 -.0032789059631669858 -.0006093712
4 1.3890000581741333 1.417865 -.004582010000065449 -.0006093712
4 1.3890000581741333 1.417865 -.0033608506502379766 -.0006093712
4 1.100000023841858 1.417865 .03921126931429379 -.0006093712
4 1.100000023841858 1.417865 .037104918792983395 -.0006093712
4 1.100000023841858 1.417865 .036233238601829657 -.0006093712
4 1.100000023841858 1.417865 .03443272530766962 -.0006093712
4 1.100000023841858 1.417865 .02526996563362867 -.0006093712
4 1.100000023841858 1.417865 .019073450673706516 -.0006093712
4 1.100000023841858 1.417865 .018588612534819627 -.0006093712
4 1.100000023841858 1.417865 .007151904262588452 -.0006093712
4 1.100000023841858 1.417865 .014357413609561769 -.0006093712
4 1.100000023841858 1.417865 .014499444223885 -.0006093712
4 1.100000023841858 1.417865 .013921937092171734 -.0006093712
4 1.100000023841858 1.417865 .01271685737387346 -.0006093712
4 1.5399999618530273 1.417865 .012079164666963182 -.0006093712
4 1.5399999618530273 1.417865 .00716736984339204 -.0006093712
4 1.5399999618530273 1.417865 .00788063315810916 -.0006093712
4 1.5399999618530273 1.417865 .00681633879941004 -.0006093712
4 1.5399999618530273 1.417865 .006783476955592701 -.0006093712
4 1.5399999618530273 1.417865 .0008174012149019787 -.0006093712
4 1.5399999618530273 1.417865 .00277341205796916 -.0006093712
4 1.5399999618530273 1.417865 .0017975961479823905 -.0006093712
4 1.5399999618530273 1.417865 .002657955196104919 -.0006093712
4 1.5399999618530273 1.417865 .0009851211820280752 -.0006093712
4 1.5399999618530273 1.417865 .0048100191602722 -.0006093712
4 1.5399999618530273 1.417865 .005723510274338288 -.0006093712
4 1.3799999952316284 1.417865 .005084054580076671 -.0006093712
4 1.3799999952316284 1.417865 .0074259662326077325 -.0006093712
4 1.3799999952316284 1.417865 .01017841319013133 -.0006093712
4 1.3799999952316284 1.417865 .011344522290153904 -.0006093712
4 1.3799999952316284 1.417865 .0119847233815383 -.0006093712
4 1.3799999952316284 1.417865 .00846514544659567 -.0006093712
4 1.3799999952316284 1.417865 .00872006067702675 -.0006093712
4 1.3799999952316284 1.417865 .007937297081933515 -.0006093712
4 1.3799999952316284 1.417865 .008264231995338047 -.0006093712
4 1.3799999952316284 1.417865 .002626290973518384 -.0006093712
4 1.3799999952316284 1.417865 .004769733071080048 -.0006093712
4 1.3799999952316284 1.417865 .004288828037580048 -.0006093712
4 1.5099999904632568 1.417865 .0030103450348373043 -.0006093712
4 1.5099999904632568 1.417865 .0010676980608890965 -.0006093712
4 1.5099999904632568 1.417865 -.00198054981620948 -.0006093712
4 1.5099999904632568 1.417865 -.0002576709104457238 -.0006093712
4 1.5099999904632568 1.417865 .001719370536849455 -.0006093712
4 1.5099999904632568 1.417865 -.0010946970159981077 -.0006093712
4 1.5099999904632568 1.417865 -.00041203850616485654 -.0006093712
4 1.5099999904632568 1.417865 .0008502249132824116 -.0006093712
4 1.5099999904632568 1.417865 -.0010846949403825468 -.0006093712
4 1.5099999904632568 1.417865 -.003410998505710562 -.0006093712
4 1.5099999904632568 1.417865 -.0031783749851852297 -.0006093712
4 1.5099999904632568 1.417865 -.002886334290051005 -.0006093712
4 1.5499999523162842 1.417865 -.0023420224475766208 -.0006093712
4 1.5499999523162842 1.417865 -.0011567322096284784 -.0006093712
4 1.5499999523162842 1.417865 -.002413825004820118 -.0006093712
4 1.5499999523162842 1.417865 -.003561881440058567 -.0006093712
4 1.5499999523162842 1.417865 -.004437255144742497 -.0006093712
4 1.5499999523162842 1.417865 -.001321013824925759 -.0006093712
4 1.5499999523162842 1.417865 .0009930816087858965 -.0006093712
4 1.5499999523162842 1.417865 -.000030053225495138558 -.0006093712
4 1.5499999523162842 1.417865 -.00025366199256105816 -.0006093712
4 1.5499999523162842 1.417865 .0031576600593439044 -.0006093712
4 1.5499999523162842 1.417865 .0017091193316435445 -.0006093712
4 1.5499999523162842 1.417865 .002972087483720765 -.0006093712
4 1.2799999713897705 1.417865 .0015071261897042498 -.0006093712
4 1.2799999713897705 1.417865 -.000771124694131986 -.0006093712
4 1.2799999713897705 1.417865 -.002834069970186823 -.0006093712
4 1.2799999713897705 1.417865 -.0039409218708400025 -.0006093712
4 1.2799999713897705 1.417865 -.004219105288234017 -.0006093712
end



Displaying results in full form - double format

Hi All,

I am using the following command, in stata 13.1

total subsidyamt, over(fiscal_year1)

to calculate the total amount of subsidy provided in each fiscal year, however, the results display the total subsidy amount in their scientific notation, is there a way by which we could display the results in full form integers, FYI, the display needs to be in double format

At present the results are displayed as below

Over Total Std. Err. [95% Conf. Interval]

subsidyamt
1993 566000 12070.11 542342.9 589657.1
1994 4.54e+07 107917.5 4.52e+07 4.56e+07
1995 6.38e+07 128991.8 6.36e+07 6.41e+07
1996 7.34e+07 141826.7 7.31e+07 7.36e+07
1997 8.54e+07 158783.3 8.51e+07 8.58e+07
1998 9.23e+07 168680.5 9.20e+07 9.26e+07
1999 1.10e+08 191806.5 1.09e+08 1.10e+08
2000 1.28e+08 203604 1.28e+08 1.29e+08
2001 1.15e+08 200214.9 1.14e+08 1.15e+08
2002 1.14e+08 197153.4 1.14e+08 1.14e+08
2003 8.37e+07 175729.6 8.33e+07 8.40e+07
2004 1.16e+08 207778.1 1.15e+08 1.16e+08
2005 1.16e+08 208659.8 1.15e+08 1.16e+08
2006 1.29e+08 199145.2 1.28e+08 1.29e+08
2007 1.18e+08 205689.4 1.18e+08 1.19e+08
2008 2.10e+08 218814.2 2.10e+08 2.11e+08
2009 2.56e+08 247226.8 2.56e+08 2.57e+08
2010 2.18e+08 218655 2.17e+08 2.18e+08
2011 1.84e+08 204790.8 1.84e+08 1.84e+08
2012 5.70e+08 534520 5.69e+08 5.71e+08
2013 1.04e+09 720283.5 1.04e+09 1.04e+09
2014 6.10e+08 561786.4 6.09e+08 6.11e+08
2015 3.57e+08 441164.8 3.56e+08 3.58e+08
2016 2.55e+07 214876.5 2.50e+07 2.59e+07
2018 9.39e+07 163531.4 9.36e+07 9.42e+07


----------------------- copy starting from the next line -----------------------
Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input long subsidyamt float fiscal_year1
30000 1953
30000 1969
30000 1982
10000 1993
10000 1993
....
10000 1993
 7000 1993
 7000 1993
 7000 1993
 7000 1994
10000 1994
10000 1994
...
10000 1994
10000 1994
 7000 1994
10000 1994
10000 1994
10000 1994
10000 1994
end
------------------ copy up to and including the previous line ------------------

Listed 100 out of 366242 observations
Use the count() option to list more

Thank you.

Very weird spmap problem: "master data not sorted"

Hello all

Suddenly, I am having some unexplained problem with spmap: syntax which previously was working now does not anymore, reporting the error "master data not sorted". Now, I know for sure that sorting _ID in the master data is kind of unrequested, since I can run spmap regardless of the sorting of _ID in other .dta files which still work. This seems to be quite random and sudden: some .dta files no more work with spmap, always reporting such error.

Do you have any idea? I am quite lost now.

Thanks in advance, kind regards.

R2 difference from different regressions

Array
Hello, i'm working on replicating Peters and Taylor(2016)
i want to know how to compare R^2 from different regressions like panel c

for example
panel A regression
xi:xterg y x_a i.fyear, fe cluster(gvkey)

panel B regression
xi:xterg y x_b i.fyear, fe cluster(gvkey)

how to test whether the R2 values in Panels A and B are different, taking into account the correlation across regressions and again clustering by firms.
thanks!

Year and industry fixed efffects

Hello;

I have a paneldataset of the S&P 500 from 2002 to 2019 and need to implement industry and year fixed effects.

However i get this problem if i implement the xtset command

my regression command is xtreg firm_beta esg_single (certain control variables), fe

with xtset company_key year however this if for firm fixed effects

Array
could you help me out?

Thanks, Patrick!

An issue using esttab syntax

Hello,
I would like to show the `y' variable in excel when running each of the following regressions using "esttab" syntax. In other words, I would like to show in excel which y variable that was used in order to divide the sample for each regression (I have over 50 y variables).

Code:
foreach y of varlist H_* {
eststo: quietly reg `ylist' `xlist'* if `y'==1, vce(cluster firm)
}
Code:
esttab using AAA, csv label wide onecell numbers scalars (F) replace
Thank you very much.

How to rename and format variables from appended tempfile?

Hi Statalist,

I'm trying to append, rename, format and generate new variables. may goals are to;
1. rename variables "SCH_ID" "School_id" and "SchoolID into "schoolid" since these data has same content.
2. generate data source from each of the tempfile
3. destring var "PSCGC_MUN" with string from other sources. My code below

clear all
import excel "...\SY2017-2018 EnrollmentDB.xlsx", sheet("ES") cellrange(A2:AU2366) firstrow
tempfile sy2017_18enrol_es
save `sy2017_18enrol_es', replace
clear all
import excel "...\SY2017-2018 EnrollmentDB.xlsx", sheet("SS") cellrange(A2:BC483) firstrow
tempfile sy2017_18enrol_ss
save `sy2017_18enrol_ss', replace
clear all
import excel "...\SY2018-2019 EnrollmentDB.xlsx", sheet("ES") cellrange(A2:BN2320) firstrow
tempfile sy2018_19enrol_es
save `sy2018_19enrol_es', replace
clear
import excel "...\SY2018-2019 EnrollmentDB.xlsx", sheet("SS") cellrange(A2:BF472) firstrow
tempfile sy2018_19enrol_ss
save `sy2018_19enrol_ss', replace
clear
import excel ...\SY 2019-2020.xlsx", sheet("BARMM") cellrange(A5:CA2723) firstrow
tempfile sy2019_20enrol_es_ss
save `sy2019_20enrol_es_ss', replace
clear
set more off
import excel "...\SY 2020-2021 Enrolment.xlsx", sheet("Sheet1") firstrow
tempfile sy2020_21enrol_es_ss
save `sy2020_21enrol_es_ss', replace
clear all
set more off
local forms `sy2017_18enrol_es' `sy2017_18enrol_ss' `sy2018_19enrol_es' `sy2018_19enrol_ss' `sy2019_20enrol_es_ss' `sy2020_21enrol_es_ss'
foreach file of local forms {
local datasource "`file'"
foreach v in "`file'" {
capture confirm var SCH_TYPE
capture confirm var SCH_ID
capture confirm var School_id
capture confirm numeric var PSCGC_MUN
if !_rc {
capture ren SCH_TYPE SchoolType
capture ren SCH_ID SchoolID
capture ren School_id SchoolID
capture destring PSCGC_MUN, replace
}
gen data=`datasource'
append using "`file'" //don't want to use force here
}
}

Thank you in advance,

Comparing independent variables in different models with the samen depedent variable

Hello,

I am running three multiple linear regression models. In all three, the dependent variable is the democracy index score of a range of countries over several years that the EU has agreements with. I use three independent variables in each model. Two of those are the same in each model: a sum of imports and exports from and to the EU divided by the GDP of the country, and a score on network governance with the EU. However, the third independent variable is a binary variable which tells whether the country has a particular agreement with the EU in force or not. In each model, this is a different agreement type (there are 3 of those agreements possible (an SAA, an AA, and a CPA), hence 3 models). A country can only have one of the particular agreements in force at a time, or it can have no agreement at all in force. The three models thus look like this:
M1:
  • M1:
    • IV: democracy score
      • DV1: SAA
      • DV2: (imports+exports)/gdp score
      • DV3: network governance score
  • M2:
    • IV: democracy score
      • DV1: AA
      • DV2: (imports+exports)/gdp score
      • DV3: network governance score
  • M3:
    • IV: democracy score
      • DV1: CPA
      • DV2: (imports+exports)/gdp score
      • DV3: network governance score
My question is: how can I compare the independent variables of the different models among each other? For example: if both SAA and AA are positive and significant, but the SAA in model 1 has a higher beta-value than the AA in model 2, is it possible to say that the SAA accounts for a higher variance in the democracy scores overall?

If anything is unclear or you need more information, please let me know.

Best, Harry

Latent class analysis - constraining categorical variables

Hi Statalist!

I am trying to perform a latent class analysis and I am very new to this method, so I hope you will understand if my question is a bit naïve.
I would like to perform the analysis using nine categorical variables (each with five categories) and obtain three classes. Here is the code I am using:

Code:
gsem (e4_1b e4_2b e4_3b e4_4b e4_5b e4_6b e4_7b e4_8b e4_9b <-, ologit), lclass(C 3)
Once I try to estimate the marginal predicted means of the outcome within each latent class (estat lcmean), the command seems to run forever, without giving any output. Looking on the forum, I am supposing that the issue is related to one coefficient being above 15. Below you can find part of the output, and the high coefficient in red.

Code:
Class          : 2

Response       : e4_1b
Family         : ordinal
Link           : logit

Response       : e4_2b
Family         : ordinal
Link           : logit

Response       : e4_3b
Family         : ordinal
Link           : logit

Response       : e4_4b
Family         : ordinal
Link           : logit

Response       : e4_5b
Family         : ordinal
Link           : logit

Response       : e4_6b
Family         : ordinal
Link           : logit

Response       : e4_7b
Family         : ordinal
Link           : logit

Response       : e4_8b
Family         : ordinal
Link           : logit

Response       : e4_9b
Family         : ordinal
Link           : logit

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
/e4_1b       |
        cut1 |  -6.207387   .2099544                      -6.61889   -5.795884
        cut2 |  -4.585708     .09238                     -4.766769   -4.404646
        cut3 |  -1.894731   .0285634                     -1.950714   -1.838748
        cut4 |   .7498389   .0206396                      .7093861    .7902917
-------------+----------------------------------------------------------------
/e4_2b       |
        cut1 |  -3.075159   .0378004                     -3.149246   -3.001071
        cut2 |  -.6201423   .0158303                     -.6511692   -.5891154
        cut3 |   1.625804   .0206937                      1.585246    1.666363
        cut4 |   4.059866   .0600725                      3.942126    4.177606
-------------+----------------------------------------------------------------
/e4_3b       |
        cut1 |  -4.235327   .0753749                     -4.383059   -4.087595
        cut2 |  -2.243794   .0294187                     -2.301453   -2.186134
        cut3 |   -.249513   .0220957                     -.2928197   -.2062063
        cut4 |   2.190033   .0348925                      2.121645    2.258421
-------------+----------------------------------------------------------------
/e4_4b       |
        cut1 |  -24.39119   2397.723                     -4723.841    4675.059
        cut2 |  -3.252751   .0575151                     -3.365479   -3.140024
        cut3 |  -.6396046   .0260331                     -.6906285   -.5885807
        cut4 |    2.04866   .0369879                      1.976165    2.121155
-------------+----------------------------------------------------------------
/e4_5b       |
        cut1 |  -5.785578   .1817201                     -6.141743   -5.429414
        cut2 |   -2.87042   .0414458                     -2.951652   -2.789188
        cut3 |  -.8560699   .0229513                     -.9010535   -.8110862
        cut4 |   1.127883   .0237217                       1.08139    1.174377
-------------+----------------------------------------------------------------
/e4_6b       |
        cut1 |  -.2159903   .0155038                     -.2463771   -.1856035
        cut2 |   1.242691   .0190969                      1.205261     1.28012
        cut3 |   2.667265   .0303306                      2.607818    2.726711
        cut4 |   5.438793   .1438952                      5.156764    5.720823
-------------+----------------------------------------------------------------
/e4_7b       |
        cut1 |   -1.62553   .0205138                     -1.665736   -1.585324
        cut2 |   .0296444   .0151189                      .0000119    .0592768
        cut3 |   1.490041   .0191115                      1.452583    1.527499
        cut4 |   3.487356   .0482501                      3.392788    3.581925
-------------+----------------------------------------------------------------
/e4_8b       |
        cut1 |   1.331648    .023303                      1.285975    1.377321
        cut2 |   2.582377   .0354519                      2.512892    2.651861
        cut3 |    3.56694    .043742                      3.481207    3.652673
        cut4 |   6.136812   .1994962                      5.745807    6.527817
-------------+----------------------------------------------------------------
/e4_9b       |
        cut1 |   .0988198   .0149272                      .0695631    .1280765
        cut2 |   1.434078   .0194443                      1.395968    1.472188
        cut3 |   2.952968   .0329211                      2.888444    3.017492
        cut4 |   5.747459   .1649939                      5.424077    6.070841
------------------------------------------------------------------------------
As I understood, I should constrain the coefficient to 15. I tried to use the following code, but it is not adequate for my categorical variable.

Code:
gsem (e4_1b e4_2b e4_3b e4_4b e4_5b e4_6b e4_7b e4_8b e4_9b <- ) (2: e4_4b  <- _cons@15) , ologit lclass(C 3)
Could anyone please suggest me the right specifications? Any suggestion is be more than welcome,

Kind regards,
Marla


Testing significance for Alphas CAPM

Hello everybody,

I have calculated monthly alphas and betas for monthly fund data from different portfolioids. How can I test the statistical significance of all calculated alphas? Can I do it with the fama macbeth regression?

Survey data- Pool Cross Sectional data with weights- What to do with weights?

Dear all,

I have a pooled cross sectional data ( weakly balanced, 9 different years wealth and income data for individuals belonging to 4 races, 2 genders, and three categories of education) I have been referring to the STATA guide about this. There is a whole pdf manual about about svy.
What I saw was that when we do svy, we don't need to do any calibration of the variables using the weights given in the data. For instance we don't need to calculate weighted mean or anything (like X1W1+X2W2..../W1+W2+....). We just specify the survey design using svyset and use use "svy:" before putting any estimation command. The data somewhat looks like this- Each year has multiple individuals belonging to a each race category and probably there are different individuals in each year, not the same individuals sampled every year. I think this is sampling with replacement (since new set of individuals in each year? or not?How to know?), but I am not sure.

weight Year Race Gender Age Income
1678.4 1990 Black M 55
.2 1990 White M 25
6546 1990 Black M 44
151.55 1990 White F 56
564.55 1991 White F 60
54.66 1991 White M 30
1483.08 1991 Black M 29
452.6 1991 Black F 48
111.56 1992 White M 65
My questions are -
1. How to know if our data is sampling with replacement or sampling without replacement?
2. How to know if it is a one stage design or a two stage design? (This is necessary to know for when we specify survey design)
2. Which approach is better for plotting trends in income? Using svyset, doing regressions and marginsplotting OR without svyset, by calculating weighted means and doing line plot OR
collapse (p50) income [pw=weight], by(year race)?
3. If we use svyset, how to know the type of weights? aweight, pweight...etc?
4. svyset PSU [pweight=pw], strata(strata)
Is is alright if I use year as strata?
5. What is the primary sampling unit here? I think each individual?

Any help would be greatly appreciated.
Thank you.

Saturday, November 28, 2020

Loops to make variables compatible

Hi, I have a variable which has been coded differently in two datasets. (old one for year 2000 and new one for year 2005 )

Old codes are { 2 4 5 7 8 9 10 11 12 13 14 15 16 18 19 20 21 22 23 24 25 26 27 3 28 29 31 6 32 17 33 30 }

New codes are { 28 18 10 24 6 2 1 29 32 23 27 14 17 13 21 3 8 11 33 16 9 19 35 12 4 26 7 30 31 15 34 25}

So, for example, the variable (lets call it state) was taking a value 2 earlier but in the new dataset that value corresponds to 28...... similarly 4 earlier and 18 now and so on


both the datasets have been appended with a variable called year taking value 2000 for old dataset and 2005 for new dataset.


How do I write a loop to convert the old codes for variable state to new one for the year 2000 so that the variable state can be made compatible across years ?

IPTW - strategies

Hi everyone,

I'm evaluating whether increased dosage (number of completed sessions) of an behaviour change therapeutic program, is associated with a disappearance in that behaviour in the 12 months after the program ends. The outcome is a logistic binary effect - did engage in behaviour vs. did not engage in that behaviour.

Unfortunately, the program is not a RCT, only have a naturalistic/observation design: another group - those who were referred but never commenced the program - are going to act as a control group. I'm planning to use IPTW to develop propensity scores to weight the groups to balance both in terms of the likelihood of participating in the program to begin with. I am seeking an average treatment effect on the treated (ATT) sample, relative to never having participated in the program before.

I have two questions:

1. I was planning to do a xtmelogit of program participation, followed by Predict prob_particication and then

gen iptw=1/prob_participation
summ iptw


However, i've also seen people use -pscore- syntax to develop propensity scores? Is it okay to use the former method or is it less efficient than pscore?

2. When choosing the covariates to include in propensity score, is it better practice to include only personal characteristics of the sample - e.g., age, gender, education, etc. or is it also okay to include some broader operational factors in the model, - e.g., program trainer identity, timing of the start of the program, etc.? I've seen written in some places that choosing more 'operational factors' can be problematic because they are often more related to treatment than to the outcome and may be confounders.


I'm not sure if its helpful information, but my data is in wide format.


Thanks in advance! I've never completed IPTW before so it's very new to me

Marlee



creating a loop

Hi guys:
as you can see, below are four parts of repetitive codes, and the only difference is ccode2 (31, 40, 41, 42). so i am wondering if there is anyway to write a loop to simplify these repetitive codes? Thanks in advance and happy holidays!
Code:
use "/Users/jack/Library/Mobile Documents/com~apple~CloudDocs/data/COW_Trade_4.0/cow_trade_4.0_rr.dta", clear
drop edtc
sort ccode1 year
by ccode1 year: egen edtc= sum(trade_sha1) if ccode1==2 & ccode2 != 31
sort ccode1 ccode2 year
save "/Users/jack/Library/Mobile Documents/com~apple~CloudDocs/data/COW_Trade_4.0/cow_trade_4.0_rr.dta", replace
drop if edtc == .
sort ccode1 year
quietly by ccode1 year: gen dup = cond(_N==1,0,_n)
drop if dup>1
drop dup
keep ccode1 ccode2 year edtc
rename edtc edtc1
replace ccode2 = 31
sort ccode1 ccode2 year
save "/Users/jack/Library/Mobile Documents/com~apple~CloudDocs/data/COW_Trade_4.0/cow_edtc.dta", replace
merge ccode1 ccode2 year using "/Users/jack/Library/Mobile Documents/com~apple~CloudDocs/data/COW_Trade_4.0/cow_trade_4.0_rr.dta"
drop _merge
drop if ccode1_1 ==.
save "/Users/jack/Library/Mobile Documents/com~apple~CloudDocs/data/COW_Trade_4.0/cow_trade_4.0_rr.dta", replace


drop edtc
sort ccode1 year
by ccode1 year: egen edtc= sum(trade_sha1) if ccode1==2 & ccode2 != 40
sort ccode1 ccode2 year
save "/Users/jack/Library/Mobile Documents/com~apple~CloudDocs/data/COW_Trade_4.0/cow_trade_4.0_rr.dta", replace
drop if edtc == .
sort ccode1 year
quietly by ccode1 year: gen dup = cond(_N==1,0,_n)
drop if dup>1
drop dup
keep ccode1 ccode2 year edtc
rename edtc edtc1
replace ccode2 = 40
sort ccode1 ccode2 year
save "/Users/jack/Library/Mobile Documents/com~apple~CloudDocs/data/COW_Trade_4.0/cow_edtc.dta", replace
merge ccode1 ccode2 year using "/Users/jack/Library/Mobile Documents/com~apple~CloudDocs/data/COW_Trade_4.0/cow_trade_4.0_rr.dta"
drop _merge
drop if ccode1_1 ==.
save "/Users/jack/Library/Mobile Documents/com~apple~CloudDocs/data/COW_Trade_4.0/cow_trade_4.0_rr.dta", replace

drop edtc
sort ccode1 year
by ccode1 year: egen edtc= sum(trade_sha1) if ccode1==2 & ccode2 != 41
sort ccode1 ccode2 year
save "/Users/jack/Library/Mobile Documents/com~apple~CloudDocs/data/COW_Trade_4.0/cow_trade_4.0_rr.dta", replace
drop if edtc == .
sort ccode1 year
quietly by ccode1 year: gen dup = cond(_N==1,0,_n)
drop if dup>1
drop dup
keep ccode1 ccode2 year edtc
rename edtc edtc1
replace ccode2 = 41
sort ccode1 ccode2 year
save "/Users/jack/Library/Mobile Documents/com~apple~CloudDocs/data/COW_Trade_4.0/cow_edtc.dta", replace
merge ccode1 ccode2 year using "/Users/jack/Library/Mobile Documents/com~apple~CloudDocs/data/COW_Trade_4.0/cow_trade_4.0_rr.dta"
drop _merge
drop if ccode1_1 ==.
save "/Users/jack/Library/Mobile Documents/com~apple~CloudDocs/data/COW_Trade_4.0/cow_trade_4.0_rr.dta", replace

drop edtc
sort ccode1 year
by ccode1 year: egen edtc= sum(trade_sha1) if ccode1==2 & ccode2 != 42
sort ccode1 ccode2 year
save "/Users/jack/Library/Mobile Documents/com~apple~CloudDocs/data/COW_Trade_4.0/cow_trade_4.0_rr.dta", replace
drop if edtc == .
sort ccode1 year
quietly by ccode1 year: gen dup = cond(_N==1,0,_n)
drop if dup>1
drop dup
keep ccode1 ccode2 year edtc
rename edtc edtc1
replace ccode2 = 42
sort ccode1 ccode2 year
save "/Users/jack/Library/Mobile Documents/com~apple~CloudDocs/data/COW_Trade_4.0/cow_edtc.dta", replace
merge ccode1 ccode2 year using "/Users/jack/Library/Mobile Documents/com~apple~CloudDocs/data/COW_Trade_4.0/cow_trade_4.0_rr.dta"
drop _merge
drop if ccode1_1 ==.
save "/Users/jack/Library/Mobile Documents/com~apple~CloudDocs/data/COW_Trade_4.0/cow_trade_4.0_rr.dta", replace

Panel index within group

Hey there, I like to create a new column indicating the occurrence of non-missing BAM within each id. For example, a new column named "BAM_panel" to start from 1-6 for id 3. Simply put, I want a panel index for BAM within each id when it is not missing. Would you help out? Thanks very much!

Code:
clear
input int id float BAM
3   5
3   2
3   7
3  -6
3   1
3  -4
3   .
4   8
4  21
4   6
4  11
4   2
4  -1
4  -2
4 -15
4  -1
4   4
4   4
4   6
4  11
4  11
4  22
4   7
4  28
4   4
4   8
4   7
4   7
4  35
4  16
4   4
4   7
4  10
4   1
4  -7
4  -5
4   3
4   7
4   8
4  35
4   6
4  10
4   3
4   1
4  -7
4   4
4  -5
4   7
4  -5
4   1
4  -1
4   3
4  -2
4   .
7   .
end

Question: Generating summary statistics table with tabstat and table

Dear all,

I have recently attempted to generate a descriptive statistic table with the command

Code:
tabstat govtdebt metrde, statistics( mean median ) by(incomelevel)
and it generate the following:



The mean and median is currently in row format.

I was hoping to change it into the column format where the mean median is displayed directly each variable. For example:



Is tabstat or table able to do this? Otherwise what command should I use? Thank you.

Fitting two lines using aaplot

Dear All,
Is it possible to fit two lines (i.e. more than two variables) using the aaplot command? If yes, will appreciate any possible help.

Thanks,
Dapel

Time fixed effects as dummies change significance

Dear Statalist,
I am doing research on using nighttime light (radiances per area) to predict real regional GDP (random effect model). I have a dataset of 34 provinces in Indonesia for the period 2012-2020 (quarterly data). My mentor suggested I include time fixed effects (i.qdate) and province fixed effects (i.id) as dummies. But when I do so, the significance of my interest variable (ln_nli) become insignificance. Here's my result before and after adding the time and provinces dummy:


. xtreg . xtreg ln_pdrb ln_nli, re ro

Random-effects GLS regression Number of obs = 1,122
Group variable: id Number of groups = 34

R-sq: Obs per group:
within = 0.3997 min = 33
between = 0.5016 avg = 33.0
overall = 0.4384 max = 33

Wald chi2(1) = 164.50
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000

(Std. Err. adjusted for 34 clusters in id)
------------------------------------------------------------------------------
| Robust
ln_pdrb | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ln_nli | .1363934 .0106343 12.83 0.000 .1155505 .1572363
_cons | 17.57819 .1798055 97.76 0.000 17.22578 17.9306
-------------+----------------------------------------------------------------
sigma_u | .82280392
sigma_e | .09782348
rho | .98606209 (fraction of variance due to u_i)
------------------------------------------------------------------------------

. xtreg ln_pdrb ln_nli i.qdate i.id, re ro

Random-effects GLS regression Number of obs = 1,122
Group variable: id Number of groups = 34

R-sq: Obs per group:
within = 0.8764 min = 33
between = 1.0000 avg = 33.0
overall = 0.9985 max = 33

Wald chi2(33) = .
corr(u_i, X) = 0 (assumed) Prob > chi2 = .

(Std. Err. adjusted for 34 clusters in id)
------------------------------------------------------------------------------
| Robust
ln_pdrb | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ln_nli | .0071749 .0078816 0.91 0.363 -.0082727 .0226225
|
qdate |
210 | .0281132 .0042352 6.64 0.000 .0198124 .036414
211 | .0285663 .012437 2.30 0.022 .0041903 .0529424
212 | .024168 .0056676 4.26 0.000 .0130597 .0352764
213 | .0488249 .0062365 7.83 0.000 .0366016 .0610482
214 | .0875023 .0052677 16.61 0.000 .0771778 .0978268
215 | .0941612 .0085182 11.05 0.000 .0774657 .1108566
216 | .0838755 .006852 12.24 0.000 .0704458 .0973053
217 | .1006991 .0076502 13.16 0.000 .085705 .1156933
218 | .1361169 .007639 17.82 0.000 .1211449 .151089
219 | .1439286 .0098438 14.62 0.000 .1246351 .1632221
220 | .1285819 .0110317 11.66 0.000 .1069601 .1502037
221 | .1559425 .0115845 13.46 0.000 .1332374 .1786476
222 | .1867966 .0139732 13.37 0.000 .1594097 .2141835
223 | .1943102 .0125887 15.44 0.000 .1696368 .2189835
224 | .1819726 .013399 13.58 0.000 .1557111 .2082341
225 | .2012336 .0149958 13.42 0.000 .1718423 .2306248
226 | .2399893 .0161248 14.88 0.000 .2083854 .2715933
227 | .2516112 .0159806 15.74 0.000 .2202897 .2829326
228 | .226623 .0141486 16.02 0.000 .1988923 .2543537
229 | .2448326 .0168339 14.54 0.000 .2118388 .2778264
230 | .2867302 .0181616 15.79 0.000 .2511341 .3223263
231 | .2939583 .0181339 16.21 0.000 .2584166 .3295001
232 | .2795545 .0158775 17.61 0.000 .2484351 .3106738
233 | .3027815 .018904 16.02 0.000 .2657303 .3398327
234 | .3326927 .0195459 17.02 0.000 .2943834 .371002
235 | .3371736 .0176719 19.08 0.000 .3025373 .3718099
236 | .3191344 .0178491 17.88 0.000 .2841509 .3541179
237 | .3431905 .0197575 17.37 0.000 .3044666 .3819144
238 | .378396 .0190506 19.86 0.000 .3410574 .4157345
239 | .3853209 .0203378 18.95 0.000 .3454596 .4251822
240 | .3490189 .0190272 18.34 0.000 .3117262 .3863116
241 | .3074161 .02012 15.28 0.000 .2679816 .3468506
|
id |
2 | 1.349615 .0033258 405.80 0.000 1.343097 1.356134
3 | .2081012 .0001396 1490.41 0.000 .2078276 .2083749
4 | 1.35338 .0031468 430.08 0.000 1.347212 1.359547
5 | .080414 .0020315 39.58 0.000 .0764323 .0843958
6 | .8071257 .0031772 254.04 0.000 .8008985 .8133529
7 | -1.094362 .0033559 -326.10 0.000 -1.100939 -1.087784
8 | .5555607 .0047317 117.41 0.000 .5462867 .5648347
9 | -.9177501 .0002953 -3108.17 0.000 -.9183288 -.9171714
10 | .273506 .0109266 25.03 0.000 .2520903 .2949218
11 | 2.518404 .03778 66.66 0.000 2.444356 2.592451
12 | 2.349943 .0151691 154.92 0.000 2.320212 2.379674
13 | 1.945931 .0119341 163.06 0.000 1.922541 1.969321
14 | -.3214616 .0144193 -22.29 0.000 -.349723 -.2932003
15 | 2.446185 .0140759 173.79 0.000 2.418597 2.473773
16 | 1.158497 .0175069 66.17 0.000 1.124184 1.19281
17 | .1139387 .013109 8.69 0.000 .0882455 .1396319
18 | -.3446538 .0030191 -114.16 0.000 -.3505712 -.3387365
19 | -.6924587 .0047129 -146.93 0.000 -.7016958 -.6832216
20 | -.0098659 .0049347 -2.00 0.046 -.0195378 -.0001941
21 | -.3577425 .0049798 -71.84 0.000 -.3675027 -.3479824
22 | 1.333831 .002178 612.41 0.000 1.329562 1.3381
23 | 1.334234 .0017355 768.79 0.000 1.330832 1.337635
24 | -.8315154 .0046088 -180.42 0.000 -.8405486 -.8224823
25 | -.4790523 .0026143 -183.24 0.000 -.4841762 -.4739284
26 | -.3125605 .0025476 -122.69 0.000 -.3175537 -.3075674
27 | .7999971 .0031723 252.18 0.000 .7937795 .8062147
28 | -.4333453 .0034614 -125.19 0.000 -.4401296 -.4265611
29 | -1.630576 .0010813 -1508.04 0.000 -1.632696 -1.628457
30 | -1.477908 .0055417 -266.69 0.000 -1.48877 -1.467047
31 | -1.512727 .0054251 -278.84 0.000 -1.52336 -1.502094
32 | -.7831341 .0059699 -131.18 0.000 -.794835 -.7714333
33 | -.7823274 .0068561 -114.11 0.000 -.7957651 -.7688898
34 | .1184597 .0086356 13.72 0.000 .1015342 .1353851
|
_cons | 17.0143 .0191901 886.62 0.000 16.97669 17.05191
-------------+----------------------------------------------------------------
sigma_u | 0
sigma_e | .04505634
rho | 0 (fraction of variance due to u_i)
------------------------------------------------------------------------------

.