Tuesday, January 31, 2023

How to create a weighted average for multiple observations for a variable by borrower_id and year

Borroer_id Year Loan amount Loan maturity Loan interest
1 2011 101 60 8.5
1 2011 95 55 5.7
2 2011 85 55 8.6
3 2011 90 44 6.5
3 2012 82 46 7.5
4 2011 65 60 6.2
4 2012 67 90 5.4
5 2011 85 45 6.5
5 2011 62 78 4.5
6 2012 55 23 6.5


I would like to calculate the weighted average for the (loan amount, loan maturity, and loan interest ) according to the weight by loan amount relative to the total amount in each borrower_id.
For instance, for borrower 1, the weighted average is (101/196, 95/196), respectively equal to 0.51, 0.48 and using these percentages to calculate the weighted loan average = (101*0.51+95*0.48)=97.11
Using the same weighted average percentage created from borrower_1 loans (0.51, 0.48), I would like to use them to be applied to both the loan maturity and loan interest as well.

thank you so much in advance for your help.

Calculating differences between adjacent elements

Dear all,
I wonder if there is any easy solution available in Mata for calculating the difference between adjacent elements of a (column) vector? For example, the difference of vector Y should return (Y[2]-Y[1], Y[3]-Y[2], ..., Y[n]-Y[n-1]).
Thank you,
Eilya.

save regression results for margins ploting

My question is straightforward. Could Stata save regression results so that I do not have to re-run the model when trying to use margins and marginsplot commands?

Estimate the difference-in-differences estimator

Hi,

I restricted my wave sample from 1-10 to 9-10 and I was wondering how would i use the command bysort to create a covid variable. To show the difference between the level of wellbeing before (wave9) and after wave (10) using.


I would appreciate any clarifications as I do not have much knowledge of the bysort command.


Kind regards,

Iani



Combining variables in the variable list by country

I am looking at the dataset where I have three variable lists: life expectancy by country for specific age groups. The age interval for now is four years, and I would like to combine them (e.g. combine first 24 years 0 - 24) by country, so that the life expectancy is combined as well. Could you help me with what the easiest way would be? Thank you!

Interpretation of Competing Risk with stcrreg

I'm new to using competing risk analysis and want to make sure I'm interpreting and using it correctly (given that I'm getting good result, I don't want to be mis-representing the data)
I am trying to use a competing risk (Stcrreg) analysis for an intervention in a clinical study. I am interested in the first 48 hours after randomization. I’ve coded my desired outcome as event=0 and have two competing risk events (event=2, precluding any further possibility of the desired outcome; event=1, impeding the desired outcome). So, if one group has the intervention (=1) and the other group does not get the intervention (=0) can I interpret the subhazard ratio as the probability of experiencing the desired outcome during the time period for the events coded?

Code:
stset time, failure(event==0)
stcrreg intervention, compete(event==1 2)

I come up with SHR = 2.997 (95% CI 1.46 – 6.15)

Is it correct to say that the probability of the desired outcome is 3 times more likely during the first 48 hours with the intervention?

Monday, January 30, 2023

Multiplying observations in a variable list

In a dataset of multiple variable lists, I would like to multiply the value of only selected variables in one variable list by 4. Namely, I want to keep the observations unchaged if the observation concerns the age_group of <1 years and 95 plus (age_group variable list is not numeric). For all the other cases, I am trying to multiply the observation values by 4. I initially used this command: gen new_var = old_var * 4 if age_group != "<1 year" but I cannot incorporate 95 plus age group in this command. Also, in a new variable list, observations for <1 age group are deleted, while I would just like to keep the previous value. Thanks!

Zero event meta analysis with ipdmetan

Hey everyone, I'm trying to run a two-stage meta-analysis to generate a forest plot with individual participant data using ipdmetan to plot the odds of an event across several trials. Unfortunately, for the majority of trials, in one of the subgroups of interest, there were no events. Is there any way ipdmetan can adjust for zero-event meta-analysis to include these trials in the analysis? I can only see an option "keepall" to include the excluded studies in the forest plot. Reading the literature for zero event meta-analysis, some propose to put in 0.5 events in the subgroup of interest. Is there any way ipdmetan can handle this? Thank you!

Traj command (dropout model)

Hello everyone,

I am using the traj command to examined gait speed trajectories and I have a question about the dropout model option. When I used this option, some of my models had positive BICs and some had negative BICs. However, when I removed that option, all BICs were positive. What is the benefit of using this option? Can I omit this option?


Here’s my code and BIC values:

2 groups (with dropout model)

traj, var(walkspeed_*) indep(time_*) model(cnorm) order(0 0) min(0.07533158) max(1.9544182) dropout(0 0)

BIC= -873.03 (N=9525) BIC= -869.39 (N=2836) AIC= -851.54 ll= -845.54

2 groups (without dropout model)

traj, var(walkspeed_*) indep(time_*) model(cnorm) order(0 0) min(0.07533158) max(1.9544182)

BIC= 2004.89 (N=9525) BIC= 2007.31 (N=2836) AIC= 2019.21 ll= 2023.21

Thanks in advance.

Making a dummy for threshold


Hi,

I am going to make a dummy Ii ( Em (i)<= Em) that selects firms whose employment (Em(i)) is below a certain threshold, the median of employment over the sample.
firms ID: NPC_FIC
Employment: Em

Code:
* Example generated by -dataex-. For more info, type help dataex
clear
input double NPC_FIC float Em
500008310   1
500027353   3
500049799   2
500050215   2
500135017   7
500139105   2
500154496  28
500206942   3
500251226   2
500265754   1
500321035   3
500369375   2
500392001   1
500392920   1
500422489   1
500458395   1
500979142 145
500979162  16
500979181   8
500979184  12
500979206  34
500979220  32
500979239   7
500979255  32
500979269  68
500979271  10
500979282   4
500979284   7
500979293  63
500979319   3
500979326   8
500979329  27
500979360   4
500979373  33
500979374  10
500979383  78
500979450  14
500979456  51
500979474  89
500979489  14
500979491  22
500979522  13
500979524   5
500979526  16
500979530  22
500979531   5
500979532  59
500979537  31
500979543  18
500979551  11
500979559   2
500979561   6
500979563  18
500979564  27
500979566  34
500979567  26
500979569  30
500979574  10
500979596  53
500979625   8
500979634  17
500979645   2
500979673   3
500979679  10
500979689   2
500979712   3
500979737  12
500979740  46
500979749   1
500979750  66
500979758  21
500979809  22
500979810   5
500979812   1
500979820  55
500979838  22
500979871   1
500979874   8
500979890   9
500979935   6
500979937   1
500979940   6
500979959 100
500979962   6
500979964 310
500979981 227
500979987 565
500979989   2
500980043  17
500980048 204
500980053  76
500980077   4
500980081  39
500980087   1
500980105   3
500980109   5
500980110  15
500980146   4
500980167   9
500980175  23
end

Listed 100 out of 24599 observations

First, I made the median:
Code:
egen median_Employment = median( Employment)
Now, I need to make a dummy that selects firms whose employment is below the median of employment over the sample.
Any suggestion is appreciated.

Cheers,
Paris


Selecting Date Range

I have a data where one of the variables is CloseDate whose Variable Type is int and Format is %tdnn/dd/CCYY. I would simply like to select a range to include when Total the sum of a variable. In other words, sum the amounts between, for example, 1/1/2022 and 12/31/2022. As a new Stata user, I would appreciate help doing what I think should be an easy task. Thanks for your help.

How are p-values calculated in an Oaxaca-Blinder Decomposition

Hi all,

I am trying to interpret the results of an Oaxaca-Blinder Decomposition. I am using the popular Oaxaca command. For my project I am just interested in the results of a two-way decomposition (i.e. I don't want the interaction terms), so I am using the pooled option. These results are based off of a logistic regression. Here is my output:

Code:
.
svyset [pweight = weight], str(stratum_var) psu(cluster_var)

tab race, gen(race)
tab educ, gen(educ)
tab hhinc, gen(hhinc)
tab age_g5, gen(age_g5

oaxaca cohab_par age_g52 age_g53 age_g54 age_g55 race2 race3 race4 imm ///
>          educ2 educ3 educ4 hhinc2 hhinc3 hhinc4, ///
>          by(rural) svy logit pooled

Blinder-Oaxaca decomposition

Number of strata = 18                             Number of obs   =      3,268
Number of PSUs   = 72                             Population size = 38,244,320
Design df       =         54
Model              =     logit
Group 1: rural = 0                              N of obs 1         =      2116
Group 2: rural = 1                              N of obs 2         =       468


Linearized
cohab_par  Coefficient  std. err.      t    P>t     [95% conf. interval]

overall      
group_1    .1405385   .0141326     9.94   0.000     .1122043    .1688727
group_2    .2199063   .0242979     9.05   0.000     .1711919    .2686207
difference   -.0793678   .0276199    -2.87   0.006    -.1347424   -.0239933
explained   -.0197445   .0107814    -1.83   0.073    -.0413599    .0018708
unexplained   -.0596233   .0273862    -2.18   0.034    -.1145294   -.0047172

explained    
age_g52   -.0005991      .0012    -0.50   0.620    -.0030048    .0018067
age_g53       -.001    .001295    -0.77   0.443    -.0035962    .0015963
age_g54    .0000969   .0009533     0.10   0.919    -.0018143    .0020081
age_g55   -.0004468   .0010476    -0.43   0.671    -.0025472    .0016535
race2    -.001195   .0013982    -0.85   0.396    -.0039982    .0016081
race3    .0009479     .00266     0.36   0.723    -.0043851    .0062809
race4    .0005973   .0013599     0.44   0.662     -.002129    .0033237
imm    .0003432   .0019318     0.18   0.860    -.0035298    .0042161
educ2   -.0002301   .0009301    -0.25   0.806    -.0020949    .0016347
educ3    .0008762   .0017989     0.49   0.628    -.0027304    .0044828
educ4    -.012033    .005119    -2.35   0.022     -.022296     -.00177
hhinc2   -.0003015   .0006636    -0.45   0.651    -.0016318    .0010289
hhinc3   -.0001882   .0006291    -0.30   0.766    -.0014493     .001073
hhinc4   -.0066125   .0039993    -1.65   0.104    -.0146306    .0014057

unexplained  
age_g52    -.028689   .0165345    -1.74   0.088    -.0618386    .0044607
age_g53   -.0297615   .0290941    -1.02   0.311    -.0880916    .0285686
age_g54   -.0556364   .0411148    -1.35   0.182    -.1380666    .0267938
age_g55   -.0026016   .0261534    -0.10   0.921     -.055036    .0498328
race2    .0048568   .0077741     0.62   0.535    -.0107293    .0204428
race3    .0176802   .0161501     1.09   0.278    -.0146989    .0500593
race4    .0041202   .0058635     0.70   0.485    -.0076355    .0158759
imm   -.0093887    .012094    -0.78   0.441    -.0336357    .0148583
educ2    .0047812   .0141557     0.34   0.737    -.0235994    .0331617
educ3   -.0291882    .026426    -1.10   0.274    -.0821691    .0237926
educ4   -.0366389   .0235373    -1.56   0.125    -.0838284    .0105505
hhinc2    .0017072   .0118416     0.14   0.886    -.0220339    .0254482
hhinc3    .0011674   .0093851     0.12   0.901    -.0176486    .0199834
hhinc4   -.0020982   .0118124    -0.18   0.860    -.0257806    .0215842
_cons    .1000664    .098881     1.01   0.316    -.0981782    .2983109
I am seeking help since within these results, there is overall significant differences between the two groups (urban and rural in this case), with roughly 24.9% of the difference coming from difference in composition/explained (p-value not statistically significant) and 75.1 percent coming from differences in coefficients/unexplained (p-value is statistically significant). On the surface this make sense, but when you look at the individual variables within the explained and unexplained portions, these findings don't line up.

Specifically, the only significant variables are found in the explained portion (educ4), and there are no significant variables in the unexplained portion (despite unexplained being significant overall).

Am I misinterpreting the results? Or, on a more technical level, how are standard errors and p-values calculated within an Oaxaca-Blinder Decomposition? Does the calculation differ when trying to estimate the significance of the overall explained/unexplained components than when trying to calculate the effects of individual variables?

Please let me know if I can clarify anything

Solve dynamic non linear deterministic system

Dear forum,
After several hours of work, I am looking for the forum to help me to find a solution to my problem.
I am used to work with GAMS to solve non linear equations systems, I would like to know if it is possible to do this step directly with STATA. Indeed, this resolution is for me only a first step in my project since the results obtained are then integrated in a microsimulation model in a household survey and this work is conducted on STATA. I would like to find a way to do everything in STATA if it is possible.
What I need is to know if in STATA I can solve a system of deterministic dynamic nonlinear equations (without random shock). It is a model calibrated on national accounting data and therefore does not need to be estimated since we are not using time series (or panel) data. I tried with the dsgenl command without success.
Thanks in advance for your answers.

Sunday, January 29, 2023

Calculating summary statistics for correlation coefficients

Hello,
I have calculated some pairwise correlation coefficients between observations in a panel data set within a grouping variable, and I have successfully transferred the correlation coefficient matrix into a data set. I am now looking to calculate some summary statistics for these correlation coefficients (mean, percentiles, std), however, I do not want to double count coefficients in these calculations. For example, since the correlation coefficient at column C1 row 2 is the same at column C2 row one, if I collapse and sum by column, I will be double counting these coefficients. Similarly, I do not want to include correlation coefficients between a variable and itself. How can I avoid double counting correlation coefficients?

[CODE]
* Example generated by -dataex-. For more info, type help dataex
clear
Code:
* Example generated by -dataex-. For more info, type help dataex
clear
input float(C1 C2 C3 C4 C5 C6 C7 C8)
       1 .9937328 .9937328  .  . . 1 .9032351
.9937328        1        1  .  . . 1  .946541
.9937328        1        1  .  . . 1  .946541
       .        .        .  1  1 . .       -1
       .        .        .  1  1 . .       -1
       .        .        .  .  . . .        .
       1        1        1  .  . . 1        .
.9032351  .946541  .946541 -1 -1 . .        1

end
Thanks for any help!

Country-pair specific id for gravity model

Hello all,

I'm trying to estimate a gravity model with trade data that is disaggregated at industry level and trying to include country-pair fixed effects in my regression. I'm aware when you want to generate country pair id, you egen pairid = group(importer exporter). But what I have been wondering is when you generate pair id like above, that means a pair of USA(importer) and Canada(exporter) would be different from a pair of Canada(importer) and USA(exporter).

So my question is, is it right to run regression with country-pair fe with pair ids just like above? From my understanding, we include country-pair fe to observe heterogeneity between each country pair. So then, wouldn't it be more appropriate for me to have the same pair id for the pair of USA and Canda regardless of which one of these two country is importer or exporter? (Canada USA(imp)-Canada(exp) and Canada(imp)-USA(exp) both pair should have the same pair id since they both pairs are comprised of the same countries) Secondly, if so, then what is the command I can use to generate the same pair id for the USA and Canada regardless of them being importer or exporter?

Thank you

How to merge datasets using joinby?

I have the following data. In Data set A, I have rows of children with "h13hhidc" indicating their household ID including the relevant sub-household ID information. "h13hhidc" is not unique to the children since multiple children of the same household are found within the data. Data Set B includes parents (some single households and some married households). These are also uniquely identified by h13hhidc. I would like to merge Data A and Data B so that the final result includes all parents in Data B having a new row for each child present in Data A, that has a matching h13hhidc. I want to keep people in Data B that do not have children in Data A (as a I later have more child data sets to import), but again, have several lines for parents that did in fact have matching children in Data A.

I have searched the forum and feel like this would be the proper time for the joinby command, but I am not getting the desired result. Any help is greatly appreciated.

Data A
Code:


. list hhid opn h13hhidc psubhh qsubhh education  in 1/500 , sepby (hhid)

     +------------------------------------------------------+
     |   hhid   opn   h13hhidc   psubhh   qsubhh   educat~n |
     |------------------------------------------------------|
  1. | 010004   101    0100040        0        0          . |
  2. | 010004   201    0100040        0        0          . |
  3. | 010004   202    0100040        0        0          . |
  4. | 010004   604    0100040        0        0          . |
  5. | 010004   605    0100040        0        0          . |
Data B

Code:
. list hhid pn h13hhidc in 6/20, sepby (hhid)

     +-------------------------+
     |   hhid    pn   h13hhidc |
     |-------------------------|
  6. | 010003   020          . |
  7. | 010003   030    0100030 |
     |-------------------------|
  8. | 010004   010          . |
  9. | 010004   040    0100040 |
     |-------------------------|
 10. | 010013   010          . |
 11. | 010013   040    0100131 |
     |-------------------------|
 12. | 010038   010    0100380 |
 13. | 010038   040    0100380

Describe variables

I have a dataset (example below) and I would like to know the frequency of males (sex==1) and females (sex==0) that mentioned (resinsulted3_3 ==1) and not mentioned (resinsulted3_3 ==0) the variable resinsulted3_3.
Many thanks in advance
Code:
* Example generated by -dataex-. For more info, type help dataex
clear
input long pidp byte(sex resinsulted3_3)
 68043527 1 0
 68344767 2 0
 68395767 1 0
 68395771 2 0
 68467847 2 0
 68990767 2 0
 69307651 1 0
 69443647 1 1
 69593251 1 0
 69734695 1 0
 69848927 2 1
 69848931 1 0
 69931887 2 0
 70435091 2 0
137668727 2 0
137675535 1 0
138035931 1 1
138292287 2 0
204551495 2 0
204551499 1 1
205473571 1 0
205566727 1 1
205722447 2 1
205857091 1 0
205995807 1 0
206348055 2 0
272116287 2 1
272159131 1 1
272395771 1 0
273058775 1 0
273729247 1 0
273973371 2 0
274289571 1 0
340490971 1 0
341507567 1 1
341636091 1 0
341678935 2 0
341895167 1 1
341901971 1 1
342167847 1 0
346745684 1 0
408627647 2 0
408983291 1 0
409210407 2 0
409330767 2 0
409447727 1 1
409962499 2 0
410297047 1 0
410339207 2 0
410339223 1 0
410355527 1 0
410454807 2 0
416520484 2 1
476163891 2 0
476351567 2 0
477203615 1 0
477457927 1 0
477703411 2 1
477945499 1 1
478246051 2 0
478263727 1 0
478468407 1 0
478513291 1 0
545284535 2 0
545702047 2 1
545714287 2 0
545789087 1 0
545795903 2 0
546102567 2 0
546226327 2 1
546445295 1 1
546720687 2 0
546764207 2 0
612737127 2 0
613139007 2 0
613487171 1 1
613516407 2 1
613518455 1 0
613555167 2 0
613878167 2 0
614307251 2 0
680962211 2 0
681007099 2 0
681717687 1 1
681959767 1 1
687758964 1 0
748383527 2 0
748565087 1 1
749221971 1 0
749427331 1 1
749433451 1 1
749436167 1 1
749776171 1 1
750230407 2 0
816190411 2 0
816391011 2 0
817014571 1 0
817370211 1 0
817714967 2 0
817917607 1 0
end
label values sex c_sex
label def c_sex 1 "male", modify
label def c_sex 2 "female", modify
label values resinsulted3_3 c_resinsulted3_3
label def c_resinsulted3_3 0 "not mentioned", modify
label def c_resinsulted3_3 1 "mentioned", modify

How to perform relative mortality in STATA?

Hi Everyone. I have a dataset of cancer patients from 2004-2014 with survival status until 2015. Now I've been asked to do relative mortality analysis and acturial survival. Could you please give me the syntax for performing this analysis?

Regards

Sactter plot with different groups while adding label for one particular observation (sepscatter command)

Hi everyone,

I have a cross section data for a group of countries and three variables: government spending, test scores and regions (please find below an example of the data). I would like to create a scatter plot with two variables (gov spending and test scores) while having different colors for each region and at the same time adding one label for one particular observation.

Example of the data

* Example generated by -dataex-. For more info, type help dataex
clear
input int year str3 code float harmonizedtestscores double govexpstudent str26 region
2020 "ABW" . 17.08119 "Latin America & Caribbean"
2020 "AFG" 354.7588 10.25174 "South Asia"
2020 "ALB" 434.1276 34.18148 "Europe & Central Asia"
2020 "AND" . 12.30702 "Europe & Central Asia"
2020 "ARE" 448.0407 6.04318 "Middle East & North Africa"
2020 "ARG" 408.1726 14.97726 "Latin America & Caribbean"
2020 "ARM" 442.96945 10.15365 "Europe & Central Asia"
2020 "ATG" 406.9974 8.05843 "Latin America & Caribbean"
2020 "AUS" 515.68536 19.30184 "East Asia & Pacific"
2020 "AUT" 507.6394 23.48041 "Europe & Central Asia"
2020 "BDI" 422.7478 12.92904 "Sub-Saharan Africa"
2020 "BEL" 516.80786 21.87767 "Europe & Central Asia"
2020 "BEN" 383.92285 10.25321 "Sub-Saharan Africa"
2020 "BFA" 403.6543 16.13029 "Sub-Saharan Africa"
2020 "BGD" 368.3153 7.56853 "South Asia"
2020 "BGR" 441.0933 22.99728 "Europe & Central Asia"
2020 "BHR" 451.706 11.16362 "Middle East & North Africa"
2020 "BLZ" . 16.55987 "Latin America & Caribbean"
2020 "BMU" . 8.05056 "North America"
2020 "BRA" 413.24475 20.05496 "Latin America & Caribbean"
2020 "BRB" . 20.75168 "Latin America & Caribbean"
2020 "BRN" 437.5159 8.87354 "East Asia & Pacific"
2020 "BTN" 386.7394 12.84915 "South Asia"
2020 "BWA" 391.3183 10.0503 "Sub-Saharan Africa"
2020 "CAF" 368.7302 4.10032 "Sub-Saharan Africa"
2020 "CAN" 533.998 17.56737 "North America"
2020 "CHE" 515.13904 24.78876 "Europe & Central Asia"
2020 "CHL" 452.2182 18.38536 "Latin America & Caribbean"
2020 "CHN" 441 5.90241 "East Asia & Pacific"
2020 "CIV" 373.2305 13.27872 "Sub-Saharan Africa"
2020 "CMR" 378.8688 5.44964 "Sub-Saharan Africa"
2020 "COD" 310.1971 7.19307 "Sub-Saharan Africa"
2020 "COG" 370.6141 11.68812 "Sub-Saharan Africa"
2020 "COL" 419.0275 16.98989 "Latin America & Caribbean"
2020 "COM" 392.2255 9.75717 "Sub-Saharan Africa"
2020 "CPV" . 16.65962 "Sub-Saharan Africa"
2020 "CRI" 428.5557 20.82721 "Latin America & Caribbean"
2020 "CUB" . 49.06773 "Latin America & Caribbean"
2020 "CUW" . 10.00974 "Latin America & Caribbean"
2020 "CYM" . 17.06149 "Latin America & Caribbean"
2020 "CYP" 502.1622 31.88086 "Europe & Central Asia"
2020 "CZE" 512.2216 13.8679 "Europe & Central Asia"
2020 "DEU" 517.28143 17.44257 "Europe & Central Asia"
2020 "DJI" . 37.32327 "Middle East & North Africa"
2020 "DMA" 403.7832 14.64306 "Latin America & Caribbean"
2020 "DNK" 517.8778 25.05246 "Europe & Central Asia"
2020 "DOM" 345.2165 16.18636 "Latin America & Caribbean"
2020 "DZA" 374.0891 11.47531 "Middle East & North Africa"
2020 "ECU" 420.1486 9.51098 "Latin America & Caribbean"
2020 "EGY" 355.9865 9.84518 "Middle East & North Africa"
2020 "ERI" . 5.26822 "Sub-Saharan Africa"
2020 "ESP" 506.6212 17.11832 "Europe & Central Asia"
2020 "EST" 543.20605 20.29085 "Europe & Central Asia"
2020 "ETH" 348.199 7.87686 "Sub-Saharan Africa"
2020 "FIN" 533.70764 21.50883 "Europe & Central Asia"
2020 "FJI" 383.336 12.35277 "East Asia & Pacific"
2020 "FRA" 510.2606 17.3647 "Europe & Central Asia"
2020 "GAB" 456.1277 4.71526 "Sub-Saharan Africa"
2020 "GBR" 520.35645 24.17334 "Europe & Central Asia"


I have used the following code and it works perfectly (please find below an image):
separate harmonizedtestscores ,by( region ) veryshortlabel
scatter harmonizedtestscores? govexpstudent

I just would like to add a label for one country "EGY". I would be grateful if you could advise.

Many thanks in advance.

Jala

Creating a combined averaged using a panel dataset

I have a panel dataset in vertical form with 6 countries; each shows values for 14 indicators between 2000 and 2020. There is a column that shows the indicator "value" for each "year" and another column that takes the value of the most recent value ("lastvalue") between 2015-2020 (or"lastyear"), thus, this value repeats itself in 20 rows).

Here is an example for only one country Antigua and Barbuda (ATG)
Code:
* Example generated by -dataex-. For more info, type help dataex clear input str3 countrycode str20 indicatorcode int year double value int lastyear double lastvalue "ATG" "FX.OWN.TOTL.FE.ZS" 2000 . . . "ATG" "FX.OWN.TOTL.FE.ZS" 2001 . . . "ATG" "FX.OWN.TOTL.FE.ZS" 2002 . . . "ATG" "FX.OWN.TOTL.FE.ZS" 2003 . . . "ATG" "FX.OWN.TOTL.FE.ZS" 2004 . . . "ATG" "FX.OWN.TOTL.FE.ZS" 2005 . . . "ATG" "FX.OWN.TOTL.FE.ZS" 2006 . . . "ATG" "FX.OWN.TOTL.FE.ZS" 2007 . . . "ATG" "FX.OWN.TOTL.FE.ZS" 2008 . . . "ATG" "FX.OWN.TOTL.FE.ZS" 2009 . . . "ATG" "FX.OWN.TOTL.FE.ZS" 2010 . . . "ATG" "FX.OWN.TOTL.FE.ZS" 2011 . . . "ATG" "FX.OWN.TOTL.FE.ZS" 2012 . . . "ATG" "FX.OWN.TOTL.FE.ZS" 2013 . . . "ATG" "FX.OWN.TOTL.FE.ZS" 2014 . . . "ATG" "FX.OWN.TOTL.FE.ZS" 2015 . . . "ATG" "FX.OWN.TOTL.FE.ZS" 2016 . . . "ATG" "FX.OWN.TOTL.FE.ZS" 2017 . . . "ATG" "FX.OWN.TOTL.FE.ZS" 2018 . . . "ATG" "FX.OWN.TOTL.FE.ZS" 2019 . . . "ATG" "FX.OWN.TOTL.FE.ZS" 2020 . . . "ATG" "FX.OWN.TOTL.MA.ZS" 2000 . . . "ATG" "FX.OWN.TOTL.MA.ZS" 2001 . . . "ATG" "FX.OWN.TOTL.MA.ZS" 2002 . . . "ATG" "FX.OWN.TOTL.MA.ZS" 2003 . . . "ATG" "FX.OWN.TOTL.MA.ZS" 2004 . . . "ATG" "FX.OWN.TOTL.MA.ZS" 2005 . . . "ATG" "FX.OWN.TOTL.MA.ZS" 2006 . . . "ATG" "FX.OWN.TOTL.MA.ZS" 2007 . . . "ATG" "FX.OWN.TOTL.MA.ZS" 2008 . . . "ATG" "FX.OWN.TOTL.MA.ZS" 2009 . . . "ATG" "FX.OWN.TOTL.MA.ZS" 2010 . . . "ATG" "FX.OWN.TOTL.MA.ZS" 2011 . . . "ATG" "FX.OWN.TOTL.MA.ZS" 2012 . . . "ATG" "FX.OWN.TOTL.MA.ZS" 2013 . . . "ATG" "FX.OWN.TOTL.MA.ZS" 2014 . . . "ATG" "FX.OWN.TOTL.MA.ZS" 2015 . . . "ATG" "FX.OWN.TOTL.MA.ZS" 2016 . . . "ATG" "FX.OWN.TOTL.MA.ZS" 2017 . . . "ATG" "FX.OWN.TOTL.MA.ZS" 2018 . . . "ATG" "FX.OWN.TOTL.MA.ZS" 2019 . . . "ATG" "FX.OWN.TOTL.MA.ZS" 2020 . . . "ATG" "IC.FRM.FEMO.ZS" 2000 . . . "ATG" "IC.FRM.FEMO.ZS" 2001 . . . "ATG" "IC.FRM.FEMO.ZS" 2002 . . . "ATG" "IC.FRM.FEMO.ZS" 2003 . . . "ATG" "IC.FRM.FEMO.ZS" 2004 . . . "ATG" "IC.FRM.FEMO.ZS" 2005 . . . "ATG" "IC.FRM.FEMO.ZS" 2006 . . . "ATG" "IC.FRM.FEMO.ZS" 2007 . . . "ATG" "IC.FRM.FEMO.ZS" 2008 . . . "ATG" "IC.FRM.FEMO.ZS" 2009 . . . "ATG" "IC.FRM.FEMO.ZS" 2010 21.3 . . "ATG" "IC.FRM.FEMO.ZS" 2011 . . . "ATG" "IC.FRM.FEMO.ZS" 2012 . . . "ATG" "IC.FRM.FEMO.ZS" 2013 . . . "ATG" "IC.FRM.FEMO.ZS" 2014 . . . "ATG" "IC.FRM.FEMO.ZS" 2015 . . . "ATG" "IC.FRM.FEMO.ZS" 2016 . . . "ATG" "IC.FRM.FEMO.ZS" 2017 . . . "ATG" "IC.FRM.FEMO.ZS" 2018 . . . "ATG" "IC.FRM.FEMO.ZS" 2019 . . . "ATG" "IC.FRM.FEMO.ZS" 2020 . . . "ATG" "SE.SEC.CMPT.LO.FE.ZS" 2000 . 2018 102.01148986816406 "ATG" "SE.SEC.CMPT.LO.FE.ZS" 2001 . 2018 102.01148986816406 "ATG" "SE.SEC.CMPT.LO.FE.ZS" 2002 . 2018 102.01148986816406 "ATG" "SE.SEC.CMPT.LO.FE.ZS" 2003 . 2018 102.01148986816406 "ATG" "SE.SEC.CMPT.LO.FE.ZS" 2004 . 2018 102.01148986816406 "ATG" "SE.SEC.CMPT.LO.FE.ZS" 2005 . 2018 102.01148986816406 "ATG" "SE.SEC.CMPT.LO.FE.ZS" 2006 . 2018 102.01148986816406 "ATG" "SE.SEC.CMPT.LO.FE.ZS" 2007 . 2018 102.01148986816406 "ATG" "SE.SEC.CMPT.LO.FE.ZS" 2008 96.2264175415039 2018 102.01148986816406 "ATG" "SE.SEC.CMPT.LO.FE.ZS" 2009 78.1376495361328 2018 102.01148986816406 "ATG" "SE.SEC.CMPT.LO.FE.ZS" 2010 107.908851623535 2018 102.01148986816406 "ATG" "SE.SEC.CMPT.LO.FE.ZS" 2011 112.236839294434 2018 102.01148986816406 "ATG" "SE.SEC.CMPT.LO.FE.ZS" 2012 106.744491577148 2018 102.01148986816406 "ATG" "SE.SEC.CMPT.LO.FE.ZS" 2013 95.4545516967773 2018 102.01148986816406 "ATG" "SE.SEC.CMPT.LO.FE.ZS" 2014 91.2582778930664 2018 102.01148986816406 "ATG" "SE.SEC.CMPT.LO.FE.ZS" 2015 99.1847763061523 2018 102.01148986816406 "ATG" "SE.SEC.CMPT.LO.FE.ZS" 2016 115.650970458984 2018 102.01148986816406 "ATG" "SE.SEC.CMPT.LO.FE.ZS" 2017 104.237289428711 2018 102.01148986816406 "ATG" "SE.SEC.CMPT.LO.FE.ZS" 2018 102.011489868164 2018 102.01148986816406 "ATG" "SE.SEC.CMPT.LO.FE.ZS" 2019 . 2018 102.01148986816406 "ATG" "SE.SEC.CMPT.LO.FE.ZS" 2020 . 2018 102.01148986816406 "ATG" "SE.SEC.CMPT.LO.MA.ZS" 2000 . 2018 95.35211181640625 "ATG" "SE.SEC.CMPT.LO.MA.ZS" 2001 . 2018 95.35211181640625 "ATG" "SE.SEC.CMPT.LO.MA.ZS" 2002 . 2018 95.35211181640625 "ATG" "SE.SEC.CMPT.LO.MA.ZS" 2003 . 2018 95.35211181640625 "ATG" "SE.SEC.CMPT.LO.MA.ZS" 2004 . 2018 95.35211181640625 "ATG" "SE.SEC.CMPT.LO.MA.ZS" 2005 . 2018 95.35211181640625 "ATG" "SE.SEC.CMPT.LO.MA.ZS" 2006 . 2018 95.35211181640625 "ATG" "SE.SEC.CMPT.LO.MA.ZS" 2007 . 2018 95.35211181640625 "ATG" "SE.SEC.CMPT.LO.MA.ZS" 2008 79.2866897583008 2018 95.35211181640625 "ATG" "SE.SEC.CMPT.LO.MA.ZS" 2009 88.0920181274414 2018 95.35211181640625 "ATG" "SE.SEC.CMPT.LO.MA.ZS" 2010 83.8666687011719 2018 95.35211181640625 "ATG" "SE.SEC.CMPT.LO.MA.ZS" 2011 93.8642272949219 2018 95.35211181640625 "ATG" "SE.SEC.CMPT.LO.MA.ZS" 2012 99.3573303222656 2018 95.35211181640625 "ATG" "SE.SEC.CMPT.LO.MA.ZS" 2013 80.2564086914063 2018 95.35211181640625 "ATG" "SE.SEC.CMPT.LO.MA.ZS" 2014 85.8257522583008 2018 95.35211181640625 "ATG" "SE.SEC.CMPT.LO.MA.ZS" 2015 92.4202117919922 2018 95.35211181640625 end
I would like to average the values for all the 6 countries and create a new combined countrycode. This means that I want a new set of rows for the new "averagedcountries" with
1. average "value" for each indicator (14) for each "year" (between 2000-2020).
2. average "lastvalue" for each indicator for the "recentvalue" (between 2015-2020).

I would appreciate any guidance! Thank you

How to deal with multicollinearity when adding fixed effects dummies in regression with cross sectional data?

Hello,

I have cross sectional data with 26 groups. I estimated a probit and fracreg regression for my two research questions. Since my key explanatory variables varies at the group level, I added the dummies for group and clustered errors at group level as well. However, I estimated vif by running reg command with the same variables I used for probit and fracreg regression. The VIF is very high for my key explanatory variable (around 28,000) when I add group dummies. However, when I remove the dummies it is within the acceptable range (like 4 etc). Why is the multicollinearity high when I add dummies? How can I fix the issue?

Saturday, January 28, 2023

scatter Y X || lfit Y X ||, by(variable)

Good evening,

I'd like to run scatter Y X || lfit Y X ||, by(variable)
but have everything on one graph instead of multiple graphs. I realize that there's going to be a lot of points and lines superimposed; that's OK. Is there a way to do it?

Thank you,
Stan

Brant Test Significance Question

Hello everyone. Thank you all for taking the time to answer other questions on this forum, it had been very helpful. This is my first time posting here. I would like to ask a question regarding the oparallel and brant test in particular. As you can see, the assumption of parallel lines is violated.

Tests of the parallel regression assumption

| Chi2 df P>Chi2
-----------------+----------------------
Wolfe Gould | 63.74 22 0.000
Brant | 57.02 22 0.000
score | 60.3 22 0.000
likelihood ratio | 63.02 22 0.000
Wald | 58.48 22 0.000

Brant test is also. However, I am confused why doesn't the chi2 table add up to 57.02? It is much less if I add it up on the calculator. I apologize if it is a really stupid question, but I humbly request your help.

Brant test of parallel regression assumption

| chi2 p>chi2 df
-------------+------------------------------
All | 57.02 0.000 22
-------------+------------------------------
region | 0.02 0.881 1
sex | 0.03 0.870 1
age | 0.76 0.384 1
edu | 3.57 0.059 1
sonh | 1.28 0.258 1
dauh | 3.62 0.057 1
2.mainsrc | 0.22 0.641 1
3.mainsrc | 0.12 0.734 1
4.mainsrc | 0.86 0.355 1
savin | 0.33 0.568 1
gold | 1.02 0.314 1
bond | 2.08 0.150 1
house | 0.00 0.952 1
land | 2.11 0.147 1
car | 0.94 0.333 1
2.satisf | 1.43 0.232 1
3.satisf | 0.09 0.760 1
2.health | 0.78 0.377 1
3.health | 3.89 0.049 1
1.templ | 0.32 0.571 1
2.templ | 0.79 0.373 1
3.templ | 0.02 0.892 1

When I run gologit2 there are also no constraints found. Image attached.

Thank you all. I sincerely hope to get some answer.

Best way to improve processing speed for large data sets (~3gb)

How much will more RAM help me with processing speed.

I am working with a dataset of 87 million records and is about 3gb in size. Its a 20 year cohort of people and my datasets contain 5-6 rows of data corresponding to their health events. It would be difficut to collapse any further (I could of course separate them into years, but would rather improve processing power than add these extra steps.

For example, today i tried to run a bsample to sample 800000 records out of my 80,000,000, and it took 1 hours and never did complete. I have similar long wait time when trying to flag if a string contained any of a list of 20 ICD codes from this database, waiting times of upwards of an hour while the computer sounds like its about to take off like an airplaine

I have 7GB of available ram and it always running at max during these churns I currently only have 50GM of free C drive disk space (its mostly full). Lots of this I cant delete as they are records and dataset from other projects, and since this is a work computer getting more space will require a series of request from IT etc..

What will make the most difference for me? Requesting more hard drive space would be very easy, and I seem to remember that stata needs plenty of hard drive space to run big datasets.

Or do I need to double my RAM and that is the only way.

This is an ongoing project and will be working with these datasets for at least a year, so need to fix. Using stata 17SE

Thanks


Friday, January 27, 2023

Hausman Test Failed

Hi! my name is Karina. I run a Hausman test on my stata using command:

quietly xtreg ecgrowth jubgrowth sukubunga inflation, fe
estimates store fe
quietly xtreg ecgrowth jubgrowth sukubunga inflation, re
estimates store re
hausman fe re

and it come up with this:

chi2(3) = (b-B)'[(V_b-V_B)^(-1)](b-B)
= -2.36 chi2<0 ==> model fitted on these
data fails to meet the asymptotic
assumptions of the Hausman test;
see suest for a generalized test

after searching for a little while, i found another option for doing the hausman test; it's either using xtoverid or sigmaless/sigmamore. Later, i try to use this as my command:

xtreg ecgrowth jubgrowth sukubunga inflation if t>=2004 & t<=2021, re
xtoverid
di r(j)
est store re
xtreg ecgrowth jubgrowth sukubunga inflation if t>=2004 & t<=2021, fe
hausman fe re, sigmaless

is this the right command to run the hausman test? if not, then would anyone help me to give the right command? im so confused and on the edge of a thesis deadline. Thank you for anyone replying!

Running ANOVA in loops

Hi, I want to run repeated measures ANOVA on math scores and semester, but the analysis has to be done for all levels of stress (high/mid/low) and sex variables (male/female).

Please let me know if this is possible do in a loop without having to subset manually.

Thank you!

Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte student_id str6(income sex semester) float math_score
 1 "low"    "male"   "summer"  75.79707
 2 "low"    "male"   "winter"  78.42905
 3 "low"    "male"   "fall"    77.19025
 4 "low"    "male"   "summer"  86.13298
 5 "low"    "male"   "winter"  84.81785
 6 "low"    "male"   "fall"   72.808044
 7 "low"    "male"   "summer"  60.94741
 8 "low"    "male"   "winter"  93.07262
 9 "low"    "male"   "fall"    96.72986
10 "low"    "male"   "summer"  95.45274
11 "low"    "male"   "winter"  73.21725
12 "low"    "male"   "fall"    68.09888
13 "low"    "male"   "summer"  54.45422
14 "low"    "male"   "winter" 70.273636
15 "low"    "male"   "fall"     63.7756
16 "low"    "male"   "summer"  73.78484
17 "low"    "male"   "winter"  74.67471
18 "low"    "male"   "fall"    74.49361
19 "low"    "male"   "summer"  65.32313
20 "low"    "male"   "winter"  71.98983
21 "low"    "male"   "fall"    70.34425
22 "low"    "male"   "summer"  63.78717
23 "low"    "male"   "winter"  79.24786
24 "low"    "male"   "fall"     82.9422
25 "low"    "male"   "summer" 66.953354
26 "low"    "male"   "winter"    83.132
27 "low"    "male"   "fall"    45.36668
28 "low"    "male"   "summer"  81.48463
29 "low"    "male"   "winter"  85.73037
30 "low"    "male"   "fall"    79.11235
31 "low"    "male"   "summer"  81.41333
32 "low"    "male"   "winter"   82.6655
33 "low"    "male"   "fall"     62.9081
 1 "low"    "male"   "summer"  61.16171
 2 "low"    "male"   "winter"  55.96095
 3 "low"    "male"   "fall"    65.61567
 4 "low"    "male"   "summer"  57.84998
 5 "low"    "male"   "winter"  62.85389
 6 "low"    "male"   "fall"    86.14867
 7 "low"    "male"   "summer" 65.382805
 8 "low"    "male"   "winter"  78.90556
 9 "low"    "male"   "fall"    77.81376
10 "low"    "male"   "summer"  73.22633
11 "low"    "male"   "winter"  78.89809
12 "low"    "male"   "fall"    67.75861
13 "low"    "male"   "summer"  63.12998
14 "low"    "male"   "winter" 64.939606
15 "low"    "male"   "fall"    68.52872
16 "low"    "male"   "summer"  82.71146
17 "middle" "male"   "winter"  89.11002
18 "middle" "male"   "fall"    52.11129
19 "middle" "male"   "summer"  65.83049
20 "middle" "male"   "winter"  68.84013
21 "middle" "male"   "fall"   69.065605
22 "middle" "male"   "summer"  74.18057
23 "middle" "male"   "winter"  82.78862
24 "middle" "male"   "fall"    67.25807
25 "middle" "male"   "summer" 69.853386
26 "middle" "male"   "winter"  91.95356
27 "middle" "male"   "fall"    76.73396
28 "middle" "male"   "summer"  60.11725
29 "middle" "male"   "winter"  76.10474
30 "middle" "female" "fall"    70.36514
31 "middle" "female" "summer"  67.02353
32 "middle" "female" "winter"  59.18249
33 "middle" "female" "fall"     79.9767
 1 "middle" "female" "summer"  79.48432
 2 "middle" "female" "winter"  64.14967
 3 "middle" "female" "fall"    84.50146
 4 "middle" "female" "summer"  63.50959
 5 "middle" "female" "winter"  73.36041
 6 "middle" "female" "fall"    81.75935
 7 "middle" "female" "summer"  71.10308
 8 "middle" "female" "winter"  56.25582
 9 "middle" "female" "fall"    77.83181
10 "middle" "female" "summer"  60.94855
11 "middle" "female" "winter"  78.57565
12 "middle" "female" "fall"     83.1645
13 "middle" "female" "summer"   79.3857
14 "high"   "female" "winter"  74.47901
15 "high"   "female" "fall"    56.33062
16 "high"   "female" "summer"  66.48232
17 "high"   "female" "winter"  63.08831
18 "high"   "female" "fall"    60.62999
19 "high"   "female" "summer"  76.90691
20 "high"   "female" "winter"  69.44049
21 "high"   "female" "fall"    56.89666
22 "high"   "female" "summer"  79.77268
23 "high"   "female" "winter" 68.887505
24 "high"   "female" "fall"    71.54644
25 "high"   "female" "summer"  65.84627
26 "high"   "female" "winter"  77.20901
27 "high"   "female" "fall"    66.96027
28 "high"   "female" "summer" 67.319885
29 "high"   "female" "winter" 74.991455
30 "high"   "female" "fall"    70.49643
31 "high"   "female" "summer"  71.18354
32 "high"   "female" "winter"   56.2911
33 "high"   "female" "fall"    69.46329
end

Time invariant dummies.

Hi,

I am performing a fe model and I am adding a series of country dummies to control for country fe.
Unfortunately, all the country dummies are omitted due multicollinearity (they do not change over time), Is there a way to overcome this problem?

Many thanks, Chiara

How can I check if a string has repeated words?

I know the moss package is related to this but I cannot make it work yet(one similar post can be found here). For example for the string "John said John" the result should be "John" and positions 1 and 11.


Extra:
A good extra thing would be to understand multiple words as well for example "John Williams thanked John Williams" should return "John Williams". The best would be to be able to give me the longest repeated substrin for example "John loves Joh" would give "Joh" and "John Williams thanked John Will" should give "John Will"

forvalue loop for discontinuous variable

How I can use a forvalue loop over the variable whose values are not sequentially continuous? I want to use forvalue loop for the industry codes which are not continuous and huge in numbers.

Panel Data

Hi,

I am using panel data with the wave variable ranging from 1 to 10 however i would like to focus on wave 1 to 8 instead when testing for POLS how would I be able to do this ?

I would appreciate any help

Kind regards,

Denis


Testing stability of reggression discontinuity model (TED)

Hello Statalisters,

I need your valuable advice.
Having applied in my analysis regression discontinuity design (RD), I want to check the validity of its results through the ted command to check the stability of the regression discontinuity model.
Can someone guide me how can I install this command in Stata 17?

Any help in this regard would be extremely helpful to me.


Thank you in andvane.

Best,
Angeliki

Thursday, January 26, 2023

matching single county with unique congressional district

According to us house election data a single county for a specific state belongs to multiple congressional districts. However, based on the total geographical area I can match unique county to a specific district for that state.

In the following data nhgisnam means the name of the county , district means Congressional district , cnty_area means the total geographical area of that county and cnty_part_area means the portion of total area of that county belonging to that particular congressional district. For example : In th first line for Autauga county the cnty_part_area's value is 1564828723. That means in district 2 Autauga county's total area is 1564828723 - out of its total area (cnty_area) of 1565529773. Autauga county also belong to district 2, district 7 and district 6 of state 1. But, the highestr portion of it's area belong to district 2 which I can figure out from that cnty_part_area variable.

Can anyone kindly guide me how I can code the data so that for each county in a particular state I can keep the observation where each county is assigned to a single district in that state based on the highest value of cnty_part_area variable for that specific county ??

Code:
* Example generated by -dataex-. For more info, type help dataex
clear
input str36 nhgisnam str14 statenam byte(cd_statefip district) float county double(cnty_area cnty_part_area)
"Autauga"       "Alabama" 1 2 1001 1565529773  1564828723
"Autauga"       "Alabama" 1 6 1001 1565529773 272866.6476
"Autauga"       "Alabama" 1 7 1001 1565529773 428181.5623
"Baldwin"       "Alabama" 1 1 1003 4232265763  4228366412
"Baldwin"       "Alabama" 1 7 1003 4232265763 149732.2219
"Barbour"       "Alabama" 1 2 1005 2342716428  2341574170
"Barbour"       "Alabama" 1 3 1005 2342716428 601774.0707
"Bibb"          "Alabama" 1 6 1007 1621774445  1621293818
"Bibb"          "Alabama" 1 7 1007 1621774445 480626.7196
"Alameda"      "California" 6  9 6001  1928023778 347196605.8
"Alameda"      "California" 6 10 6001  1928023778 350508415.5
"Alameda"      "California" 6 11 6001  1928023778 654388620.1
"Alameda"      "California" 6 13 6001  1928023778 571141686.4
"Alameda"      "California" 6 15 6001  1928023778  77415.0506
"Alpine"       "California" 6  3 6003  1924880954  1923718768
"Alpine"       "California" 6  4 6003  1924880954 194564.3245
"Alpine"       "California" 6 19 6003  1924880954 163767.2729
"Alpine"       "California" 6 25 6003  1924880954 326582.7162
"Amador"       "California" 6  3 6005  1566159667  1565521028
"Amador"       "California" 6  4 6005  1566159667 637209.1726
"Amador"       "California" 6 11 6005  1566159667 1425.507371
"Glenn"        "California" 6  1 6021  3437311730 146543.9293
"Glenn"        "California" 6  2 6021  3437311730  3437165187
"Humboldt"     "California" 6  1 6023  9286259251  9282075925
"Humboldt"     "California" 6  2 6023  9286259251 589746.8231
"Los Angeles"  "California" 6 22 6037 10591484333  1402787367
"Los Angeles"  "California" 6 24 6037 10591484333 350751.8175
"Los Angeles"  "California" 6 25 6037 10591484333  4108619373
"Los Angeles"  "California" 6 26 6037 10591484333  1272584266
"Los Angeles"  "California" 6 27 6037 10591484333 392578311.2
"Los Angeles"  "California" 6 28 6037 10591484333 201732469.2
"Los Angeles"  "California" 6 29 6037 10591484333 262922748.4
"Los Angeles"  "California" 6 30 6037 10591484333 742182418.1
end

Jackknife xtqreg

Using the Grunfeld data, this works:
Code:
bootstrap, reps(50) cluster(company) idcluster(comp): xtqreg invest mvalue kstock i.comp#c.time, quantile(.25) i(company)
But if I use jackknife instead of bootstrap it does not work. What needs to be different for jackknife? I took out reps and changed i(company) to the jackknife equivalent.

Combine multiple rows into one when end_date = start_date

Hi there,

I am looking for a code that combine the rows when end_date = start_date for a certain ID when all the variables have same values. For example, in the table below, ID 1 has 4 rows and their variables x and y have the same values in each rows under ID 1. ID 1 has end_date=start_date in each of its row. In that case, ID 1 will be combined into one row that will show the very first start_date and very last end_date.

id start_date end_date x y
1 10 12 10 12
1 12 14 10 12
1 14 16 10 12
1 16 21 10 12
2 11 13 15 17
2 13 15 15 17
3 16 18 20 22
3 18 20 20 22
3 20 25 20 22



Result will look like this:

id start_date end_date x y
1 10 21 10 12
2 11 15 15 17
3 16 25 20 22

I am seeking for help to get the code.
Thank you.

Kind regards,
Rakib

Wild bootstrap with ML

Dear users, I need to estimate a maximum likelihood model with wild bootstrap as I have few cluster issue. I wonder if I can get any advice on how to implement the wild boostrap. To be specific, I did the following to use boottest and failed.

ml model lf ML_eut_mu (CRRA: choice p0b p1b pay0b pay1b = post trt postxtrt ///
) (mu: = )
ml max, diff iterate(100)
boottest postxtrt

Thank you!

Bug in the dir extended macro


Dear All,

I wonder if anyone can interpret the below syntax line for me.

Thank you, Sergiy

Array





New Stata package: ddml for Double/debiased Machine Learning

I am happy to announce a new package that I have written together with Christian Hansen, Mark Schaffer and Thomas Wiemann.

We introduce the package ddml for Double/Debiased Machine Learning (DDML) in Stata. Estimators of causal parameters for five different econometric models are supported, allowing for flexible estimation of causal effects of endogenous variables in settings with unknown functional forms and/or many exogenous variables.

The main features are:
  1. ddml supports flexible estimators of causal parameters in five econometric models: (1) the Partially Linear Model, (2) the Interactive Model (for binary treatment), (3) the Partially Linear IV Model, (4) the Flexible Partially Linear IV Model, and (5) the Interactive IV Model (for binary treatment and instrument).
  2. ddml supports data-driven combinations of multiple machine learners via stacking by leveraging pystacked, our complementary Stata frontend relying on the Python library scikit-learn. See our separate Working paper.
  3. Aside from pystacked, ddml can be used in combination with many other existing supervised machine learning programs available in or via Stata. ddml has been tested with lassopack, rforest, svmachines, and parsnip. Indeed, the requirements for compatibility with ddml are minimal: Any eclass-program with the Stata-typical "reg y x'' syntax, support for if conditions and post-estimation predict is compatible with ddml.
  4. ddml provides flexible multi-line syntax and short one-line syntax. The multi-line syntax offers a wide range of options, guides the user through the DDML algorithm step-by-step, and includes auxiliary programs for storing, loading and displaying additional information. We also provide a complementary one-line version called qddml (`quick' ddml), which uses a similar syntax as pdslasso and ivlasso.
You can find more information---including many examples---on our Website or in our Working paper.

You can install ddml from our Github or SSC.

Code:
ssc install ddml
net install ddml, from(https://raw.githubusercontent.com/aahrens1/ddml/master)
Many thanks to Kit Baum for uploading the package to SSC.

Predicitve Margins and Marginsplots dor continuous variable

Hello together,

I am completely new to margins and marginsplots an have the following question:

I have this regression:

Code:
reghdfe c.Change_NII c.GGDEBT##c.Total_domestic_exposure c.GGDEBT_lag1 c.Long_term_rate c.Unemployment c.log_totalassets c.log_ownfunds c.CET_Actual c.NII_Actual, absorb(i.ST_Year i.Country_Code2)
and I need the predicitve margins and the following margins plot where GGDEBT has to be held constant at certain values (30, 60, 80, 100, 120)
with Predicted Change_NII on the y-axis and Total_domestic_exposure on the x-axis.



Array


Thanks in advance and kind regards,
Natalie

Wednesday, January 25, 2023

Merging problem of congressional district data with counties

This is my one set of data where I have statefip , district ( which stands for congressional district) and county. I want to merge this data with the following dataset. Problem is in my master data ( which is posted first here) congressional district from the same district belongs to multiple counties. I don't know how I can resolve the issue to merge with the using data ( which is posted second in this post below )

Can anyone help ??

Code:
* Example generated by -dataex-. For more info, type help dataex
clear
input byte(cd_statefip district) float county
6  9 6001
6 10 6001
6 11 6001
6 13 6001
6 15 6001
6  3 6003
6  3 6003
6  3 6003
6  4 6003
6 19 6003
6  4 6005
6  4 6005
6 11 6009
6 19 6009
6 19 6009
6  1 6011
6  1 6011
6  2 6011
6  2 6011
48  5 48001
48  6 48001
48 11 48003
48 19 48003
48  1 48005
48  1 48005
48  5 48005
48 27 48007
48 15 48055
48 25 48055
48 25 48055
48 28 48055
end
The data I'm trying to merge with is like the following.

Code:
* Example generated by -dataex-. For more info, type help dataex
clear
input byte(state_fips district) float democrat_pres byte incumbent_pres_match float(democrat_voteperc competitive)
6  1 1 0  42.61742 2
6  1 1 0  42.61742 2
6  2 1 1   71.2433 8
6  2 1 1   71.2433 8
6  3 1 1  54.23049 4
6  3 1 1  54.23049 4
6  4 1 0  38.89085 1
6  4 1 0  38.89085 1
6  5 1 1  74.47112 8
6  5 1 1  74.47112 8
6  6 1 1  75.05244 9
6  6 1 1  75.05244 9
6  7 1 1  51.68154 4
6  7 1 1  51.68154 4
6  8 1 0         . .
6  8 1 0         . .
6  9 1 1   55.5541 5
6  9 1 1   55.5541 5
6 10 1 0  47.29181 3
6 10 1 0  47.29181 3
6 11 1 1  69.73173 8
6 11 1 1  69.73173 8
6 12 1 1  85.08385 .
6 12 1 1  85.08385 .
6 13 1 1  86.78157 .
6 13 1 1  86.78157 .
6 14 1 1  78.91653 9
48  1 0 0  56.44611  5
48  1 0 0  56.44611  5
48  2 0 0  60.84797  6
48  2 0 0  60.84797  6
48  2 0 0  60.84797  6
48  3 0 1 24.331583  .
48  3 0 1 24.331583  .
48  3 0 1 24.331583  .
48  4 0 0  57.82096  5
48  4 0 0  57.82096  5
48  4 0 0  57.82096  5
48  5 0 1  40.26217  2
48  5 0 1  40.26217  2
48  5 0 1  40.26217  2
48  5 0 1  40.26217  2
48 17 0 0  51.36383  4
48 17 0 0  51.36383  4
48 17 0 0  51.36383  4
48 18 0 0  76.91312  9
48 18 0 0  76.91312  9
48 18 0 0  76.91312  9
end
label values competitive competitivb
label def competitivb 0 "30-", modify
label def competitivb 1 "35-", modify
label def competitivb 2 "40-", modify
label def competitivb 3 "45-", modify
label def competitivb 4 "50-", modify
label def competitivb 5 "55-", modify
label def competitivb 6 "60-", modify
label def competitivb 7 "65-", modify
label def competitivb 8 "70-", modify
label def competitivb 9 "75-", modify
label def competitivb 10 "80-", modify

Is there a way to make Python stop on error when running PyStata ?

Suppose I have the following (toy) code
Code:
import stata_setup
stata_setup.config("C:/Program Files/Stata17", "mp")

for i in range(1000):
 %stata replace y=2
and I run the whole thing in python. Suppose Stata throws an error that variable y does not exist. If it were the real Stata, this would stop the program, and I would be happy. But in python, it will be executed 1000 times! Is there any way to stop it on its own?

Create empty graph in a loop

Dear Statlisters,
I have a series of graphs made in a loop which are then amalgamated using grc1leg (a wrapper for graph combine). There are a number of these combined graphs that will need to be compared. When running a loop, if there is a graph without observations, state throws up an error and stops. This can be circumnavigated by prefixing capture in front of the catplot command. A further error then manifests when trying to save the graph that has no data. The potential errors are 1) failure to save anything if no graph is in the memory and the code stopping, 2) saving the last graph in the memory and the code continuing. The first of these can also be circumnavigated by introducing an if command and counting the number of non-missing datapoints and opting not to save if there is no graph. However, there is then an absence of file to combine with grc1leg.

For ease of coding and for ease of reading and comparability, I can manually remove the missing graph's code and introduce a hole into the grc1leg matrix where this missing graph would be, however this is a suboptimal method of fixing the problem. Instead it would be easier if the original graph would save as an empty graph with no datapoints, but I am unclear as to how to ask state to comply (if indeed it can).

initial code:
Code:
foreach var of local systemic {
    colorpalette red yellow, ipolate(4) nograph
    capture catplot Overall_`var' , over(Armcode) stack asyvars  bar(2, col("`r(p4)'")) bar(3, col("`r(p3)'")) bar(4, col("`r(p2)'")) bar(5, col("`r(p1)'")) percent graphregion(color(white)) bgcolor(white) allcategories legend(order(2 3 4))
    graph save "`var'.gph", replace
}
grc1leg  "1.gph" "2.gph" "3.gph" "4.gph" "5.gph" "6.gph" "7.gph" "8.gph"  graphregion(color(white))
graph save "combine.gph", replace
with if command:
Code:
foreach var of local systemic {
    colorpalette red yellow, ipolate(4) nograph
    capture catplot Overall_`var' , over(Armcode) stack asyvars  bar(2, col("`r(p4)'")) bar(3, col("`r(p3)'")) bar(4, col("`r(p2)'")) bar(5, col("`r(p1)'")) percent graphregion(color(white)) bgcolor(white) allcategories legend(order(2 3 4))
    qui count if !(missing(Overall_`var'))
    if(`r(N)'>0) {
        graph save "`var'.gph", replace
    }
}
manual solution - removing the problematic graph eg 1.gph and introducing a hole
Code:
foreach var of local systemic {
    colorpalette red yellow, ipolate(4) nograph
    capture catplot Overall_`var' , over(Armcode) stack asyvars  bar(2, col("`r(p4)'")) bar(3, col("`r(p3)'")) bar(4, col("`r(p2)'")) bar(5, col("`r(p1)'")) percent graphregion(color(white)) bgcolor(white) allcategories legend(order(2 3 4))
    graph save "`var'.gph", replace
}
grc1leg  "2.gph" "3.gph" "4.gph" "5.gph" "6.gph" "7.gph" "8.gph"  graphregion(color(white)) holes(1)
graph save "combine.gph", replace
Any help gratefully received.
Rob

Changing an estimation stored macro with a previously stored macro

I am trying to change an estimation stored macro with the stored result from the previous estimation. I manage to do so with `e(cmd)' but not `e(inexog)' below.

Code:
. sysuse auto, clear
(1978 Automobile Data)

. 
. qui ivreg2 price (weight=length) i.turn headroom, cluster(turn) small

. display e(inexog)
31b.turn 32.turn 33.turn 34.turn 35.turn 36.turn 37.turn 38.turn 39.turn 40.turn 41.turn 42.t
> urn 43.turn 44.turn 45.turn 46.turn 48.turn 51.turn headroom

. local x = "`e(inexog)'"

. display "`x'"
31b.turn 32.turn 33.turn 34.turn 35.turn 36.turn 37.turn 38.turn 39.turn 40.turn 41.turn 42.t
> urn 43.turn 44.turn 45.turn 46.turn 48.turn 51.turn headroom

. 
. cap prog drop pretend_to_be_ivreg2

. prog pretend_to_be_ivreg2, eclass
  1.         ereturn local cmd = "ivreg2"
  2.         ereturn local inexog = "`x'"
  3. end

. display "`x'"
31b.turn 32.turn 33.turn 34.turn 35.turn 36.turn 37.turn 38.turn 39.turn 40.turn 41.turn 42.t
> urn 43.turn 44.turn 45.turn 46.turn 48.turn 51.turn headroom

. 
. qui ivreghdfe price (weight=length) headroom, absorb(turn) cluster(turn) // coeff/SE: 4.509
> 739   .7761883, Fstat: 69.125

. display e(cmd)
ivreghdfe

. display e(inexog)
headroom

. pretend_to_be_ivreg2

. display e(cmd)
ivreg2

. display e(inexog) // DOES NOT WORK
.

.
I only managed to get `e(inexog)' to output the desired list of variables when setting, within the program, ereturn local inexog = "...." (list of variable). I would like to do it automatically by making reference to a local macro, as I tried above.

Code:
. cap prog drop pretend_to_be_ivreg2

. prog pretend_to_be_ivreg2, eclass
  1.         ereturn local cmd = "ivreg2"
  2.         ereturn local inexog = "31b.turn 32.turn 33.turn 34.turn 35.turn 36.turn 37.turn
>  38.turn 39.turn 40.turn 41.turn 42.turn 43.turn 44.turn 45.turn 46.turn 48.turn 51.turn he
> adroom"
  3. end

. qui ivreghdfe price (weight=length) headroom, absorb(turn) cluster(turn) // coeff/SE: 4.509
> 739   .7761883, Fstat: 69.125

. display e(cmd)
ivreghdfe

. display e(inexog)
headroom

. pretend_to_be_ivreg2

. display e(cmd)
ivreg2

. display e(inexog) // NOW, IT WORKS
31b.turn 32.turn 33.turn 34.turn 35.turn 36.turn 37.turn 38.turn 39.turn 40.turn 41.turn 42.t
> urn 43.turn 44.turn 45.turn 46.turn 48.turn 51.turn headroom

.

Discrete Choice Experiment - Fractonnial factorial design

Dear collegues,

As PhD student, I want to set up a Discrete Choice Experiment in a consumers survey. The main objective is to measure consumer willingness to pay for multi-labelled seafood products. Following focus group and expert opinion, my attributes and their corresponding levels will be the following :
Attributes Level
Price Market price; +10%; + 20%; +30%
Labels None; Ecolabel; Health Allegations; Both labels
Production Method Wild cod; Raised salmon; Wild seabass; raised seabass

My full factorial desing is composed by 64 attribute combinations (4*4*4). To generate it, i used the following code

Code:
matrix attrib = 4,4,4
genfact, levels (attrib)
list, separator (4)
I obtained a table which present the entire combination.

I therefore have to select a fractonnial factorial design for my survey. To do so, i used the following code:

Code:
rename x1 price
rename x2 labels
rename x3 species

recode price (1=0) (2=1) (3=2) (4=3)
recode labels (1=0) (2=1) (3=2) (4=3)
recode species (1=0) (2=1) (3=2) (4=3)

******* Test de Defficient *******

matrix b = J(1,9,0)
dcreate i.price i.labels i.species, nalt(2) nset(12) bmat(b)
I aim to propose to each individual in my survey to choose between two options and one output option. Each individual will have to do this operation three times.


I don't understand how to select the fractionnal design with the higher Defficenty ... Each time I run the last two lines of my code, the Defficienty score changes. I am confused about the procedure to follow. And I haven't introduced the opt-out yet for the moment in my design ...


Sorry for these questions, I am just starting to get interested in these issues of DCE.

If you want more information to clarify my position, i remain available

Best Regards

Jean-François DEWALS


Tuesday, January 24, 2023

10000 posts by William Lisowski

Congratulations William Lisowski on reaching the milestone of 10,000 posts on Statalist! Your contributions have greatly enriched the community by providing valuable insights, helpful advice, and interesting perspectives on various topics. Your dedication and willingness to share your knowledge and experience have made the forum a better place for everyone. Thank you for being an active and engaged member of the community, and we look forward to reading your future contributions. Your hard work is appreciated.

Need some help dealing with duplicates

Hi all, I need some help with dealing with duplicates in my data.

I have something like this:
var1 var2 var3 var4
a x Red 1
a x Green 2
b y Red 3
b y Green 4

This is a simplified version of my data, most of them only have 1 unique var1 and var2, which I intend to use for merging data. But from some variables, there are different types of Var3, which separates them.
What I want to achieve is to add up var4 if var1 and var2 are the same, and drop every var3 != Red.

So the effect I want is:
var1 var2 var3 var4
a x Red 3
b y Red 7

I got a feeling that I should use for loop but not sure where to start.

Can anyone help please?
Thank you sooooo much!!!

Mediation effects with MLM?

I'm trying to do a somewhat tricky mediation test. Basically, I'm looking at the relationship between parental education (4 levels) and the risk of self-harm (binary) in college students, and I want to know if the "academic rank" of the student's college explains part of this association. For those curious, academic rank is measured by the NCES-Barron's Admissions Competitiveness Index. Here, I'm testing mediation by comparing the bottom category (High school degree) to the top (Advanced degree).

I need to adjust for clustering by school but also account for the fact that the mediator is school-specific, not student-specific. Here's what I originally tried:

Code:
svyset schoolID [pw=nrweight] // students are nested in schools, and nrweight adjusts for nonresponse based on each school's demographics

local cov "i.survey_year i.gender i.race age_10cat"

* in the following, sib_any is the outcome, and it stands for self-injurious behavior
* inst_acarank is a 6-level ordinal measure of academic rank (basically, how hard it is to get into that university)
* maxed4_merged is a 4-level measure of parental education
* below, subpop(deg_bach) restricts my test to bachelor's degree students

gen sib_any2 = sib_any
svy, subpop(deg_bach): gsem (sib_any <- i.maxed4_merged `cov', logit) ///
           (sib_any2 <- i.maxed4_merged `cov' i.inst_acarank, logit) ///
           if inst_acarank!=.

                                      
margins, dydx(4.maxed4_merged) post
mlincom 1-2
But I'm thinking I need to do something that actually addresses the multilevel nature of the data. Something like:

Code:
eststo m2: mixed sib_any i.maxed4_merged i.gender i.race i.survey_year age_10cat [pw=nrweight] ///
           || inst_acarank: || schoolID_new:
eststo m1: mixed sib_any i.maxed4_merged i.gender i.race i.survey_year age_10cat [pw=nrweight] || ///
           schoolID_new: if e(sample)
... and then compare the coefficient on 4.maxed4_merged (High school vs. Advanced degree). Btw, I don't know if I need to say
Code:
 ,vce(cluster schoolID)
for these mixed effects models.

What do you think is the best approach? I'm open to all possibilities.

Very large T- Statistics

Hi Stata Community, I'm running some regressions using Fixed Effect Methodology. The issue is that the outcome of my regression shows very large T statistics for some of the variables. I would like to have your thought about ( is this a good sign to have T statistics above 100) or there is an issue that I need to fix. I also noticed that the F test is not reported and only dot is shown. Any could explain what is the issue and how it can be solved?
Thanks a lot for your help in advance.

Code:
xtreg data38350 w_l_data4024 w_l_asset w_l_net_int_margin w_l_fee_income wholefund w_l_non_int_exp w_l_mktshare RESTINC_rest RESTDEC_rest. fe cluster (Country)
Code:
Fixed-effects (within) regression               Number of obs      =     52406
Group variable: Bvdidnum                        Number of groups   =      8897
 
R-sq:  Within  = 0.0314                         Obs per group: min =         1
       Between = 0.0540                                        avg =       5.9
       Overall = 0.0462                                        max =        17
 
                                                F(8,10)            =         .
corr(u_i, Xb)  = -0.4073                        Prob > F           =         .
 
                                (Std. err. adjusted for 11 clusters in Country)
-------------------------------------------------------------------------------
              |               Robust
    data38350 | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
--------------+----------------------------------------------------------------
 w_l_data4024 |   .3722737    .001208   308.18   0.000     .3695822    .3749653
    w_l_asset |  -4.095435   .0018257 -2243.23   0.000    -4.099503   -4.091367
w_l_net_int~n |  -.0930606   .0012431   -74.86   0.000    -.0958303   -.0902909
w_l_fee_inc~e |   .2234046   .0134468    16.61   0.000     .1934434    .2533658
    wholefund |  -1.636517   .8602466    -1.90   0.086    -3.553265    .2802323
w_l_non_int~p |   .0662385   .0014957    44.29   0.000     .0629059     .069571
 w_l_mktshare |   .4508283   .0805129     5.60   0.000     .2714344    .6302223
 RESTINC_rest |   .0947936   .0000537  1765.90   0.000      .094674    .0949132
 RESTDEC_rest |   .0178919   .0015936    11.23   0.000     .0143413    .0214426
        _cons |   32.68893   .0093108  3510.87   0.000     32.66818    32.70967
--------------+----------------------------------------------------------------
      sigma_u |  9.7930902
      sigma_e |  4.4650662
          rho |  .82789564   (fraction of variance due to u_i)
-------------------------------------------------------------------------------

Calculate economic significance for censored (Tobit) regression with multiple imputation

Hi,

I'm trying to calculate economic significance from a censored (Tobit) regression with multiple imputation according to this definition:

"Economic significance is indicated by the predicted change in the dependent variable from a movement from the tenth to the ninetieth percentile in the independent variable as a percentage of the mean of the dependent variable".



So far I have the following code (with y being my dependent variable, and x being my independent variable(s)):

mi register y x
mi impute mvn y x, add(10)
mi estimate, cmdok post: tobit y x, ll(0) ul(1) vce(robust)



I have tried using the following code (which works without mi estimate) immediately after the above:

sum y, meanonly
local y=r(mean)
margins, expression(predict(ystar(0,1))/`y_m') at((p10) x) at((p90) x) contrast(atcontrast(ar._at) marginswithin) atmeans



But I get the error:

e(sample) does not identify the estimation sample



Any help would be very much appreciated! :-)


Best regards,
Valdemar

creating table following multiple imputation with svy suite

Hello.

Is it possible to create tables using "collect" series of commands using mi estimate:svy, subpop (subpopulation): logit x a b c d?

I use the syntax below without the imputation and it works however it did not generate the table I wanted when I used it after having done multiple imputation.

Thank you.

"quietly: collect _r_b _r_se, tag(model[(1)]):svy, subpop (lostemp2008_V7Chp3): logit lostemp0810 i.age5cat08 female i.race4 i.educ4 married08 r10iwendm"
Note: I have 5 models, so I do this command 4 more time for the other models and at the end I get a single table with columns for each model.

Monday, January 23, 2023

Computing a cumulative score

Hi,

I have the following sample of data. I have an indicator of 1 and 0 in once column and I want to compute the sum of consecutive 1s (i.e the cumulative column) for each ID.. How can I do that in stata? Array

Adding Number of Observations Label to Stacked Bar Chart

Hi all,

I'm trying to make a stacked bar chart that shows the number of observations within each group. My current code looks something like this:

graph bar lab risk gov, over(lt) ytitle("Proportions") ylabel(,labsize(small)) stack

But I would like there to be a label above each bar that shows the number of observations for that particular bar. Using the following lines:

sort lt
by lt: count

I see that there are 679, 678, and 680 for each bar. Is there a way to label this in the chart?

Many thanks,
Cora

Code:
* Example generated by -dataex-. For more info, type help dataex
clear
input float(lab gov risk lt)
    .1700374    .7782796             . 1
   .43530655    .2523498 -.00009400777 1
    .2423384  .035211213    -.27288932 2
   .14841408   1.0353048   -.003870568 1
    .3139711   .01862362     -.3500698 2
   .28238416   .07640956        -.2502 2
    .4400993    .0670787 -.00056580675 1
    .4450542   .23286644             . 1
   .12029877   .06753071     -.3343936 1
    .5082648  .008052898  -.0031476584 1
   .51327187   .04179808   -.001943114 2
    .4545794    .1239085    -.24128672 1
    .5322632  .013264816             . 1
    .4979768  .023063127     -.1731898 1
    .3181508   .06416951    -.22944584 3
   .13223591    .7514665    -.08747536 2
     .243767           0     -.3631753 2
    .3984841   .09781377    -.25557485 2
    .2594303    .6841453    -.02333964 1
 .0002424968   1.3886087 -.00016629793 1
.00020155523   1.4744323             . 1
 .0002928973    1.482304             . 1
   .25241518    .7487788             . 1
    .5009559   .09423032             . 1
   .22404113   .17477986     -.3677354 2
   .24345666           0     -.3581009 1
    .1634283   .10225035     -.3377799 2
   .50319344   .06749732  -.0008835515 1
    .3006188  .008035306    -.26575184 3
   .12657124    .6291607    -.28722692 1
   .53362954           0             . 1
    .4184705   .01314309     -.2038869 1
    .5343853  .015596353             . 1
    .3781637           0     -.3169422 1
   .50177145   .08583656             . 1
    .3970145   .27546874     -.1764103 1
  .025122266    .7652866     -.0876677 1
    .5389655           0             . 1
    .4668964   .10456214    -.17507926 1
   .53362954           0             . 1
   .30558315   .05589489     -.1812089 2
   .04403041    .3396863     -.3529391 2
    .3763588    .4286624             . 1
    .2673638    .7169658             . 1
   .26665908    .4362247    -.22927435 1
     .264844   .10131685     -.2486496 2
    .3497822   .03381969    -.12766492 2
    .4957976   .05695531             . 1
     .454442           0             . 1
    .4964611           0     -.1172438 2
   .14419264   .21328416     -.2944041 3
  .017567163   1.1927178    -.12055212 1
   .23156355           0     -.3671024 1
   .24158622      .26593     -.3394998 2
    .4820203   .04930139    -.06837179 1
         .54           0             . 1
    .2487957   .21417157    -.13990873 1
    .5003201  .036252417     -.1072251 1
   .04380105    1.280126             . 1
    .3979877   .01832301    -.04471144 1
   .18135896    .5702492   -.006220731 1
   .18662126           0     -.1729077 3
    .1840557   .50763726    -.27756977 1
   .41636425           0    -.25866598 2
    .4459521           0    -.02916088 1
    .3315545           0     -.3131432 1
   .10622215    .5709492             . 1
    .2908345  .031559385    -.15997745 2
    .4678009   .04758657    -.15388933 1
    .4751051   .10665444             . 1
   .09704377    .4514135     -.3623746 2
   .27230695    .4651364    -.05074546 1
    .5260047   .03887583             . 1
   .43815535    .0918106    -.23629928 1
    .3128505    .2115928    -.12426014 1
    .4359849    .0975452    -.13636968 1
    .4457981  .036428157    -.17247716 1
    .5343853  .015596353             . 1
    .4035726    .1919202    -.19471768 1
    .5160003   .01278616   -.005513604 2
    .3007862   .07709143    -.27725396 1
    .4477225   .25632626             . 1
    .2314591 .0040478515    -.29092264 2
     .397527  .006521621    -.20152047 1
    .4879967           0             . 1
    .2080003    .9212269             . 1
         .54           0             . 1
    .4816735   .13781506    -.02348751 1
    .4440349           0    -.08713304 1
    .4171881   .16578023    -.14507967 1
    .4358414    .2036193             . 1
   .42539865   .04657824    -.12620062 2
    .3682399    .4724361             . 1
    .3041541    .2263215     -.2994768 1
    .2044367    .7129219  -.0004181319 1
   .44044185   .13565966    -.15355474 1
    .3315545           0     -.3131432 1
    .2187026   .10570543     -.3050115 1
      .47968    .1424138             . 1
     .367069   .28755468     -.2607651 1
end

Dates in Stata

I am trying convert some dates in quarters. I am doing this by going into the variables manager and changing the format to quarters, as seen in the screenshot I've attached. When I do that though it is still in the Stata format (eg. 6973-Q1), instead of a format I can work with (eg. 2021q1) What am I doing wrong?


Dynamic Panel Equation

Hello
I am trying to build a Dynamic Panel Equation with the following information
- Variables: Y, X, and C (controlling variable)
- Time t1, t2, t3, ..., tn

At any time t, I propose the following possible relationships. Let t=t2
X(t2) → Y(t2)
Y(t1) → Y(t2)
Y(t1) → X(t2)

Array

Taking the information above, the mode is:

Y(t) = a + b*Y(t-1) + c*X(t) + d*X(t-1) + Σθ*C(t)

Where:
- C(t): are controlling variables at time t

I have two questions:
1. Does the model above correct?
2. If the model above is correct then the Instrument Variables (according to Anderson and Hisao 1981) is:
- IV1: Y(t-1) - Y(t-2)
- IV2: X(t-1) - X(t-2)

Is this correct?

Thank you
Phan

Sorting data

Hello,

I have the following data below:

Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input double(ID start end)
2 18080 18747
2 20702 21335
2 18748 23011
2 21336 23011
end
format %td start
format %td end
Within ID’s, I want to properly order the data so that the previous end date is one day less than the next start date. The data should look like this:

Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input double(ID start end)
2 18080 18747
2 18748 23011
2 20702 21335
2 21336 23011
end
format %td start
format %td end
If this cannot be achieved within ID, then I would like to create separate groups for those cases.

I would appreciate any assistance.

Thanks,
Anoush K.

Sunday, January 22, 2023

Graphign Functions

Hello, any idea on how to graph the following function:
y=35+2x^2-ln(x*3)

I tried this:
twoway function y=35+2x^2-ln(x^3), range (0 1)
error in expression: 35+2x^2-ln(x^3)

Thanks in advance

stpm2cr - Maximum Number of Iterations Reached

Hi everyone,

I'm using -stpm2cr- to model the effect of a treatment on a set of cardiovascular outcomes over a three year period. I can plot the cumulative incidence figures without any trouble. However, when I attempt to use the same predict code to obtain point estimates and confidence intervals of time points along the way (1 month, 1 year, 2 years, and 3 years), I occasionally get an error that says:

-Maximum number of iterations exceeded.-

Once the error hits, the code stops and I can do no more.

I can't find any reference to this in the documentation and Google isn't turning up much. Any advice for how to deal with this? The data must be in there somewhere in order for plots to be generated, I'd welcome suggestions for how to address this.

Thanks!

Line graph of percent of frequencies within categories instead of bar graph

Dear statalist,

I have a dataset of skin cancer over ten years. I want to plot a line graph (instead of a bar graph) of the percent of frequencies within categories. Unfortunately, I can not find it in stata. Any help would be appreciated.

Thank you,
Abdullah

Code:
* Example generated by -dataex-. For more info, type help dataex
clear
input int year
2020
2017
2019
2016
2019
2020
2019
2018
2020
2015
end
label values year labels5
label def labels5 2015 "2015", modify
label def labels5 2016 "2016", modify
label def labels5 2017 "2017", modify
label def labels5 2018 "2018", modify
label def labels5 2019 "2019", modify
label def labels5 2020 "2020", modify
Array

Interpreting Log transformed ITSA model

This question might not be so much as a programming question as an interpretation question.

I have a sample dataset as follows:
Code:
* Example generated by -dataex-. For more info, type help dataex
clear
input byte id str8 date byte month_num int year byte seq float(oc_per10000 ec_per10000) byte period
1 "Jan 2022"  1 2022  1  632.7574 4.2058268 0
1 "Feb 2022"  2 2022  2 572.58765 4.0631905 0
1 "Mar 2022"  3 2022  3   633.962  4.245679 0
1 "Apr 2022"  4 2022  4 605.58405 3.6853316 0
1 "May 2022"  5 2022  5  616.6671 4.1147165 0
1 "Jun 2022"  .    .  .         .         . .
1 "Jul 2022"  7 2022  6  596.7175  5.263053 1
1 "Aug 2022"  8 2022  7  620.8705  5.038163 1
1 "Sep 2022"  9 2022  8  571.1929 4.3138437 1
1 "Oct 2022" 10 2022  9  577.1292  4.168524 1
1 "Nov 2022" 11 2022 10  560.5041 4.1477256 1
2 "Jan 2022"  1 2022  1  632.7574 4.2058268 0
2 "Feb 2022"  2 2022  2 572.58765 4.0631905 0
2 "Mar 2022"  3 2022  3   633.962  4.245679 0
2 "Apr 2022"  4 2022  4 605.58405 3.6853316 0
2 "May 2022"  5 2022  5  616.6671 4.1147165 0
2 "Jun 2022"  .    .  .         .         . .
2 "Jul 2022"  7 2022  6  596.7175  5.263053 1
2 "Aug 2022"  8 2022  7  620.8705  5.038163 1
2 "Sep 2022"  9 2022  8  571.1929 4.3138437 1
2 "Oct 2022" 10 2022  9  577.1292  4.168524 1
2 "Nov 2022" 11 2022 10  560.5041 4.1477256 1
end
Code:
* Declare data to be time-series data
tsset id monthly

* sort by ID and time
sort id monthly

di monthly("2022m7","YM")

*log for percentage point change;
gen logec_per10000=ln(ec_per10000)
gen logoc_per10000=ln(oc_per10000)

*ITSA initial model;
itsa ec_per10000, single treat(1) trperiod(750) posttrend replace
*test for autocorrelation;
actest ,lags(9) 
*ITSA final model;
itsa ec_per10000, single treat(1) trperiod(750) posttrend replace force
*ITSA Log;
itsa logec_per10000, single treat(1) trperiod(750) posttrend fig replace force
My question is how to interpret the coefficients on the log transformed itsa for _t (pre-trend), _b[_t]+_b[_x_t2022m7] (post-trend), and _x_t2022m7 (change in trend).
  1. When you log transform the dependent variable, is taking the natural log the appropriate method?
  2. Do I have to further exponentiate the coefficients to interpret it as a monthly growth rate? (i.e., (exp(coef)-1)*100)?
Thanks,

Why line, bar, and pie plot dominate publications

Dear Stata users,

There are so many plot types in statistical world, line, bar, pie, box, histogram, violin, radar, mosaic, and diagnose series. And more and more new types are being created, heat plot, joy plot, alluvial plot, upset plot, just to name a few. However, line, bar, and pie plot still dominate publications, whether academic papers or popular research reports, especially in social science. Why?

Regression output featuring a period for one variable

I'm using logit regression to link state-level policies with an individual's probability of reemployment. As controls, I feature a handful of state-by-month labor market data. As in prior work that's used the same data I'm working with, and that's explored the same outcome, the model also controls for squared and cubed values of these labor market slack measures: so unemployment rate, expressed as a proportion (e.g., 0.065); then unemployment rate^2 (0.004225); and unemployment rate^3 (0.000275). Some of the non-squared or -cubed proportions are rather small; which means the squared and cubed values are even smaller. I assume that's why the regression output associated with those variables shows a dot or period, as below (see INITRATE3). Is there anything to be concerned about here? Should I remove the variable for which a dot is showing up?

Code:
                      ur_sa |   8.80e-10   1.12e-08    -1.64   0.100     1.38e-20    56.01906
                     ur2_sa |   2.56e+64   2.57e+66     1.47   0.141     4.65e-22    1.4e+150
                     ur3_sa |   5.2e-175   1.3e-172    -1.56   0.119            0    1.28e+45
                        iur |   .1303132   .9846071    -0.27   0.787     4.83e-08      351892
                       iur2 |   1.09e+14   8.32e+15     0.42   0.673     7.56e-52    1.56e+79
                       iur3 |   3.96e-41   8.42e-39    -0.44   0.661     6.3e-222    2.5e+140
                   initrate |   .0017314    .083949    -0.13   0.896     9.25e-45    3.24e+38
                  initrate2 |   1.1e-221   3.5e-218    -0.16   0.872            0           .
                  initrate3 |          .          .     0.19   0.849            0           .
                  empgrowth |   1.156185   .0878403     1.91   0.056     .9962262    1.341828
                       emp2 |   .9770312   .0225311    -1.01   0.314     .9338542    1.022205
                       emp3 |   1.002295   .0017089     1.34   0.179     .9989514     1.00565

Filling in missing data from previous values?

Below I have included an example of a wide dataset in which children have ages reported at each wave (ex: wave 5 age is k5agebg). In some instances, age is set to ".m" but we know that interviews are 2 years a part. How can I make .m's on my age variables be equal to the next value minus 2 if currently .m? In the example, k11agebg should be 35 on the first observation and 26 on the second. k10agebg would be 33 on the first observation and 24 on the second. I've done similar things with panel data, but am a bit stumped on solving this in the wide format.

Code:
. list hhidpn kidid k1age k2agebg k3agebg k4agebg k5agebg k6agebg k7agebg k8agebg k9agebg k10agebg k11agebg k12agebg kabyearbg if hhidpn == 920672010 , sepby(hhidpn)

        +------------------------------------------------------------------------------------------------------------------------------------------------------------+
        |    hhidpn        kidid   k1age   k2agebg   k3agebg   k4agebg   k5agebg   k6agebg   k7agebg   k8agebg   k9agebg   k10agebg   k11agebg   k12agebg   kabyea~g |
        |------------------------------------------------------------------------------------------------------------------------------------------------------------|
128812. | 920672010   9206720101       .         .         .         .         .         .         .         .         .         .m         .m         37       1977 |
128813. | 920672010   9206720102       .         .         .         .         .         .         .         .         .         .m         .m         28       1986 |
        +------------------------------------------------------------------------------------------------------------------------------------------------------------+
Code:
* Example generated by -dataex-. For more info, type help dataex
clear
input long hhidpn str10 kidid byte(k1age k2agebg k3agebg k4agebg k5agebg k6agebg k7agebg k8agebg k9agebg k10agebg k11agebg k12agebg) int kabyearbg
920672010 "9206720101" . . . . . . . . . .m .m 37 1977
920672010 "9206720102" . . . . . . . . . .m .m 28 1986
end

Table1 command issue

Hello, when I write this command: table1, by (pes) vars(edad conts \ par_cat cat \ hta cat \ preprev cat \ diabm cat \ diabg cat \ imc conts) \ gadg conts \ edadpar conts ) , STATA responds "varlist required", I don't understand why?. All those variables are already created.

Regression analysis with panel data

Hello Everyone,

I have a panel dataset with approximately 5300 observations and I am analysing if a firm's ESG score has a significant influence on the firm's stock price including several control variables.
Therefore, I use the following regression command:

reghdfe prcc_f ESG_score epspx bkvlps size leverage rd market_to_book invest_a, absorb(Ticker Year) vce(robust)

Does this command make sense or how do I interpret it correctly (With vce(robust) I want to eliminate the problem of heteroskedasticity)?

Thank you very much in advance!

cannot run GMM

Dear Statalist,


I am using stata v.14... I have unbalanced panel data with T = 17 and N = 18. I mostly have reversal causality from 2 of my control variables (z1 and z2 )and my main independent variable (M1). When I run two-step system GMM, the results change dramatically, eroding the significance of most, if not all, the variables. However, the model is valid according to AR(2), sargan , and Hansan tests, as follows:
Code:

Code:
xtabond2 lny l.lny lnz1 x1 x2 x3 x4 x5 m1 m1sq m1_z2 z2 x6 x7, ///
gmm (l.lny , lag(1 3) collapse) iv ( lnz1 x1 x2 x3 x4 x5 m1 m1sq m1_z2 z2 x6 x7 ) twostep cluster(country) nodiffsargan



Dynamic panel-data estimation, two-step system GMM
------------------------------------------------------------------------------
Group variable: country                    Number of obs = 177
Time variable : time                          Number of groups = 17
Number of instruments = 17            Obs per group: min = 1
Wald chi2(13) = 1.40e+10               avg = 10.41
Prob > chi2 = 0.000                           max = 16
(Std. Err. adjusted for clustering on country)
------------------------------------------------------------------------------
| Corrected
lny |             Coef.     Std. Err.      z         P>|z|       [95% Conf. Interval]
-------------+----------------------------------------------------------------
lny|
L1. |        1.041343   .0277953  37.46    0.000      .9868654  1.095821
|
lnz1 |      .0005234   .0033601    0.16    0.876    -.0060623 .007109
x1|          .0000164   .0000536    0.31   0.760    -.0000887 .0001214
x2|          .0000259   .0001718    0.15   0.880    -.0003108 .0003625
x3|          .0002729   .0005714    0.48   0.633    -.0008469 .0013927
x4|         -.0000124   .0000428   -0.29   0.772    -.0000962 .0000714
x5|         -.0001993   .0002734   -0.73   0.466    -.0007351 .0003365
m1|         .0052374   .0120197    0.44    0.663    -.0183208 .0287955
m1sq |   -.0018499   .0028094   -0.66      0.510    -.0073563 .0036564
m1_z2 | -.0002276   .0034992   -0.07       0.948    -.0070859 .0066308
z2|         -.0019458   .0062422   -0.31   0.755    -.0141803 .0102886
x6|         -.002183     .0010165   -2.15  0.032    -.0041753 -.0001908
x7|          .0000406   .0000986    0.4    0.680    -.0001525 .0002338
_cons | -.1274088   .1413448    -0.90      0.367    -.4044396 .149622
------------------------------------------------------------------------------
Instruments for first differences equation
Standard
D.(lnz1 x1 x2 x3 x4 x5 m1 m1sq m1_z2 z2 x6 x7)
GMM-type (missing=0, separate instruments for each period unless collapsed)
L(1/3).L.lny collapsed
Instruments for levels equation
Standard
lnz1 x1 x2 x3 x4 x5 m1 m1sq m1_z2 z2 x6 x7
_cons
GMM-type (missing=0, separate instruments for each period unless collapsed)
D.L.lngini collapsed
------------------------------------------------------------------------------
Arellano-Bond test for AR(1) in first differences: z = -3.09 Pr > z = 0.002
Arellano-Bond test for AR(2) in first differences: z = -0.89 Pr > z = 0.375
------------------------------------------------------------------------------
Sargan test of overid. restrictions: chi2(3) = 0.53 Prob > chi2 = 0.913
(Not robust, but not weakened by many instruments.)
Hansen test of overid. restrictions: chi2(3) = 0.49 Prob > chi2 = 0.921
(Robust, but weakened by many instruments.)


I tried to put the endogenous variables in the gmm style part, however it does not provide much changes in the significance problem. I also tried to change the number of lags, however it is relatively useless as well. So what is the wrong in my command pls.

What does "star" option in pwcorr mean?

Normally I used pwcorr to have a correlation matrix. And for such a command, we have a star option.

From the description, they stated:

The star option in the pwcorr command in Stata is used to display a star (*) next to correlation coefficients that are significant at a certain level. The level of significance is determined by the user and is specified as an argument in the star option. For example, pwcorr, star(.05) will star all correlation coefficients that are significant at the 5% level or better. The option is used to quickly identify which correlation coefficients are statistically significant and which are not.
It is confused to me. What does it mean that correlation coefficients are statistically significant then?

Saturday, January 21, 2023

Why using the same code but can run in one machine and cannot run in another machine resulting r(123)?

Hi all,

I suspect the problems previously but I can confirm today that some code of mine can work in my shcool's machine but my personal machine. They shared almost the same stat and all fully licensed.

The information of my school machine is:
Array

And my personal machine is

Array

And when I run the code

Code:
 
 did_imputation lnq_r id2 ndate nbanned_date, horizons(0/10) pretrend(10)
My personal computer announce r(123) while the machine from my school can run successfully.

Could you please guide me to sort it out?

Thank you