BJ Data Tech Solution

Specialized on Data processing, Data management Implementation plan, Data Collection tools - electronic and paper base, Data cleaning specifications, Data extraction, Data transformation, Data load, Analytical Datasets, and Data analysis. BJ Data Tech Solutions teaches on design and developing Electronic Data Collection Tools using CSPro, and STATA commands for data manipulation. Setting up Data Management systems using modern data technologies such as Relational Databases, C#, PHP and Android.

Wednesday, March 31, 2021

Standard errors and 95% Confidence Intervals for Proportions - Differences between different Stata versions

Dear community:

I am having doubts about how Stata calculates standard errors and 95% confidence intervals for proportions, especially because I get different results from different versions of Stata. Here is my code example:

Code:

clear 
input x freq
0 47
1 53
end
expand freq
drop freq

proportion x

When I run the code above in Stata 13.1, I get a Std. Err. of .0501614 and a 95% Conf. Interval that goes from .4305964 to .6270792

When I run the same code in Stata 15.1, I get a Std. Err. of .0499099 and a 95% Conf. Interval that goes from .4310876 to .6266107

I have two questions:

1) Why are the standard errors different?

2) I have tried to "manually" calculate the 95% confidence intervals with the formula: CI(lower) = prop-(1.96*std. err.) ; CI(upper) = prop+(1.96*std. err.), but in both cases I don´t get the same results as those provided by Stata.

Am I doing anything wrong? Any advice and/or clarification is greatly appreciated.

Best regards,

Paolo Moncagatta

Identifying common observations between two groups

Dear Statalist,

I have a dataset with firms and owners. This is a minimum working example:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str1 Firm_id str13 Owner_legal_name str7 Owner_last_name str19 Owner_address byte Share
"A" "John Smith "   "Smith"   "87 Granville street" 10
"A" "Maria Lopez"   "Lopez"   "87 Granville street" 20
"A" "Robert Brown"  "Brown"   "1022 Nelson street"   5
"A" "Ron Gilford"   "Gilford" "287 Howe street"     30
"A" "Rebecca Smith" "Smith"   "1022 Nelson street"  10
"A" "Joe Ramsey"    "Ramsey"  "503 Main street"     25
"B" "Anna Mancini"  "Mancini" "49 Rupert avenue"    25
"B" "David Bauer"   "Bauer"   "8 Cambie street"     25
"B" "Tessa Garcia"  "Garcia"  "8 Cambie street"     50
end

I want to reconstruct family relationship among owners as follows: individuals belong to the same sub-family if they have the same last name OR if they live at the same address. Different sub-families can be related among them (forming a family) because they have members in common. For example, John Smith, Maria Lopez, Robert Brown, and Rebecca Smith all belong to one family. My desired output is as follows:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str1 Firm_id str13 Owner_legal_name str7 Owner_last_name str19 Owner_address byte(Share last_name_id address_id Family_id Familyshare)
"A" "John Smith "   "Smith"   "87 Granville street" 10 1 1 1 45
"A" "Maria Lopez"   "Lopez"   "87 Granville street" 20 2 1 1 45
"A" "Robert Brown"  "Brown"   "1022 Nelson street"   5 3 2 1 45
"A" "Ron Gilford"   "Gilford" "287 Howe street"     30 4 3 2 30
"A" "Rebecca Smith" "Smith"   "1022 Nelson street"  10 1 2 1 45
"A" "Joe Ramsey"    "Ramsey"  "503 Main street"     25 5 4 3 25
"B" "Anna Mancini"  "Mancini" "49 Rupert avenue"    25 1 1 4 25
"B" "David Bauer"   "Bauer"   "8 Cambie street"     25 2 2 5 75
"B" "Tessa Garcia"  "Garcia"  "8 Cambie street"     50 3 2 5 75
end

I thought generating unique identifiers would be a good place to start:

Code:

egen last_name_id=group(Firm_id Owner_last_name)
egen address_id=group(Firm_id Owner_address)

Once I have Family_id, Familyshare (my ultimate variable of interest) is just

Code:

by Firm_id Familyid, sort: egen Familyshare=total(Share)

.
My problem is how to generate Family_id. In other words, how to identify observations which are common between two groups. I looked into egen and tried to browse statalist but surprisingly I couldn't find anything useful.
I would greatly appreciate any suggestion.

Commands for sleep and retry if a file being accessed at the moment

I have multiple instances of Stata accessing the same excel file to record some information. Every once in a while they try to access the excel file at the exactly the same time and I get following error:
file C:/…/MaxMin.xlsx could not be loaded
r(603);

What I need is to check first if the file is being accessed by another instance and if it is sleep 5-10 seconds and retry. If not, proceed to write to the file.
However, I have no experience with file commands I usually see in program files. Can anyone provide some guidance?

sqom subcost matrix problem?

Hi,

I am trying to run the latest version of sqom on Stata16 using subcost(meanprobdistance) and am getting an error.
My command is:

sqom, subcost(meanprobdistance) full k(2)

and my data has 3 states (no study, part-time, full-time) for 684 periods and 5614 individuals.

When I run the command I get the following message:

symmetric SQsubcost[3,3]
c1 c2 c3
r1 0
r2 1.9972898 0
r3 1.9991412 1.9998221 0
Perform 3673405 Comparisons with Needleman-Wunsch Algorithm
Running mata function
needlemanwunschapproxmatrix(): 3001 expected 8 arguments but received 6
sqomfull(): - function returned error
<istmt>: - function returned error

I tried reindexing mata using:

mata: mata mlib index

but still get the same error. I tried running the same command on the same data using the old version of the SQ package on Stata15 and it seems to run.

Any ideas on how I might fix this error?

Thanks in advance,
Jane.

Histogram with lpattern(non-solid line) adds a weird slant line

Hi all, I have just found that, when I specify 'lpattern(dash)' or 'dot' or, anything other than the solid line for 'twoway histogram' command, I get a weird slant line from the foot of the left-end bar to the head of the right-end bar. Here is the MWE:

Code:

sysuse auto
histogram mpg, d lp(dash)

Array

The discrete option is not necessary, but it's easier to see with it. I have performed -update all- just now, so the software should be perfectly up-to-date. Does anyone have any idea why this is happening, and is there any work-around for this?

Restructuring data for survival analysis

Hello everyone,

Please I need help here on re-structuring my data (Nigeria Demographic and Health Survey) for survival analysis.

I'm trying to generate time of observation (duration in months) for each child from month of birth till the end of the study observation period in 2018, of which ran the syntax commands below:

*setting ending time of observation
s220bm //child's month of death
s220by //child's year of death
tab1 s220by s220bm
recode s220by (9998=.) //don't know (0.16%)
recode s220bm (97/98=.) //inconsistent (97) = 1%, don't know (98) = 13%

gen mth_death= s220by *12+s220bm //to generate month of death
tab mth_death
order mth_death, after(b7)

//Dec(12) 2018 is the end of the survey month
gen mth_end=.
replace mth_end=mth_death if mth_death!=.
replace mth_end=12 if mth_death==. //child is still alive
tab mth_end, missing
sort mth_end
order mth_end mth_death
*check the browser data file to make sure all is reasonable

gen yr_end=.
replace yr_end= s220by if s220by!=.
replace yr_end= 2018 if s220by==. //end of the survey year if child is still alive (will be censored)
order yr_end, after (mth_death)

tab1 mth_end yr_end, m

b1 //child's month of birth
b2 //child's year of birth

*generate time of observation (duration in months) for each child from month of birth till the end of the study observation period in 2018
replace b1=12 if b1==.
gen mths_birth= b2*12+b1
tab mths_birth
order mths_birth, after(yr_end)

gen mths_observe=yr_end*12+mth_end
tab mths_observe
order mths_observe, after(mths_birth)

gen dur=mths_observe-mths_birth+1
browse if dur==. //no missing variable
tab dur
order dur, after(mths_observe)

Dataex command of the variables created returned the output below:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(id mth_death mth_end yr_end mths_birth mths_observe dur)
24696 . 12 2018 24197 24228 32
25986 . 12 2018 24195 24228 34
  496 . 12 2018 24200 24228 29
 7642 . 12 2018 24197 24228 32
30776 . 12 2018 24189 24228 40
 5260 . 12 2018 24181 24228 48
 5966 . 12 2018 24209 24228 20
16389 . 12 2018 24197 24228 32
 2278 . 12 2018 24208 24228 21
10239 . 12 2018 24218 24228 11
22916 . 12 2018 24209 24228 20
19681 . 12 2018 24209 24228 20
15325 . 12 2018 24202 24228 27
12416 . 12 2018 24210 24228 19
25760 . 12 2018 24198 24228 31
20777 . 12 2018 24204 24228 25
24014 . 12 2018 24212 24228 17
32472 . 12 2018 24173 24228 56
 7727 . 12 2018 24187 24228 42
   17 . 12 2018 24220 24228  9
19211 . 12 2018 24170 24228 59
 8561 . 12 2018 24184 24228 45
32324 . 12 2018 24208 24228 21
20246 . 12 2018 24172 24228 57
 3222 . 12 2018 24166 24228 63
 3917 . 12 2018 24187 24228 42
 3838 . 12 2018 24220 24228  9
 2899 . 12 2018 24211 24228 18
32490 . 12 2018 24165 24228 64
 5342 . 12 2018 24202 24228 27
  854 . 12 2018 24199 24228 30
 5289 . 12 2018 24173 24228 56
32329 . 12 2018 24191 24228 38
31267 . 12 2018 24172 24228 57
11971 . 12 2018 24217 24228 12
 2560 . 12 2018 24181 24228 48
11991 . 12 2018 24172 24228 57
32672 . 12 2018 24216 24228 13
31303 . 12 2018 24171 24228 58
 5615 . 12 2018 24183 24228 46
31260 . 12 2018 24203 24228 26
  444 . 12 2018 24218 24228 11
31258 . 12 2018 24202 24228 27
29164 . 12 2018 24185 24228 44
 5369 . 12 2018 24214 24228 15
25160 . 12 2018 24185 24228 44
30031 . 12 2018 24168 24228 61
  955 . 12 2018 24176 24228 53
19429 . 12 2018 24174 24228 55
30973 . 12 2018 24217 24228 12
17152 . 12 2018 24196 24228 33
14551 . 12 2018 24212 24228 17
27428 . 12 2018 24225 24228  4
30139 . 12 2018 24205 24228 24
17610 . 12 2018 24194 24228 35
 7779 . 12 2018 24214 24228 15
31942 . 12 2018 24175 24228 54
10185 . 12 2018 24181 24228 48
 4431 . 12 2018 24190 24228 39
 9942 . 12 2018 24221 24228  8
15298 . 12 2018 24224 24228  5
31239 . 12 2018 24209 24228 20
12379 . 12 2018 24201 24228 28
 7364 . 12 2018 24186 24228 43
 7384 . 12 2018 24183 24228 46
19049 . 12 2018 24193 24228 36
11634 . 12 2018 24213 24228 16
  337 . 12 2018 24174 24228 55
28952 . 12 2018 24205 24228 24
16368 . 12 2018 24212 24228 17
  812 . 12 2018 24217 24228 12
10026 . 12 2018 24225 24228  4
20664 . 12 2018 24174 24228 55
31666 . 12 2018 24177 24228 52
 7855 . 12 2018 24172 24228 57
26979 . 12 2018 24185 24228 44
20907 . 12 2018 24220 24228  9
20642 . 12 2018 24215 24228 14
26260 . 12 2018 24214 24228 15
24653 . 12 2018 24182 24228 47
31487 . 12 2018 24217 24228 12
 5028 . 12 2018 24208 24228 21
13361 . 12 2018 24204 24228 25
30680 . 12 2018 24202 24228 27
32319 . 12 2018 24173 24228 56
31181 . 12 2018 24198 24228 31
 8549 . 12 2018 24170 24228 59
 8597 . 12 2018 24172 24228 57
 6333 . 12 2018 24215 24228 14
22702 . 12 2018 24169 24228 60
31210 . 12 2018 24197 24228 32
17347 . 12 2018 24204 24228 25
25702 . 12 2018 24170 24228 59
25797 . 12 2018 24166 24228 63
31206 . 12 2018 24180 24228 49
 1336 . 12 2018 24171 24228 58
31204 . 12 2018 24199 24228 30
 5881 . 12 2018 24221 24228  8
26365 . 12 2018 24202 24228 27
15413 . 12 2018 24178 24228 51
end

------------------ copy up to and including the previous line ------------------

Please am I on the right track before I stset?

Thank you

Happy birthday to statalist.org

Statalist re-invented itself as a forum on March 31, 2014. That doesn't sound that long ago, but in the last 7 years, there have been more than 300,000 posts (an average of more than 100 per day) and over 48,000 members have registered.

Thank you to all of you who post, lurk, and make Statalist a great resource for Stata users worldwide. Whether you are a prolific participant, answering questions far and wide, or you quietly read what others have to say, or you are somewhere inbetween, you help make Statalist a forum like no other.

Combinations of binary variables

Hello,

I have a dataset with 10 variables and 5 binary variables (A, B, C,D,E). I'm trying to get all possible combinations of 2,3, 4 and 5 of all but I'm not sure how to go about this using a loop/permin,combin in Stata. Moreover, I want Stata to count each combination iteration and tell me the sum of times each variable (intersection in combinations/permutations) is is 1.

The data is such that the number of A=1 (say 5) instances adds upto A=1 in all other iterations ( A=1 + B=1 +C=1 (2), A=1+C=1 (3). For example, I've been trying:

tab A if A !=0 & B !=0 & C !=0

I'm having difficulties getting the combinations with a loop (minimum code) and also tallying up all iterations to 5.

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(A B C D E)
1 0 0 0 1
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
1 0 0 0 1
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
1 0 0 0 1
1 0 0 0 0
0 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
0 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
1 0 0 0 0
1 0 0 0 1
1 0 0 0 1
0 0 0 0 0
1 0 0 0 0
0 0 0 0 0
1 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
1 0 0 0 0
0 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
0 0 0 0 0
1 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
1 0 0 0 1
1 0 0 0 1
1 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 1
0 0 0 0 0
0 0 0 0 0
1 0 0 0 0
0 0 0 0 0
1 0 0 0 0
0 0 0 0 0
0 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0

convert hexadecimal to binary

I have variables whose values come as hexadecimal strings 32 characters long as a rusult of an md5 hash function to an original string. I need to convert those values into 128 bit long binary numbers.
I have been searching the forum but don't seem to be able to find a command that performs this apparently simple task in Stata. Most likely I am overlooking it. Any help?.
Thanks Luca

HGLM - Model 2 level variables not siginificant but postestimation plot shows variation

Good day,

I am using a two level HGLM model
model 1 - Level 1 variables have significant effect on outcome
Model 2 - Level 2 variables does not have any significant effect on outcome variable. However, the postestimation plot shows some variation.

Is there a way to report this?

Thank you very much for taking time to respond to my question.

Robustness checks

Dear statalist,

I am running a bootstrap on an OLS model with the dependent variable being log(maximal grip strength) on a set of independent variables. However, I wanna know if it is possible to run som robustness model checks, such as a reset test etc. Do I have to do this before or after running the bootstrap?

Comparing coefficients while keeping difference between other coefficients constant in mixed model

Dear Statalist users,

I would like to compare whether the slope is statistically different between two groups while pretending that the intercept was exactly the same between the groups.

A bit about the model: I run a growth curve model with economic resources (continuous variables in Euro) as my outcome measure. My explanatory variables are time since divorce ("divduration" in years) and whether a respondent actually experienced a divorce ("treat" with 1=divorced, 0=continuously married). So the base model looks as follows:

Code:

mi estimate, dots post: mixed wealth  c.divduration##i.treat $control1  || id: divduration if psmatched2 ==1, variance mle

I am now adding another interaction with a remarriage indicator:

Code:

mi estimate, dots post: mixed wealth c.divduration##i.treat##i.remar $control1 || id: divduration if psmatched2 ==1, variance mle

And get the following output:

Code:

Imputations (5):
  ..... done

Multiple-imputation estimates                   Imputations       =          5
Mixed-effects ML regression                     Number of obs     =      9,760

Group variable: id                              Number of groups  =      5,006
                                                Obs per group:
                                                              min =          1
                                                              avg =        1.9
                                                              max =          4
                                                Average RVI       =     6.4175
                                                Largest FMI       =     0.9839
DF adjustment:   Large sample                   DF:     min       =       4.18
                                                        avg       =      64.83
                                                        max       =     397.52
Model F test:       Equal FMI                   F(   7,  134.7)   =      11.17
                                                Prob > F          =     0.0000

-------------------------------------------------------------------------------------------
                   wealth |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------------------+----------------------------------------------------------------
              divduration |   2221.492   643.0812     3.45   0.003      874.101    3568.884
                          |
                    treat |
                 Treated  |  -39046.84   8871.472    -4.40   0.000    -56692.11   -21401.57
                          |
      treat#c.divduration |
                 Treated  |  -297.1171   1363.105    -0.22   0.828    -2976.913    2382.679
                          |
                  1.remar |    9413.24   16072.45     0.59   0.562    -23172.01     41998.5
                          |
      remar#c.divduration |
                       1  |   1394.398   2100.509     0.66   0.509    -2796.951    5585.746
                          |
              treat#remar |
               Control#1  |          0  (empty)
               Treated#1  |          0  (omitted)
                          |
treat#remar#c.divduration |
               Control#1  |          0  (empty)
               Treated#1  |          0  (omitted)
                          |
       1.flag_firstwealth |  -13878.67   4438.822    -3.13   0.003     -22757.7   -4999.634
         1.flag_impwealth |   14860.63   6932.212     2.14   0.067    -1366.448    31087.72
                    _cons |   81025.73   5520.197    14.68   0.000     69689.74    92361.72
-------------------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
id: Independent              |
             sd(divduration) |   11011.74   1056.377      8637.166    14039.15
                   sd(_cons) |    99006.4   6453.131      84522.53    115972.3
-----------------------------+------------------------------------------------
                sd(Residual) |   133315.1   9924.388      108806.7      163344
------------------------------------------------------------------------------

My problem/my question: I would now like to test whether there is a statistical difference between the growth curve of divorcees that are ever remarried compared to divorcees that are never-remarried while keeping their initial differences constant. So at the moment, remarried divorcees have 1394 Euros more right at divorce than never-married divorcees. I would like to pretend that there was no initial difference and then test whether the two differ in their growth rate after divorce.

Note: continuously married respondents can never be remarried and thus their interaction coefficient falls out of the model.

Data: SOEP
STATA: 16.1

I am not very experienced with postestimation commands and appreciate any advice on how I can solve this.

Thank you,
Nicole

How to calculate year in month with conditions

Dear all,

Year of birth and month of birth of each individual in a given dataset, now I want to generate two variables as follows:
1) Year of birth in month; and
2) Year of birth in month among those who were born after August 1980 (1=yes, 0=no)

Any help is much appreciated. Thanks!

Data

Code:

clear
input double id byte month int year
100202 12 1986
100302  3 1978
100402 12 1983
100405 10 1991
100503  2 1989
100504  4 1993
100505  3 1995
100602  8 1965
100702  1 1975
100703 11 1996
100902 11 1971
101104  2 1983
101203  5 1990
101304 10 1984
101305  2 1985
101404  8 1987
101502  1 1982
101506  8 1987
101507  6 1991
101602  9 1969
101704  8 1985
101804  5 1983
101905  6 1990
101907  9 1994
102002 12 1977
200105  7 1996
200402  9 1974
200403 10 1995
200501  8 1976
200603 12 1982
200704 11 1993
200705  6 1994
201102 10 1972
201104  6 1993
201203  2 1987
201301 11 1974
201303  6 1994
201401  1 1975
201502 10 1978
201506  5 1984
201604  9 1982
201605 12 1993
300104  3 1983
300106  1 1983
300204  2 1974
300205  5 1998
300304 10 1984
300502 12 1973
300503  9 1995
300604 10 1980
300704  5 1985
300709  2 1984
300803  5 1981
300902 10 1974
301003 11 1971
301004 11 1998
301103  3 1987
301204  9 1978
301302 11 1971
301303  3 1997
301404  8 1985
301604  6 1972
301606 10 1993
301704  3 1983
301801  3 1986
301902 12 1968
301903  3 1997
302001  3 1980
400104 12 1995
400204  3 1995
400302  8 1964
400503  3 1986
400602  2 1970
400902  5 1969
400903  8 1993
401002  9 1985
401303  1 1993
401501  3 1978
401602  7 1968
401704  7 1982
401902 12 1984
402003  7 1985
402006 12 1991
500102  6 1991
500202  9 1971
500203 10 1996
500302  2 1971
500402  3 1976
500501  1 1988
500702  1 1969
500703 12 1990
500802  4 1967
500803  5 1990
500804  3 1996
500904 11 1971
501002  3 1984
501003 11 1988
501104  7 1981
501302  5 1974
501403  5 1989
end
label values month wb1m
label values year wb1y

Interpreting Sargan-Hansen Tests

Hi all,

I am running diff GMM using Sebastian's xtdpdgmm command. However, I am not sure how to interpret the 2 different Sargan-Hansen tests that are produced when I run "estat overid". Is one more important than the other? What range should the p-values both be in?

Note that my instrument count is between 20-30, the number of groups I have is 52, I have about 1200 observations and 32 time periods. Instruments are collapsed.

I have tried looking at older threads to find the answer, but could not see one.

Best,
Jaspal

How to generate a variable which takes for every combination another value?

Dear all
I have a question. I have a dataset with lots of different variables and want to generate a new variable which takes for every possible combination of variables an other combination. For example, if I my variables would be gender and "under_30_years" than I would like to have a variable which takes the value 1 if male and not under_30_years, 2 if female and not under_30_years, 3 if male and under_30_years and 4 if female and under_30_years. I've heard that a possible way is to do that with egen tag. But how have I exactly to do it?
Best regards and thank you very much

sort row with string variables?

Dear All, Is it possible to sort "string" variable? Suppose that the data set is

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str54 temp
"2089525.IB,2089526.IB,2089527.IB"                      
"2089502.IB,2089503.IB,2089504.IB"                      
"2089511.IB,2089512.IB,2089513.IB"                      
"2089499.IB,2089500.IB,2089501.IB"                      
"2089516.IB,2089517.IB,2089518.IB"                      
"2089481.IB,2089482.IB,2089483.IB"                      
"2089478.IB,2089479.IB,2089480.IB"                      
"2089442.IB,2089443.IB,2089444.IB"                      
"2089494.IB,2089495.IB,2089496.IB"                      
"2089467.IB,2089468.IB,2089469.IB"                      
"2089439.IB,2089440.IB,2089441.IB"                      
"2089455.IB,2089456.IB,2089457.IB"                      
"2089464.IB,2089465.IB,2089466.IB"                      
"2089489.IB,2089490.IB,2089491.IB,2089492.IB,2089493.IB"
"2089387.IB,2089388.IB,2089389.IB"                      
"2089408.IB,2089409.IB,2089410.IB,2089411.IB"           
"2089445.IB,2089446.IB,2089447.IB,2089448.IB"           
"2089430.IB,2089431.IB,2089432.IB"                      
"2089355.IB,2089356.IB,2089357.IB"                      
"2089412.IB,2089413.IB,2089414.IB"                      
end

split temp, p(",")
drop temp

outreg2 keep option does not work

Hello,

I am in desperate need for some help for my thesis regarding Stata.

I would like to make use of the outreg2 command. However, when I use this command and specify the variables that should be kept, he ignores these variables and creates a list of all variables after which I get the notion that
I attempted to create a matrix with too many variables. However, I do not even want that much variables, however the code doesn't listen to my commands to keep only certain variables.

I used the following command:
outreg2 using x.doc, replace sum(log) keep(var x var y var z)

What is wrong with this way of typing which variables should be kept?

Thanks in advance!

Difference in Consecutive Values by Group Labels

Dear Statalisters,
I need to generate another variable called 'difference' which contains the difference between 'consecutive' median_gnp values based on year_cat.

Code:

clear
input float(median_gnp year year_cat)
 3730.3 1962 0
 3730.3 1962 0
 3730.3 1963 0
 3730.3 1963 0
   4441 1964 1
   4441 1964 1
   4441 1965 1
  4441 1965 1
5232.05 1966 2
5232.05 1966 2
5232.05 1967 2
5232.05 1967 2
 5912.6 1968 3
 5912.6 1968 3
 5912.6 1969 3
 5912.6 1969 3
 6833.95 1970 4
6833.95 1970 4
end

What I mean is that I need a variable diff that contains the values 4441 - 3730.3, 5232.05-4441, 5912.6-5232.05, and so on. (Basically, the 'diff' variable will show the growth in gnp year_cat wise)

This is a sample. My original dataset has around >50000 observations and >100 categories and my purpose is to make a 'growth' chart. The data is categorical - NOT time series or panel. I am on Stata/IC 16.1.

IV ordered probit using CMP (syntax help)

Hi There, I am estimating the causal effect of maternal education on child health.

My dependent variable ==> Breastfeeding duration. It is ordinal and categorical as having 6 categories from 1 to 6.

My endogenous variable ==> mother's education. It is a binary variable. 1= if mother has at least 8 years of education; 0=otherwise.

My instrument ==> Education reform exposure.It is a binary variable. 1= if mother born in 1987 and further; 0=1986 and earlier.

Due to ordinal and categorical nature of my dependent variable, I need to run IV-OPROBIT with CMP.

Here is my code:

cmp (dependent variable=endogenous variable mother_age_yrs i.region5 i.placeofresidence i.wealth_index) (endogenous variable=instrument mother_age_yrs i.region5 i.placeofresidence i.wealth_index), indicators($cmp_oprobit $cmp_cont) nolrtest

which is equal to:

cmp (bf_kategorik= completionof8years mother_age_yrs i.region5 i.placeofresidence i.wealth_index) (completionof8years=reformexposure mother_age_yrs i.region5 i.placeofresidence i.wealth_index), indicators($cmp_oprobit $cmp_cont) nolrtest

I placed the same control variables in both paranthesis.

My dependent variable has 2,699 observations. Yet, the estimates are for the full sample. Why? Also, what is " _cmp_y1" that appears in the ordered probit regression. How should I interpret those results? or am I writing the syntax wrong??

Fitting individual models as starting point for full model fit.
Note: For programming reasons, these initial estimates may deviate from your specification.
For exact fits of each equation alone, run cmp separately on each.

Iteration 0: log likelihood = -4664.7438
Iteration 1: log likelihood = -4629.733
Iteration 2: log likelihood = -4629.7322
Iteration 3: log likelihood = -4629.7322

Ordered probit regression Number of obs = 2,699
LR chi2(11) = 70.02
Prob > chi2 = 0.0000
Log likelihood = -4629.7322 Pseudo R2 = 0.0075

------------------------------------------------------------------------------------
_cmp_y1 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------------+----------------------------------------------------------------
completionof8years | .0178139 .0500493 0.36 0.722 -.0802809 .1159088
mother_age_yrs | .018091 .0035856 5.05 0.000 .0110634 .0251187
|
region5 |
South | -.1883487 .0684742 -2.75 0.006 -.3225557 -.0541417
Central | -.086971 .0634316 -1.37 0.170 -.2112947 .0373528
North | -.173147 .070988 -2.44 0.015 -.3122809 -.0340131
East | .0030043 .0593384 0.05 0.960 -.1132969 .1193054
|
placeofresidence |
Rural | .126082 .0539364 2.34 0.019 .0203687 .2317954
|
wealth_index |
2 | -.1122657 .0629024 -1.78 0.074 -.2355522 .0110207
3 | -.0479112 .0707739 -0.68 0.498 -.1866255 .0908031
4 | -.1049022 .0765191 -1.37 0.170 -.2548768 .0450724
5 | -.2524081 .0874588 -2.89 0.004 -.4238243 -.080992
-------------------+----------------------------------------------------------------
/cut1 | -.7832895 .1329833 -1.043932 -.522647
/cut2 | -.3245831 .1319903 -.5832794 -.0658869
/cut3 | .2203479 .1316789 -.0377379 .4784337
/cut4 | .8320008 .1322858 .5727253 1.091276
/cut5 | 1.721697 .1350108 1.457081 1.986313
------------------------------------------------------------------------------------

Warning: regressor matrix for _cmp_y1 equation appears ill-conditioned. (Condition number = 50.918938. )
This might prevent convergence. If it does, and if you have not done so already, you may need to remove nearly
collinear regressors to achieve convergence. Or you may need to add a nrtolerance(#) or nonrtolerance option to the command line.
See cmp tips.

Source | SS df MS Number of obs = 5,571
-------------+---------------------------------- F(11, 5559) = 262.73
Model | 437.639922 11 39.7854474 Prob > F = 0.0000
Residual | 841.803625 5,559 .151430765 R-squared = 0.3421
-------------+---------------------------------- Adj R-squared = 0.3408
Total | 1279.44355 5,570 .229702612 Root MSE = .38914

----------------------------------------------------------------------------------
completionof8y~s | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------------+----------------------------------------------------------------
reformexposure | .2868814 .016345 17.55 0.000 .2548389 .318924
mother_age_yrs | -.006512 .00108 -6.03 0.000 -.0086293 -.0043948
|
region5 |
South | .063416 .0187688 3.38 0.001 .0266219 .1002102
Central | .0551908 .0171079 3.23 0.001 .0216527 .0887289
North | .1037862 .0194318 5.34 0.000 .0656923 .1418801
East | -.0238842 .0158652 -1.51 0.132 -.0549861 .0072177
|
placeofresidence |
Rural | .0230901 .0138136 1.67 0.095 -.0039899 .0501701
|
wealth_index |
2 | .1180296 .015968 7.39 0.000 .086726 .1493332
3 | .270963 .0180196 15.04 0.000 .2356376 .3062884
4 | .4604087 .019482 23.63 0.000 .4222163 .498601
5 | .7657424 .0208734 36.69 0.000 .7248224 .8066625
|
_cons | .2080448 .0386628 5.38 0.000 .1322506 .2838391
----------------------------------------------------------------------------------

Warning: regressor matrix for completionof8years equation appears ill-conditioned. (Condition number =
> 46.94053.)
This might prevent convergence. If it does, and if you have not done so already, you may need to remov
> e nearly
collinear regressors to achieve convergence. Or you may need to add a nrtolerance(#) or nonrtolerance
> option to the command line.
See cmp tips.

Fitting constant-only model for LR test of overall model fit.

Fitting full model.

Iteration 0: log likelihood = -7270.6442
Iteration 1: log likelihood = -7270.4032
Iteration 2: log likelihood = -7270.3934
Iteration 3: log likelihood = -7270.3934

Mixed-process regression Number of obs = 5,571
LR chi2(22) = 2394.36
Log likelihood = -7270.3934 Prob > chi2 = 0.0000

------------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------------+----------------------------------------------------------------
bf_kategorik |
completionof8years | .1900263 .2454854 0.77 0.439 -.2911162 .6711688
mother_age_yrs | .0204678 .0048412 4.23 0.000 .0109793 .0299563
|
region5 |
South | -.1981234 .0696177 -2.85 0.004 -.3345715 -.0616752
Central | -.0964643 .0646646 -1.49 0.136 -.2232045 .030276
North | -.1901399 .0745826 -2.55 0.011 -.336319 -.0439607
East | .006895 .0595097 0.12 0.908 -.1097418 .1235319
|
placeofresidence |
Rural | .1220701 .0541987 2.25 0.024 .0158426 .2282976
|
wealth_index |
2 | -.1308262 .067806 -1.93 0.054 -.2637234 .0020711
3 | -.0927723 .0943269 -0.98 0.325 -.2776495 .092105
4 | -.1806642 .1301963 -1.39 0.165 -.4358442 .0745159
5 | -.380619 .1983883 -1.92 0.055 -.769453 .0082149
-------------------+----------------------------------------------------------------
completionof8years |
reformexposure | .2868814 .0163274 17.57 0.000 .2548804 .3188825
mother_age_yrs | -.006512 .0010788 -6.04 0.000 -.0086265 -.0043975
|
region5 |
South | .063416 .0187486 3.38 0.001 .0266695 .1001625
Central | .0551908 .0170894 3.23 0.001 .0216961 .0886854
North | .1037862 .0194109 5.35 0.000 .0657416 .1418308
East | -.0238842 .0158481 -1.51 0.132 -.0549459 .0071775
|
placeofresidence |
Rural | .0230901 .0137987 1.67 0.094 -.0039549 .050135
|
wealth_index |
2 | .1180296 .0159508 7.40 0.000 .0867666 .1492926
3 | .270963 .0180002 15.05 0.000 .2356834 .3062427
4 | .4604087 .019461 23.66 0.000 .4222658 .4985516
5 | .7657424 .0208509 36.72 0.000 .7248754 .8066095
|
_cons | .2080448 .0386211 5.39 0.000 .1323488 .2837409
-------------------+----------------------------------------------------------------
/cut_1_1 | -.6980257 .1801493 -3.87 0.000 -1.051112 -.3449395
/cut_1_2 | -.2403788 .1773734 -1.36 0.175 -.5880243 .1072668
/cut_1_3 | .3034286 .1748699 1.74 0.083 -.0393101 .6461673
/cut_1_4 | .9136664 .172688 5.29 0.000 .5752041 1.252129
/cut_1_5 | 1.801126 .1709347 10.54 0.000 1.4661 2.136151
/lnsig_2 | -.9448915 .0094737 -99.74 0.000 -.9634596 -.9263235
/atanhrho_12 | -.0697405 .0976723 -0.71 0.475 -.2611747 .1216937
-------------------+----------------------------------------------------------------
sig_2 | .3887217 .0036826 .3815705 .396007
rho_12 | -.0696276 .0971988 -.2553939 .1210965
------------------------------------------------------------------------------------

.

Reshape data in order to calculate gini coefficient panel data using income groups

Dear Statalist,

I'm trying to calculate the Gini of counties using the the population of 20+ income groups and the median income of said groups. An extract of the data I'm working on is as follows:

Code:

 
 * Example generated by -dataex-. To install: ssc install dataex clear input float countyid int(year pop_1_19 pop_20_39) double(median_1_19 median_20_39)  1 2015 1004  349  5.8 29.2  1 2016  926  380  5.9   30  1 2017  881  382  5.9 29.2  1 2018  931  381  6.1 29.3  1 2019  933  400  6.5 29.1  2 2015 1068  508  5.4 29.9  2 2016 1031  517  5.4 30.3  2 2017  959  489  6.5 29.1  2 2018  890  557  6.7 29.3  2 2019  860  490  7.1 29.7  3 2015  626  305  6.7 29.2  3 2016  586  292  6.9 29.5  3 2017  534  293  6.6 29.5  3 2018  638  294  7.1 29.8  3 2019  627  289  6.8 29.1  4 2015  143   95  8.2 28.4  4 2016  160   92  8.4 28.7  4 2017  123   88  8.3   27  4 2018  125   85  7.4 27.8  4 2019  134   97    7 30.2  5 2015  408  228  7.2 29.2  5 2016  388  229  6.5 29.9  5 2017  402  219    7 30.8  5 2018  411  209  7.8 28.6  5 2019  387  229  7.5 29.4  6 2015   78   37  7.1 26.7  6 2016   47   41  9.3 31.4  6 2017   55   29    8 30.3  6 2018   60   36  7.5 29.7  6 2019   63   30  8.2 32.2  7 2015  179   87  5.8 28.3  7 2016  159   77  5.6 31.1  7 2017  157   66  5.4 28.9  7 2018  149   71  7.6 29.7  7 2019  139   64  7.9 28.2  8 2015  702  413  7.5 28.9  8 2016  641  422  7.6 29.8  8 2017  588  383    8 29.5  8 2018  591  415  7.6 28.8  8 2019  570  393  7.4 30.3  9 2015  251  139  6.8 30.2  9 2016  248  116  7.3 28.3  9 2017  233  132    8 30.4  9 2018  220  140  7.8 29.2  9 2019  208  149  7.9 29.3 10 2015  533  326  9.1 29.2 10 2016  570  336  9.3   30 10 2017  631  325  9.1 28.5 10 2018  631  325  9.4 27.8 10 2019  551  306  9.6 29.7

I've tried to find a sollution as to calculate the Gini with the data being shaped the way it is, but being the Stata novice I am I haven't had any luck. However, I believe that I could be able to use the
ginidesc command if I reshape my data the following way:

Code:

 countyid year group population median  1 2015 1  xxx  xxx  1 2015 2  xxx  xxx  1 2015 3  xxx  xxx  1 2015 4  xxx  xxx  1 2015 5  xxx  xxx  1 2015 6  xxx  xxx  1 2015 7  xxx  xxx  1 2015 8  xxx  xxx  1 2015 9  xxx  xxx  1 2015 10  xxx  xxx  1 2015 11  xxx  xxx  1 2015 12  xxx  xxx  1 2015 13  xxx  xxx  1 2015 14  xxx  xxx  1 2015 15  xxx  xxx  1 2015 16  xxx  xxx  1 2015 17  xxx  xxx  1 2015 18  xxx  xxx  1 2015 19  xxx  xxx  1 2015 20  xxx  xxx  1 2015 21  xxx  xxx  1 2015 22  xxx  xxx  1 2015 23  xxx  xxx  1 2015 24  xxx  xxx  1 2015 25  xxx  xxx

I.e. if I could reshape my data so that group (income group) becomes a variable 1 if income group is income group 1-19 (pop_1_19 in my original data), 2 if income group is income group 20-39 (pop_20_39 in my original data)etc... population is the value of the variable of pop_xx_xx in my original data for it's corresponding group and median is the median vage of corresponding income group. Any ideas on how I could proceed with the reshaping? Or if there is any way to calculate the gini of each county for each year using the shape of my current data? Thanks in advance.

Series 0 not found using grc1leg2

Dear all, I am using Stata 14 on Windows 10. The following code is supposed to create a combined graph of several scatter plots for two variables. I used almost the same code several times for other graphs and it worked. However, this time I always get the error message "Series 0 not found" after all single graphs for the respective countries have been generated, i.e. no combined graph is created (I only get an empty combined graph). After spending several hours trying to fix that problem, I have no more idea...

Code:

levelsof country, local(levels)

local graph ""

foreach i of local levels {
    summarize net_tradevalue_m if country == "`i'", meanonly
    local y_min = `r(min)'
    local y_max = `r(max)'
    
    local y_min_r = round(`y_min')
    local y_max_r = round(`y_max')
    
    scatter net_tradevalue_m instal if country == "`i'", ///
    xtitle("Number of items installed") ///
    ytitle("Import value (US$) net of re-exports") ///
    yscale(range(`y_min' `y_max')) ylabel(`y_min_r' `y_max_r') ///
    title("`i'") name(g_`i', replace)
    local graph "`graph' g_`i'"
}

grc1leg2 `graph', xtob1title ytol1title ytsize(vsmall) labsize(vsmall)

Code:

* Example generated by -dataex-.
year country net_tradevalue_m instal
2005 "AE"   .596557    3
2007 "AE"    .57997    2
2008 "AE"  1.334496    2
2012 "AE"    .82767    4
2013 "AE"   2.23105    2
2014 "AE"  3.857373    0
2015 "AE"  2.996899    0
2016 "AE"  3.456476    0
2017 "AE" 10.282882    0
2018 "AE"  6.563038    0
2019 "AE"  3.967778    0
1996 "AR"  2.930427    0
1997 "AR"  2.566672    0
1998 "AR"  2.713423    0
1999 "AR"  1.757973   20
2000 "AR"   .820514   50
2001 "AR"  1.731663   40
2002 "AR"   .433283   29
2003 "AR"   .693137   33
2004 "AR"   .537529   17
2005 "AR"  1.334015   65
2006 "AR"  1.607395   36
2007 "AR"  5.205558  141
2008 "AR" 13.348274  150
2009 "AR"  2.953936   45
2010 "AR"  8.285274   96
2011 "AR" 22.273457  407
2012 "AR" 15.632359  180
2013 "AR" 13.275552  181
2014 "AR" 10.507074    0
2015 "AR" 14.350166    0
2016 "AR" 11.726218    0
2017 "AR" 16.001951    0
2018 "AR"  9.526949    0
2019 "AR" 15.542093    0
1996 "AT"  9.284512  277
1997 "AT" 11.324429  232
1998 "AT"  11.18832  145
1999 "AT" 16.971384  273
2000 "AT"  37.53419  320
2001 "AT" 16.574034  330
2002 "AT" 16.691013  670
2003 "AT"  14.51144  365
2004 "AT" 24.916763  545
2005 "AT"  25.72232  485
2006 "AT" 35.112324  498
2007 "AT"  38.49554  621
2008 "AT"  46.18709  638
2009 "AT"  41.32523  508
2010 "AT"  36.71757  496
2011 "AT"  43.99167  628
2012 "AT"  40.30826  835
2013 "AT"   40.7037  720
2014 "AT"  51.64431  898
2015 "AT"  48.09217  987
2016 "AT"   73.0638 1686
2017 "AT"     81.19 1641
2018 "AT"  82.17607 1504
2019 "AT"  74.40486 1475
1996 "AU"  4.763878  250
1997 "AU" 10.905708  526
1998 "AU"  6.894873  347
1999 "AU"  8.601764  180
2000 "AU"  8.956798  440
2001 "AU" 10.291573  270
2002 "AU"  13.03586  421
2003 "AU" 17.182116  569
2004 "AU" 28.043364  652
2005 "AU"  29.79912  890
2006 "AU" 19.709654  719
2007 "AU"  26.95258  734
2008 "AU"  20.39395  781
2009 "AU"  8.937055  399
2010 "AU" 13.016978  624
2011 "AU"  17.47799  690
2012 "AU"   22.5622 1214
2013 "AU"  13.40527  323
2014 "AU" 11.591695    0
2015 "AU" 13.899048    0
2016 "AU"  16.18698    0
2017 "AU" 19.643173    0
2018 "AU"  28.74417    0
2019 "AU" 18.894983    0
2003 "BA"     .0334    0
2004 "BA"   .166554    0
2005 "BA"   .074738    0
2006 "BA"   .260394    0
2007 "BA"   .535262    0
2008 "BA"   .677682    0
2009 "BA"  1.532421    0
2010 "BA"   .381561    2
2011 "BA"  3.805619    2
2012 "BA"   .817377    0
2013 "BA"   .587407    0
2014 "BA"  1.019831    0
2015 "BA"   .604327    0
2016 "BA"   .752781    0
2017 "BA"  1.305355    0
2018 "BA"  1.188319    0
2019 "BA"  2.205264    0
end

It would be great, if someone could help me. Thanks a lot!

Drop if Strmatch with multiple conditions

Hello everyone,

I'm new to STATA and i have the following problem:
I want to drop all observations, that don't contain the words "Mergers" or "Acquisition" for the Variable (String) "Dealtype".

I used the following Drop if Strmatch formula:

drop if strmatch(Dealtype, "*Merger*")==0

worked just fine but then STATA deletes everything, that isn't named Merger. How can i add the Condition to delete all Observations, if the Variable "Dealtype" doesn't contain the words "Merger" OR "Acquisition"?

Hope you can help me out.

Merger Simulation Nested Logit Model (Björnerstedt & Verboven, 2013)

Dear forum,

I follow the paper written by Björnerstedt and Verboven (2013) and Berry (1994) to simulate a merger:

In Berry (1994) they use a nested logit model to estimate demand. Brörnerstedt & Verboven do a merger simulation based on the nested logit model to estimate demand and simulate a merger. They work out an example with commands you need for stata. I want to base the method of my masterthesis on this paper, but I am stuck with a problem of price endogeneity.

On page 10 Börnerstedt and Verboven do a fixed estimates regression instead of an IV regression (to simplify). This is ofcourse not the correct way, because we have to account for price endogeneity. Following the paper from Berry (1994), like they do in the paper, I would like to do an IV regression in stata using product characteristics as instruments. Berry states that you have to invert the function defining the market shares to uncover the mean utility levels of various products. These mean utilities can then be related to product characteristics and prices using instruments.

Can it be explained why this would be a good instrument?
Which commands do I need to make this IV regression with the right instruments work in stata?

Berry 1994: Estimating Discrete-Choice Models of Product Differentiation (jstor.org)
Börnerstedt and Verboven 2013: Bjornerstedt&Verboven_StataJournal.pdf (aguirregabiria.net)

I hope my questions are clear and that someone knows the answer.

Kind regards,
Caro

ITSA Error on Monthly Time Series Data with a Gap Due to Zero Count

Hi, I am currently looking at datasets from several hospitals to assess the impact of COVID-19 pandemic on the surgical volume changes through an interrupted time series analysis. With each row in my dataset being a unique surgical operation, I created a variable for monthly surgery and then aggregated the dataset by the count of surgical volume per month in order to declare it a time series dataset.

Code:

// Set up the data by month
gen surgery_date_month = mofd(surgdate_mdy)
format surgery_date_month %tm

// Collapse data by month
foreach var of varlist record_id{
bysort surgery_date_month: egen count`var'=count(`var')
}

keep surgery_date_month count* 
duplicates drop surgery_date_month, force

tsset surgery_date_month, monthly

The output:

Code:

        time variable:  surgery_dat~h, 2019m1 to 2021m2, but with a gap
                delta:  1 month

I tried running the -itsa- command but got the error "surgery_date_month is not regularly spaced" likely due to 2020m7 missing because there was zero operation that month at this site. I manually edited by adding an observation 2020m7 with a count of 0, but I was wondering if there's a way to avoid this type of manual editing by changing my unsophisticated coding as I create the monthly count.

Thank you in advance.

Tuesday, March 30, 2021

Combined plots

Hi
I recently came across a graph where instead of making bar graphs the authors/researchers had combined three plots (namely scatter, violin and bar graphs) to show the distribution of a variable over different categories. I am wondering if it can be done in Stata.
The graph looks as below
Array

I am using Stata 15 and my data is a Diversity Index from 1980 to 2015 for different agro-climatic zones.
Thanks

Graphing variables with extreme values

Hi Statalist.

I want to graph a number of financial variables, such as total household assets, and compare the values between a few dichotomous and categorical variables (e.g. race, religion, etc). I graph using a 95% confidence band for each to show the range of values and overlay that with the average of each of the same variables. The financial variable is on the y axis and age is on the x axis so I can observe the change in these values over the lifecycle (by group).

As you can see my dataset contains extreme values and I am looking for options on how best to 'deal' with these for graphing purposes - without excluding any of the observations as I do not want to artificially affect the mean values.

Array

Code:

tw (lpolyci totasset hgage1 if group == 1 & wave == 2, bwidth(3) lc("230 76 138") lw(medthick) ciplot(rarea) acolor("230 76 138%30") alw(5) level(95)) /// 
    (lpolyci totasset hgage1 if group == 2 & wave == 2, bwidth(3) lc("25 154 222") lw(medthick) ciplot(rarea) acolor("25 154 222%30") alw(none) level(95)) /// 
    (connect totassetave hgage1 if group == 1 & wave == 2, lc("230 76 138%70") lwidth(medthin) lpattern(shortdash) m(oh) mlw(vthin) mc("230 76 138%90"))  /// 
    (connect totassetave hgage1 if group == 2 & wave == 2, lc("25 154 222%10") lwidth(medthin) lpattern(shortdash) m(oh) mlw(vthin) mc("25 154 222%90")), ///
    title("Wave 2", size(medsmall) position(11) justification(right)) ///
    legend(region(lstyle(none)) order(2 "type A" 4 "type B") col(2) pos(0) ring(1) bplace(ne) rowgap(.1) colgap(1) size(small) color(none) region(fcolor(none))) /// angle(h) 
    ytitle("Total assets", size(small)) xtitle("Age", size(small)) ///    
    xla(20(10)100, format(%8.0fc) labsize(vsmall)) xtick(20(10)100) xmtick(15(10)95) ///
    yla(0(400000)2000000, format(%10.0fc) labsize(vsmall)) ytick(0(400000)2000000) ymtick(200000(400000)2000000, grid nogmin gex glc(gs12) glp(dot) glw(medthin))  ytick(0(.1).5) ///
    plotr(margin(zero) lw(medthin)) scheme(burd) name("Fig4", replace) scale(1.2)

Comments on options and code appreciated. (Note my draft code is copied from elsewhere and amended to suit.)

One option is to take the natural log of these values (after applying this change to the first five lines of code) I obtain this graph - still with extreme values. (Note - there are no negative values). I believe using -yscale(log)- and/or -ylabels- will help here but I have not yet worked out how to code such that the values are in $ terms. Any suggestions here appreciated.

Array

Stata v.15.1. I am using panel data. This post has its roots at #11-#15 here https://www.statalist.org/forums/for...-loop-question - though 'morphed' from the original thread title hence reposting.

Extract substring between nth and (n+1)th commas in a variable

How can I extract a substring between the nth and (n+1)th commas in a variable?

For example, consider ID = 3 and beta = "eight,nine,ten,eleven,twelve". How could I extract the substring between the 3rd and 4th commas? (Answer: "eleven")

Code:

clear
input ID strL beta
1 "one,two,three,four"
2 "five,six,seven"
3 "eight,nine,ten,eleven,twelve"
end

Please note this is a vastly simplified example of an 80,000+ observation dataset where I have as many as 1,000 commas in an observation of the variable beta. I am using Stata 16.1 on Windows 10.

Many thanks!

How to adjust (select or handpick) correlation matrix column(s) output to be esttab after estpost correlation table?

Dear Statalisters, I would like to find out whether it is possible to select the specific column(s) to be produced in the correlation matrix output. Supposedly:

Code:

estpost correlate var1-var20, matrix

The output I'm looking for is vertically it will still consist of all variables listed (var1 to var20), but for horizontally, I would like to handpick which column order to be produced (i.e. var1 to var5) so then I can make separate esttab outputs after arranging the variables included in a specific order/sequence. The table layout will look something like this:

First Sequence:

Code:

Variables | var1 | var2 | var3 | var4 | var5
var1
var2
var3
var4
var5
var6
var7
var8
var9
var10
var11
var12
var13
var14
var15
var16
var17
var18
var19
var20

Second Sequence:

Code:

Variables | var6 | var7 | var8 | var9 | var10
var6
var7
var8
var9
var10
var11
var12
var13
var14
var15
var16
var17
var18
var19
var20

Third Sequence:

Code:

Variables | var11 | var12 | var13 | var14 | var15
var11
var12
var13
var14
var15
var16
var17
var18
var19
var20

Fourth Sequence:

Code:

Variables | var16 | var17 | var18 | var19 | var20
var16
var17
var18
var19
var20

My apologies for using the code delimiter to outline the table output. Obviously it is not a sequence of codes, but I hope you understand I am trying to describe the correlation output table that I would like to very much produced. Any help will be much appreciated. Thank you so much and cheers!

Best,
Nampuna

MATCHIT- Stata for data consolidation and cleaning using fuzzy string comparisons

Hello,

I came across your matchit command in Stata for data consolidation and cleaning using fuzzy string comparisons. I would like to use it for matching EU-ETS installations (ID) and emission details (ED) of such installations. ID contains location and ED contains emissions from such installations. Both the ID and ED file contains unique identification code PERMIT_ID. The problem is there are more than 15,000 entries for ID and more than 12,000 entries for ED, I want to match both the files and find the locations for installations in the ED file, I know the only way is to use by PERMIT_ID, but I cannot figure out the command. . I have checked in excel both the files have more than 10,000 common entries. I also do not want to discard the extra 2,000 in the ED file.

The columns look as below. I would really appreciate if you can suggest me a way.

ED file:

INSTALLATION_NAME	PERMIT_ID	2019	2018	2017
Baumit Baustoffe Bad Ischl	IKA119	46415	42302	48681
Breitenfelder Edelstahl Mitterdorf	IES069	26777	23457	21031
Ziegelwerk Danreiter Ried im Innkreis	IZI155	5129	3738	3487
Isomax Dekorative Laminate Wiener Neudorf	ICH113	33020	30922	30088
Sandoz Werk Kundl	ICH106	57869	58011	65993
Ziegelwerk Martin Pichler Aschach	IZI150	14202	14040	8852
FHKW Süd StW St. Pölten	EFE041	1476	3060	3687
FHKW Nord StW St. Pölten	EFE040	31973	32342	31624
Vetropack Pöchlarn	IGL173	59982	59513	58286
Vetropack Kremsmünster	IGL172	68178	61631	66001
Sinteranl., Hochöfen, Stahlwerk Donawitz	IVA065	2846643	2923552	3075201
Voestalpine Stahl Linz	IVA062	8812969	7816077	9220971
VOEST-Alpine Stahl Linz (Kalk) Steyrling	IKA120	303621	274486	346279

ID file:

Installation Name	Permit_ID	NUTS_ID
Baumit Baustoffe Bad Ischl	IKA119	AT31
Breitenfelder Edelstahl Mitterdorf	IES069	AT22
Ziegelwerk Danreiter Ried im Innkreis	IZI155	AT31
Wienerberger Blindenmarkt	IZI146-1	AT12
Isomax Dekorative Laminate Wiener Neudorf	ICH113	AT12
Sandoz Werk Kundl	ICH106	AT33
Ziegelwerk Martin Pichler Aschach	IZI150	AT31
FHKW Süd StW St. Pölten	EFE041	AT12
FHKW Nord StW St. Pölten	EFE040	AT12
Vetropack Pöchlarn	IGL173	AT12
Vetropack Kremsmünster	IGL172	AT31
Energiepark Donawitz	IVA066	AT22
Sinteranl., Hochöfen, Stahlwerk Donawitz	IVA065	AT22

Thanks

need help: panel data analysis

Hello,

I have a panel data sample consisting of obervations of several years. For each year I have some duplicates which I won`t drop since it is important data. My problem is now, that I don`t know how to "say stata" that I have panel data, since it refuses my time variable. What can I do? Moreover, I don`t know how to do a bivariate correlation with this kind of data. Can anybody help me please?

Kind regards

probleme to perform ado function

Dears,

i am a new one on this forum and on stata, thank you for your help
my question is :
I want to run an ado program according to the following program:

Part1: .ado

* program to do rolling granger causality test
. capt prog drop rgranger
. prog rgranger, rclass
syntax varlist(min=2 numeric ts) [if] [in] [, Lags(integer 2)]
var `varlist´ `if´ `in´, lags(1/`lags´)
vargranger
matrix stats = r(gstats)
return scalar s2c = stats[3,3]
return scalar s2i = stats[6,3]
return scalar s2y = stats[9,3]
end

Part2:

rgranger lconsumption linvestment lincome, lags(4)

Despite the fact that I save part 1 in an .ado file and put it in "PLUS: c: \ ado \ plus ", when I run rolling as shown in part 2,
I get the message : invalid name.

Please is there someone who can help me overcome this problem,
thank you in advance

Can i pick the Model with lower AIC but higher BIC?

Good day Everyone,

I was comparing two models
Model2 has a lower AIC but a higher BIC.
can i pick model 2 as the better model since its lrtest ( ll(model)) is also lower than model 1.

Model	N	II(null)	II(model)	df	AIC	BIC
Model 1	47,111		-5159.380	8	10334.760	10404.650
Model 2	47,111		-5153.186	12	10330.370	10435.210

thank you for responding.

probleme to perform ado function

Controlling for year when appending two datasets

Hello,

One newbie question but I'm stuck on this basic task. I'm pooling two datasets from two different years and have successfully added observations to the main dataset through the command "append". However, when running my regressions I would like to control for the survey year. But I don't know how to since there is no such variable due to their originally cross-sectional nature? Do I need to create a variable that includes the entire dataset and then control for it? How do I do that? There is a variable of the respondents' ID but there are thousands of observations in that one.

Thankful for any ideas!

Quantile regression (QR) for panel data

Dear all,

i hope you are doing good

i'm trying to work on a non linear relationship. does quantile regression an appropriate model ???

( some colleagues tell me that the QR do not deal with the non linearity)

so i would appreciate your help

Best regards

Reverse Causality and Panel Data

Hi all,
I hope all is well.
Please, I am studying if some of the variables affect firm instrument issuance. Assume that I have these variables (X1 X2 X3 ). Well, in the logit model , the results show that the firm with lower of X 1 are more likely to issue this security. In the second model, I have used fixed effect regression to see the impact of the instrument issuance on firm X1, the results show that issuing this instrument has no significant impact in reducing X1 as well as no increasing in firm X1 . Well, My professor say that the result of X1 in logit is contradict the findings of the second result of x1 in the fixed model. He say that perhaps there is reverse causality ? i know and I have my insight to justify this issue by different ideas but please, what dose this mean (here is reverse causality) ? and how can I test the reverse causality in stata ? Thank you so much

Table1_mc stata 16 issue

Hello,

I am trying to create a table 1 in stata 16.

I am using the following code:
table1_mc, by(race_ethnicity) vars(age contn %4.0f \ gender bin %4.0f \ state cat %4.0f ) However every time I run the code I get the error message

command table1_mc is unrecognised.

I have tried installing the package:
ssc install table1_mc Whenever I do this it says file c:\ado\plus\trek.nxt already exists.

Would really appreciate any help!

Hannah

Analysis of Demographic Factors Across Syllabi

Hi all!

I'm very new to Stata so apologies in advance if my question isn't very well formed. I am performing an analysis of gender, race, and nationality bias in international affair's syllabi. I have collected 60 syllabi, and coded each one for the professor's age and race and for each assigned readings' authors race, gender, and nationality, as well as the date published for each assigned reading. I am trying to calculate if the professor's race and gender has an impact on the demographics of the readings they assign. I am firstly doing it by individual factors (e.g. do white professors assign more white authors) and then by intersectional factors (e.g. do white, male professors assign more white, male, American authors). I have made all the variables to be dummy variables (e.g. 0 if male, 1 if non-male), but I'm not sure where to go from here. I would like to be able to calculate things such as how much more likely a white professor is to assign a white author. Thank you so so so much in advance for your help!

Best,
Manush

why I need to run the command keep twice to delete some observations?

Code:

------------------------------------------------------------------------------------------------------------------------------------
      name:  <unnamed>
       log:  /Users/ll/Downloads/log.smcl
  log type:  smcl
 opened on:  30 Mar 2021, 21:28:40

. cd "/Users/ll/Downloads/"
/Users/ll/Downloads

. use significance7.dta, clear

. append using significance8
.
. tempvar tempid tempMediatorTotal

. egen `tempid' = group(IV Moderator Controls),missing

. gen `tempMediatorTotal'=Mediator

. bys `tempid': replace `tempMediatorTotal' = `tempMediatorTotal'[_n-1]+ " " +`tempMediatorTotal' if _n>1
(27,500 real changes made)

. bys `tempid': replace `tempMediatorTotal' = `tempMediatorTotal'[_N]
(27,500 real changes made)

.
. keep if strpos(`tempMediatorTotal',"pos")>0 & strpos(`tempMediatorTotal',"neg")>0
(18,664 observations deleted)

.
.
. tempvar length length1 length2

. gen `length'=length(IV)

. gen `length1'=length(Controls)

. gen `length2' = length(Mediator)

. hashsort -`length' IV Moderator   -`length1' -`length2' Mediator
(note: missing values will be sorted first)

.
. gegen tempid = group(IV Moderator Controls),missing

. su tempid

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      tempid |     12,008    395.5428    212.1199          1        728

. order tempid

.
. keep if pvalueInteraction < .05
(5,647 observations deleted)


. tempvar tempid tempMediatorTotal

. egen `tempid' = group(IV Moderator Controls),missing

. gen `tempMediatorTotal'=Mediator

. bys `tempid': replace `tempMediatorTotal' = `tempMediatorTotal'[_n-1]+ " " +`tempMediatorTotal' if _n>1
(5,753 real changes made)

. bys `tempid': replace `tempMediatorTotal' = `tempMediatorTotal'[_N]
(5,753 real changes made)

.
. keep if strpos(`tempMediatorTotal',"pos")>0 & strpos(`tempMediatorTotal',"neg")>0
(2,781 observations deleted)

.

. log close _all
      name:  <unnamed>
       log:  /Users/ll/Downloads/log.smcl
  log type:  smcl
 closed on:  30 Mar 2021, 21:29:19
------------------------------------------------------------------------------------------------------------------------------------

As you can see, at the second time I run command keep, there are some observations deleted, why those observations are not deleted at the first time?
It it because the usage of egen group ?

Panel regression model / Multiple regression model?

Hi everyone,

I have a dataset available which links the the Covid-19 situation to the change in port activity (expressed in difference of port calls = ship arrivals in ports) for any specific date between 01/01/2020 - 31/08/2020 and any given country. An example of dataset for the countries China & Belgium is attached to see the data formulation yourself. In the end the data set is a list per country and any date, given the country linked to indicators summarising the Covid-19 situation in that country for that particular date. More information about the indicators: The policy indicator reports a number which reflects the severity of the corresponding containment policy. The stricter a government’s policy is, the higher the representing indicator will be. C1, C2, and C6 are all three reported on a scale between zero and three (0 – 1 – 2 – 3). On the other hand, C3, C5, and C7 use a scale between zero and two (0 – 1 – 2), while C4 and C8 make use of scale between zero and four (0 – 1 – 2 – 3 – 4). The stringency index is reported by a percentage where a value close to 100% indicates the strictest situation one can imagine.

First, the stringency index is linked to the change in port calls by a single regression model.
Second, the main purpose is to check how much of the change in port calls is explained by the indicators and even more crucial, which indicator has the declarative value.

Now, I had some questions related about the second research question.
- Would it be best to use a panel regression model, or make use of multiple regression model?
- There are three different scales of containment policies used in the data set, does Stata automatically recognise difference between these scales?

Thank you in advance!
Dan

Regressing a variable recorded in 2018 on a variable of 2011

Hello everybody, I'm a beginner of Stata and I'm already facing the first issues.
I have a panel data where my dependent variable (average age by municipality, called "old" in my panel data) was recorded in 2011, on the other hand my "treatment" variable (housing prices) belongs to 2018. I simply want to regress the 2011 average age on the housing prices in 2018. Of course there are a lot of missing observations, but I thought that Stata could drop them automatically. However, when I send the command

regress housing old

I get the error r(2000) no observations.

So my question is: how do I tell Stata to match the 2011 data with the corresponding 2018 data of the same municipality?

Thank you in advance.

Omitted interaction terms in the fixed effect due to collinearity

Hi all,

I have a question when I am running a fixed effect regression which is used to investigate how the dependent variable will be when the level of organizations changes over the year. Here is my command:

Code:

xtset year

xtreg coverage i.high i.year i.performance i.finance i.planning i.planning#i.year i.finance#i.year i.performance#i.year,r

**Performance/Planning/Finance are all the assurances elements that can examine the quality of the organization. Level A.B.C.D is the rating of each element. (A is the best rating, D is the worst)

The results are:
Array

The first part of the result seems reasonable. All the coefficients of level D are omitted for the purpose of baseline so that all other coefficients can be compared with the baseline level. For example, performance level B is 1.53% lower than performance level D in the dependent variable (immunization coverage rate). Here the planning level A is omitted. I think it is because of a high correlation between level A and level D, since the fixed effects take out all the variance at the group level, there is nothing left for level A to explain. If it is the case, it can be explained as level D organizations may have coverage as high as level A organizations.

Now for the second part: integration of the assurance levels and years. all the year 2005 are omitted for the baseline. For example, in the year 2006, the finance level 2 (level B) organization has a 3.96% higher coverage rate than the finance level B in the year 2005. Now my question is why the coefficients of level A * years are all omitted?
Array

Thank you so much!! Very looking forward to your reply!

Replace odd commas by spaces in a variable

Hi,
Thanks in advance for your help. Maybe, what I am going to ask is an easy question for many of you but I have been unable to find a solution. Sorry in that case!!!

I have a var (GEOMETRY) that I need to change replacing odd commas by an space. In this example, I am trying to replace this value:
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| GEOMETRY |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
1. | MultiPolygon(((232840.245027644, 4143512.94418335, 233025.149040326, 4143497.94213052, 233137.343532, 4143913.93358042, 233082.751561214, 4143979.93295409, 233074.45.. |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

By this value:

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| GEOMETRY |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
1. | MultiPolygon(((232840.245027644 4143512.94418335, 233025.149040326 4143497.94213052, 233137.343532 4143913.93358042, 233082.751561214 4143979.93295409, 233074.45.. |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

As you can see, every two commas (starting by the first one) should be replaced by a space or deleted (it is the same for me). But I don not know how to do it.

What I am intending is to adapt a system (used in Oracle Space) by other one (used in QGIS, WKT). That is the reason why I need this replacing.
Thanks a lots
Juan Miguel Gómez.

Importing an Imputed File

Hi everyone,

I created an Imputed version of my data on a PC machine using Stata/SE. I've now opened the data set using Stata/IC on a Mac. Does anyone know if Stata/IC will understand that the data is imputed when I try running regression analysis, or will it see each row as a new participant?

Thanks

Dougie

Mark latest observation with unbalanced paneldata

I want to mark the latest observation in an unbalanced dataset with panel data

Timevaribel: t (varies som 0 to 17)

person id: id

The program code I use now:

bysort id t: generate last_obs = sum(t==17)==1 & sum(t[n-1]==2)==0

But the problem is that it only marks the persons with a time variable = 17 year and it does not mark the persons with less observations that 17

So how do I generate a variable that marks the latest observation no matter how long the time variabel is?

Need help with sampling process

Hello,

I am working with a dataset with merged data from 2 survey rounds (2005 & 20011).

The first column/variable 'id' is the unique identification number for the individuals.

The second column/variable "SURVEY" represents whether the observation is from survey 1 or 2.

I want to only keep the observations that are present in both surveys.

Currently, the data has been sorted by 'id' and you may notice there are two observations for the same individual, one from survey round 1 and the other from round 2. That's exactly what I need. However, is there a way to drop the observations that are missing the observation for the other round?

Here's a sample of the data:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str12 id int SURVEY
"101020102010" 1
"101020102010" 2
"101020102011" 2
"101020102014" 2
"10102010205"  1
"10102010206"  1
"10102010206"  2
"10102010207"  1
"10102010207"  2
"10102010304"  1
"10102010305"  1
"10102010305"  2
"10102010306"  1
"10102010306"  2
"10102010307"  2
"10102010307"  1
"10102010403"  1
"10102010404"  1
"10102010404"  2
"10102010405"  2
"101020105010" 2
"101020105010" 1
"10102010507"  1
"10102010508"  1
"10102010509"  1
"10102010509"  2
"10102010708"  2
"10102010708"  1
"10102010709"  2
"10102010709"  1
"10102010804"  2
"10102010805"  2
"10102010806"  2
"10102010904"  1
"10102010906"  1
"10102010907"  2
"10102010907"  1
"10102010908"  1
"10102010908"  2
"10102010909"  2
"10102010909"  1
"101020110010" 2
"10102011203"  2
"10102011204"  2
"10102011304"  2
"10102011304"  1
"10102011305"  1
"10102011306"  1
"10102011306"  2
"10102011307"  2
"10102011307"  1
"10102011403"  2
"10102011403"  1
"10102011404"  2
"10102011404"  1
"10102011405"  1
"10102011405"  2
"10102011406"  2
"10102011407"  2
"10102011605"  2
"10102011605"  1
"10102011606"  2
"10102011606"  1
"10102011607"  1
"10102011607"  2
"10102011702"  1
"10102011703"  1
"10102011704"  1
"10102011704"  2
"10102011705"  1
"10102011705"  2
"101020118010" 1
"10102011806"  1
"10102011806"  2
"10102011807"  2
"10102011807"  1
"10102011808"  2
"10102011903"  1
"10102011903"  2
"10102011904"  2
"10102011904"  1
"10102011905"  2
"10102011905"  1
"10102011906"  2
"10102012007"  1
"10102012007"  2
"10102012008"  1
"10102012008"  2
"10102020103"  1
"10102020104"  1
"10102020105"  2
"10102020105"  1
"10102020303"  2
"10102020304"  2
"10102020305"  2
"10102020403"  1
"10102020405"  1
"10102020405"  2
"10102020505"  1
"101020206010" 2
end
label values SURVEY SURVEY
label def SURVEY 1 "IHDS1 1", modify
label def SURVEY 2 "IHDS2 2", modify

THanks for your help in advance!

Cartesian product of Columns

Dear Statalisters,

I have 2 sets Set_1= {HH, HT, TT, TH} and Set_2 = {1, 2, 3, 4}. These are the initial column values and names.
I want to expand this 4X2 dataset to a 4X 4 such that the new data set is a 'Cartesian Product' like so:

Set_1	Set_2
HH	1
HH	2
HH	3
HH	4
HT	1

I am searching for element-wise multiplication and cartesian products and I am getting the wrong kind of command (for this job).
I don't have a Dataset yet & I have to construct one, so I couldn't post one here. I am also aware of the command

Code:

cross

which performs pairwise merge. But in my case, the issue is slightly different.

Do help and thank you, in advance!

GMM for time series

Hello Statalist,
1.I am using time series data. my main equation is
pb_{t =} a₀+a₁pb_t-1+a₂pd_t-1+og_t+e
but there is a problem of endogeneity so we use instrumental variables and the equation for that is
pd_t-1= b₀+b₁pb_t+b₂EX_t+b₃og_t-1+v_t here (EX and og_t-1 are instrumental variables)
here
pb is primary balance
pd is public debt
og is output gap (using HP filter)
EX is exchange rate.
how to use GMM for the above mention data particularly the IV's.
2. As my data is time series so do we need stationarity check before using GMM?.
Thanks for your valuable input.

Monday, March 29, 2021

Stationarity, Markov switching

Hi everyone, I was trying to model times series with markov switching regimes. Preliminarily, do I need to have my time series to be stationary? Doesnt the notion of regime switching itself imply that series should be non-stationary? Can I model non-stationary times series with markov switching regimes?

Out of sample estimation following xtdpdgmm

Stata throws a 301 error (last estimates not found) whenever I try to use estimates from a dynamic model (using xtdpdgmm) to predict out of sample output.
Steps I followed:
1- Run regression: xtdpdgmm L(0/1).a L(0/2).b L(0/2).c L(0/2).d L(0/2).e L(0/2).f, collapse model(difference) gmm(l.a, lag(0 1)) gmm(b d f, lag(1 1)) gmm(c e, lag(. .)) teffects vce(robust)
2- est save model1
3- on a different dataset: est use model1 (loads correctly).
4 - predict yhat, xb output: last estimates not found (r(301).

Identifying missing Data & Interpolation of Panel Data

Hello,

I've created a dataset of 30 countries over 30 years for 29 different variables. For some of these countries, the 30-year time span for some countries has several missing data points and I was wondering how do I interpolate these missing observations for each country? Also, is there a command to identify which countries have completely or nearly empty sets for a specific variable? For example, if I had a dataset:

country	year	selfemployed%	FDI(inflow %)
Bermuda	2001	3%	25.16%
Bermuda	2002	.	23.40%
Bermuda	2003	.	6%
Bahamas	2001	.	.
Bahamas	2002	.	14%
Bahamas	2003	.	20%
Barbados	2001	14%	16%
Barbados	2002	16.3%	5%
Barbados	2003	16%	.

Is there a command that summarizes which countries have missing data among all variables? I would like to be able to tell which variables I should delete given that there are very few observations for multiple countries. I would also like to know how I would interpolate for instance the missing data point for the FDI in Bermuda for 2003.

Thank you!

Shifting observations across rows

Hello,
I am fairly new to Stata and am having trouble even starting on this problem. I am working with the National Longitudinal Survey of Youth 1997 from the Bureau of Labor Statistics dataset. I am having some trouble with the household roster section. Individuals who belong to a household are recorded each year and given an identifying number (i.e. 1 for mother, 2 for father, 3 for sister, etc.). I was interested in isolating the siblings. Doing this, though, has left me with many missing values and sibling variables that go up to 40 because the number of household members changes from year to year.

I found some code on another forum that does this.

https://stackoverflow.com/questions/...tions-in-stata

However, the code here changes my numbers from 13-26 to 1-8. Additionally, it increased the number of variables I had. Originally, I had 30 sibling in 1997 and after running the code I had 37. How do I fix the code so these do not happen?

Reading image properties such as GPS and date taken

Hi everyone, I was wondering if there is a way to read photo/ image properties such as date taken, data modified and GPS coordinates. I have about 15,000 images.
I've tried applying different command but I found 'ls' command to provide something close to what I need but only data created on my device, size and filename.

Thanks in advance

Diego

Alternatives to Propensity Score Matching (PSM)

Hello everyone,

I have a large dataset of forest pixels with different control variables such as altitude, slope, distance to nearest river etc. I would like to match the observations from my treatment area with the control area, to later estimate the effect of a policy trying to mitigate deforestation via diff-in-diff.
However, both areas (treatment and control) have large sig. differences in their baseline values and the decision of program implementation does not really depend on those but the deforestation rate does. Therefore, I think that PSM is not the right approach here.
When estimating the pscore I obviously get very large values near 1 for the treated and very low ones near 0 for the control area, so the common support is very limited (from treated 24.000 off support and only 6000 on support, while all controls are on support)
Therefore, I am looking for a matching method where I can match my treatment with control observations based on the covariates to implement a diff-in-diff afterwards.
I was searching for literature on the topic but got quite confused with the variety of matching methods out there.

Do you know useful literature on the topic which gives an overview over different matching methods and their applicability?
Can you suggest a matching method which is more applicable for my situation given I want to conduct a diff in diff analysis?

Thanks!

Best,
David

FE constant vs RE constant

Hi,

I am estimating a model by FE and the constant I get is not close to the mean of my only explanatory variable, but when I run the same specification but using xtreg re for a Hausman test the constant is much closer to the mean. is this of any significance? and I am not quite sure what it means when the constant differs so much from the mean and if this might be an indicator of a problem?

any advice would be appreciated

Categorical x Categorical interaction

Hi all,

I am hoping someone can clear something up for me.

I want to run a poisson regression model - with a continuous dependent variable and with sex and age as factor variables.

glm cpssdx ib0.female##ib5.age_factor, fam(poisson) link(log) nolog vce(robust) eform baselevels

I want to report the interaction term but can't determine whether this would be from either of the following codes:

contrast female@age_factor

Contrasts of marginal linear predictions

Margins : asbalanced

-----------------------------------------------------
| df chi2 P>chi2
------------------+----------------------------------
female@age_factor |
13 | 1 1.05 0.3066
14 | 1 0.43 0.5096
15 | 1 0.55 0.4572
16 | 1 1.39 0.2376
17 | 1 10.57 0.0011
Joint | 5 14.00 0.0156
-----------------------------------------------------

(i.e [F(5,707) = 14.00, p = 0.0156) - so a significant interaction.

OR

margins ib0.female#ib5.age_factor

Adjusted predictions Number of obs = 717

Expression : Linear prediction, predict()

-----------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. t P>|t| [95% Conf. Interval]
------------------+----------------------------------------------------------------
female#age_factor |
MALE#13 | .1111111 .1042772 1.07 0.287 -.0936188 .3158411
MALE#14 | .2272727 .0666959 3.41 0.001 .096327 .3582184
MALE#15 | .2714286 .0528782 5.13 0.000 .1676115 .3752456
MALE#16 | .25 .0638565 3.92 0.000 .124629 .375371
MALE#17 | .1190476 .0482709 2.47 0.014 .0242761 .2138192
FEMALE#13 | .2325581 .067467 3.45 0.001 .1000986 .3650177
FEMALE#14 | .2803738 .0427694 6.56 0.000 .1964035 .3643441
FEMALE#15 | .326087 .0461245 7.07 0.000 .2355296 .4166443
FEMALE#16 | .35 .044241 7.91 0.000 .2631404 .4368596
FEMALE#17 | .3423423 .0419917 8.15 0.000 .2598989 .4247858
-----------------------------------------------------------------------------------

I specifically want to look at sex differences at each age - but there appears to be numerous ways to go about this.

Anyone have any thoughts on this?

Thanks!

xttab

Hi,

I am busy with creating a word doc with my panel data descriptive statistics. I am using the following code, which works.
asdoc xttab Country_code, save(Descriptive.doc), replace dec(1) title(Descriptive statistics)
However, I would like to create a table with only the within Freq and Percent (so without Overall and Between). Does anyone know how to do this?
Thanks in advance!

Shaping graphs using grc1leg

Hi all,

I'm making a set of different scatters using the following code:

Code:

 foreach x in physical_strength conceptualisation serving_attending managing_coordinating latitude control_ext control_int teamwork repetitiveness technical_c bureaucratic_c {
twoway (scatter `x' log_wage if year==2005, msize(small) msymbol(circle_hollow) mcolor(ebblue)) (scatter `x' log_wage if year==2016, msize(small) msymbol(circle_hollow) mcolor(orange)) , legend(label(1 "2005") label(2 "2016") size(medsmall)) xtitle("log wage", size(medsmall)) ytitle("task score", size(medsmall)) xlabel(, labsize(medsmall)) ylabel(, labsize(medsmall)) title("`x'", size(medsmall)) name(`x', replace) nodraw xsize(20) ysize(20) 
}

In this way - using xsize(20) and ysize(20) I have squared graphs, which is what I would like. However, once I try to combine them with the command:

Code:

 grc1leg physical_strength  conceptualisation  serving_attending managing_coordinating, xcommon legendfrom(conceptualisation)

whatever I try, they change their shape and are no more squares. Is it possible to maintain the exact shape I specify in the twoway scatter when combining using grc1leg? Is there any way to fix the square shape of the scatters?

Thanks all!

create dataset based on all possible pairs of identifiers within each group in Stata

Hi,

I have a dataset that looks like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str7 country1 str11 country2 str4 group
"China"   "Philippines" "68a"
"China"   "Thailand"    "68a"
"Bahamas" "Jamaica"     "176a"
"Bahamas" "Grenada"     "176a"
end

I need to transform the above dataset into like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str11(country1 country2) str4 group
"China"       "Philippines" "68a"
"China"       "Thailand"    "68a"
"Philippines" "China"       "68a"
"Philippines" "Thailand"    "68a"
"Thailand"    "China"       "68a"
"Thailand"    "Philippines" "68a"
"Bahamas"     "Jamaica"     "176a"
"Bahamas "    "Grenada"     "176a"
"Jamaica"     "Bahamas"     "176a"
"Jamaica"     "Grenada"     "176a"
"Grenada"     "Bahamas"     "176a"
"Grenada"     "Jamaica"     "176a"
end

I tried my best to follow the Stata code in this article: https://www.stata.com/support/faqs/d...-to-all-pairs/. However, I ended up with a dataset that looks like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str7 country1 str11 countr2 str4 group
"China"   "Philippines" "68a"
"China"   "Philippines" "68a"
"China"   "Thailand"    "68a"
"China"   "Thailand"    "68a"
"Bahamas" "Jamaica"     "176a"
"Bahamas" "Jamaica"     "176a"
"Bahamas" "Grenada"     "176a"
"Bahamas" "Grenada"     "176a"
end

I'm not sure what I'm doing wrong. Thanks in advance for your help!

This question is also posted here: https://stackoverflow.com/questions/...ach-group-in-s

Best,

Dotty