BJ Data Tech Solution

Specialized on Data processing, Data management Implementation plan, Data Collection tools - electronic and paper base, Data cleaning specifications, Data extraction, Data transformation, Data load, Analytical Datasets, and Data analysis. BJ Data Tech Solutions teaches on design and developing Electronic Data Collection Tools using CSPro, and STATA commands for data manipulation. Setting up Data Management systems using modern data technologies such as Relational Databases, C#, PHP and Android.

Friday, July 31, 2020

Logistic regression, propensity score matching, or IPTW?

Hello,

I would like to compare the "adjusted" mortality after surgery between two cohorts with very different sample sizes. The first cohort A is compromised of 100,000+ patients while the second cohort Bonly includes 500 patients. After adjusting for certain characteristics, I would like to see if there's a mortality difference between the two groups. Three methods come to mind: logistic regression, PS matching, and IPTW.

I prefer running a propensity score analysis over a multivariate logistic regression to show that the covariates are appropriately balanced between both groups.

I initially favored performing a propensity score matching. However, only a small fraction of patients from cohort A will be included in the final analysis. I am not sure if it is methodologically sound to do that given the huge difference in sample size.

IPTW may therefore be the best method, but wanted to confirm with people more experienced than me first!

Why can't we cluster on anything we like?

I often get error messages indicating that Stata doesn't like the variable that I have chosen to cluster on. Consider the following model, which has student random effects and tries to cluster standard errors at the teacher level:

Code:

mixed absent ib(2).classtype ib("k").gradenum i.schid || stdntid:, cluster(tchid)

It produces the following error message:

highest-level groups are not nested within tchid

which is true, but so what? Why does the cluster option care whether students are clustered within teachers? Theoretically, it seems to me I should be able to cluster on teachers whether they nest students or not.

My best guess is that this is a computational issue -- some constraint used by Stata to keep the matrices involved in clustered standard errors manageable in size. But I don't know. Your expertise most appreciated.

Writing a Loop

Dear Statalist Users,

I have multiple txt files and I want to convert them into stata files so that I can append the data.

This is the manual way of importing the data and then saving it for stata.

import delimited "C:\Dir\Manu - 00-01.txt", delimiter("|") varnames(2)
rename v1 company
rename v2 product
save "manu_00-01", replace

However, I have 20 text files in the form of Manu - 00-01.txt, Manu - 00-02.txt, Manu - 00-03.txt, and so on till Manu - 18-19.txt.

I seek your suggestion on writing a loop for the same so that I can append it accordingly

Thank you

Problem with reaching 1:n ratio with psmatch2 !

Hi everyone,

I am trying to perform a 1:4 propensity score matching using the command psmatch2. After I run the code, the ratio of individuals in the treatment group to the individuals in the control group is approximately equal to 1:4 (23.6% and 76.4% instead of 25% and 75%). Whenever a 1:n matching is reported in the literature, the number of individuals matched is always exactly equal to the pre-specified ratio. Here's my code:

Code:

psmatch2 group, neighbor(4) pscore(score) caliper (0.2) quietly

*Creating matching groups
gen pair1 = _id if _treated==0
replace pair1 = _n1 if _treated==1

gen pair2 = _id if _treated==0
replace pair2 = _n2 if _treated==1

gen pair3 = _id if _treated==0
replace pair3 = _n3 if _treated==1

gen pair4 = _id if _treated==0
replace pair4 = _n4 if _treated==1

bysort pair1: egen paircount1 = count(pair1)
bysort pair2: egen paircount2 = count(pair2)
bysort pair3: egen paircount3 = count(pair3)
bysort pair4: egen paircount4 = count(pair4)

egen byte paircount = anycount(paircount1 paircount2 paircount3 paircount4), values(2)

tab group if paircount!=0

HTML Code:

group    Freq.    Percent    Cum.
            
0    1,248    76.38    76.38
1    386    23.62    100.00
            
Total    1,634    100.00

Am I doing something wrong? How can I arrive to a ratio of 25% to 75%?

Thanks

stata ttest significance

Hi everyone,

I have the following dataset

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int NEWID float(age married dep) byte(hh_mem mem) float mig
  1 46 1         1 0 . 0
  2 37 1         1 0 . 0
  3 30 1         2 0 . 1
  4 46 1        .6 0 . 1
  5 55 1         1 0 . 0
  6 60 1  .3333333 0 . 1
  7 31 1  .5714286 0 . 1
  8 35 1 1.3333334 0 . 1
  9 51 1        .5 0 . 1
 10 38 1  .5714286 0 . 1
 11 59 0        .6 0 . 1
 12 45 1       .75 0 . 1
 13 29 1       2.5 0 . 1
 14 66 1      1.25 0 . 0
 15 24 1  .8333333 0 . 0
 16 50 1        .6 0 . 1
 17 29 1       1.5 0 . 1
 18 50 1 .44444445 0 . 1
 19 25 1         1 0 . 1
 20 54 0  .7272727 0 . 0
 21 59 1        .5 0 . 1
 22 65 0      .625 0 . 1
 23 56 1       .25 0 . 0
 24 36 1         1 0 . 0
 25 64 1  .3333333 1 . 0
 26 59 1         1 1 . 0
 27 64 1      .875 0 . 0
 28 64 0  .8888889 0 . 1
 29 33 1 1.3333334 0 . 1
 30 54 1  .7142857 0 . 1
 31 76 1         0 0 . 0
 32 44 1  .6666667 0 . 0
 33 27 0 .16666667 0 . 0
 34 26 1        .8 0 . 1
 35 46 1        .2 0 . 1
 36 48 1         1 0 . 0
 37 50 1         1 0 . 0
 38 63 0       .25 0 . 1
 39 52 0         1 0 . 0
 40 35 0  .4285714 0 . 1
 41 31 0        .6 0 . 1
 42 42 1       2.5 0 . 0
 43 43 1         1 0 . 1
 44 51 0 1.5714285 0 . 1
 45 47 1       1.2 0 . 1
 46 35 1       1.2 0 . 1
 47 50 0         1 0 . 1
 48 41 1      1.25 0 . 0
 49 37 1 1.0555556 0 . 1
 50 37 1  .6666667 0 . 1
 51 33 1       1.2 0 . 0
 52 35 1         2 0 . 1
 53 49 1         0 0 . 0
 54 54 1         0 0 . 0
 55 39 1  .6666667 0 . 0
 56 34 1         4 0 . 0
 57 26 1       1.2 0 . 0
 58 57 1 1.8333334 0 . 1
 59 31 1       1.2 0 . 1
 60 48 1         1 0 . 0
 61 41 1         2 0 . 0
 62 67 1         1 0 . 1
 63 55 1  .2857143 0 . 1
 64 33 1         2 0 . 1
 65 45 1 .44444445 0 . 1
 66 31 1  .6666667 0 . 0
 67 62 1      1.25 0 . 0
 68 56 0  .8333333 0 . 1
 69 22 1  .3333333 0 0 1
 70 50 1         0 0 . 1
 71 23 1        .4 0 . 1
 72 50 1        .4 0 . 1
 73 47 1       1.5 0 . 1
 74 67 1         1 0 . 1
 75 41 1       1.5 0 . 0
 76 37 1       .75 0 . 1
 77 61 1         1 0 . 1
 78 25 1       1.5 0 . 1
 79 32 0 1.3636364 0 . 1
 80 53 0  .8571429 0 . 1
 81 63 0       .75 0 . 0
 82 28 1         1 0 . 1
 83 38 1         1 0 . 0
 84 36 1         1 0 . 0
 85 26 1        .5 0 . 0
 86 26 1         1 0 . 1
 87 40 1 1.6666666 0 . 1
 88 44 1       .25 0 . 1
 89 39 1        .5 0 . 1
 90 48 1        .4 1 0 1
 91 42 1        .4 0 . 1
 92 56 1        .6 0 . 0
 93 54 1  .3333333 0 . 0
 94 45 1        .4 0 . 0
 95 41 1  .6666667 0 . 0
 96 55 1        .5 0 . 0
 97 50 1        .7 0 . 1
 98 57 0       .75 0 . 0
 99 33 1         2 0 . 0
100 25 1        .5 0 . 0
end

i use

Code:

global varlist age married dep hh_mem mem
estpost ttest $varlist, by (mig)
    esttab, wide nonumber mtitle("diff.")

I want to get the level of significance at 1% 5% and 10% but stata gives me

t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001

How can i change this to reflect the significance i want to show? Thank you for your help in advance.

Best command for cross-classified models

I'd like to fit a model with student and teacher random effects. The ID variables for students and teachers are stdntid and tchid. There are about 1,000 teachers and 10,000 students. Students are not nested within teachers.

The only command I know that will fit such a model is something like this:

Code:

mixed absent ib(2).classtype ib("k").gradenum i.schid || _all: R.tchid || studntid:

Correct me if my syntax isn't quite right.

Anyhow it won't run. I know because this simpler model, with only student random effects,

Code:

mixed absent ib(2).classtype ib("k").gradenum i.schid || _all: R.studntid

takes forever, and this model, with only teacher random effects,

Code:

mixed absent ib(2).classtype ib("k").gradenum i.schid || _all: R.tchid

returns an error: "likelihood evaluates to missing".

Clearly I can't run a model with both teacher and student random effects if I can't run a model with teacher or student random effects alone. At least not in -mixed-.

If the -mixed- command isn't usable for this model, is there something else that is? Many thanks for suggestions.

Best,
Paul

Asdoc "conformability error"

Dear Statalist,

I am getting an error message when running the asdoc command. The first code works fine, but the second code returns:

Code:

 func_nested_reg():  3200  conformability error
                 <istmt>:     -  function returned error
r(3200);

Code:

 asdoc reg lrhourlywage c.ltotalexports_china#i.schoolinglevel1 ltotaltrade_china ltotaltrade_eu15 ltotaltrade_usa ltotaltrade_mca ltotaltrade_row i.schoolinglevel1 female age agesq married rural i.region i.occupationgroup i.isic1 i.establishmentsize i.year  [fweight = factor], vce(cluster region) nest replace add(Year dummies, YES, Industry Dummies, Yes, Region Dummies, Yes, Occupation Dummies, Yes, Establishment Dummies, Yes) drop(i.region i.occupationgroup i.year i.isic1 i.establishmentsize)

Code:

 asdoc reg lrhourlywage c.ltotalexports_china#i.schoolinglevel1 ltotaltrade_china ltotaltrade_eu15 ltotaltrade_usa ltotaltrade_mca ltotaltrade_row i.schoolinglevel1 female age agesq married rural i.region i.occupationgroup  i.isic1 i.establishmentsize [fweight = factor], vce(cluster region) nest add(Year Dummies, YES, Industry Dummies,Yes, Region Dummies, Yes, Occupation Dummies, Yes, Establishment Dummies, Yes) drop(i.region i.occupationgroup i.isic1 i.establishmentsize )

The only difference between the above codes in the i.year.
However, to understand what was going on, I ran the first regression twice changing 'nest replace' to 'nest append' for the second regression and the same error is presented even though nothing changed.

I hope you can help!
Thank you in advance
Ray

mixed vs. xtreg, re

Here are four ways to estimate the same model (I think):

Code:

xtreg absent ib(2).classtype ib("k").gradenum i.schid, re xtreg absent ib(2).classtype ib("k").gradenum i.schid, re mle mixed absent ib(2).classtype ib("k").gradenum i.schid || stdntid: mixed absent ib(2).classtype ib("k").gradenum i.schid || _all: R.stdntid

Version 1 runs in 2 seconds. Version 2 runs in 10 seconds. Version 3 takes 50 seconds to return the same result as version 2. Version 4 just spins without returning a result.

What accounts for these differences? I'm especially confused about why version 3 takes so much longer than version 2, and why version 4 doesn't finish when version 3 does.

Writing a Loop

Dear Statalist Users,

I have multiple txt files and I want to convert them into stata files so that I can append the data.

This is the manual way of importing the data and then saving it for stata.

import delimited "C:\Dir\Manu - 00-01.txt", delimiter("|") varnames(2)
rename v1 company
rename v2 product
save manu_00-01

However, I have 20 text files in the form of Manu - 00-01.txt, Manu - 00-02.txt, Manu - 00-03.txt and so on till Manu - 18-19.txt.

I seek you suggestion on writing a loop for the same so that I can append it accordingly

Thank you

Operations over lags

Hello,

I need to do some operations over all the lags up to the current value, but the loop I wrote takes forever, could you please tell me how to speed it up?

levelsof permno, local(levels)
foreach lev of local levels {
gen l1_new=(retex-b_cons-b_mktrf*mktrf)*(mktrf-mktrf_mean)^2
local lag=_n
forval i=1/`lag'-1{
replace l1_new=l1_new+(l`i'.retex-b_cons-b_mktrf*l`i'.mktrf)*(l`i'.mktrf-mktrf_mean)^2
}
}

Thank you in advance,
Alina

Later edit: I am also getting errors in the code, but I don't know why.

Force Merge

Hi, I am trying to merge household-level data for two different rounds but I keep on getting an error. I have attached the STATA code and the error.

Command: merge m:1 STATEID DISTID PSUID HHID2005 HHSPLITID2005 using "C:\Users\Hammu\Desktop\Merging by including yea r variable\2012 data\Round2HH.dta"

Error: variable IDHH is long in master but str10 in using data. You could specify merge's force option to ignore this numeric/string mismatch. The using variable would then be treated as if it contained numeric missing value.

Interpreting interaction

Dear Statalist,

I am running the following regression with an interaction between a categorical education variable and continuous variable. The categorical variable takes on the values (1 = no education, 2 = primary education, 3 = secondary education, 4 = university education). When I include imports non-interacted as a control, the categorical variable omits the base group (Output1).
However, when I use total trade as the control instead, it keeps the base group (Output 2). How can I understand this, and is it possible to interpret the coefficients on the interaction in Output 2?

Thank you in advance.
Ray

Code:

 reg lrhourlywage schoolinglevel1#c.ltotalexports_china ltotaltrade_china ltotaltrade_eu15 ltotaltrade_usa ltotaltrade_mca ltotaltrade_row i.schoolinglevel1 i.region i.occupationgroup i.isic1 i.establishmentsize _2015 _2016 _2017 _2018 [fweight = factor] , vce(cluster region) 
, vce(cluster region)

Code:

 
Linear regression                               Number of obs     =  1,517,202
                                                F(4, 5)           =          .
                                                Prob > F          =          .
                                                R-squared         =     0.4382
                                                Root MSE          =      .4583

                                                          (Std. Err. adjusted for 6 clusters in region)
-------------------------------------------------------------------------------------------------------
                                      |               Robust
                         lrhourlywage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------------------------------+----------------------------------------------------------------
schoolinglevel1#c.ltotalexports_china |
                                   2  |   .0055916   .0055252     1.01   0.358    -.0086114    .0197947
                                   3  |   .0146709   .0162778     0.90   0.409    -.0271724    .0565142
                                   4  |   .0720004   .0125011     5.76   0.002     .0398653    .1041356
                                      |
                  ltotalexports_china |   -.027375    .012245    -2.24   0.076    -.0588519    .0041019
                  ltotalimports_china |   .0288434    .024478     1.18   0.292    -.0340792     .091766
                     ltotaltrade_eu15 |  -.0073836   .0710008    -0.10   0.921    -.1898971    .1751298
                      ltotaltrade_usa |   .0914549   .0475511     1.92   0.112    -.0307791     .213689
                      ltotaltrade_mca |    .038949   .0462096     0.84   0.438    -.0798365    .1577346
                      ltotaltrade_row |   .0495398   .0261668     1.89   0.117    -.0177242    .1168038

Output 2

Code:

Linear regression                               Number of obs     =  1,517,202
                                                F(4, 5)           =          .
                                                Prob > F          =          .
                                                R-squared         =     0.4383
                                                Root MSE          =     .45823

                                                          (Std. Err. adjusted for 6 clusters in region)
-------------------------------------------------------------------------------------------------------
                                      |               Robust
                         lrhourlywage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------------------------------+----------------------------------------------------------------
schoolinglevel1#c.ltotalexports_china |
                                   1  |  -.0315226   .0112729    -2.80   0.038    -.0605005   -.0025447
                                   2  |  -.0256822   .0071686    -3.58   0.016    -.0441096   -.0072549
                                   3  |  -.0167331   .0064504    -2.59   0.049    -.0333145   -.0001517
                                   4  |   .0404786   .0129226     3.13   0.026       .00726    .0736971
                                      |
                    ltotaltrade_china |   .1049092   .0460189     2.28   0.072    -.0133862    .2232047
                     ltotaltrade_eu15 |   .0343422   .0780796     0.44   0.678    -.1663679    .2350523
                      ltotaltrade_usa |   .1433543   .0451913     3.17   0.025     .0271863    .2595224
                      ltotaltrade_mca |  -.0104562   .0550203    -0.19   0.857    -.1518903     .130978
                      ltotaltrade_row |   .0377084   .0232289     1.62   0.165    -.0220033    .0974202

Points of mass in running variable - RD

Dear all,

I am struggling with my RD specification. I am using the rdrobust package from SSC and when I try to compute the optimal bandwidth using rdbwselect with fuzzy option I am getting alerts as

Code:

Not enough variability to compute the preliminary bandwidth. Try checking for mass points with option masspoints(check).
Not enough variability to compute the bias bandwidth (b). Try checking for mass points with option masspoints(check). 
Not enough variability to compute the loc. poly. bandwidth (h). Try checking for mass points with option masspoints(check)

but adding the option as required nothing changes and I keep getting the same alert.

My running variable is population and its distribution is quite skewed

Array

so my feeling is that this is the issue with the bandwidth computation.

Do you have any insight about it? Maybe some trick to group population in some way it will increase variability?

ppmlhdfe with disaggregated data: exporter,importer,sector fiexed effects

Hello, I am running a gravity model.
My data looks something like this

Code:

use "http://fmwww.bc.edu/RePEc/bocode/e/EXAMPLE_TRADE_FTA_DATA" if category!="TOTAL"

I have a panel with dyadic trade by sector (in this case category manuf- nonmanuf).
I would like to estimate the model accounting for sectoral differences in bilateral exports.
One way is to estimate the gravity model separately for manufactured good and non manufactured goods

egen imp = group(isoimp)
egen exp = group(isoexp)
eststo: ppmlhdfe trade fta if category == "MANUF", absorb(imp#year exp#year imp#exp) cluster(imp#exp)
eststo: ppmlhdfe trade fta if category == "NONMANUF", absorb(imp#year exp#year imp#exp) cluster(imp#exp)

[CODE]

egen imp = group(isoimp)
egen exp = group(isoexp)
eststo: ppmlhdfe trade fta if category == "MANUF", absorb(imp#year exp#year imp#exp) cluster(imp#exp)
eststo: ppmlhdfe trade fta if category == "NONMANUF", absorb(imp#year exp#year imp#exp) cluster(imp#exp)
esttab, se

[CODE]

However, I would like to estimate the model using all of the data including exporter-sector-year and importer-sector-year fixed effects. What is the correct way to implement this with ppmlhdfe?

I am thinking about something like this:

[CODE]

egen cat = group(category)
eststo: ppmlhdfe trade fta, absorb(imp#year#cat exp#year#cat imp#exp#cat) cluster(imp#exp)

[CODE]

Is this correct? in particular is it recommended also to cluster by dyad-category? moreover should I cluster the errors also on the category dimension?

If anyone has any suggestion, or warning against doing this type of analysis it would be most welcomed.

Fine and Gray with Censoring

When I produce cumulative incidence curves (CIC) for the Fine and Gray model using stcrreg and stcurve, I noticed that the estimated plateau level of the CIC (for large durations) highly depends on the degree of censoring – see the graph below. E.g. if 60% of the spells are censored, the CIC attains a maximum of 0.18, if instead 40% are censored, the maximum CIC is 0.29. Censoring is random. In theory, however, the estimated plateau level of the CIC should NOT depend on the degree of censoring. Indeed, the estimated plateau level of the CIC is invariant, when I do the same exercise in R using "cmprsk". So, either, I do a basic coding error or there is an issue with the Stata implementation.

Array

Here is the Stata code generating the data plus the graph:

Code:

set seed 1000

foreach level in 40 60 80 100 { // level of censoring

clear
set obs 1000
gen i=_n
gen ra=round(runiform())
gen rb=1-ra
gen udur=-ln(uniform())

gen female=rnormal(0,1)
gen deutsch=rnormal(0,1)
gen VT_HE=rnormal(0,1)

*add censoring to data:
qui gen ran=runiform()
qui gen ran2=runiform()
qui replace ra=0 if ran>0.`level'
qui replace rb=0 if ran>0.`level'
qui replace udur=udur*ran2 if ran>0.`level'

stset udur, failure(ra==1)
stcrreg VT_HE deutsch female, compete(ra==0) iter(100)

stcurve, cif legend(off) xla(, grid) outfile("FG_`level'.dta", replace)
}

foreach level in 40 60 80 100 { // level of censoring
use "FG_`level'.dta", clear
rename ci1 ci`level'
save "FG_`level'.dta", replace
}

use "FG_40.dta", clear
foreach level in 60 80 100 {
joinby _t using FG_`level'.dta, unmatched(both)
drop _merge
}

sort _t
twoway (line ci40 _t)(line ci60 _t)(line ci80 _t)(line ci100 _t), legend(order(1 "60" 2 "40" 3 "20" 4 "0"))
graph export statalist.png, replace


foreach level in 40 60 80 100 {
erase "FG_`level'.dta"
}

Has anybody come across this problem before?
Thank you in advance!

Interactions between sex and country of birth : do I also have to include interactions between control variables and sex?

Dear STATA community,

This is my first post and I hope that you can help me with my problem.
In a nutshell, I run a regression to estimate the impact of workers' country of birth (reference: native workers) on wages. Besides, I also control for sex (female = 1) and education.

Code:

. reg log_sal_bonus i.Birth_region_gen_final_a sex i.education [aw=Pond_AB], r
(sum of wgt is 20,200,447.423307)

Linear regression                               Number of obs     =  1,304,858
                                                F(8, 1304849)     =   33434.19
                                                Prob > F          =     0.0000
                                                R-squared         =     0.3156
                                                Root MSE          =     .30553

-----------------------------------------------------------------------------------------------------
                                    |               Robust
                      log_sal_bonus |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
------------------------------------+----------------------------------------------------------------
           Birth_region_gen_final_a |
               First - 2-Developed  |    .024678    .001561    15.81   0.000     .0216185    .0277375
 First - 3-Transition_&_Developing  |  -.0927497   .0012075   -76.81   0.000    -.0951164   -.0903829
                            Others  |   .0698681   .0051349    13.61   0.000      .059804    .0799323
              Second - 2-Developed  |  -.0150958   .0012651   -11.93   0.000    -.0175754   -.0126162
Second - 3-Transition_&_Developing  |  -.1198456   .0018575   -64.52   0.000    -.1234863   -.1162048
                                    |
                                sex |  -.1625361   .0007542  -215.51   0.000    -.1640143   -.1610579
                                    |
                          education |
                                 2  |   .0767518   .0007308   105.02   0.000     .0753195    .0781842
                                 3  |    .476564    .001068   446.23   0.000     .4744708    .4786572
                                    |
                              _cons |   2.816732   .0006466  4356.09   0.000     2.815464    2.817999
-----------------------------------------------------------------------------------------------------

Now, I want to run a regression with interactions between the country of birth and sex instead of splitting into two sub-samples (female and male). My output seems quite logical if we look at the females' coefficients (in fact, the females' coefficients here are almost the sum of the coefficients in my previous table + sex).

Code:

reg log_sal_bonus i.Birth_region_gender i.education [aw=Pond_AB], r
(sum of wgt is 20,200,447.423307)

Linear regression                               Number of obs     =  1,304,858
                                                F(13, 1304844)    =   20634.97
                                                Prob > F          =     0.0000
                                                R-squared         =     0.3159
                                                Root MSE          =     .30546

---------------------------------------------------------------------------------------------------------
                                        |               Robust
                          log_sal_bonus |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------------------------------+----------------------------------------------------------------
                    Birth_region_gender |
               Belgium - Natives women  |  -.1640748   .0008906  -184.22   0.000    -.1658204   -.1623292
                 First - Developed men  |   .0350259    .002058    17.02   0.000     .0309922    .0390595
               First - Developed women  |  -.1566719   .0022561   -69.44   0.000    -.1610938   -.1522501
                    First - Others men  |   .0756946   .0064093    11.81   0.000     .0631327    .0882566
                  First - Others women  |  -.1058152   .0085449   -12.38   0.000    -.1225629   -.0890675
   First - Transition_&_Developing men  |  -.1048025   .0014601   -71.78   0.000    -.1076643   -.1019407
 First - Transition_&_Developing women  |  -.2272819   .0020083  -113.17   0.000    -.2312182   -.2233457
                Second - Developed men  |  -.0141235   .0015387    -9.18   0.000    -.0171394   -.0111077
              Second - Developed women  |  -.1813582   .0021551   -84.15   0.000    -.1855821   -.1771343
  Second - Transition_&_Developing men  |  -.1293338    .002309   -56.01   0.000    -.1338593   -.1248083
Second - Transition_&_Developing women  |  -.2649113   .0030794   -86.03   0.000    -.2709469   -.2588758
                                        |
                              education |
                                     2  |   .0768369   .0007309   105.12   0.000     .0754043    .0782695
                                     3  |   .4765397   .0010672   446.51   0.000      .474448    .4786315
                                        |
                                  _cons |   2.817192    .000665  4236.48   0.000     2.815889    2.818495
---------------------------------------------------------------------------------------------------------

However, I am worried of the fact that I may also have to include interactions between education and sex since I also run the interaction of country of birth and sex and the impact of education may be different for men and women. I do that and unfortunately, my output is quite different. The females's coefficients are extremely positive compared to native men and from a labour economic point of view, that is not possible.

Code:

. reg log_sal_bonus i.Birth_region_gender i.education#i.sex [aw=Pond_AB], r
(sum of wgt is 20,200,447.423307)
note: 3.education#1.sex omitted because of collinearity

Linear regression                               Number of obs     =  1,304,858
                                                F(15, 1304842)    =   18325.00
                                                Prob > F          =     0.0000
                                                R-squared         =     0.3172
                                                Root MSE          =     .30518

---------------------------------------------------------------------------------------------------------
                                        |               Robust
                          log_sal_bonus |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------------------------------+----------------------------------------------------------------
                    Birth_region_gender |
               Belgium - Natives women  |   .2846285    .001693   168.13   0.000     .2813103    .2879466
                 First - Developed men  |   .0345303   .0020482    16.86   0.000     .0305158    .0385448
               First - Developed women  |   .2924861   .0027622   105.89   0.000     .2870724    .2978999
                    First - Others men  |   .0730843   .0063722    11.47   0.000     .0605951    .0855736
                  First - Others women  |   .3455405   .0086388    40.00   0.000     .3286088    .3624722
   First - Transition_&_Developing men  |  -.1030127    .001467   -70.22   0.000    -.1058881   -.1001374
 First - Transition_&_Developing women  |   .2184794   .0026264    83.19   0.000     .2133318     .223627
                Second - Developed men  |   -.013088   .0015412    -8.49   0.000    -.0161087   -.0100673
              Second - Developed women  |    .265587   .0026116   101.70   0.000     .2604684    .2707056
  Second - Transition_&_Developing men  |  -.1282039   .0023131   -55.42   0.000    -.1327376   -.1236702
Second - Transition_&_Developing women  |   .1830853   .0033804    54.16   0.000     .1764599    .1897108
                                        |
                          education#sex |
                                   1 1  |  -.4461616   .0017727  -251.68   0.000    -.4496361   -.4426872
                                   2 0  |   .0692601   .0008954    77.35   0.000     .0675052    .0710149
                                   2 1  |  -.3534031   .0017207  -205.38   0.000    -.3567757   -.3500306
                                   3 0  |   .4926451   .0013284   370.86   0.000     .4900414    .4952487
                                   3 1  |          0  (omitted)
                                        |
                                  _cons |   2.816362   .0007338  3838.17   0.000     2.814924      2.8178
---------------------------------------------------------------------------------------------------------

Can you tell me if I finally have to include interactions between the control variables (e.g. education) and sex? In that case, why do the coefficients become positive for women?
Or is it if enough to exclusively include control variables as in table 2?

Thank you so much for your help!

Complete Time Series

Hello everyone,

I'm wondering how to 'insert' empty values in my panel data. My data looks as follows:

Array

For each b_id I want a full series from 2008-2018. So for example, for b_id 174 I would want to automatically create 4 entries:

174 - United Kingdom - 2008
174 - United Kingdom - 2009
174 - United Kingdom - 2010
174 - United Kingdom - 2011

All other variables will then be missing.

Kind regards,

Philippe

Creating a new variable differentiated by gender

Hi Statalist.

I want to be able to test if there is a difference in the effect of level of education by gender. Here's my draft code.

Code:

gen male_educ = 1 if edhigh1 == 9  // up to year 11 "11 years"
replace male_educ = 2 if (edhigh1 == 8 | p_edhigh1 == 8) & (hgsex == 1 | p_hgsex == 1) // year 12 "12 years"
replace male_educ = 3 if edhigh1 == 5 | p_edhigh1 == 5 & (hgsex == 1 | p_hgsex == 1) // cert 3, cert 4 "13 years"
replace male_educ = 4 if edhigh1 == 4 | p_edhigh1 == 4 & (hgsex == 1 | p_hgsex == 1) // adv dip, diploma "14 years"
replace male_educ = 5 if edhigh1 == 3 | p_edhigh1 == 3 & (hgsex == 1 | p_hgsex == 1) // bachelor, honours "18-19 years"
replace male_educ = 6 if edhigh1 == 2 | p_edhigh1 == 2 & (hgsex == 1 | p_hgsex == 1) // grad diploma, grad cert "19-20 years"
replace male_educ = 7 if edhigh1 == 1 | p_edhigh1 == 1 & (hgsex == 1 | p_hgsex == 1) // masters, doctorate "20-24 years"

I then repeat the same code for females:

Code:

gen fem_educ = 1 if edhigh1 == 9 | p_edhigh1 == 9 & (hgsex == 2 | p_hgsex == 2) // up to year 11 "11 years"
replace fem_educ = 2 if edhigh1 == 8 | p_edhigh1 == 8 & (hgsex == 2 | p_hgsex == 2) // year 12 "12 years"
replace fem_educ = 3 if edhigh1 == 5 | p_edhigh1 == 5 & (hgsex == 2 | p_hgsex == 2) // cert 3, cert 4 "13 years"
replace fem_educ = 4 if edhigh1 == 4 | p_edhigh1 == 4 & (hgsex == 2 | p_hgsex == 2) // adv dip, diploma "14 years"
replace fem_educ = 5 if edhigh1 == 3 | p_edhigh1 == 3 & (hgsex == 2 | p_hgsex == 2) // bachelor, honours "18-19 years"
replace fem_educ = 6 if edhigh1 == 2 | p_edhigh1 == 2 & (hgsex == 2 | p_hgsex == 2) // grad diploma, grad cert "19-20 years"
replace fem_educ = 7 if edhigh1 == 1 | p_edhigh1 == 1 & (hgsex == 2 | p_hgsex == 2) // masters, doctorate "20-24 years"

Sample data:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long(id p_id) byte(wave edhigh1 p_edhigh1  hgsex p_hgsex)
101  102  1 5 9 1 2
101  102  2 5 9 1 2
101  102  3 5 9 1 2
101  102  4 5 9 1 2
103  104  1 9 5 2 1
103  104  2 9 5 2 1
103  104  3 9 5 2 1
103  104  4 9 5 2 1
106 142 11 5 5 2 1
106 142 12 5 5 2 1
106 142 13 5 5 2 1
106 142 14 5 5 2 1
106 142 15 5 5 2 1
106 142 16 5 5 2 1
106 142 17 5 5 2 1
106 142 18 5 5 2 1
110 163 12 1 3 1 2
110 163 13 1 3 1 2
110 163 14 1 3 1 2
110 163 15 1 3 1 2
110 163 16 1 3 1 2
110 163 17 1 3 1 2
110 163 18 1 3 1 2
111  231  6 9 4 2 1
111  231  7 9 4 2 1
111  231  8 9 4 2 1
111  231  9 9 4 2 1
end

I would appreciate help correcting/improving this code.

(New variable based on edhigh1 - tabulated below): Array

N.B. Stata v.15.1. Using panel data. variables are differentiated by respondent and their partner - "p_" represents value for partner.

Using an interaction between a categorical and a continuous interaction where the base category is highly informative

Dear Statalist,

Sorry if my question is too general for this forum. I hope it is okay.

I am looking at the impact of trade on the returns to various levels of education in a developing country. Education is a categorical variable (where: 1 = no completed education, 2 = primary complete, 3 = secondary complete, 4 = university complete). The data is from household surveys.

My base regression is as follows:

Hourly wage = (trade_country_i) + (trade_worldminus_i) + (education level) + (more controls) + u

Because I am interested in the returns to specific returns to education, I do the following interaction:

Hourly wage = x1(trade_country_i)*(education level) + (trade_country_i) + (trade_worldminus_i) + (education level) + i(more controls) + u

The base category (1=no complete) is omitted by Stata.. but this masks important information. The coefficients on the other education levels are always positive, which makes sense as they are relative to the lowest level of education. But I really I would like to know how each react of them independently.

Is there a way to do this on Stata?

Thank you in advance.
Ray

ivprobit and cmp ivprobit

Hi. I am using an individual-based survey and I am trying to estimate the impact of migration and remittances on child education in Egypt. Therefore, I have two main equations: the first one studies the effect of migration (dummy independent variable "migrant" = 1 if the individual reported a migrant in his/her household, =0 otherwise) on child education (dummy dependent variable "school" = 1 if the individual attends school, =0 otherwise). The second equation studies the effect of remittances (dummy independent variable "remit" = 1if the individual reported receiving remittances in his/her household, =0 otherwise) on child education (dummy "school" variable). I am planning to use cmp ivprobit regression, but since this is my first time to ever use this type of regression, I have a few questions and I would really appreciate your help:

1. Examining the literature, I have noticed that some economists only used ivprobit, while others used cmp ivprobit. Based on what I understood, cmp ivprobit would be more convenient in my case since it allows for errors in different equations to be correlated. Is this correct? should I use cmp ivprobit instead of ivprobit?
2. I have already tried to run ivprobit and cmp ivprobit, but I got some errors on stata. I have checked the help commands but I am still not sure what exactly my error is. Can someone tell me what is wrong with the following commands?

this is the ivprobit command for the school-migrant equation:

Code:

#delimit ;
ivprobit school [migrant age age2 eldest i.fteducst i.mteducst fth_absent urban1] (migrant = oilpricewhenmigrantis31) 
[if age >= 6 & age <= 17 & marital != 4 & marital != 5 & yrbirth1 ==.] [pweight=expan_indiv], vce (cluster hhid)first;
#delimit cr 
margins, dydx(*) predict(pr)

this is the error I get

Code:

 migrant unknown weight type

this is the cmp ivprobit command

Code:

#delimit ; 
cmp (migrant = oilpricewhenmigrantis31 age age2 eldest i.fteducst i.mteducst fth_absent urban1) (school = migrant age age2 eldest i.fteducst i.mteducst fth_absent urban1) 
[if age >= 6 & age <= 17 & marital != 4 & marital != 5 & yrbirth1 ==.] [pweight=expan_indiv], vce (cluster hhid) indicators ($cmp cont $cmp probit);
#delimit cr
margins, dydx(*) predict(pr) force

this is the error I get

Code:

weights not allowed
invalid syntax

3. kindly note that my iv is: "oilpricewhenmigrantis31" which is the oil price when the migrant is 31 years old. Should I consider it as left-censored variable instead of continuous variable in the cmp command?
4. There are some control variables that I would like to add in the IV equation. I am wondering how this can be done. should I just add them after my instrumental variable in the IV equation?
5. If I end up choosing the cmp ivprobit and I would like to run another probit regression assuming my independent variables are exogenous. Does it also have to be cmp probit?
6. When doing my analysis at the end, should I only focus on the coefficients of the marginal effects?
I apologize for my many questions. I decided to gather all my problems in one post and, as a beginner, I would really appreciate your help!

Merging datasets code issue

Hi all!

I can't figure out a certain line of code. I'd like to merge two data sets. The master dataset looks as follows:
Array

And I'm trying to merge it with a dataset that looks like this:
Array

I know the specification "merge m:m country year using X.dta" works, however, it only merges the data if the year is available in the master dataset. The dataset I add has data starting from 2008 instead of 2012. How do I add empty year observations 2008, 2009, 2010, 2011 for each b_id in the master so that no data gets lost? Note that not all entries of b_id in the set start at 2012.

Kind regards,

Philippe

Replacing "NA" with missing

I'm super new to using Stata and could really use some help! I have imported a csv file from R into Stata and am not sure if there is a succinct way for replacing all the NA values with missing. I have tried: replace 'var'= "." if 'var'== "NA" ....... However, I have 700 variables (many of which are string) so doing this one by one is taking way too long. Is there a better way to do this? Thank you in advance!!

Thursday, July 30, 2020

eintreg

Hi,

I am using eintreg for interval regression with sample selection I would like to know the number of points used to do the numerical integration.

From my understanding, the numerical integration is over two dimensions, one dimension for the error term in the main equation and one dimension for the error term in the selection equation.

However, the manual and the dialog menu of eintreg shows the number of integration points for three and four dimensions with relative customization, but nothing is said about two dimensions.

Thanks in advance.

Simone

p values

Hi,
We are comparing the clinical characteristics of infants who received Octreotide for chylothorax versus those who did not. I used Wilcoxan ranksum test to compare the duration of ventilation between the two groups because the data was not normally distributed. The median duration in the Octreotide group was 631 hours (Q1 287 and Q3 724). It was 223 hours (Q1 64 and Q3 364 hours).
i would be grateful for your help in interpreting the stata output for this. Which p value we need to use?. Prob > |z| = 0.1200 OR Exact Prob = 0.1246

ranksum duration_of_ventilation, by(octreotide_used)

Two-sample Wilcoxon rank-sum (Mann-Whitney) test

octreotide~d | obs rank sum expected
-------------+---------------------------------
0 | 25 388 425
1 | 8 173 136
-------------+---------------------------------
combined | 33 561 561

unadjusted variance 566.67
adjustment for ties -0.19
----------
adjusted variance 566.48

Ho: durat~on(octreo~d==0) = durat~on(octreo~d==1)
z = -1.555
Prob > |z| = 0.1200
Exact Prob = 0.1246

how to backup files better

Hello！
When I use stata to clean data,I always want to keep track of every step I take so that I can use backup file to get every dta which is produced during this period until I get the data that can be used to analyse. Dofile is a good choice,but there are so many steps that you don't need to write a sentence,such as changing a number or dropping a observation,especially when there are so many steps like this and there are no rules to put them together.I feel confused when there are so many files to process and they are chaotic.Do I need to record every step when cleaning data? Is there a better solution to backup files ?

Centering on Mean - Interaction of 2 continous variables using the first difference estimator

Hi,

My research project is looking at the impact of financial development (bank development - proxied by private credit& stock market development proxied by stock market cap) on labour share(ls-pwt) using a panel dataset of 80 countries, period, 2000-2017. Here, some of my variables have unit root and as such I am using the first difference estimator as opposed to the fixed effects estimator. My main estimation model is as follows:

ΔLS_it = β₀ + β₁Δprivatecredit_it + β₂ΔStockMcap_it + ΔX_it β + Ө_t + u_it Equation 1

As part of one of the specifications, I am interested in ascertaining if the impact of financial development on labour share depends on the efficiency of that development process (2 diff efficiency variables are used) as follows:

ΔLS_it = β₀ + β₁Δprivatecredit_it*BankNIMargin + β₂ΔStockMCap_it*StockMTurnover + ΔX_it β + Ө_t + u_it

As such, I am interested in using an interaction term between the financial development variables and efficiency variables, both of which are continuous variables. From my readings, I gathered that it is advisable to center the relevant variables on meaningful values (mean) - which I have done as follows:
summarize pcredit, meanonly
gen pcredit_c=pcredit-r(mean)
by countryid: gen dpcredit_c=d.pcredit_c

summarize smcap, meanonly
gen smcap_c=smcap-r(mean)
by countryid: gen dsmcap_c=d.smcap_c

summarize bnimargin, meanonly
gen bnimargin_c=bnimargin-r(mean)
by countryid: gen dbnimargin_c=d.bnimargin_c

summarize smtratio, meanonly
gen smtratio_c=smtratio-r(mean)
by countryid: gen dsmtratio_c=d.smtratio_c

After this process I run the first difference estimator as follows:

regress dls_pwt c.dpcredit_c##c.dbnimargin_c c.dsmcap_c##c.dsmtratio_c $dcv4 i.year, robust cluster(countryid)

Basically, my query is whether, this centering process is the correct way to proceed in the context of using the first difference estimator. Here, I am centering the relevant variables on the mean of the variables across all countries (panels).

Dropping observations based on multiple conditions

Hi Everyone,

Thanks for taking the time to read my query.
I currently am cleaning a very big dataset (52 variables, 82284 observations) for longitudinal analysis. The dataset is based on information returned from 6 different surveys. I have converted the dataset to long format so currently there are about 6 different observations (in years) for each ID. There are approximately 13,000 unique ID variables.
In the third survey that was sent, additional information was asked that was not included in any further surveys but is essential to my analysis. I therefore want to exclude all participants who did not complete the third survey (in 2010).

What I am trying to find is some kind of command like this: drop id if Survey3Completed=. in 2001

Thank-you very much.
Sarah

define after a certain range

Hello,

I want to define a variable conditioned after a certain range of another variable.

For instance, each id has a binary variable 'served' which indicates whether the id was served by the system.

Each id also has exit_month when they exited the system.

Here, I want to find the month they returned after they exited.

Code:

ID     month          served       exit_month         return_month
1       2015m9                1             .           .
1       2015m10              1             .           .
1       2015m11              0          2015m11          .
1       2015m12              0             .             .
1       2016m1                1             .         2016m1
1       2016m2                1             .           .
1       2016m3                0         2016m3         .
1       2016m4                1            .             2016m4
2      2015m9                 1            .                .
2      2015m10               0           2015m10    .
...

I tried using -inrange -, but did not work. Is there a way where I can set the range for exit_month?

gen return_month = "between first exit_month and next exit month"?

Thank you!

Error in mi impute chained (logit)

Hello,

I am trying to run an mi impute chained using (logit) and I get the following error message when I try to impute medications which is a binary variable indicating yes or no if the patient was on meds:

error occurred during imputation of... on m = 1
r(2000);

This is the code that I am running:

Code:

mi set wide
mi register imputed BMI bp medications
mi impute chained (pmm, knn(5)) BMI bp (logit) medications = age sex i.clinic i.individual intervention, add(50) force noisily savetrace("trace.dta",replace)

I did some checks to try and resolve the issue but it seems this happens to any binary variable that I try to impute (I have even purposefully introduced missing into some variables as a test):

1. Coded medications as 0 and 1
2. Simplified the imputation by imputing medications only
3. Tried mlogit but ran into convergence issues

When I run the imputation without medications, both BMI and bp impute just fine however.

Is this a common problem with mi impute (logit)?

Obtaining the Spatially Weighted Regressors Using spregress

Hi Everyone:

I'm looking for a simple way to obtain the regressors W*X when using spregress, where W is the chosen spatial weighting matrix. I mainly interested in the contiguity case, but I'd prefer to know how to do this generally. In particular, if I run the command

Code:

spregress y x1 x2, ml ivarlag(Wc:x1 x2)

is there a way to save the regressors Wc*X?

Essentially, I would like to have the original regressors and the weighted regressors to use in standard Stata commands -- not just spregress.

Thanks. Jeff

Simple sum within the same variable

hello, this may be a very simple question but is it possible to create a variable that simply sums consecutive numbers within the same variable separated by different number of missings (i.e., '.')? so the information I have is under the variable 'seq' and I want to create the variable 'total' - see example below:

seq	total
1	4
2	4
3	4
4	4
.	.
.	.
1	2
2	2
.	.

thanks in advance for your help!

Using GEE for repeated cross sectional, nested data

I'm conducting an impact evaluation for a school level program and interested in evaluating if there is a dose response relationship between levels of implementation and student outcomes. I have binary student level outcome data (time points 1 and 3) and continuous school level implementation data (time points 1, 2, and 3). The student data was collected from a random sample of classrooms in each school at both time points thus a repeated cross section design, however I do not have a classroom identifier. School level implementation data is collected from all schools with no missing data.

I'm wondering if it would be possible to fit a GEE model to assess the dose-response relationship with this data structure? I've been trying to find research on GEE with nested repeated cross sectional data only to find more and more research on longitudinal data. Any guidance would be much appreciated!

Change degrees of freedom after estimating an OLS regression with sem

Dear Stata experts,

I recently learned how to change the degrees of freedom in the regress command using the dof() option (Thanks Trent Mize). This can be very handy when estimating a fixed effects model using the regress command with manually demeaned variables (example from Trent at the end of this post).

I would like a way to change the df like this when estimating an OLS regression using sem. The dof() option is not available, and the two other strategies I tried have not worked yet:

Strategy 1
I can change the df after regress like this:

mata: st_numscalar("e(df_r)", 21761)

Unfortunately, this approach does not seem to update the t statistics or confidence intervals. I have not tried that approach yet with sem.

Strategy 2
It also looks like the df can also be changed with the ereturn repost command (see Stata Forum post), but I am not sure how to do that without deleting the b and V matrices.

Does anyone know how to replicate the effect of the dof() option in a way that will work with sem?

Thanks,

Jeremy

Code:

webuse union, clear

**************************************************************************    
// #0 - Data Management
**************************************************************************    

drop if missing(union, age, grade, not_smsa, south, year)

*Calculate person-specific means and demean the variables (subtract the person
*specific mean from each so the new value represents the deviation from 
*the subject-specific average)
local ivs "union age grade not_smsa south year"
foreach v in `ivs' {
    egen     `v'M = mean(`v'), by(idcode)
    gen     `v'D = `v' - `v'M
    }
    
**************************************************************************    
// #1 - Show normal FE model with xtreg
**************************************************************************    
xtset         idcode

xtreg         grade age union i.not_smsa south c.year, fe
est             store fixed


**************************************************************************    
// #2 - Use all demenaed vars (DV and IVs)
**************************************************************************    
*Fixed effects models using all demeaned vars    
*Force df to be correct with dof() option (from FE models estimated earlier)
reg         gradeD ageD unionD not_smsaD southD yearD,   dof(21761)
est store    demean    

********************************
*Compare the estimates
********************************
esttab fixed demean, nobase noconstant

Help using loop to replace values in several variables with conditions

Hello,
Can anyone suggest a way to use a loop to replace values in several variables with conditions? Here is the situation.
I created seven variables (years in which an outcome is measured); named these as ys_prot_at`year'. Each of these variables will have a value equal to:
the number of years an observation (my rows) has been treated (protected, with variable name as year_pa) minus the year of the outcome measure (corresponding to the year in ys_prot_at`year') plus 1; I wan to account cases the year of the outcome = the year of initial treatment to be at least 1 to avoid having 0 for treated units. The value in ys_prot_at`year' would be 0 if the observation has not been protected in the given year of the outcome measure. This is a panel dataset.

The code below works to get what I need but I couldn't do this using a loop; I am not sure if using a loop is possible to get the result I wanted? I don't think sharing a sample dataset is necessary so I am assuming my description will do for now. Please let me know if you know a way to using a loop in a situation like the one I tried to describe.
Thanks in advance,
Carlos

foreach year of numlist 1986 1991 1996 2001 2003 2011 2016 {
gen ys_prot_at`year' = 0
label variable ys_prot_at`year' "years protected by `year'"
}
replace ys_prot_at1986 = cond(year_pa == 1956, -1956+1986+1, 0)
replace ys_prot_at1991 = cond(year_pa == 1956, -1956+1991+1, cond(year_pa == 1987, -1987+1991+1, 0))
replace ys_prot_at1996 = cond(year_pa == 1956, -1956+1996+1, cond(year_pa == 1987, -1987+1996+1, cond(year_pa == 1994, -1994+1996+1, 0)))
replace ys_prot_at2001 = cond(year_pa == 1956, -1956+2001+1, cond(year_pa == 1987, -1987+2001+1, cond(year_pa == 1994, -1994+2001+1, cond(year_pa == 1998, -1998+2001+1, cond(year_pa == 1999, -1999+2001+1, 0)))))
replace ys_prot_at2003 = cond(year_pa == 1956, -1956+2003+1, cond(year_pa == 1987, -1987+2003+1, cond(year_pa == 1994, -1994+2003+1, cond(year_pa == 1998, -1998+2003+1, cond(year_pa == 1999, -1999+2003+1, cond(year_pa == 2003, -2003+2003+1, 0))))))
replace ys_prot_at2011 = cond(year_pa == 1956, -1956+2011+1, cond(year_pa == 1987, -1987+2011+1, cond(year_pa == 1994, -1994+2011+1, cond(year_pa == 1998, -1998+2011+1, cond(year_pa == 1999, -1999+2011+1, cond(year_pa == 2003, -2003+2011+1, cond(year_pa == 2006, -2006+2011+1, cond(year_pa == 2007, -2007+2011+1, cond(year_pa == 2009, -2009+2011+1, cond(year_pa == 2010, -2010+2011+1, 0))))))))))
replace ys_prot_at2016 = cond(year_pa == 1956, -1956+2016+1, cond(year_pa == 1987, -1987+2016+1, cond(year_pa == 1994, -1994+2016+1, cond(year_pa == 1998, -1998+2016+1, cond(year_pa == 1999, -1999+2016+1, cond(year_pa == 2003, -2003+2016+1, cond(year_pa == 2006, -2006+2016+1, cond(year_pa == 2007, -2007+2016+1, cond(year_pa == 2009, -2009+2016+1, cond(year_pa == 2010, -2010+2016+1, 0))))))))))

Data in Stata format for Card & Krueger

Does anyone know where i can get the data of the famous Card& Krueger, 1994. AER paper on minimum wages.

"Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania," American Economic Review, American Economic Association, vol. 84(4), pages 772-793, September.

Coefplot with note() option

Dear Ladies and Gentlemen,

I would like to have a longer note comment under my graphs stretching over two lines.
I run the following command:

Code:

coefplot (margin21, msymbol(T)) ... , note(Notes: The unemployment .., span)

I am not sure how to make the comment in note() go over two or three lines. At the moment, it just is on one line with the end of the comment not displaying.

Any help would be very much appreciated.

BW

Nico

Create Dataset in Stata using a Loop

I'm having trouble finding guidance on this. I want to create a dataset that has a variable called "location" and contains the values 1, 3, 4, 6, 7 with another variable called "action" that contains values1, 2, 3, 4, 5. I want my dataset to contain every combination of these, but I do not want to input manually. Any ideas?

location action
1 1
1 2
1 3
1 4
1 5
3 1
3 2
...

meta regression missing data

Hi everybody,

I have to do a meta-analysis. The aim of this meta-analysis (37 studies) is THE prevalence of bipolar disorder in an ASD (AUTISM SPECTRUM SYNDROME). i Have to taste The effect of these moderators (% percent female with ASD, percent of intellectual disability and percent with Communicative disorder)
However I ve a lot of missing data.
Is it correct to add 0.5 where the data is missing or 0?

The command that I use for meta regression is: metareg prev_ASD_BIPOLAR per_female_asd_bipolar , wsse( se_prev_ASD_BIPOLAR)

ps wsse(varname) All values of varname must be greater than zero

Thanks to everybody

Transforming survey results to numerical codes appropiately (Encoding help)

Stata and Stata Forum Beginner here.

Situation: Using Limesurvey data for a health-related QOL study. This has questions where the responses range from things like 'none of the time' to 'all of the time'.
Exporting the responses often results in strings (with non-numerical characters) where I would want numerical values as a code. So I searched up the encoding command. (the alternative would be to use global find and replace on excel but I want to find a way to do this on Stata, so I can instantly transform any new responses appropriately using a do file).

Problem: Encoding option encodes some of the responses in unusable ways. For example with the 'none of the time' and 'all of the time', I would want the first to be 0 and the last to be 5 etc but it instead will code it in an illogical order with the first being 3 and the last being 4 etc, which makes regression results nonsensical. Although I understand why Stata cant encode it exactly how it would make sense without further information. Unfortunately Limesurvey cant export the answers in any different way. (Limesurvey isnt great at exporting to Stata even with a specifically designed Stata xml plugin).

Question: Is there a way to exactly tell Stata how to encode such responses? Or a better way to assign numerical codes to the string answers from the survey rather than encode?

Any help would be greatly appreciated as I'm up against the clock. Let me know if any further information is needed, again I'm new here so not sure what should be provided.

ologit in favour of parsimoniousness despite violated parallel lines?

Dear Statalists,

I could use your input on the following 😊

My case: I am testing the influence of a factor variable (4 different countries) on an ordinal outcome variable (text complexity, scale 1-6). Since the parallel lines assumption is not met for all four categories of the factor variable, I use gologit2 for a partial proportional odds model. This gives me the odds ratios for the predictor's influence at each cut off point, which is fine. However, it is rather detailed for the hypothesis I am testing, which suggests that text complexity increases depending on the country (category 1 in the factor variable should be lowest, 4 highest). What I can say based on the PPO model is that this varies at each cut off point (which makes sense). Yet computing a cumulative PO model with ologit also shows overaching block patterns (two low countries vs two higher countries) but not the increasing trend as hypothesized.

My question: I have read that I could still use ologit in favour of parsimoniousness (justifying with BIC, which indeed is lower for olgit than gologit2), but I'm not sure if ignoring the violated paralell lines assumption is a good way to go. Do you have experience with whether it is "okay" or common to do this? Or maybe other ideas to make interpretation less detailed? I was thinking of clustering the scale values again, so I have less cut off points ...

I'm looking forward to your opinions on this and am trying to copy the gologit2 and ologit models below (first time, so I hope this works).

Thanks,
Julia

Code:

ologit icomplexity i.csystem, or

Iteration 0:   log likelihood = -5620.6987  
Iteration 1:   log likelihood = -5537.1818  
Iteration 2:   log likelihood = -5537.0217  
Iteration 3:   log likelihood = -5537.0217  

Ordered logistic regression                     Number of obs     =      4,563
                                                LR chi2(3)        =     167.35
                                                Prob > chi2       =     0.0000
Log likelihood = -5537.0217                     Pseudo R2         =     0.0149

------------------------------------------------------------------------------
 icomplexity | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     csystem |
          2  |   .9913936    .080441    -0.11   0.915     .8456296    1.162283
          3  |   1.925266   .1530373     8.24   0.000     1.647517    2.249841
          4  |   2.157683   .1677328     9.89   0.000     1.852752    2.512799
-------------+----------------------------------------------------------------
       /cut1 |   .3362666   .0567362                      .2250656    .4474675
       /cut2 |   1.378303   .0602158                      1.260282    1.496323
       /cut3 |   3.011779   .0782728                      2.858367     3.16519
       /cut4 |   4.851709   .1477801                      4.562065    5.141352
       /cut5 |   5.997357    .248598                      5.510114      6.4846
------------------------------------------------------------------------------
Note: Estimates are transformed only in the first equation.

Code:

gologit2 icomplexity i.csystem, autofit lrforce or

------------------------------------------------------------------------------
Testing parallel lines assumption using the .05 level of significance...

Step  1:  Constraints for parallel lines imposed for 4.csystem (P Value = 0.8842)
Step  2:  Constraints for parallel lines are not imposed for
          2.csystem (P Value = 0.00000)
          3.csystem (P Value = 0.00000)

Wald test of parallel lines assumption for the final model:

 ( 1)  [1]4.csystem - [2]4.csystem = 0
 ( 2)  [1]4.csystem - [3]4.csystem = 0
 ( 3)  [1]4.csystem - [4]4.csystem = 0
 ( 4)  [1]4.csystem - [5]4.csystem = 0

           chi2(  4) =    1.16
         Prob > chi2 =    0.8842

An insignificant test statistic indicates that the final model
does not violate the proportional odds/ parallel lines assumption

If you re-estimate this exact same model with gologit2, instead
of autofit you can save time by using the parameter

pl(1b.csystem 4.csystem)

------------------------------------------------------------------------------

Generalized Ordered Logit Estimates             Number of obs     =      4,563
                                                LR chi2(11)       =     232.51
                                                Prob > chi2       =     0.0000
Log likelihood =  -5504.443                     Pseudo R2         =     0.0207

 ( 1)  [1]4.csystem - [2]4.csystem = 0
 ( 2)  [2]4.csystem - [3]4.csystem = 0
 ( 3)  [3]4.csystem - [4]4.csystem = 0
 ( 4)  [4]4.csystem - [5]4.csystem = 0
------------------------------------------------------------------------------
 icomplexity | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
1            |
     csystem |
          2  |   .8776573    .072918    -1.57   0.116     .7457701    1.032868
          3  |    1.78562     .15009     6.90   0.000     1.514403     2.10541
          4  |   2.270685   .1801171    10.34   0.000     1.943736    2.652629
             |
       _cons |    .745114   .0430006    -5.10   0.000     .6654261     .834345
-------------+----------------------------------------------------------------
2            |
     csystem |
          1  |          1   4.40e-17    -4.16   0.000            1           1
          2  |   1.289039   .1231981     2.66   0.008     1.068842    1.554599
          3  |   2.089253   .1922505     8.01   0.000     1.744474    2.502174
          4  |   2.270685   .1801171    10.34   0.000     1.943736    2.652629
             |
       _cons |   .2302401     .01509   -22.41   0.000      .202485    .2617996
-------------+----------------------------------------------------------------
3            |
     csystem |
          1  |          1   1.20e-17    -8.01   0.000            1           1
          2  |    2.26293   .3548734     5.21   0.000     1.664123    3.077207
          3  |    3.49951   .5104081     8.59   0.000      2.62941    4.657534
          4  |   2.270685   .1801171    10.34   0.000     1.943736    2.652629
             |
       _cons |   .0332747   .0035651   -31.76   0.000     .0269722      .04105
-------------+----------------------------------------------------------------
4            |
     csystem |
          2  |   3.739563   1.489331     3.31   0.001     1.713241    8.162503
          3  |   7.827675   2.759696     5.84   0.000      3.92226    15.62173
          4  |   2.270685   .1801171    10.34   0.000     1.943736    2.652629
             |
       _cons |   .0032357   .0009507   -19.51   0.000     .0018192    .0057552
-------------+----------------------------------------------------------------
5            |
     csystem |
          2  |    5.32339   3.901655     2.28   0.023     1.265668    22.39014
          3  |   10.30695   6.901483     3.48   0.000     2.774406    38.29045
          4  |   2.270685   .1801171    10.34   0.000     1.943736    2.652629
             |
       _cons |   .0008055   .0004672   -12.28   0.000     .0002585    .0025103
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

Unexpected coefficients in GMM sys

Hello everyone,
I'm new with GMM, I'm using this estimator because I found it's the most suitable for my case. I have a panel dataset relative to the 20 Italian regions (n=20) for the period 2013-2017 (t=5). I want to use a dynamic model and regress GDP per capita on lagged GDP, other control variables (Investments, R&D, PublicExpenditure, Demography) and two variable of interest that are Corruption and an index of CriminalOrganization. I want to use a sys GMM since my data present kind of persistence. I use the command collapse to avoid the problem of "too many instruments" (I don't know if it is better to also limit the lags). I wanna treat PublicExpenditure and R&D as exogeneous. Data are in logs. I copied my command to ask if someone of you, experts, finds some contrast with the idea I presented:
xtabond2 logGDPpercapita L.logGDPpercapita logInvestments logPublicExpenditure logR&D logCorruption logCrimOrg logDemography, gmmstyle(L2.logGDPpercapita L.logInvestments L.logCorruption L.logCrimOrg L.logDemography, collapse) ivstyle(logR&D logPublicExpenditure) robust
The first question is: what are the main robustness checks to do? Are Sargan-Hansen, AR, Wald enough?
The problems are: 1)I obtain not significant estimation (for some coefficients) and mainly 2) the signs of coefficients are unexpected (to be clear, for example, the coefficient relative to Criminal Organization is positive). How it is possible considering that the correlation between GDP and CrimOrg is largely negative. Maybe Collinearity among regressors? Singularity? Too many instruments? Please, give me suggestions, it's real important for my research. PS: Results change dramatically if I exclude, add, "change" some variables or modify "something" in the command.
Thanks in advance
Regards

Need some help on a loop

Hello dears all.

I merged two datasets.
The first dataset contained the listing of households' members. In this dataset, i had the age(age_m) sex (sexe_m), and ID.

The second contained the members who were seeked in each household. The main variables concerned in this are the age (age_malade), sex (sexe_malade )and submission__uuid

Now, i want to know if any member who was seek is in the listing of member using a loop with ID. In other words, a want to know if a seek member with age_malade and sexe_malade is in the listing.

I added un file for the test.

Thanks all!

Unable to read .dta file through do file - works fine otherwise

Hello!

I am working on a do file alongside a colleague but its commands will not load successfully and it generates an r(601) error. The code stops at the point where I instruct which file to load and it says

'file /some_data.dta' not found

However, when I type

use 'file/some_data.dta'

into the command box, it loads fine. This tells me it must be something wrong with the code in the do file itself and not the filepath I am using because it loads that loads into Stata fine. I am a MacOS user working with Stata15. Because my colleagues work on Windows, I needed to change the filenames (done to my understanding below).

Here is the code I am working with:

----------------------------------------

gl filepath =‎⁨"/Users/josephkalarickal/Desktop/Google Drive/Research Assistant/Work for the Professor/Work Projects/Ongoing Projects/SkillsCountry/"
gl data ="$filepath/data"
gl dofiles ="$filepath/dofiles"
gl tempfiles ="$filepath/tempfiles"
gl results ="$filepath/results"
gl logfiles ="$filepath/logfiles"

use "$filepath/some_data.dta", clear *this is where the code breaks for some reason.

----------------------------------------

Thank you in advance for the help.

Joseph

Add a vertical line?

Dear all, I generate a data set and estimate an interaction model as follows.

Code:

clear
set seed 123
set obs 1000
gen x = runiform() 
gen z = runiform() 
gen xz = x*z

// interaction model
gen y1 = 1+3*x-2*z+1*xz+rnormal()
reg y1 c.x##c.z, robust
quietly margins, dydx(x) at(z=(0(0.1)1))
quietly marginsplot, recast(line) recastci(rline) ciopts(lpattern(dash) lcolor(red)) yline(0)

The graph is

Array

I wonder if we can add vertical lines, say at x=0 and x=0.5? Thanks.

Wednesday, July 29, 2020

Import mulitple sas7bdat files into stata with a loop?

Hello

I have 524 sas files I would like to import as dta files, ex.

bef201012.sas7bdat
bef201112.sas7bdat
bef201212.sas7bdat
faik2010.sas7bdat
faik2011.sas7bdat
faik2012.sas7bdat
ind2010.sas7bdat
ind2011.sas7bdat
ind2012.sas7dat

etc,

My code looks like this:

local filenames: dir . files "*sas7bdat"

foreach file of local filenames {
import sas using "`file'"
save `file'.dta, replace
}

but it doesn't seems to work very well? Is there anybody that can help me correct my code

Kind regards Frank

test for significant difference in number of children between 2 groups

Hi there,

I've been searching for a few hours on the internet but right now I can't see the wood for the trees.

I want to test if there is a significant difference in the number of children a parent has between men and women. In my sample number of children only takes on the values 1,2, and 3 by coincidence.

I've read some things about just using a t-test, but also about a two-sample poisson rate test because it is count data? I'm confused as to how I should interpret the variable and what test I should use.

Thank you in advance.

95% Confidence Interval for relative concentration index

Hello

I would like to calculate 95% CI of relative concentration index. However, I could not find a stata command for this. Could anyone kindly share this stata command?

To briefly explain my dataset, it is a complex survey design. After I opened up my dataset, I have used the following stata command.

svyset [pweight=wt_itvex], strata(kstrata) psu(psu)

conindex overw_obese_bi_valid [aweight=wt_itvex], rankvar(mincome_per_capita) bounded limits (0 1) truezero wagstaff cluster (psu) compare(town_t)

** variable information
(overw_obese_bi_valid: binary variable for overweight/obese adults)

(mincome_per_capita: continuous variable for monthly income per capitia variable)

(town_t: categorical variable for urban and rural areas)

After I run this command, I could get index value and standard error, but not 95% CI (You can also see the results in the attached file).

Many thanks in advance.

스크랩을 저장할 수 없습니다
충돌 보고서 보기

소스 링크
클립보드에 복사

' ; // get html // ======== var _html = _response; // normalize // ========= _html = _html.replace(//gi, '>'); _html = _html.replace(/\s+\/>/gi, '/>'); // remove // ====== _html = _html.replace(/

Loop of different files using capture

Hello Statalisters
I am running a loop over several files, like:

Code:

local originals  "/Users/onedrive/stata/files" // Defining the working directory
local files: dir "`originals'" files "group*" // Define local files
local dir1"/Users/onedrive/stata/newfiles"// Working directory for new files

foreach f of local files {
   capture {...
        save `"`dir1'/`f'"', replace
    }    
   if _rc!=0 {
        save `"`dir1'/break`f'"', replace        
    }  
  }

The problem is that when one particular file breaks up, the following files do the same, even tough when I test the do.file individually (outside the loop) they work just fine.
My question is, how to perform a do.file over several files that allows one specific file to break if there is an exception, while keeping the other files running into the loop?

Thank you and regards,
LR

Stata/MP 15.1 for Mac (64-bit Intel)
Revision 03 Feb 2020

Leave out 90th percentile

I am using a census dataset which has these following variables:

county

industry

income

I know how to calculate the normal one: collapse (p90) income, by(county industry).

How do I calculate a leave-out 90th percentile income by industry? Suppose there are 3 counties (i=1,2,3) and 3 industries (j=1,2,3). For county i=1 and industry j=1, I would like to get the 90th percentile income of people in industry j=1 and in counties i=2 and 3.

Thanks a lot in advance!

Variable label in generate?

Dear All,

I'd like to confirm whether there is any syntax that would allow me to prescribe variable labels in the generate's syntax, something like hypothetical:

Code:

generate balance=income-spent, varlabel("Balance at the end of the month")

The documentation for generate suggests that one can prescribe the value labels immediately in the same syntax, but is silent about the variable labels.

If it doesn't exist yet, would be good to have it some time in the future.This would help quite a bit making the programs shorter and more documented.

Currently one can do that in 2 statements, but that requires retyping the name of the variable and may be spaced out in the code:

Code:

generate balance=income-spent 
variable label balance "Balance at the end of the month"

Thank you, Sergiy Radyakin

Can I do a OLS Regression if the distribution of my DV is like this?

The dependent variable is the proportion of the ** time / *** time.
Because this is a proportion, the values range from 0 to 1.
As the histogram shows, 319 obs (45.83%) have the value of "0", and 168 obs (24.14%) have the value of "1".
I don't think I can do a multivariate regression analysis in this case, but then which one should I do?
Should I group them as two (0~50%)(51~100%) and do logit?
Array

xtwest command

i am trying to apply the xtwest command i get an error
i have one dependent variable and 5 independent variables
t=28 n=5
xtwest net fdi gdp capita un fifto64 , constant lags(1) leads(1) lrwindow(3) bootstrap(100)

i get this result:

unknown egen function rowmiss()

How to save a Kaplan-Meier survival point estimate to a local macro?

Hello STATA Community,

I am currently using version 15.1 of STATA.

I have some straightforward survival time data, (time in years, censor (1=failed, 0=censored) , group of interest). example as follows:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double pfsyrs byte pfscensor str17 first_deftx_type
   4.802739726027397 1 "RT"     
   3.147945205479452 0 "Surgery"
                   0 0 "RT"     
   1.704109589041096 0 "RT"     
  4.5698630136986305 0 "RT"     
   3.734246575342466 0 "Surgery"
  3.2246575342465755 0 "Surgery"
   7.043835616438356 0 "Surgery"
   4.816438356164384 0 "Surgery"
   1.189041095890411 0 "Surgery"
   .6958904109589041 0 "Surgery"
   -.273972602739726 0 "RT"     
  .15616438356164383 0 "RT"     
  2.1890410958904107 0 "RT"     
  .11232876712328767 1 "Surgery"
   .4986301369863014 1 "RT"     
-.052054794520547946 0 "RT"     
  11.734246575342466 0 "RT"     
                   0 1 "Surgery"
  11.542465753424658 0 "RT"     
  12.126027397260273 0 "RT"     
   4.594520547945206 0 "RT"     
  1.6383561643835616 0 "RT"     
   .3863013698630137 1 "Surgery"
                   0 1 "RT"     
   .8493150684931506 1 "RT"     
  1.6821917808219178 1 "RT"     
   7.682191780821918 0 "RT"     
   .8712328767123287 0 "RT"     
  1.4712328767123288 0 "Surgery"
   1.515068493150685 0 "Surgery"
  1.1863013698630136 0 "Surgery"
   4.183561643835616 1 "RT"     
   10.35068493150685 0 "RT"     
  2.6082191780821917 1 "RT"     
   .6520547945205479 0 "Surgery"
  3.4547945205479453 0 "RT"     
   .1178082191780822 1 "Surgery"
   .8246575342465754 1 "Surgery"
  3.0027397260273974 0 "Surgery"
   .7068493150684931 0 "Surgery"
   2.802739726027397 0 "Surgery"
   4.561643835616438 1 "RT"     
    8.24931506849315 0 "RT"     
                 -.4 0 "RT"     
  .23835616438356164 0 "RT"     
 -.03287671232876712 0 "RT"     
  2.8054794520547945 0 "RT"     
  1.5068493150684932 0 "RT"     
   5.175342465753425 0 "Surgery"
  1.9205479452054794 0 "Surgery"
end

I can easily display the 5 and 10 year survival point estimates for the groups using the "sts list" command:

Code:

. stset pfsyrs, failure(pfscensor) scale(1)

     failure event:  pfscensor != 0 & pfscensor < .
obs. time interval:  (0, pfsyrs]
 exit on or before:  failure

------------------------------------------------------------------------------
      4,929  total observations
        568  observations end on or before enter()
------------------------------------------------------------------------------
      4,361  observations remaining, representing
      1,012  failures in single-record/single-failure data
 15,887.025  total analysis time at risk and under observation
                                                at risk from t =         0
                                     earliest observed entry t =         0
                                          last observed exit t =  89.83836



. sts list, by(first_deftx_type) failure at(5 10)

         failure _d:  pfscensor
   analysis time _t:  pfsyrs

              Beg.                      Failure       Std.
    Time     Total     Fail             Function     Error     [95% Conf. Int.]
-------------------------------------------------------------------------------
RT 
       5       442       51              0.0686    0.0097     0.0520    0.0903
      10       147       28              0.1511    0.0180     0.1194    0.1902
Surgery 
       5       800       95              0.0577    0.0061     0.0468    0.0711
      10       235       45              0.1550    0.0155     0.1272    0.1883
-------------------------------------------------------------------------------
Note: Failure function is calculated over full data and evaluated at indicated
      times; it is not calculated from aggregates shown at left.

However, what I would like to do is actually just save the value of the 5-year failure function for any particular group (in this case, for the "RT" group it is equal to 0.0686) as a local scalar so that I can use this discrete value in a program. Does anyone know how to save a failure function value from a discrete specified time (like 5 years) for a KM estimate in a local scalar?

Many Thanks,

Jonathan Tward

Importing oddly formatted txt data into stata

Hi statalist, I am dealing with precinct by precinct voting results from counties in Texas. My goal is to record elections results (how many votes each person received, who won, etc). While I currently have been just ctrl-f'ing to each precinct and aggregating the results at the end I feel like the data are formatted well enough that if I knew my stata code well enough I could easily transfer this data into a .csv or .dta format.

Here is the example of my code

Code:

0001 001 CALDWELL HTS
                                                      VOTES  PERCENT
           REGISTERED VOTERS - TOTAL .  .  .  .  .  .     0
           BALLOTS CAST - TOTAL.  .  .  .  .  .  .  .   282


          CITY OF ROUND ROCK MAYOR
          VOTE FOR  1
           PATRICK BOSE  .  .  .  .  .  .  .  .  .  .    29    11.15
           SUEANN CAMPBELL  .  .  .  .  .  .  .  .  .    46    17.69
           NYLE MAXWELL  .  .  .  .  .  .  .  .  .  .   185    71.15


          CITY OF ROUND ROCK COUNCIL, PLACE 1
          VOTE FOR  1
           RUFUS HONEYCUTT  .  .  .  .  .  .  .  .  .    84    33.07
           TED WILLIAMSON.  .  .  .  .  .  .  .  .  .    98    38.58
           SHARON IZZO.  .  .  .  .  .  .  .  .  .  .    72    28.35


          CITY OF ROUND ROCK COUNCIL, PLACE 4
          VOTE FOR  1
           CARLOS T. SALINAS.  .  .  .  .  .  .  .  .   189   100.00


          ROUND ROCK ISD TRUSTEE, PLACE 1
          VOTE FOR  1
           YVETTE SANCHEZ.  .  .  .  .  .  .  .  .  .   136    50.18
           KARLA SARTIN  .  .  .  .  .  .  .  .  .  .    43    15.87
           VIVIAN SULLIVAN  .  .  .  .  .  .  .  .  .    92    33.95


          ROUND ROCK ISD TRUSTEE, PLACE 3
          VOTE FOR  1
           PHIL DENNEY.  .  .  .  .  .  .  .  .  .  .    41    15.77
           DIANE COX  .  .  .  .  .  .  .  .  .  .  .   104    40.00
           DEBBIE BRUCE-JUHLKE .  .  .  .  .  .  .  .   115    44.23


          ROUND ROCK ISD TRUSTEE, PLACE 6
          VOTE FOR  1
           DANIEL MCFAULL.  .  .  .  .  .  .  .  .  .    76    28.90
           RAYMOND HARTFIELD.  .  .  .  .  .  .  .  .   108    41.06
           MARK MAUND .  .  .  .  .  .  .  .  .  .  .    79    30.04

PRECINCT REPORT            WILLIAMSON COUNTY, TEXAS
                           JOINT GENERAL ELECTION
                           MAY 7, 2005
RUN DATE:06/01/05 05:02 PM

          0002 002 STONY POINT
                                                      VOTES  PERCENT
           REGISTERED VOTERS - TOTAL .  .  .  .  .  .     0
           BALLOTS CAST - TOTAL.  .  .  .  .  .  .  .   338


          CITY OF ROUND ROCK MAYOR
          VOTE FOR  1
           PATRICK BOSE  .  .  .  .  .  .  .  .  .  .    26     9.52
           SUEANN CAMPBELL  .  .  .  .  .  .  .  .  .    62    22.71
           NYLE MAXWELL  .  .  .  .  .  .  .  .  .  .   185    67.77


          CITY OF ROUND ROCK COUNCIL, PLACE 1
          VOTE FOR  1
           RUFUS HONEYCUTT  .  .  .  .  .  .  .  .  .    96    36.64
           TED WILLIAMSON.  .  .  .  .  .  .  .  .  .   104    39.69
           SHARON IZZO.  .  .  .  .  .  .  .  .  .  .    62    23.66

my end game is to have all the names of candidates and their respective votes. Any advice to hasten my process and dump the manual process of looking at each outcome?

Iterations in logistic regression

Is there any work on what affects the number of iterations required to achieve convergence in logistic regression -- number of X variables, distribution of X variables, correlation among X variables, strength of relationship between X and Y?

I apologize for posting this non-Stata question here, but I've turned up nothing in Google Scholar and this tends to be a pretty knowledgeable crowd.

Finding Most Common String Values Across Variables

Hi all:

I am trying to find the most common string values across variables. While I work in criminal justice data, I can't share that data. So I made a test set with color. Assume that each respondent might have multiple colors occur. Each time a color occurs, they get the color in a new variable. (In real life, these are charges.) There is no rhyme or reason why something is entered as the first color or second. I need to know the five most common colors that occur across the data set (the five most common charges). I know how to do this for one variable with the group command, but can't figure out how to do so across variables.

I searched the forums, but did not find a solution.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str2 id str6 Color1 str5(Color2 Color3) str6 Color4
"1"  "Blue"   ""      ""      ""      
"2"  "Red"    "Black" "White" ""      
"3"  "Orange" "Blue"  ""      ""      
"4"  "Black"  "Red"   "Blue"  "Orange"
"5"  "Blue"   ""      ""      ""      
"6"  "Blue"   "Green" "Tan"   ""      
"7"  "Green"  "Blue"  ""      ""      
"8"  "Red"    "Blue"  "Green" ""      
"9"  "Purple" ""      ""      ""      
"10" "Black"  "Red"   ""      ""      
end

This is my first real post here, so please let me know if I did not enter the needed information.

Thank you!

Interpretation of sdtest

Hi everyone,

I am using the sdtest command to evaluate variances in two groupes. But in Stata's output, I'm have problems with the interpretation.

Array

I'm interpreting like: At a significance level of 5%, the null hypothesis cannot be rejected. But, at a significance level of 1%, the null hypothesis can be rejected.

So, my doubt is: Am I right?
I would appreciate if you can help me with this question. If you could recommend me books on the subject, I would really appreciate it.

Help producing line graph

Hi all,

I have somewhat of embarrassing question. I am trying to make a line graph that displays what percentage of the observations (or individuals) fall into different categories over time-- like as seen in the image attached. Does anyone have any idea of how to do so?

Thanks,
Claire

The Hausman test for endogeneity

I have read in a paper that we can use Hausman test for endogeneity. and the authors mentioned that we can use error term as follows:

HTML Code:

we perform the Hausman test (Gujarati, 2003) as follows. First, we obtain the error term (ύ) from an estimate of audit committee cash compensation regression (ACCASH) that includes the following determinants: firm size (LNTA), leverage (LEV), return on assets (ROA), market‐to‐book ratio (MKTBOOK), litigation risk (LITRISK), sales growth in industry (INDSAL), inside ownership (INSIDER), CEO power (CEOPOWER), accounting expertise on the audit committee (ACEXPERT), audit committee meetings (ACMEET), audit committee multiple‐directorships (ACBUSY), and industry fixed effects. Next, we include the obtained error term (ύ) in all our main regressions to determine if it is significant. A significant ύ will indicate that the propensity to beat earnings by a large margin and audit committee cash compensation is endogenous. In all of our primary and additional tests, the error term (ύ) is not significant (p > .10). As there is no evidence of endogeneity between our test and dependent variables, we can proceed to estimate and present single multivariate regression results

kindly can someone explain to me how to extract the error term and how to apply this process ?
thanks in advance.

Using analytical weight in STATA's mixed effects model

I have a learning assessment dataset of over 60 countries and 2-5 years/waves. While countries are the same in all years, individuals (students) are different in each year. In other words, the data is cross-sectional at the student level. I use a two-step procedure to conduct country-level mixed-effects panel regression. At first, I regress student economic background on their math achievement for each country in each year using a simple OLS regression: achievement = a + economic_background + e.
Or in STATA:

Code:

reg achievement economic_background

The data structure is somewhat like the following-- different students are surveyed in different years from the same country:

Code:

student country year achievement economic_background
101 1 2000 500 78
201 1 2000 488 98
106 1 2003 589 66
407 1 2003 400 76

Then, I use the coefficient of economic_background (named inequality_gradient) as the dependent variable at the second stage regressed by some country-level variables. I use a mixed-effects model using STATA's mixed command to do so. The model looks like the following:

Code:

mixed inequality_gradient var2 var3 || country:

However, to get an unbiased standard error of the mixed-effects model at the second stage, I would like to weight the model by the inverse square of the standard error of economic_background coefficient found in the first OLS regression. To employ this weight named as gradient_se, I am trying to use STATA's analytical weight aweight option. But it seems like mixed command does not accept aweight option. Does anybody have any suggestion about how to incorporate these analytical weights in mixed command in any other ways?
I have tried the following code but get an error:

Code:

mixed inequality_gradient var2 var3 [aw=gradient_se] || country:
aweights not allowed
r(101);

I have also tried with pweight but since I only have weights at level 1 I get a warning saying that the results may be biased. But I do not have weights for countries at level 2. Can I incorporate the weights only at level 1 in a mixed model any other ways?

The data structure at the second stage looks like the following:

Code:

country inequality_gradient gradient_se year var2 var3
1 300 44 2000 1 3
1 200 34 2000 1 3
2 498 55 2003 2 2
2 388 67 2003 4 1

Please let me know if I need to make my problem clearer. I would be happy to do so.

Fixed effects and cluster error in gravity model

Dear all.
I need your help.
I am working in a gravity model that pretends to estimate Colombian exports to 136 partners from 2005 to 2018 through the PPML approach.
Therefore, the dataset includes only exports from Colombia to its partners (one exporter; 136 importers).
My questions are:
1. Is it necessary to include time fixed effects?
2. Is it compulsory to estimate the specification with cluster by country or distance? Or is it right to estimate it with robust standard error as usually?
Thanks in advance.
Regards.