BJ Data Tech Solution

Specialized on Data processing, Data management Implementation plan, Data Collection tools - electronic and paper base, Data cleaning specifications, Data extraction, Data transformation, Data load, Analytical Datasets, and Data analysis. BJ Data Tech Solutions teaches on design and developing Electronic Data Collection Tools using CSPro, and STATA commands for data manipulation. Setting up Data Management systems using modern data technologies such as Relational Databases, C#, PHP and Android.

Sunday, October 31, 2021

gen dummy with missing data

Dear All, Suppose that I run the code

Code:

sysuse auto, clear
gen d = (rep78 &gt; 2) &amp; !missing(rep78)

i.e., d-=1, if rep78 > 2 (3,4,5) and d=0 if rep78 <= 2 (1,2). My problem is that I want d =. if rep78 = . (missing). Any suggestions (hopefully, in one line)? Thanks.

Tabulate two way by summarizing the sum

Hi Everyone,

I have the following data on year (first dimension of panel) , market (second dimension of panel) and product code/type (third dimension of panel) along with market share where market is defined as year and market dimension interaction. The market share was generated by :

bysort ye ma: egen tsales=total(qu)
gen share=qu/tsales

where qu is quantity sold of product in a market -year interaction.

Code:

clear
input byte(ye ma) int co byte org float(qu share)
70 1 411 2  1100   .004219555
70 1 241 6   544  .0020867616
70 1 499 7  1880   .007211603
70 1 196 1  8600    .03298925
70 1 413 2  1000   .003835959
70 1 488 2  5400    .02071418
70 1 212 1  7150   .027427107
70 1 134 2  6660    .02554749
70 1  64 3  7800    .02992048
70 1 410 2  2100   .008055514
70 1 500 7   690   .002646812
70 1 400 3   800   .003068767
70 1 439 2  2500   .009589897
70 1 435 3  7200   .027618906
70 1 429 3  4300   .016494624
70 1  36 1  6700   .025700925
70 1 458 3   800   .003068767
70 1 481 2   380  .0014576644
70 1 434 3  1100   .004219555
70 1 217 7  7350     .0281943
70 1 530 7   350  .0013425857
70 1 437 2  7600    .02915329
70 1 269 4  9350   .035866216
70 1 408 2  1350   .005178545
70 1  15 2  2700    .01035709
70 1 412 2  2200    .00843911
70 1 172 2 11000    .04219555
70 1 447 4   560   .002148137
70 1 478 2   150 .00057539385
70 1 455 3   220   .000843911
70 1 402 3  1200  .0046031508
70 1 431 3  2300   .008822706
70 1 430 3  3300   .012658665
70 1 417 1  3700    .01419305
70 1 419 1  1300   .004986747
70 1  26 1  3500   .013425857
70 1 497 1  6350    .02435834
70 1 491 1  6000   .023015754
70 1 174 2  1500   .005753939
70 1 418 1  4650    .01783721
70 1 422 8  7475   .028673794
70 1 521 1 14100    .05408702
70 1 503 7   250  .0009589898
70 1 407 2  2200    .00843911
70 1 406 2  3000   .011507877
70 1 214 1  4450   .017070018
70 1 535 6  2000   .007671918
70 1 524 1  3800   .014576645
70 1 213 1  8000   .030687673
70 1 544 2  5200   .019946987
end

Now, org implies the origin country of the product and ma is the market (second dimension of panel) in which the good is sold. What I want is the cross table where columns are destination or ma (Second dimension of panel) and rows as the origin country (location) and each sell to represent how much percentage of sales in a market comes from a certain country. The column sum should be 100. I tried to use the

Code:

tab org ma, summarize(share)

which did not give me the result. In order to confirm that shares are correct, I checked the following command, and it showed that calculation of share was correct.

Code:

sum share if ma==1 & ye==71

I think the tab command giving me number of observations and their average across the years. Rather than summing up the market share of all the goods from the same origin in the given year. Any help is much appreciated.

Assigning colors to stacked bar plot with ordinal y-variable.

Dear Forum,

I would like to plot a stacked bar graph of an "outcome" over hiv status (positive vs. negative) and stratified by sex (male vs. female).

The "outcome" is in fact 3 separate variables ("low", "mid" and "hi") generated from an ordinal yvar with 3 levels ("1" "2" and "3"):

low = 1 if yvar==1
mid = 1 if yvar==2
hi = 1 if yvar==3

My specific goal - for which I seek guidance - is that I would like to (i) color code the bars according to HIV status (negative = navy; positive = maroon), while (ii) indicating the stacked bars "low" (100%) "mid" (20%) and "hi" (70%) by varying the opacity of their color, and lastly (iii) to generate a commensurate legend.

I have the below written stacked box plot command:

- graph bar (sum) cvd_low cvd_mid cvd_hi, over(hiv) over(sex) percent legend(label(1 "Low (<10%)") label(2 "Intermediate (10-30%)") label(3 "High (≥30%)") order(1 - "" 2 - "" 3)) stack

The output is also copied below.

Array

The goal is to have, for example, a navy "HIV negative" bar with different levels of outcome (cvd risk) indicated by different opacities (or shades) of the navy. I will very much appreciate any suggestions you may have on how to go about this.

Thanks in advance.

Itai

egen problem

Hi list,

I want to generate a variable that equals the mean of the values of the first two nonmissing variables in a three-variable list (var1, var2, var3). At the same time, I also want to record which two of the three variables in my list have been used to generate the new variable (I do not need the variable name; the order of the variable is enough). How can I do that efficiently? Thank you very much!

Code:

webuse auto,clear

replace headroom=. if mpg==17
replace weight=. if mpg==22|mpg==17
//I did the above manipulation so that some observations have only one nonmissing value in the varlist to be considered below
//so that you cannot simply apply egen rowmean to fulfil my task

keep rep78 headroom weight
rename rep78 var1
rename headroom var2
rename weight var3
//how to generate a variable that equals the mean of the first two nonmissing variables in the varlist (var1, var2, var3)
//and record which two of the three variables in my list have been used to generate the new variable?

"Encode" a String with another String?

Hello:
I am trying to create a long code list where var1 is a string and what it should represent is shown in var2, also a string.
I then would like to store this coding so that I can refer to it later when var1 shows up, var2 will be displayed in the dataset.
This is in a similar fashion to how I would think -encode- would work for strings to a number, except I want to get another string.

Code:

*
clear
input str5 var1 str27 var2
"one"   "the diameter of the earth"  
"two"   "the diameter of venus"      
"three" "the diameter of jupiter"    
"four"  "the population of paris"    
"five"  "the surface area of england"
"six"   "the population of phoenix"  
end

Is there any sort of command available for this?
Or if not, I thought to try to make a loop using a global, but couldn't get the syntax right. I have a few thousand unique strings I'd like to "encode", is there a limit to the number of globals you can have?

Thanks for advice...

Diff in diff: Fix data for one-to-one matching propensity score

I have a data set of companies with the date they were acquired by a business group and a control group in which the companies have been in their business group since their creation. Each company has as many observations as the years between the creation date of the company and 2018 and a dummy variable that is 1 between the acquisition date and 2018. For the control group the dummy variable is 0 in all observations.

I also used one-to-one propensity score matching to form groups within the treatment and the control group considering industry and company creation date. Using the pscore, how can I change the dummy variable for control observations to be 1 in the date their pair in the treatment group was acquired?

Thanks!

New to stata, Impossible problem for a summary table?

Hello all- I'm so glad I found this forum! I've just started working with stata for a biostats class. I've been given a problem set to work on and am already stuck on the first question. The professor has asked us to create a descriptive summary table for continuous data that includes both variable name and variable description (as well as sd, mean, min, max, and # of obs). So far I've tried su, fsum, and univar. Adding label as an optino for fsum gets me the "closest" except there's no "variable label" header in the table. Any ideas?

Maximum number of iterations exceeded

Hello,

I hope you guys are doing well.
I am estimating a model using panel ardl techniques, namely the dynamic fixed effect regressor. And Stata sent this message " Maximum number of iterations exceeded".
I would like to know what it means and how can I overcome this situation?

Thank you so much

Reshape Wide Issue

Hi, I have been trying to reshape my data from long to wide format using reshape wide. The data looks like this:

OrganisationName AssetClass CapitalCommitted TotalCapitalCommitted
Maroni Private Equity 50 120
Susani Private Equity 70 120
Mazia Debt 30 75
Mungo Debt 45 75

Total Capital Committed denotes the total amount of capital committed per asset class by all organisations (I obtained this number by using the command: egen TotalCapitalCommitted = total(CapitalCommitted), by(AssetClass).

I now would like to reshape my data from long to wide format and I have been trying to do so through: reshape wide TotalCapitalCommitted, i(OrganisationName) j(AssetClass) string, however whenever I run this code I get the error message: TotalCapitalCommittedInfrastructure invalid variable name. Would anyone be able to help me? The most important piece of information I have to showcase is the total capital committed per asset class.
Many thanks

Calculating F-statistic after constrained regression

I have run this unconstrained regression
*Model: lcost=ß_1+ß_2loutput+ ß_3plabor+ ß_4lcapital+ß_5lpfuel

and would like to set the restriction : R = (0, 0, 1, 1, 1) to test hypothesis P labor + βP capital + βP fuel = 1
I defined matrices for R and r

Code:

matrix R_c = (0,0,1,1,1)
gen r=1

I need to test the null hypothesis of homogeneity based on the F ratio formula using SSR for restricted and unrestricted regression (F-stat=(SSRrestr. − SSRunres./#r)/SSRunres /(n-k)) or based on multiplying the matrices of restriction and coefficients Rß to use the other formula to manually calculate F-stat

How to best do this?
I used this code below but apparently there is an error

Code:

reg lcost loutput lplabor lcapital lpfuel
constraint 3 _b[lpfuel]=_b[lplabor]=_b[lcapital]=1

 cnsreg lcost loutput lplabor lcapital lpfuel, constraints(3)

How to convert SIC codes into Fama French 12

Hello, I am a uni student and I'm terrible with computers. For a project that is due in a week, we are supposed to convert industry SIC codes (which number in the thousands) into Fama French 12 classifications. I've read a few posts on this forum detailing how to do this, but I'm lost. I'm unfortunately not too proficient at stata. Can anyone give me a step by step explanation? Any help is appreciated. Thank you in advance.

putdocx basic summary statistics

I am running Stata 17 SE.

I would like to send the output of basic summary statistics commands, namely

Code:

summarize

Code:

tabulate

Code:

codebook

Code:

describe

However, I can't do the usual

Code:

putdocx table table1=etable

presumably because the summary statistics commands do not generate tables, even though their output does look a lot like tables.

For individual elements, the manual suggests

Code:

putdocx text (" ‘r(mean)’ ")

etc

But I am not interested in one element here or there to include in a largely text report...I would like to send to the docx file the whole output...

Reshape dataset including variables names as an additional variable

Hello Statalist community,

I don't have yet much experience using Stata and hope you can help me with a problem I am encountering.

I have a dataset with a time variable (years) and industry variables in which the observations represent the respective industry's beta for each year.
I need to reshape the dataset so that I can have all the betas under the same variable (Ind_Beta) and create an additional variable that identifies the industry, repeating of course the years.

Please find below a simplified example of the current dataset and the desired structure, followed by the dataset example generated by stata dataex.

Any help or hint in the right direction would be greatly appreciated.

Thanks a lot in advance!

Current Dataset Structure:

year	Agric	Food	Drug	Books
2000	1.12	1.45	0.97	…
2001	1.05	1.6	0.88	…
2002	1.18	1.34	0.92	…

Desired Structure:

year	Industry	Ind_Beta
2000	Agric	1.12
2000	Food	1.45
2000	Drug	0.97
2000	…	…
2001	Agric	1.05
2001	Food	1.6
2001	Drug	0.88
2001	…	…
2002	Agric	1.18
2002	Food	1.34
2002	Drug	0.92
2002	…	…

Dataset example generated by stata dataex:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int year float(agric food beer smoke toys fun books)
2000  .3452645 .13540165 .004693561 .10387945  .5594783  .7765604   .672514
2001  .4666844 .19018325  .22386266 .13462715  .7232649 1.2980112  .6057138
2002  .4180116  .4688401   .4606416  .3165837  .8264946 1.0237317  .7255396
2003  .6058208  .5994597  .54949224 .58115196  .8094561  1.174974  .8105838
2004   .645611   .618094   .5450272  .6523659 1.0529572 1.0226463  .6682123
2005  .7191871  .6432664   .6079998  .8309416 1.2295903 1.1020511  .7842661
2006  .7900746  .5786355   .4802569  .4810273  .9801628  .9165996  .7079894
2007 1.3577225  .7155325   .5886528   .634958  .7687722 1.0139668  .8080173
2008 1.1235923  .6186133   .5526155  .6556639  .8520126  1.213102  1.132914
2009  .6970165 .52631867   .4582001  .3303039 1.0007415 1.5126996  1.307686
2010  .7568682  .5700232   .5300202  .6517677 1.0587844 1.4952796 1.1704426
2011 1.0809941  .6031252   .5268776  .5393996   .980441 1.2700168 1.1069671
2012 1.0136213   .572837  .52827054  .5611398 1.0247171 1.2806007  .9967803
2013  .9104237  .8755165   .7196253  .7702268  .9460964  1.228237  1.176336
2014  .7878072  .7916082   .5841157  .6283697 1.0399202  1.299612 1.1335369
2015  .9148453  .7809891   .7545701  .7404549 1.0749298  1.199249 1.0286957
end

xtunitroot error 'too many variables specified'

Good afternoon.

The dataset looks like this:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long pais float(year ln_co2pc_gr ln_gdppc_gr co2pc_gr gdppc_gr res_share_ch ei_ch)
1 1970            .           .            .            .          .             .
1 1971    .04318523           .    .04413044            .  -.6838717             .
1 1972  -.003316879           .  -.003311346            .  -.5843963             .
1 1973   .035467148           .   .036103528            . -.19065896             .
1 1974   -.01744461           .  -.017293148            .  1.1816183             .
1 1975   -.04150105           .   -.04065147            . -1.1635613             .
1 1976    .02756405           .   .027947137            .  -.4502255             .
1 1977      .027318           .    .02769508            .   .4519265             .
1 1978   -.01889515           .   -.01871785            .   .7232267             .
1 1979    .03856087           .    .03931371            . -.03159031             .
1 1980  -.033974648           .   -.03340414            .  1.6892276             .
1 1981   -.03340912           .  -.032857135            .   .3562208             .
1 1982  -.029852867           .   -.02941179            .     .22335             .
1 1983   .010718346           .   .010776284            .  .55891395             .
1 1984  -.025982857           .   -.02564822            .   .3025955             .
1 1985   -.09177828           .   -.08769248            .   .2845615             .
1 1986    .06215334           .    .06412572            . -.57471985             .
1 1987   .035816193           .     .0364652            . -.04727695             .
1 1988    .01995754           .   .020158285            . -2.0891473             .
1 1989     -.051548           .   -.05024241            .  -.8556552             .
1 1990   -.08002138           .   -.07690334            .  2.5997005             .
1 1991    .04825544   .07338333    .04943905    .07614343  -.7621235   -.003763424
1 1992   .014944077   .06285858   .015055898   .064876065   .8977495   -.004395153
1 1993   .020612717   .06581497   .020826785    .06802921   .9866562   -.005773365
1 1994    .02478409   .04406929     .0250941    .04505472  .05130513   .0004600968
1 1995  -.012306213  -.04115677  -.012230685   -.04032156   .1040608    .003823304
1 1996    .04091835   .04185772    .04176683    .04274599   -.889446  -.0021086514
1 1997  .0079956055   .06639385   .008027056    .06864754   .5504242  -.0032310374
1 1998   .014033318   .02645588   .014132484    .02680954  .14474016  -.0015993596
1 1999   .029364586  -.04557419    .02980039   -.04455157 -1.4123623    .004682073
1 2000  -.013619423 -.018927574  -.013527458  -.018749123  -.7077209    .006655703
1 2001   -.04242325  -.05601025   -.04153631   -.05447092  1.7975198  -.0001920025
1 2002  -.064777374  -.12618446   -.06272339   -.11854777   .4461319      .0078657
1 2003    .05957603   .07396126     .0613862    .07676546  -.9268353    -.00195185
1 2004    .07559109    .0758953    .07852178    .07884938 -1.3884256  .00006687366
1 2005    .02943516  .074453354   .029872544    .07729475  .53251714   -.005813921
1 2006    .04256248   .06724262    .04348141    .06955548  .10701164  -.0033557636
1 2007    .08056164   .07625961    .08389506    .07924238 -1.0392638  -.0004027938
1 2008    .02333355   .02984524    .02360851    .03029477   .4650531  -.0015919253
1 2009   -.07792091  -.07100487   -.07496273   -.06854227  1.7649453  -.0008191558
1 2010    .07104397    .0862999    .07362857    .09013356   .3953756   -.003054402
1 2011   .036629677   .04797363    .03730834      .049143  .19075124   -.001405331
1 2012   .005539894  -.02078247     .0055553  -.020567933  -.5361876   .0016440516
1 2013   .017653465  .013266563   .017810239    .01335466 .006848871   .0002165544
1 2014   -.02623844 -.035855293  -.025896933  -.035220277   .9186995   .0021337224
1 2015    .01380539  .016727448   .013901046   .016867941 -1.0696038   -.001246282
1 2016  -.014710426 -.031025887  -.014602168   -.03054932   .5231693    .001930914
1 2017  -.024267197   .01653099   -.02397574   .016668873  1.1013072  -.0032586295
1 2018   -.04940796 -.034734726   -.04820735   -.03413885  -.1671047   .0007113228
1 2019   -.03939629 -.031279564      -.03863   -.03079537  -.0826373  .00057273713
1 2020            .           .            .            .          .             .
2 1970            .           .            .            .          .             .
2 1971    .03735685           .    .03806343            .  -3.583941             .
2 1972   .006715775           .   .006738317            .  -3.595976             .
2 1973     .0723815           .    .07506522            .  1.0047908             .
2 1974  -.013946533           .  -.013849764            .  -2.443425             .
2 1975    .06486988           .   .067020305            .  -5.076143             .
2 1976    .06060696           .    .06248108            .   1.480327             .
2 1977    .15053034           .     .1624507            .  -.9558575             .
2 1978    .05724525           .    .05891533            .   -5.38981             .
2 1979     .2042713           .      .226631            .  1.2600955             .
2 1980   .064723015           .   .066863455            .   2.294343             .
2 1981  .0004091263           .  .0004090958            . -4.1512866             .
2 1982  -.010876656           .  -.010817155            .  -1.286386             .
2 1983   -.08180618           .   -.07854988            .  -.2205604             .
2 1984    .02359295           .    .02387344            .  .52316314             .
2 1985   -.05776501           .   -.05612823            .  -.6478212             .
2 1986      .281785           .     .3254935            .  -.4750064             .
2 1987   .062625885           .      .064629            .  -5.436717             .
2 1988   .008115768           .   .008148801            .   .7065676             .
2 1989    .05866051           .    .06041504            . -1.0042635             .
2 1990   -.02052593           .   -.02031715            .   .9601043             .
2 1991 -.0022001266   -.0434866  -.002197118    -.0425546   -.832518     .01618729
2 1992   -.12391949  -.06234837   -.11654889   -.06044444  -1.431724   -.023318866
2 1993  -.013767242 .0041484833   -.01367371  .0041567744  -1.408625   .0024125425
2 1994     .1101904  .016054153      .116491   .016183637   1.988639    .008190635
2 1995   .034041405  .015937805    .03462734     .0160664 -1.8642867    .006239843
2 1996     .0448246  .034890175    .04584408   .035505243   .2397784   .0004650287
2 1997    .12469387   .04205894     .1328022    .04295658  -.6353271      .0079945
2 1998   -.02123642  .032331467  -.021012744   .032859545  .59386986    -.00912828
2 1999    .05735397 -.000790596    .05903048 -.0007899858   -2.49613    .004564094
2 2000    .04701233   .03958511    .04813543    .04037887   2.706013    .004879483
2 2001   -.04536247 -.027560234   -.04434931   -.02718382 -1.0512298    -.00504469
2 2002  -.015072823 .0044546127  -.014959444   .004464366  .05328191   -.002887168
2 2003  .0005598068  .018220901 .00055978925   .018387806 -.10914221   -.002539354
2 2004   .012296677  .010541916   .012372726   .010597872  -.2341628 -.00010549345
2 2005   .017718315  .035025597   .017876115   .035645753   .7751106   -.003184567
2 2006   .008773804    .0531826   .008812081    .05462261 -1.2377497    -.00716155
2 2007   .010412216   .01738453   .010466738   .017535998   .3962357  -.0003483627
2 2008    .15820885  .002524376     .1714113   .002528249   -.927561     .01863219
2 2009  -.035902023  -.05624008   -.03526547   -.05468835   .4229553    .003304608
2 2010   -.07738018  -.02677727   -.07446237   -.02642146  -.8142483    -.00818127
2 2011     .0507679 -.009784698    .05207909    -.0097374  -.3567042    .007837007
2 2012   -.06547451 -.007086754   -.06337695   -.00706133 -.08397305  -.0083234655
2 2013   .010436058 -.016260147   .010490787  -.016129091  -1.083137   .0025542805
2 2014   -.09462738 -.003108978   -.09028804  -.003103939  .20679207    -.01407241
2 2015  -.010979652  .022444725  -.010919874   .022698896 -1.1899234   -.004809058
2 2016   -.06432438  .022831917   -.06229942   .023094524   .5234437   -.009212221
2 2017   .026329994   .00321579   .026679393    .00322063   .5948993    .004054799
2 2018    .01234913 -.007247925   .012426493  -.007221923   .7516962    .000632762
end
label values pais pais
label def pais 1 "Argentina", modify
label def pais 2 "Barbados", modify

In the example appear two country, but I have many more, 25.

My problem is the following.

When I execute:

Code:

xtunitroot ips ln_co2pc_gr ln_gdppc_gr if year > 1990 & year <2020

I receive a message:

too many variables specified
r(103);

It only works if I use one variable:

Code:

xtunitroot ips ln_co2pc_gr if year > 1990 & year <2020

Is It equivalent if I do one test by variable to do one test with all variables?

What's the problem?

Thank you for any answers and help!

Rename Variables using a loop

Good morning
My problem is the following.
I would like to rename my variables using a loop.
My initial situation is this:
I have a set of variables with this name v1 v2 v3 .... v235

I would like to rename these variables for groups, but assigning each group of variables the same dates, from 2013 to 2019.

So as a final result I would like this:
REVENUES 2013
REVENUES 2014
REVENUES 2015
REVENUES 2016
REVENUES 2017
REVENUES 2018
REVENUES 2019
EBITDA 2013
EBITDA 2014
EBITDA2015
EBITDA 2016
EBITDA 2017
EBITDA2018
EBITDA2019
........

as code I had thought of using this:
local y = 2013
foreach v of var * {
rename (v1-v7) REVENUES`y '
rename (v8-v14) EBITDA`y '
.....
......
local ++ y
}

But unfortunately it doesn't work (I'm a newbie in Stata).
Thank you for any answers and help!

Losing a value after dropping duplicates

Hi everyone

I need a general code to fill the remaining cells at "cvcie0" for the same "gvkey" and for the same "fyear" with the same value.
E.g.: Every row with the gvkey: 1001 and the fyear: 2011 should have the value 73 in the column "cvcie0".
I calculated the value 73 but it will drop after dropping duplicates and I believe the value will not drop if every row has the value.

In my data example, I lose the value in the columns "cvcie0" when dropping duplicates and using the code:
__
quietly by gvkey fyear: gen dup = cond(_N==1,0,_n)
drop if dup>1
__

I hope this data example is fine. As I work with credential data, I needed to build a lil dataset.
And I hope I have used dataex the right way.

Thank you so much!
Best,
Jana

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int(gvkey fyear) byte(cvcie1 cvcie0)
1001 2011 55  .
1001 2011  . 73
1001 2011  . 73
1001 2011  . 73
1001 2011  . 73
1001 2011  . 73
1001 2011 55  .
1001 2011 55  .
1001 2011  . 73
1001 2012 64  .
1001 2012 64  .
1001 2012  . 23
1001 2012  . 23
1002 2011 12  .
1002 2011 12  .
1002 2011  . 15
1002 2011  . 15
end

------------------ copy up to and including the previous line ------------------

Listed 17 out of 17 observations

How to select optimal GMM model

Greetings,

I am estimating several specifications for the macroeconomic determinants of non-performing loans using system GMM with xtabond2. The literature dopesn't have a theorical model for it, so pretty much all specifications are empirical.

Usually with time series models people select optimal specifications with information criterion like AIC or BIC, but from what I read those methods apply only to maximum likelehood estimations (and I coldn't find a way to apply them with xtabond2). I would like to know if there is a standard way to select a specification between several models with GMM.

best wishes,
Tarek Tuma

Poisson regression

Hi everyone!
this platform has been a great help for me. I would like to ask questions on Poison regression

Background- i want to estimate the elasticity of income on number of children in the household and income elasticity of years of education. log-log model could give me the required answers but, there are 0's in income variable and other two dependent variable which years of schooling and number of children.
So, I thought of using Poisson GLM.

Questions;
1) Does Poisson mean log-linear form of the regression and If I take log of dependent variable ( number of children/years of education) which has zeros in it, Would that be fine?
2) As independent variable I am using income which has huge values. To lower the magnitude I use log of income. Can I take log of income (independent variable) in Poisson?
3) Can I log transform the income knowing that it has 0 values in few observations.

I really look forward for the help.
Thank you in advance

Renaming date variables

Hello gang

I'm trying to fix such that my date variable, specifically the one showing month, looks a little "cleaner". Initially I had a date variable that showed the date that an insurance customer set up an insurance offer: "date_insurance_offer". I then used this to make two new variables showing the dates in months and quarters by writing the code:

generate insurance_offer_month = mofd(date_insurance_offer)
format insurance_offer_month %tm
generate insurance_offer_quarter = qofd(date_insurance_offer)
format insurance_offer_quarter %tq

Now the variable showing the quarters is fine, but the variable showing the months is showing 2020m1 for january 2020, 2020m2 for february 2020 etc. What i wanted to do is to make the variable "insurance_offer_month" show the months as they are named. For example 2020m1 would preferably be 2020jan, 2020m2 would be 2020feb etc, or something similar.

Is this possible somehow?

Appreciate any help!

Help - estout export to csv - a whole row is concentrated in one cell

Hi friends,

I am trying to export a table to excel using estout. Instead of each value being put in a cell (="..."), for some of the lines I get one cell with all the values, some of them distorted.

Your help will be appreciated, Thanks!

Array Array

Collapse and generate average values

Dear,

I want to collapse the years into periods of, say, two years each. For example, if I have a panel from 1990 to 1995, I want period I = 1990-1991, period II = 1992-1993, and period III = 1993 & 1994. Within each period, I will then calculate the average (mean) or identify the mode for each variable.

How can I do that in Stata?

Let me give an example with the following made-up dataset:

input str10 country year x y
"Indonesia" 1990 24 36
"Indonesia" 1991 28 22
"Korea" 1990 38 27
"Korea" 1991 42 73
"China" 1990 124 458
"China" 1991 12 24
end

I want to collapse 1990 and 1991 into just one period, so the x-value for Indonesia would be (24 + 28)/2 and the corresponding y-value (36 + 22)/2.

Now, imagine I also have the row for years 1992-2021. That means I can't just do "collapse varlist, by (country)". I will first need to create a variable called "period" that takes the value = 1 for 1990 & 1991, = 2 for 1992 & 1993, and so on.

I could do that manually one by one, but there must be a smart way to do this in Stata.

Is it possible to get help?

Best,

Robust Standard Errors how to get value for Wald Chi^2?

Hello everyone!

I am running panel data regressions and am using the 'robust' command as I think my model has heteroskedasticity (I am not sure but when I use the robust command, most of my p-values decrease considerably so I think I should be using 'robust', as it makes my coefficients significant).

Unfortunately, by using 'robust' I do not get any values for Wald Chi(2) and prob>chi(2), which I need.

One last question: I am running a fixed effect model (where governance captures country fixed effects) but my tables indicates "random-effects GLD regression" does this matter?

Here is my model without the robust command, followed with the one using the robust command. Just to put into context, I am using dummy variables for my three countries, and time lags.

I really appreciate your help as I am new with Stata and am a bit overwhelmed with econometric textbooks,

thank you very much.

Fred Nolan.

Array
Array

egen, group

Dear All, Suppose that I have this data set (the original question is here),

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int(A B) str9 C float(ID1 ID2)
1 2010 "艾一" 2 1
1 2011 "张三" 1 2
1 2012 "张三" 1 2
2 2010 "李四" 3 3
2 2011 "李四" 3 3
3 2012 "车八" 6 4
3 2013 "王五" 5 5
3 2014 "李白" 4 6
end

The raw data have three variables, A, B, and C (names). If I use

Code:

egen ID1=group(A C)

I obtain ID1 variable. However, the desired outcome is ID2 (keep the order/names of C unchanged, and the group number is in an increasing pattern ). Any suggestions? Thanks.

How to create an aggregate observational unit by adding values by country and year.

Hello,
I have the following dataset for a number of countries.
I would like to generate an aggregate entry (call it africa under variable "country") which is a sum of all the values of the variable "ca" for all the countries by year.
I cannot figure out how to do it except doing it manually in Excel which is obviously not handy.
Can some help me:
Should I need to clarify my question, please let me know.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int year float index str87 country float ca
1980 15 "Algeria"                   61248.55
1981 15 "Algeria"                      63086
1982 15 "Algeria"                   67123.51
1983 15 "Algeria"                   70748.18
1984 15 "Algeria"                   74710.08
1985 15 "Algeria"                   77474.35
1986 15 "Algeria"                   77784.24
1987 15 "Algeria"                   77239.76
1988 15 "Algeria"                   76467.36
1989 15 "Algeria"                   79831.92
1980 39 "Benin"                      1860.74
1981 39 "Benin"                     1965.387
1982 39 "Benin"                     2177.307
1983 39 "Benin"                     2082.683
1984 39 "Benin"                     2247.842
1985 39 "Benin"                     2417.076
1986 39 "Benin"                     2469.006
1987 39 "Benin"                     2431.971
1988 39 "Benin"                     2505.768
1989 39 "Benin"                     2488.038
1980  1 "Botswana"                  1516.087
1981  1 "Botswana"                  1694.112
1982  1 "Botswana"                  1837.036
1983  1 "Botswana"                  2055.612
1984  1 "Botswana"                  2335.366
1985  1 "Botswana"                  2550.895
1986  1 "Botswana"                  2738.062
1987  1 "Botswana"                   2963.26
1988  1 "Botswana"                  3325.464
1989  1 "Botswana"                   4431.94
1980 14 "Burkina Faso"              2192.041
1981 14 "Burkina Faso"              2288.216
1982 14 "Burkina Faso"              2339.263
1983 14 "Burkina Faso"              2311.912
1984 14 "Burkina Faso"              2349.373
1985 14 "Burkina Faso"              2656.291
1986 14 "Burkina Faso"              2867.462
1987 14 "Burkina Faso"              2860.642
1988 14 "Burkina Faso"              3026.523
1989 14 "Burkina Faso"               3091.73
1980 30 "Burundi"                    1344.64
1981 30 "Burundi"                   1491.376
1982 30 "Burundi"                   1484.951
1983 30 "Burundi"                   1531.115
1984 30 "Burundi"                   1529.591
1985 30 "Burundi"                   1708.338
1986 30 "Burundi"                   1773.586
1987 30 "Burundi"                     1846.2
1988 30 "Burundi"                   1936.524
1989 30 "Burundi"                    1965.25
1980  2 "Cabo Verde"                 278.052
1981  2 "Cabo Verde"                 301.547
1982  2 "Cabo Verde"                 310.067
1983  2 "Cabo Verde"                  339.59
1984  2 "Cabo Verde"                 352.435
1985  2 "Cabo Verde"                 382.893
1986  2 "Cabo Verde"                 393.889
1987  2 "Cabo Verde"                  410.86
1988  2 "Cabo Verde"                 435.498
1989  2 "Cabo Verde"                 460.317
1980  3 "Cameroon"                 12794.186
1981  3 "Cameroon"                 13726.994
1982  3 "Cameroon"                 14749.424
1983  3 "Cameroon"                 15988.812
1984  3 "Cameroon"                 17275.252
1985  3 "Cameroon"                 17681.533
1986  3 "Cameroon"                 16752.086
1987  3 "Cameroon"                 15813.303
1988  3 "Cameroon"                 15216.575
1989  3 "Cameroon"                 14737.848
1980 35 "Central African Republic"   1490.16
1981 35 "Central African Republic"  1513.225
1982 35 "Central African Republic"  1540.405
1983 35 "Central African Republic"  1446.484
1984 35 "Central African Republic"  1550.996
1985 35 "Central African Republic"  1602.746
1986 35 "Central African Republic"  1660.092
1987 35 "Central African Republic"  1578.233
1988 35 "Central African Republic"  1611.128
1989 35 "Central African Republic"   1651.01
1980 31 "Comoros"                    421.845
1981 31 "Comoros"                    438.149
1982 31 "Comoros"                    466.119
1983 31 "Comoros"                    488.591
1984 31 "Comoros"                    508.673
1985 31 "Comoros"                    518.847
1986 31 "Comoros"                    529.224
1987 31 "Comoros"                    537.892
1988 31 "Comoros"                    552.348
1989 31 "Comoros"                    543.308
1980 16 "Congo"                     3243.096
1981 16 "Congo"                     3931.224
1982 16 "Congo"                     4857.743
1983 16 "Congo"                     5130.247
1984 16 "Congo"                     5500.578
1985 16 "Congo"                     5437.178
1986 16 "Congo"                     5062.998
1987 16 "Congo"                     5072.587
1988 16 "Congo"                     5162.182
1989 16 "Congo"                     5253.278
end

Finding Market Shares

Hi all,

I have a data in the following format. I need to calculate the market share where market share would be simply percentage of sales of the brand in the given year and market. For example, A in the table should be Sales of Brand 1 in year 2000 and market 1 to Total sales in year 2000 and market 1. This should be 56 / (56 + 25). Similarly, B should be (34 / (34+54+43) ). Finally, C should be (47/ (47+43)). Any coding help for such problem is much appreciated.

Year	Market	Brand	Sales	Market Share
2000	1	1	56	A
2001	1	1	34	B
2002	1	1	45
2001	1	2	54
2002	1	2	46
2000	1	3	25
2001	1	3	43
2002	1	3	41
2000	2	1	47	C
2001	2	1	65
2002	2	1	62
2001	2	2	21
2002	2	2	43
2000	2	3	43

Saturday, October 30, 2021

group string var with random names

Hi,

I have list of investor name which are not same like "investor group" var. I want to create a variable like investor group from var investor_name. Please suggest me how to do it.

Investor_name	Investor_group
Blue Ocean	Blue Ocean Partners
Blue Ocean Partners	Blue Ocean Partners
Blue Ocean Partners LLC	Blue Ocean Partners
Breakthrough Energy	Breakthrough Energy
Deutsche Bank	Deutsche Bank
Goldman	Goldman Sachs
Goldman Sachs	Goldman Sachs
Goldman Sachs, Inc	Goldman Sachs
Google	Google
Google Ventures	Google
J.P. Morgan	JP Morgan
JP Morgan	JP Morgan
JP Morgan Chase	JP Morgan
Kleiner Perkins	Kleiner Perkins
Kleiner Perkins Caufield & Byers	Kleiner Perkins

Biomet Orthopedics, LLC	Biomet
Biomet Spine, LLC	Biomet
Biomet Trauma, LLC	Biomet
Biomet Sports Medicine, LLC	Biomet
BIomet 3i, LLC	Biomet
Biomet Microfixation, LLC	Biomet
Biomet Biologics, LLC	Biomet
Davol Inc.	C. R. Bard
Bard Peripheral Vascular, Inc.	C. R. Bard
C. R. Bard, Inc. & Subsidiaries	C. R. Bard
Bard Access Systems, Inc.	C. R. Bard
DePuy Synthes Products LLC	DePuy
DePuy Mitek LLC	DePuy
DePuy Orthopaedics Inc.	DePuy
Synthes USA Products LLC	DePuy
DePuy Spine, LLC	DePuy

ttesti equivalent of oneway ANOVA

Hi all,

I need to compare means of a parameter of 8 groups (independent). I do have SD but I do not have individual data.

I was wondering is there any Stata sytax for oneway ANOVA ?

I very much appreciate help.

With warm regards,
Sateesh

repeated-measure vs. nested anova problems

I think I want to use Repeated-Measure Anova command, "anova, repeated ()" or "wsanova" commands based on the instruction from this site (https://www.stata.com/support/faqs/s...easures-anova/), but I am unsure if I can use this command with the type of data I have. I am using Stata/MP 14.0 for Mac.

To describe the data, participants from two countries responded to items from three domains (i.e., A, B, and C). So the repeated component here is the domain. We created two conditions (high vs. low) nested within each domain so there were a total of six sub-domains (i.e., A high, A low, B high, B low, C high, and C low). Participants were randomized into either high or low condition within each domain. In other words, participants were not divided into a single condition, and their conditions differed depending on each domain. For instance, some participants answered questions in A-high subdomain, B-low subdomain, and C-high subdomain. I am attaching a portion of the dataset I have. ID is the participant number below.

ID	Country	Domain	Condition	Total
1	1	1	2	2
1	1	2	2	2
1	1	3	2	1.833333254
2	1	1	1	2.416666746
2	1	2	1	1.916666627
2	1	3	1	3.416666746
3	1	1	2	2.916666508
3	1	2	1	2.833333254
3	1	3	2	2.75
4	2	1	2	1.5
4	2	2	2	1.666666746
4	2	3	2	1.166666627
5	2	1	1	2.416666746
5	2	2	2	2.75
5	2	3	2	2.583333254
6	2	1	1	2.333333492
6	2	2	2	3.5
6	2	3	1	2

Could I use the following command to run Repeated-Measure Anova to examine the effect of two between-subject variables (country and condition) and one within-subject variable (domain)? The issue that I am currently having is that each participant could have been assigned to multiple conditions depending on the domain (e.g., participant 3, 5, or 6). If I can't use this command, could you give me some suggestions on what command/analysis I could use?

anova Total Country Condition Country#Condition/ ID | Country#Condition Domain Domain#Country Domain#Condition Domain#Condition#Country, rep (Domain)

**Another option I have been thinking of was using nested anova (two condition nested in each domain), but I am not sure if I could run interaction terms (interactions among Domain#Condition#Country) with nested anova. To sum, what I want to do essentially is to run anova with all three variables (Country, Domain, and Condition) included, and I am not sure what command to use.

I have been spending every waking hour from the past two weeks trying to resolve this so any advice would be truly appreciated. Thank you in advance for any advice!

So

Help with Recoding with Multiple Loops

Hello, I have a series of variables that looks like job[i]_sched[y]. [i] varies from 1 to 6. [y] varies from 1997 to 2017, but the last three waves of data are collected biannually, so the years are 2010 2011 2013 2015 2017.

I wanted to recode the data using:

Code:

local yvars 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 ///
2011 2013 2015 2017
foreach i in 1 2 3 4 5 6 {
local y: word `i' of `yvars' //job number
recode job`i'_sched`y' (-6/-1=.) (2/6 8 = 0)(1 7 9 = 1),gen(job`i'_schedx`y')
}

However, when I run this code, it recodes only 6 variables:

Code:

(8734 differences between job1_sched1997 and job1_schedx1997)
(8742 differences between job2_sched1998 and job2_schedx1998)
(8865 differences between job3_sched1999 and job3_schedx1999)
(8913 differences between job4_sched2000 and job4_schedx2000)
(8960 differences between job5_sched2001 and job5_schedx2001)
(8979 differences between job6_sched2002 and job6_schedx2002)

How do I get the loop to run through every combination of [i] and [y]?

Please let me know if there is anything I should clarify in this post. I looked to see if I could find a similar post, but I did not.

qq Plots

Hello everbody,

I have estimated parameter alpha for each participant with six different methods. I wanted to plot each alpha of one method with the alpha of another method. When I get the graphs they look weird. Have someone had some qqplots like mine? Or what else could be a good solution for the visualiation of the data?

Thanks.

svy:logit has more observations than simple logit command

I am working with survey data and use the svy command before running my logit regression as follows: svy: logit Y X1 X2. Running this command my number of observations is 2,774. Running the same command without "svy:" gives me 2,152 observations (these are all the observations that have no missings for the variables I am using). How is that possible?

addplot error after marginsplot

I ran a model (dependent variable = gender) and 3 independent variables – year, position_department_n, and interaction of year and position_department_n. I’d like to show a plot with predicted probabilities with panels for each position_department_n, and add the observed data to the same graph. I can get the first part (predicted probabilities by position_department_n), by adding bydimension(position_department_n) to marginsplot (code below).

However, when I use addplot to add in the observed data, I get an error: "Graph family bygraph_g does not support addplot. r(198);". Is there a way around this to achieve what I am trying to do?

Code:

. quietly mlogit gender_n c.year##i.position_department_n, rrr vce(cluster person)

. quietly margins i.position_department_n, at(year=(1(1)7))

.
. **WITHOUT ADD PLOT
.
. marginsplot, bydimension(position_department_n) legend(pos(3)) recast(line) plot1opts(lcolor(gs8)) ciopt(color(black%20)) recastci(rarea) ytitle("Propor
> tion") plotd(,label( "F" "M" "U"))

  Variables that uniquely identify margins: year position_department_n _outcome


Array 

.
. **WITH ADDPLOT
.
. marginsplot, bydimension(position_department_n) legend(pos(3)) recast(line) plot1opts(lcolor(gs8)) ciopt(color(black%20)) recastci(rarea) ytitle("Propor
> tion") plotd(,label( "F" "M" "U")) addplot((scatter prop year if gender_n==4, msymbol(circle) mcolor(black) msize(vsmall)) (scatter prop year if gender_
> n==5, msymbol(circle) mcolor(black) msize(vsmall)) (scatter prop year if gender_n==6, msymbol(circle) mcolor(black) msize(vsmall)))

  Variables that uniquely identify margins: year position_department_n _outcome
Graph family bygraph_g does not support addplot.
r(198);

svy: mean and sorting estimated means

Greetings, Statalisters.

I'm hoping this is something someone is willing and able to explain to me, since I've failed at figuring it out on my own. I need to sort estimated means, into descending order, per year. I'm using Gallup World Poll, and the estimated means are of an index, by country per year, and respondent level sampling weights need to be a part of the computation of the means to account for survey design, etc.. Since I suspect sharing a dataex of that would make my question more complicated, the following mockup shows the situation and desired outcome. Please see here:

I start with this dataset:

Code:

 clear all
 webuse highschool, clear
 svyset [pweight=sampwgt]

* making a fake year variable      
set seed 12345
generate rannum  = uniform()
sort rannum
generate year = .
lab var year "grad year"
drop rannum
replace year = 2009 in 1/999
replace year = 2010 in 1000/1999
replace year = 2011 in 2000/2999
replace year = 2012 in 3000/4071

* making a fake outcome variable 
set seed 54321
generate rannum  = uniform()
sort rannum
generate happy = .
lab var happy "happiness index"
drop rannum
replace happy = 1 in 1/700
replace happy = 2 in 701/2200
replace happy = 3 in 2201/4071
label define happy 1 "unhappy" 2 "neutral" 3 "happy"
label values happy happy
codebook, compact

I try this first:

Code:

*attempt 1 (using svy:mean with subpop for the if statement restricting year)
svy, subpop(if year==2009): mean happy, over(race) coeflegend

Which produces this:

Code:

(running mean on estimation sample)

Survey: Mean estimation

Number of strata =       1        Number of obs   =      4,071
Number of PSUs   =   4,071        Population size =  8,000,000
                                  Subpop. no. obs =        999
                                  Subpop. size    =  2,016,463
                                  Design df       =      4,070

------------------------------------------------------------------------------
             |       Mean  Legend
-------------+----------------------------------------------------------------
c.happy@race |
      White  |   2.325196  _b[c.happy@1bn.race]
      Black  |   2.207406  _b[c.happy@2.race]
      Other  |    2.30351  _b[c.happy@3.race]
------------------------------------------------------------------------------

I try this next:

Code:

*attempt 2 (using weights with arithmetic and stock Stata)
sort race year
by race year: gen meanHappy = sum(happy* sampwgt) / sum(sampwgt)
by race year: replace meanHappy=meanHappy[_N]
tabstat meanHappy if year==2009, statistics(mean) by(race) columns(statistics)

Which produces a very similar, although slimmer, output:

Code:

Summary for variables: meanHappy
     by categories of: race (1=white, 2=black, 3=other)

  race |      mean
-------+----------
 White |  2.325196
 Black |  2.207406
 Other |  2.303509
-------+----------
 Total |  2.312006
------------------

What I need to somehow produce (using this silly demo example):

Code:

* a tabulation that sorts these means in descending order
 White |  2.325196
 Other |  2.303509
 Black |  2.207406

Anyone? I realize it looks ridiculous with this demo, but what I have is eleven years of 200 national means that I need to sort from highest to lowest each year. If I could make this demo work, I could make that work too. Thanks in advance for your time.

Cheers,
Erika

Editing to add: I also will need to include a measure of precision of the estimates of these means - i.e. standard error of the mean. Haven't looked at that yet, since I'm stuck on this sorting task, but that's next and probably related.

Graph with Y axis as string

Hi everyone,

I have data like below: 25 different methods that generate 25 different estiamtion on OR and its 95%CI. I want to generate a plot with the axis as the methods' name so that people can see the results related to the name of the methods (This can be easily done In excel). How can I do it in stata?

Code:

__method __or_ __lci_or_ __uci_or_
A .6492 .258189 1.63237
B .682846 .133547 3.49148
C .6492 .10206 4.12953
D .586633 .299436 1.14929
E .594977 .303478 1.16647
F .342782 .134567 .873165
G .119268 .012092 1.17643
H .342782 .016632 7.06464
I .660587 .209265 2.08528
J .660587 .209265 2.08528
K .660587 .209265 2.08528
L .778194 .325669 1.85951
M .711595 .193472 2.61727
N .778194 .185898 3.25763
O .790423 .437463 1.42816
P .018483 .001191 .286725
Q .587435 .298901 1.15449
R .783607 .432965 1.41822
S .601916 .306494 1.18209
T .165793 .012218 2.24971
U .280154 .017255 4.54869
V .790423 .437463 1.42816
W .751405 .499381 1.13062
X .757768 .477562 1.20238
Y .880356 .200964 3.85654

Thanks
Chang

Changing the value of many observations

Hello

So, i'm working on a pretty big dataset for a Norwegian insurance company, and I wanted to change one variable in regards to its observations, or rather what the value Stata "reads" from it. Initially the observations were string and essentially showed "Expired" for an insurance offer that was no longer valid, "rejected" for a rejected insurance offer (by the company) and "accepted" was just missing values.
I started off using "decode" which made "expired" = 1, rejected = 2 and "accepted" = "." (Just a period). I do not really need the "rejected" observations so i essentially want to make the "expired" = 0 and "accepted" = 1. The variable name is "avsluttet_aarsak" (Which loosely translates to "reason for termination/closure")
I used the code:

sort avsluttet_aarsak
replace avsluttet_aarsak = 0 in 1/762621. --> Which worked for the "expired" observations, but when i wrote:
replace avsluttet_aarsak = 1 in 76262/1500403 --> then all the "accepted" observations became "rejected".

I assume this is because the underlying value that the "rejected" observations have is 1, due to the initial decoding, and this is why that happens.

Does anyone know of an easier and better way of doing this? I'm quite new to Stata so any help/input would be much appreciated

Get the standard error from --exlogistic-- command?

Hi all,

I'm running the exlogistic command for meta-analysis, with id as study id:

Code:

exlogistic r id group, binomial(n) group(id)

Code:

Exact logistic regression                    Number of obs     =        576
Group variable: __id                         Number of groups  =          8

Binomial variable: __n                       Obs per group:
                                                           min =         26
                                                           avg =       72.0
                                                           max =        197

                                             Model score       =    2.41637
                                             Pr >= score       =     0.1258
---------------------------------------------------------------------------
         __r | Odds Ratio       Suff.  2*Pr(Suff.)     [95% Conf. Interval]
-------------+-------------------------------------------------------------
     __group |   .5874346          23      0.1664       .279409    1.216778
---------------------------------------------------------------------------

How can I get the standard error of the log OR? See the following method cannot work.

Code:

gen se = _se[__group]

Code:

no variables defined
r(111);

Thanks
Chang

Friday, October 29, 2021

multilevel multinomial modelling using the gllamm command in stata

Good day everyone,
I am using the gllamm command in stata 15.0 for multilevel (2-leveled) multinomial logistic regression. The outcome/dependent variable is modern contraceptive use among women and it has 4 sub-categories showing the users of the different types of modern contraceptives [(1)permanent method and Long-acting reversible contraceptive users; (2)medium acting reversible contraceptive users; (3)short-acting reversible contraceptive users], each in contrast to the traditional users/those not using any method at all (this is the reference sub-category). For my study, individual women (level-1) are nested within communities (level-2), and my main independent variable; which is the economic empowerment of women, is also introduced as a random slope. In relation to this, I typed in the following command:
gllamm PLMSorNotnew CWEIcentre, i(ClusterNumber) link(mlogit) family(binomial) adapt nrf(2) eqs(inter slope) pweight(pw)

However, only one random intercept variance, one random slope variance, and thus one covariance were estimated instead of three each.

My understanding is that the single random intercept variance shows the between-community difference in the odd of generally using modern contraceptives, irrespective of the particular type, as opposed to traditional users/those not using any method at all. the single random slope variance, on the other hand, shows the between-community difference in the regression effect of women's economic empowerment on the general use of modern contraceptives, irrespective of the particular type. Please, what is the correct command to produce the three different random intercept variances, the three different random slope variances, and the three different covariances.
Thank you.

Manually producing probabilites after logit

Dear All,

I estimate a logit model and then I need to calculate Prob[Y=1|X], where X is my set of regressors. Obviously, I can use:

Code:

predict probabilities

after the logit estimation. However, for some specific reasons I need to compute those probabilities "manually". Hence, I have:

Code:

webuse auto.dta, replace
gen dummy=0
replace dummy=1 if price>9500
rename logit dummy
logit dummy mpg trunk length
gen probabilities2=1/(1+exp(-(_b[mpg]*mpg+_b[trunk]*trunk+_b[length]*length+_b[_cons]*_cons)))

The above will give me the same results, as if I use the postestimation command predict probabilities. However, rather then writing down the full list of estimated parameters and regressors, I would like to find out a short way. My attempt was:

Code:

logit dummy mpg trunk length
matrix bmatrix=e(b)
mat accum m = mpg  trunk length
matrix xb=bmatrix*m
gen probabilities=1/(1+exp(-(xb)))

My doubts: 1) is the way I tried to generate xb correct? I am not sure about it. If not, how I can modify the calculation? 2) When I enter xb in the last line, I get the error message:

[CODE}
matrix operators that return matrices not allowed in this context
[/CODE]

How can I include the generated values xb?

Thanks in advance for your help.

Dario

how many observations per year in panel data

Hi there,

I have a panel dataset. How can I see how many observations there are per year? I see in the data viewer that some years have lots of missing observations, but want to know exactly.

Thank you !

Eliminate a part of a string

Hello there,
I would like to eliminate a part of a string starting from a position recorded in another variable for 2000 characters backward. I can't use dataex as the strings are very long.

To give you an example let's say I want to eliminate the word BB (hence appearing at position 6) from the string text and the 3 characters before

text: AAAA BBC D
Bpos: 6
Wanted: AAC D

Thank you,

Ylenia

How do you de-trend an event study plot?

I have an event study graph that I plot using coefficients from my dependent variable regressed on the leads and lags of treatment. There appears to be slight seasonality in the plot. The data I use is micro-data and I suspect that the source of seasonality could be year of birth or quarter of birth effects. But my event study regression already includes all these fixed effects. Year of birth, as well as quarter of birth fixed effects.

Someone saw the graph and suggested I could detrend the plot (for example subtract the yearly mean from the betas)? Is there a formal way I could do this? Do I transform the data before I run the regression (its microdata, and might be a challenge) or do I transform the beta's post-estimation?

Setting the outcome variable in Cox regression.

Hi,

I am trying to estimate the hazard ratios of the age of first drinking, the goal is to measure the risks of underage drinking (<18 years). The values for age at first drinking are: 0 (never), 12,13, 14...38.

I can run the analysis as it is, but certainly not doing it right because I didn't specify the cut-off age anywhere. I am following a similar study that mentioned: "The outcome variable is the age at first drinking. For teens with no drinking experience at survey time, the age at interview was used to include them in the analysis (i.e., censor under the terminology of survival analysis)."

Please free free to share your insights on this, or link to any material that can teach the method.

Thank you.

labsize in points generating strange result

Per the request of my publisher, I am going through a set of graphs and specifying text in point sizes. This has worked fine except in one case in which the point specified for xlabels is producing a huge size. Here is the specification I'm using:
xlabel(1 `""1945-" "1950""' 2 `""1951-" "1960""' 3 `""1961-" "1970""'4 `""1971-" "1980""'
5 `""1981-" "1990""' 6 `""1991-" "2000""' 7 `""2001-" "2010""' 8 `""2011-" "2020""',labsize(11pt))
This is the only figure in which I have specified 2-line labels. It works fine if I omit the labsize specification and let it default to the standard size or specify something like *1.5 to increase the size.

Any thoughts regarding what's happening and how to fix it would be appreciated.

Where is the dta file for math scores of pupils in the third and fifth years from different schools in Inner London (Mortimore et al. 1988)

I apologize for asking this question because I assume I am missing something obvious. Many Stata examples use data from Mortimore et al. 1988. But I cannot find the dta file on Stata. I located the mjsps5.dta file for the data but with missing values. Can someone point me to the file without missing values?

Error in dictionary file

I have a dataset in .txt, which I have to translate with a dictionary file. I have written the dictionary file according to instructions from the owners of the dataset. However, there seems to be a problem with one variable (that I know of, but there might be other problems too). The variable STDIND presents with missing observations (about 80% of all observations), even though it should not. If I run the code in R (written in R language) it works, without giving me missing values, so there must be a problem with my Stata code (and I have to use Stata). This is how I wrote the dictionary file:

Code:

dictionary {
_column(1) int    ANNO    %4f "ANNO"
_column(5) int    TRIM    %1f "TRIM"
_column(6) int    REG    %2f "REG"
_column(8) int    numcff    %2f "SG4"
(...)
_column(587) int    STDFAM    %6f "STDFAM"
_column(593) int    STDIND    %6f "STDIND"
_column(599) int    NN2    %1f "NN2"
_column(600) int    RPN2    %1f "RPN2"
_column(601) int    TF    %2f "TF"
_column(603) int    TN2    %1f "TN2"
_column(604) int    F0_14    %1f "F0_14"
_column(605) int    CP0_7    %1f "CP0_7"
_column(606) int    CITTAD    %1f "CITTAD"
_column(607) int    WAVQUA    %1f "WAVQUA"
_column(608) int    nasita    %1f "SG13"
_column(609) int    citita    %1f "SG16"
_column(610) int    annres    %3f "SG18"
_column(613) int    NASSES    %3f "NASSES"
_column(616) int    CITSES    %3f "CITSES"
_column(619) int    RAPSES    %3f "RAPSES"
}

And this is the code I used to apply the dictionary file (2005_Q2_dict.dct is the dictionary file, sta_2005_2.txt is the dataset in txt):

Code:

clear
infile using "$PathDict/2005_Q2_dict.dct", using("$Path05Q2/sta_2005_2.txt")

Is there a problem with the code? Am I doing something wrong, or missing something?

Elasticity analysis

Hi all,
I am in urgent need to find the solution of a problem. It would be great if you guys could help

background of the problem:
I have dependent variable- number of children and independent variable- log of household income. I need to find out the income elasticity of number of children with cross sectional data.
Problem:
1. Can we take log of "number of children' to estimate the income elasticity.
2. If i regress lin-log regression model, Can I use the predicted values of slope to derive income elasticity by dividing it with avverage number o children?

I request everyone to please help?

Assigning random number while using pre-defined observation specific probabilities (Result of an LCA) - using rdiscrete

Hello Statalist,

As a result of an LCA I have three variables containing the probability of each observation to belong to a specific group (prob_1, prob_2, prob_3). The values of all three variables are by definition between 0 and 1.
To account for uncertainty in class assignment ( when using LCA as an independent variable) I want to run a simulation with a specific amount of iterations. In this simulation, the assignment to the class should be random while taking the probabilities (prob_1, prob_2, prob_3) into account.
The following code should demonstrate what I want to do (regarding the assignment of the class)

Code:

generate wanted = . 
mata: st_store(., "wanted", rdiscrete(2100, 1, ("0.5", "0.2", "0.3")))

I my case, probabilities vary across observations. Therefore I would like to substitute the general probabilities by individual probabilities, which are saved in the variables prob_1, prob_2 and prob_3

Code:

generate wanted = . 
mata: st_store(., "wanted", rdiscrete(2100, 1, ("prob_1", "prob_2", "prob_3")))

This however gives me the following error
rdiscrete(): 3253 nonreal found where real required
<istmt>: - function returned error
I guess the error message means that only numbers can be put where I wrote "prob_1" etc.

Therefore I thought of saving individual values in a local and than looping across all observations.
Something like this:

Code:

forvalues i = 1(1)2100 {
local prob_1_local = prob_1[`i']
local prob_2_local = prob_2[`i']
local prob_3_local = prob_3[`i']
display `prob_1_local'
display `prob_2_local'
display `prob_3_local'

mata: st_store(., "wanted", rdiscrete(1, 1, (`prob_1_local', `prob_2_local', `prob_3_local')))
}

This code gives me the error
sum of the probabilities must be 1
rdiscrete(): 3300 argument out of range
<istmt>: - function returned error
The error message is the reason why I added the display function to the code. The displayed values in my case are:
.3423453
.56572497
.09192975
Which almost perfectly add up to 1.

Can anyone help me with my approach? I'm also open to totally different approaches in case my whole approach was wrong.

Thanks

Jay

Sum variables taking into account missing data

Hello,

I have the following dataset:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long gid int(country_code year) long actor_id
62356   . 1997    .
62357   . 1997    .
79600 710 2012 2664
79601 710 2015 2664
79601 710 2015 2664
79601 710 1999 2794
79601 710 2013 2664
79601 710 2013 2664
79601 710 1999 2794
79601 710 2013 2664
80317 710 2015 2794
80317 710 2012 2664
80317 710 2017 2664
80317 710 2015 2664
80317 710 2015 2664
80317 710 2012 2535
80317 710 2002 2794
80317 710 2008 2664
80317 710 2015 2664
80317 710 2013 2794
80317 710 2009 2794
80317 710 2014 2794
80317 710 2009 2794
80317 710 2008 2664
80317 710 2017 2794
80317 710 2015 2794
80317 710 2015 2794
80317 710 2015 2794
80317 710 2008 2794
80317 710 2012 2664
80317 710 2015 2774
80317 710 2010 2794
80317 710 2017 2664
80317 710 2015 2664
80317 710 2014 2794
80317 710 2016 2664
80317 710 2016 2664
80317 710 2015 2794
80317 710 2014 2664
80317 710 2015 2794
80317 710 2016 2664
80317 710 2009 2794
80317 710 2012 2664
80317 710 2017 2794
80317 710 1999 2664
80317 710 2013 2794
80317 710 2009 2664
80317 710 2015 2794
80317 710 2012 2664
80317 710 2017 2664
80317 710 2017 2794
80317 710 2016 3181
80317 710 2016 2664
80317 710 2009 2794
80317 710 2012 2535
80317 710 2012 2794
80317 710 2010 2664
80317 710 2012 2664
80317 710 2015 2794
80317 710 2010 2664
80317 710 2015 2664
80317 710 2012 2794
80317 710 2017 2794
80317 710 2015 2664
80317 710 2012 2535
80317 710 2017 2794
80317 710 2012 2794
80317 710 2014 2794
80317 710 2015 2794
80317 710 2017 2794
80317 710 2014 2794
80317 710 2016 2664
80317 710 2015 2794
80317 710 2008 2794
80317 710 2017 2664
80317 710 2015 2794
80317 710 2015 2794
80317 710 2006 2664
80317 710 2012 2664
80317 710 2017 2664
80317 710 2012 2664
80317 710 2017 2664
80317 710 2009 2794
80317 710 2017 2794
80317 710 2001 2794
80317 710 2015 2794
80317 710 2017 2794
80317 710 2012 2664
80317 710 2017 2664
80317 710 2017 2794
80317 710 2017 2664
80317 710 2009 2794
80317 710 2013 2188
80317 710 2014 2664
80317 710 2009 2794
80317 710 2015 2794
80317 710 2015 2664
80317 710 2017 2664
80317 710 2013 2664
80317 710 2012 2664
end

I would like to sum the number of actors (variable actor_id) by year and gid. My final outcome should be a dataset which include a column with the total number of actors by gid and year. In total, my final dataset should be four columns, gid, year, total number of actors, country_code (it does not change by gid and year, it is time invariant).

I am using the following code:

Code:

bysort gid year actor_id: gen nactors = _n
keep if nactors == 1
egen nactors2 = count(nactors), by(year gid)

It is works for me if I do not have missing value. Unfortunately, I have it. For example, in the sample I attach, for gid 62356 and year 1997, there is a missing. I would like to have that the total number of actors by gid 62356 and year 1997 be 0. I am not able to do that.

In my dataset, if one pair year-gid has an actor, there is no any missing for that pair of gid-year pair.

I hope you understand my problem. If not, please let me know. Any suggestion/help/comment to solve it, it is more than welcome.

Error: <istmt>: 3499 ASREGFMB() not found

Good morning,
I try to run the asreg command and I receive this error. I checked that the installation is correct that I have everything updated in Stata (I am using Stata17) but it does not seem to work.
Also a colleague of mine same settings and everything runs it and it seems to work.
Any suggestion on what I may do wrong?
Thank you.
Best,
Tania

Cmxtmixlogit discrete choice experiment with choice card blocks

Dear all,

I am encountering a convergence problem with the 'cmxtmixlogit' command, using STATA 16.1 for windows.

I performed a discrete choice experiment with 5 attributes (specified random, hereafter regrouped under $att). I also included an ASC (specified random) to represent the opt-out alternative. My dataset contains 1,476 observations: 82 decision makers (Id_farmer), each evaluating 6 choice cards (card) with 3 alternatives - including the opt-out - (alternative), with cards regrouped under 2 blocks --> a total of 12 cards exist). To analyse the data I wrote:

cmset Id_farmer card alternative

cmxtmixlogit chosen, random(ASC $att)

The model doesn't converge. Could it be related to the fact that each respondent only evaluates 6 out of the 12 cards?

Thanks in advance!

zero-inflated and right-censored count data

Dear all statalists,
Thank you for clicking on my post.

What is the proper way to deal with zero-inflated and right-censored count data?

I have written a histogram of count data that takes integer values from 0 to 30, and found excess zero and bunching at 30.
I have used the zinb command to deal with zero inflation, but there does not seem to be a ul(30) option as implemented in cpoisson.
Please let me know how to deal with this in a practical way.

Thank you for taking the time to read my post.

Number of splitvoters from two variables

I need to know the number of split voters in Denmark - those who did not vote for the same party at the general and the local election.

An example:

A person votes for Socialdemokratiet at the general and the local election = not a split voter
A person votes for Socialdemokratiet at the general election and for Venstre at the local election = split voter

Variable 1: Which party did you vote for at the generel election

(Danish parties below)

1. Socialdemokratiet
2. Det Radikale Venstre
3. Det Konservative Folkeparti
4. SF - Socialistisk Folkeparti
5. Kristendemokraterne
6. Liberal Alliance
7. Dansk Folkeparti
8. Venstre
9. Enhedslisten

Variable 2: Which party did you vote for at the local election

(Danish parties below)

1. Socialdemokratiet
2. Det Radikale Venstre
3. Det Konservative Folkeparti
4. SF - Socialistisk Folkeparti
5. Kristendemokraterne
6. Liberal Alliance
7. Dansk Folkeparti
8. Venstre
9. Enhedslisten

How can this be done in Stata? Should i gen. a new variable to make this possible?

I hope someone can help me out here.. :-)

Best regards

Mikkel

Thursday, October 28, 2021

Panel data estimatioin

Hello. Need help with panel data estimation.
I am working on panel data with T=96 and N=260. I used Hausman test and it gives FE an appropriate model for estimation as shown

Code:

Table: Hausman Test

Ho: difference in coefficients not systematic

chi2(6) = (b-B)'[(V_b-V_B) ^(-1)](b-B) = 251.86 Prob>chi2 = 0.0000 (V_b-V_B is not positive definite)

Table: Hausman Test

Ho: difference in coefficients not systematic

chi2(6) = (b-B)'[(V_b-V_B) ^(-1)](b-B) = 251.86 Prob>chi2 = 0.0000 (V_b-V_B is not positive definite)

But the hetero and auto test show both problems

Code:

Table: Wald test for heteroskedasticity

H0: sigma(i)^2 = sigma^2 for all i chi2 (252) = 2.8e+38 Prob>chi2 = 0.0000

Code:

Table:Wooldridge test for autocorrelation

H0: no first-order autocorrelation F (1, 244) = 873.172 Prob > F = 0.0000

In accordance to the above data and results I need suggestions. Should I go for FGLS

Code:

xtgls

or use

Code:

xtreg

with

Code:

cluster

option? Plus if [CODE][cluster/CODE] then cluster with only Panel ID or Panel ID and Time both?

converting dates (year and month)

Dear All, Is there a more concise way to go from date to newdate below? Thanks.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str6 date float newdate
"202101" 732
"202102" 733
end
format %tm newdate

alternative to keep if for a large list of firm id s

Hi all,

I have a list of firm ids that contains 17,000 firm ids. I have a big dataset which contains over 5 million observations of an even bigger set of firm ids (Dataset A). For dataset A, I want to only keep the firm ids that are mentioned in my list of firm ids (or drop the firm ids that aren't in the list). Can you please help me with this?

Thank you.

Why are Frames Useful

Perhaps this is a question for another forum, but I had a more general question about dataframes.

I know the data frame feature was added in Stata (16?). Me personally, I'm old school. I started using Stata (semi-seriously) in my (senior?) year of college, with Stata 15, and now as a 1st semester Ph.D student with Stata 17. Thus, when data frames were added to Stata, I didn't know what they were or why I'd need them.

I know Python and R use dataframes, and I get the basic concept behind them... but all the same, why are they useful? Why would I (or anyone) want to use them, ever? I never got the jazz behind storing multiple datasets into n frames at once. Could anyone a little more.... enlightened, than me talk a little about why people should/would want to use dataframes for serious applied work such that it makes life remarkably easier than Stata pre-data frames?

It isn't that I dislike them or think they're bad, I've just never been in a situation where I thought to myself "Gee, another dataset sure would come in handy right about now." Anyone have any thoughts about this?

Reshaping Long Data Issue

I am trying to collapse this data but I have continuously gotten errors regarding the i and j not being unique. Here is a snippit of the dataset:

country variable total
AUS CAP 20,525
AUS CAP 1,206
AUS CAP 1,475
AUS CAP 79,601
AUS CAP 9,928
AUS CAP 553
AUS CAP 1,180
AUS CAP 1,080

...

AUS EMPE 57
AUS EMPE 58
AUS EMPE 31
AUS EMPE 17
AUS EMPE 61
AUS EMPE 50

I am trying to get it to read like this with these variables:

Country CAP EMPE

With the totals for each. If anyone can help, I'd appreciate it. I don't know where to start!

Using svy command with graph command

Dear all,

I am trying to use graph in Stata 16 to plot bars of the means of categorical variables (5-point scale) while using the svy command so as to have accurate point estimates.

Here's an example of what I tried to do using a variable "nuclear" that excludes the 6th response category ("don't know").

Code:

. svy: graph bar nuclear if nuclear!=6
graph is not supported by svy with vce(linearized); see help svy estimation for a list of Stata
estimation commands that are supported by svy
r(322);

I had successfully used svyset to calibrate the weights with the rake option:

Code:

#delimit ;
svyset [pweight=_one], rake(i.gender2 i.age1 i.region1 i.race3 i.latino i.educ1, 
    totals(_cons=25127
    1.gender2=12236 2.gender2=12891
    1.age1=7568 2.age1=8305 3.age1=9254 
    1.region1=5260 2.region1=4435 3.region1=9499 4.region1=5933
    1.race3=18242 2.race3=3191 3.race3=3694
    1.latino=20604 2.latino=4523
    1.educ1=9944 2.educ1=7748 3.educ1=7435
    ));
#delimit cr

I also saved the raked weights as "rake_wt":

Code:

#delimit ;
svycal rake i.gender2 i.age1 i.region1 i.race3 i.latino i.educ1 [pw=_one],
gen(rake_wt) totals(_cons=25127
    1.gender2=12236 2.gender2=12891
    1.age1=7568 2.age1=8305 3.age1=9254 
    1.region1=5260 2.region1=4435 3.region1=9499 4.region1=5933
    1.race3=18242 2.race3=3191 3.race3=3694
    1.latino=20604 2.latino=4523
    1.educ1=9944 2.educ1=7748 3.educ1=7435
    ) ;
#delimit cr

I know graph has a pweight option, however I get different point estimates when I set pweight equal to my calibrated weights. For example, here's the results from svy: mean:

Code:

. svy: mean nuclear if nuclear!=6
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =       1        Number of obs   =      2,836
Number of PSUs   =   2,836        Population size =     25,127
Calibration      :    rake        Design df       =      2,835

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
     nuclear |   3.087027   .0283936      3.031353    3.142701
--------------------------------------------------------------

Versus the graph I get when I set pweight equal to my calibrated weights:

Code:

graph bar nuclear if nuclear!=6 [pw = rake_wt]

[ATTACH=CONFIG]temp_24523_1635466036082_578[/ATTACH]

Without having to individually save each svy: mean result, is there a way I can graph the accurate point estimates?

Best,
Jason