BJ Data Tech Solution

Specialized on Data processing, Data management Implementation plan, Data Collection tools - electronic and paper base, Data cleaning specifications, Data extraction, Data transformation, Data load, Analytical Datasets, and Data analysis. BJ Data Tech Solutions teaches on design and developing Electronic Data Collection Tools using CSPro, and STATA commands for data manipulation. Setting up Data Management systems using modern data technologies such as Relational Databases, C#, PHP and Android.

Monday, August 31, 2020

foreach loop and local

Dear all‘
The code below is too long for me.I know it could be replaced by a foreach loop,but I don't know how to write the loop。Could anyone tell me the simplified version or some books where I can learn more？

Code:

local  class1  "var1  var2"
local  class2  "var3  var4"
local  class3  "var5  var6"
local  class4  "var7  var8"
local  class5  "var9  var10"
local  class6  "var11  var12"
egen rnmofclass1=rownonmiss(`class1')
lab var rnmofclass1 "the number of nonmissing values in class1"
egen rnmofclass2=rownonmiss(`class2')
lab var rnmofclass2 "the number of nonmissing values in class2"
egen rnmofclass3=rownonmiss(`class3')
lab var rnmofclass3 "the number of nonmissing values in class3"
egen rnmofclass4=rownonmiss(`class4')
lab var rnmofclass4 "the number of nonmissing values in class4"
egen rnmofclass5=rownonmiss(`class5')
lab var rnmofclass5 "the number of nonmissing values in class5"
egen rnmofclass6=rownonmiss(`class6')
lab var rnmofclass6 "the number of nonmissing values in class6"

Subhazard estimates from stcrreg with multiple (>2) competing hazards.

Hi All,

I am trying to estimate a competing hazard model ala Fine & Grey (1999).

I have three different transition states, all of which are mutually exclusive and of interest. (ie. State 1, State 2, and State 3).

Previous research in my field with Stata have reported subhazard ratios for each state for each variable (e.g Subhazard for x1 for Stata 1, Subhazard for x1 for State 2, Subhazard for x1for State 3)

My question is what syntax is required to produce this.

Currently my syntax looks like this,

Code:

stset date_end, id(id) failure(state==1) scale(365.25) origin(start)
stcrreg x1 x2 x3, compete(state = 2 3)

So currently this produces one set of subhazard (one for each x variable) which refer to the state of interest (1)

Do I then re-declare the survival data and then change the compete states to get all three sets of subhazards?

Thank you in advance for any advice

Removing duplicate permutations

Hi,

I am trying to remove "duplicate permutations". For instance:

ID source target
1 A B
2 A C
3 B C
4 B A
5 C B

Here, I would like to remove lines 4 and 5, because they are duplicates of lines 1 and 3. Any suggestion on how to do that?

How to move State Application

I have Stata 13 MP on my current laptop (Macbook Air)). I have just bought a new laptop (MacBook Air) and want to transfer Stata to my new laptop. What is the best way to do this? I no longer need Stata on my old laptop. Thanks

Avoiding merge m:m problem...

Dear Statalisters:

I have the following file #1:

clear
input byte(firmid ownerid) float own byte level
1 4 10 1
1 3 53.8 1
2 5 100 2
3 6 50 2
3 7 50 2
6 8 20 3
8 9 20 4
end

that I need to merge with fie #2:

clear
input byte(ownerid ownerid2) float own2 byte level2
1 4 10 1
1 3 53.8 1
2 5 100 2
3 6 50 2
3 7 50 2
6 8 20 3
8 9 20 4
end

using the variable "ownerid"

the "ownerid" variable can appear more than once both in the master and using data files. Considering prior posts saying that it is not a good idea to use the merge m:m command, I am wondering on how to merge these to files that have an "m:m" structure, but without using the merge m:m.

thank you all for your help and stay safe,

LR

How to set width Stata MP 16.1 for Unix

Hello,

I am using Stata 16.1 (64-bit, total usable memory 1510.3 GB), and trying to change an extremely large data set to "long" format (from 4.4 x 10^7 obs to 4.4 x 10^8 obs).

I am getting the error message "no room to add more variables due to width". I have already set the max memory well above what is required (500 GB), and have set maxvar to 120000 (the maximum for Stata MP). I have seen in other help forums the options "set maxvar xxx width yyy", but when I add the "width" option to the command stata reports: " -set width- not allowed; 'width; not recognized).

It also does not let me "set memory" (returning: "set memory ignored. Memory no longer needs to be set in modern Statas; memory adjustments are performed on the fly automatically."

Do I have any other options to expand the width specifically for this data set (given that it's not a memory issue)?

Thank you!

Panel data with messy time, very large N

Dear users,

My data consists of N=1033 and T=1-15, with a total number of observation of 2300. Time is very messy in the sense that the total number of observations per time is barely 2, and there are tons of irregular time gaps. I have 2 questions:

1-I think I should ignore temporal variation and go with an individual effect model. Do you agree?

2-Also, I want to do a 2SLS estimation. Since my panel is messy, I want to do 2SLS without FE. In stata terms, can't I just ivreg instead of xtivreg? (I am not myself a Stata user, but wanted to take opinion of yours, since there are very brilliant and helpful users out here). My justification is N and total number of observations are being close to each other. Under what conditions ivreg is preferable to xtivreg? Note that I ran ivreg with different specifications and it passes both wu-Hausman and Weak instrument tests.

Thanks!

Reshaping from "strange" to long format

Dear StataListers,

I use Stata 14.2.
Here is a simplified example of my data (three cities, three years, three variables (co, gas,waste)):

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int(city year) byte variable_id str5 variable_id_desc float value
3101 2011 1 "co"    8919
3101 2015 2 "gas"   68.8
3101 2015 3 "waste" 54.6
3101 2015 1 "co"    7273
4416 2011 1 "co"     157
4416 2015 2 "gas"    2.2
4416 2015 3 "waste"  2.1
4416 2016 2 "gas"    3.9
4416 2016 3 "waste"  2.3
4416 2017 2 "gas"    2.3
end

I need to make this dataset "long" in a panel data format.
I tried different variants of the reshape command, but I couldn't handle it.
Could you advise how to solve this problem?

Regressing Firm-specific variables against Macroeconomic variables

Hello,

I am using panel data with the following observations:

The firm-specific dependent variables (3 variables) have 41944 observations.
The macroeconomic independent variables (7 variables) have 56 observations.
The firm-specific control variables (4 variables) have 41944 observations.
The macroeconomic control variables (3 variables) have 56 observation.

How can I make regression analysis with the observations I have without getting errors?

Looking forward to any response!

Blinder-Oaxaca Decomposition with group specific variables

I would like to run a Blinder-Oaxaca Decomposition on black-white differences. I have panel data where all variables are race-specific group averages.

i.e. I am currently running two regressions.

xtreg white_dependentvar white_indepdentvars i.timevar, fe

xtreg black_dependentvar black_indepdentvars i.timevar, fe

I want to do an Blinder-Oaxaca decomposition of the differences. Is this possible and if so is there an existing Stata command for it?

It does not seem that the oaxaca command will work, since I have different variables for the two groups.

(Note that I do not have individual level data, so I cannot decompose the group averages to get data in a format that works with the oaxaca command).

how to split datae and time

dateofvisit 1/25/2020 1:02 1/12/2020 11:55 variable type double. how can i split date and time in to two variable.

Converting DMS coordinates to decimal in Stata

Hi! Can someone please tell me if it's possible to convert DMS coordinates to decimal in Stata? I am using Stata 14.2. Thank you!

stcox: Continuous Time Varying Covariates

When evaluating the effect of a continuous covariate that changes from one wave to the next, (in this instance, the relationship between a continuous health index score and mortality risk) do I need to indicate that the covariate varies with time via the stcox, tvc(option), or does Stata automatically do this? While previous posts touch on this, I have seen contradictory answers. Example below for clarification.

Sample Cox regressions:

#1: stcox female age index_score

#2: stcox female age, tvc(index_score)

The first regression shows a much stronger effect of a one unit change in the index score on mortality risk than the second.

Thank you

Two ID variables for each row

Hey all,
In advance would like to say sorry for experienced users for asking this (just cannot find the answer). I have a dataset which is Millenium cohort study (MSC6). Physical Activity (PA) is recorded for 2 days (weekday and weekend), hence two rows for each ID variable, so the question is what is the best way to handle such data? I have reshaped it to wide (it was originally in long format) and made 2 variables instead of 1. So, now I have a set of variables for day 1 and set for day 2, however there are lot of missing values and I simply cannot drop them as the whole row drops. (Picture attached). This accelerometer data: MCID - Id variable; FCACCAD - day accelerometer was assigned. Apologies if this is not really clear and any help would be highly appreciated.
[CODE]
gen day2 = 2 if (FCACCAD==2)
replace day2 = 1 if (FCACCAD ==1)
gen day1 =1 if (FCACCAD ==1)
replace day1=2 if (FCACCAD ==2)

gen MCSPID = _n

reshape wide MCSID FCACCWEEKDAY FCACC_N_VALID_HRS FCACC_MEAN_ACC_24H FCACC_MVPA_MEAN_ACC_E1MIN_100MG FCACC_MVPA_E5S_B1M80_T100_ENMO, i(MCSPID) j(FCACCAD)

clonevar fca1 = FCACCWEEKDAY1
clonevar fca2 = FCACCWEEKDAY2
clonevar valid_hrs1 = FCACC_N_VALID_HRS1
clonevar valid_hrs2 = FCACC_N_VALID_HRS2
clonevar mean1 = FCACC_MEAN_ACC_24H1
clonevar mean2 = FCACC_MEAN_ACC_24H2
clonevar mvpa_mean1 = FCACC_MVPA_MEAN_ACC_E1MIN_100MG1
clonevar mvpa_mean2 = FCACC_MVPA_MEAN_ACC_E1MIN_100MG2
clonevar mvpa801 = FCACC_MVPA_E5S_B1M80_T100_ENMO1
clonevar mvpa802 = FCACC_MVPA_E5S_B1M80_T100_ENMO2

[CODE]

.dta files do not open from project manager

Dear all,

I cannot open .dta files by double-clicking on them in the project manager window.
To be precise, I cannot do that in my Stata MP 16.0, but can do that in Stata SE 14.2.
On the other hand, in SE 14.2 I cannot open .xlsx files by double-clicking, but can do that in MP 16.0.
Does anybody know what might be the problem?

Best regards,
Ivica Rubil

Random match between two large surveys with weights

I have two different large-ish surveys of the US adult population. The two in principle are measuring similar labor market and demographic concepts, but survey #2 has a variable of interest, treat, that survey #1 lacks.

Naturally, I have the two surveys in separate .dta files.

What I'd like do is statistically match individuals from survey #2 to individuals in survey #1 conditional on a variety of demographic variables. For the purposes of this question, let's assume I just have four: race, age, education, and sex.

For every individual in survey #1 of a given combination of those four variables, I'd like to randomly link to an individual in survey #2 who matches on the same four variables.

One complication is that although the weighted population in both surveys is (in principle) the same, the two surveys have a different number of raw cells. So ideally the solution would incorporate the weights (let's call that variable weight1 in survey #1 and weight2 in survey #2).

An inefficient method I've tried for doing this is calling

Code:

expand weight2

in survey #2 and identifying row ranges for different groups that I can then randomly generate in survey #1. But as the populations involved are 200+ million people, that gets unwieldy fast.

fractional probit regression with interaction term. Margins command dyes() does not work for the interaction term.

Hello,

I used a fractional probit regression and added an interaction term. The latter has a significant positive effect. When I try estimating the average marginal effects to be able to interpret the results of my model, I receive a message that the margins command dyex() does not allow levels of interaction.

How can I interpret the results of this statistically significant interaction effect?

Greetings,

Gunther

How to properly read a longitudinal dataset in Stata

Dear Statalist
I am interested in analyzing a longitudinal dataset with several observations per individual. However, each individual may start in different dates as well as having different amount of observations (similar to an unbalance panel). My point is that for analyzing it, I do not know how to read the data in Stata.

Let’s say that I would like to do a regression like: y x1 x2 x3 x4 through a Fixed effect estimation (in a panel would be: xtreg y x1 x2 x3 x4, fe).
If the time dimension would be a yearly observation per individual, I would do:

Code:

xtset id1 timein

. However, I am not sure enough if this can be done also if time is daily data. I mean, in this case would the “fe” option control for individual time invariant characteristics (as in a panel dataset)?. If I would also like to control for time dummy variables as it is usually done in panel data, should I put a dummy for each day? Could this over parametrize the regression losing a lot of degrees of freedom?

A different complication arises if two or more observations start at the same date. In this case I would have to erase those repeated observations in order to use (xtset id1 timein). However, I might be losing possibly relevant observations. My second questions is if instead of using the daily variable (time), I could use an occasion variable. I mean, I first sort the data by id time, and then build an occasion variable being 1 (for the first observation within individual) 2 (for the second one)…
My point is that in this case, an individual which her first observation start at 2/2/2005 would be compared with an individual which her first observation is at 5/5/2015 (ten years after). So, if I put in the regression an occasion dummy variables, then it won`t capture the effect as the yearly dummy variables in a panel dataset (please correct me if I am wrong).

Is this (occasion variable) a possible approach? Or should I drop the repeated observations and use the daily time variable (xtset id1 timein)?
Thanks a lot in advance for your help. (here you have an example of the dataset)

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte y double x1 byte(x2 x3) float(timein occ id1)
0                  . 23 46 15036  1  4
0                  . 23 93 15051  2  4
0                  . 23 78 15901  3  4
0                  . 23 78 15932  4  4
0                 21 23 93 16302  5  4
0                 27 23 73 16406  6  4
0                  . 23 55 16580  7  4
0                  3 33 85 16712  8  4
0                  . 23 93 16953  9  4
0  4.041666666666667 23 93 17080 10  4
0                  . 23 84 17120 11  4
0                  . 55 84 17532 12  4
0                 20 22 81 15168  1  6
0                 14 22 87 15427  2  6
0           19.03125 22 47 15538  3  6
0                  . 22 47 15545  4  6
0 19.038194444444446 22 47 15553  5  6
0                 14 22 81 15812  6  6
0                 30 23 78 15866  7  6
0                 17 22 87 15968  8  6
0                  . 22 87 16071  9  6
0               25.5 22 87 16619 10  6
0                  . 22 87 16628 11  6
0                  . 22 87 16983 12  6
0                  . 22 87 17018 13  6
0                  . 22 87 17029 14  6
0                  . 54 86 17594 15  6
0                  . 54 86 17720 16  6
0                  . 54 86 17994 17  6
0                  . 54 86 18051 18  6
0                  . 54 86 18234 19  6
0                 20 51 47 16250  1  7
0                  . 51 56 16628  2  7
0                 20 51 47 16740  3  7
0                  . 51 86 17001  4  7
0                  6 54 86 17438  5  7
0                 25 54 96 17440  6  7
0                  2 54 86 17475  7  7
1                  6 54  . 17622  8  7
0                  . 54 86 17658  9  7
0                  . 54 86 17843 10  7
0                  . 54 86 17947 11  7
0                  . 54 86 18057 12  7
0                  . 54 86 18176 13  7
0 23.333333333333336 54 87 18480 14  7
0 13.583333333333334 54 87 18513 15  7
0                  . 55 86 18546 16  7
0                  . 54 86 18597 17  7
0                  . 23 96 20636 18  7
0                  . 54 84 15536  1 18
0                  . 54 84 15567  2 18
0                  . 54 86 15585  3 18
1                  . 54  . 16315  4 18
0                  6 23 85 16530  5 18
0                 10 23 85 16559  6 18
0                 10 23 85 16561  7 18
0                 10 23 85 16580  8 18
0                 10 23 85 16589  9 18
0                  6 23 85 16600 10 18
0                 10 23 85 16601 11 18
0                 10 23 85 16617 12 18
0                 10 23 85 16699 13 18
0                 10 23 85 16713 14 18
0                 10 23 85 16727 15 18
0                  2 23 85 16748 16 18
0                 10 23 85 16783 17 18
0                  . 55 47 17841 18 18
0                  . 55 41 18163 19 18
0                  . 55 41 19178 20 18
0                 30 55 85 19267 21 18
0                 28 55 85 19617 22 18
0                  . 23 85 20509 23 18
0                  . 23 85 20515 24 18
0                 20 54 78 20698 25 18
0  32.63993055555556 55 87 20752 26 18
0                  . 55 87 20755 27 18
0 37.333333333333336 55 96 21118 28 18
0                  3 33 85 21284 29 18
0               4.25 33 85 21339 30 18
0                 20 80 47 18079  1 21
0                 30 51 47 18198  2 21
1                  . 51  . 18320  3 21
0                 16 55 47 19650  4 21
0                  . 55 47 19932  5 21
0                 16 51 78 21067  6 21
0                 28 51 78 21117  7 21
0                 20 51 46 21148  8 21
0 18.889930555555555 59 47 21430  9 21
0                  . 32 78 21458 10 21
0                 10 51 78 21535 11 21
0                 20 51 78 21598 12 21
0                 10 51 78 21609 13 21
0                 10 51 78 21626 14 21
0                  . 12 41 21668 15 21
0                 12 51 78 21703 16 21
0                  . 32 78 21705 17 21
0                 18 51 78 21724 18 21
0                 20 51 78 21826 19 21
0                 20 51 78 21878 20 21
0                 30 23 56 19909  1 24
end
format %td timein

Why weak association and wide confidence intervals?

Dear All,
I am interested to examine the effect of IPV (Intimate Partner Violence) on my proxy outcome measurement for skilled maternal health care utilisation:

Dependent Variable - adequate ANC visits (1) Vs. Inadequate ANC visits (0); and

Health Facility Delivery (1) Vs. Home delivery (0)); whether it differ by mother education and household wealth status:

Main exposures: spousal emotional IPV (Yes/No or 1/0) and spousal Physical IPV (Yes/No or 1/0).

Stratified by: (Education status) - Lower education (1) & Higher education (2).

All other my covariates are also categorical variables. I have fitted multilevel logistic regression because the data is clustered at the survey level. My setting is .Windows 10 Stata 16.1.

Model I. Education stratified adjusted logistic regression
Output:

. melogit anc_adequacy i.emo_Ipv i.age_catgorey i.husband_educ i.wealth_hh i.mediae_expo i.birth_order i.dma i.V102 i.contextual_regio
> ns if educ_mom==1 || psu :,or nolog

Mixed-effects logistic regression Number of obs = 2,548
Group variable: psu Number of groups = 580

Obs per group:
min = 1
avg = 4.4
max = 12

Integration method: mvaghermite Integration pts. = 7

Wald chi2(15) = 160.53
Log likelihood = -1384.4603 Prob > chi2 = 0.0000
------------------------------------------------------------------------------------------------
anc_adequacy | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------------------------+----------------------------------------------------------------
emo_Ipv |
Yes | 1.353753 .1777473 2.31 0.021 1.046591 1.751064
|

In the above model, the main effect of emotional violence (emo_ipv) on ANC (adequate ANC), is itself quite weak (AOR = 1.35). the odds ratio becomes greater than 1, and we will see that the lower limit of the odds ratio would be close to 1. What statistical consideration should be here for this weak relationship?

Model II.

Output:

melogit del_place i.phy_ipv i.age_catgorey i.husband_educ i.wealth_hh i.mediae_expo i.birth_order i.dma i.V102 i.contextual_regions
> if educ_mom==2 || psu :,or nolog

Mixed-effects logistic regression Number of obs = 315
Group variable: psu Number of groups = 216

Obs per group:
min = 1
avg = 1.5
max = 4

Integration method: mvaghermite Integration pts. = 7

Wald chi2(15) = 21.39
Log likelihood = -90.381795 Prob > chi2 = 0.1249
------------------------------------------------------------------------------------------------
del_place | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------------------------+----------------------------------------------------------------
phy_ipv |
Yes | 4.013633 2.682928 2.08 0.038 1.082801 14.87739
|

The problem with this model is that the confidence interval for the association between physical IPV (phy_ipv) and HFD (health facility delivery place/del_place) is so broad. We can see how broad the odds ratios (AOR = 4.01; CI: 1.08, 14.88). So, I am not sure why there is the broad odds rations?

Thank you so much.

How to deal with I(1) and I(0) variables in econometric analysis

Dear Users,

I am working on a balanced panel with 50 countries observed over 19 years. My main dependent variable is I(1) while my main independent variable is I (0).
Both are indicated in the following output as FDI and I. All control variables are a mixture of I(0) and I(1) and I found evidence of heteroscedasticity, first-degree autocorrelation and cross-sectional dependence.

I had initially planned to use a systems GMM estimator, but it assumes that all variables be stationary at levels. In order to deal with the problem of mixed stationarity among variables, I saw this post where Jeff Wooldridge recommended Driscoll-Kraay approach to overcome the challenge while transforming all non-stationary variables by taking their first differences.

I have followed the recommendation but I am not confident of my output since I am neither an advanced user nor have advanced knowledge of econometrics. Here is the advice I need:

1) Can I proceed to present these results? I am a bit hesitant because three-year dummies are dropped from the estimations (I can't explain why) and the coefficients of some year dummies are strongly significant.
2) Would it be acceptable if I proceed with a GMM estimation but difference non-stationary variables to meet the expectations?

Code:

 xtscc D.FDI D.L.FDI D.GDP D.TRADE D.AID D.EXR NATR I yr*, fe lag(4)

Code:

Regression with Driscoll-Kraay standard errors   Number of obs     =       850
Method: Fixed-effects regression                 Number of groups  =        50
Group variable (i): ID                           F( 23,    16)     =     26.24
maximum lag: 4                                   Prob > F          =    0.0000
                                                 within R-squared  =    0.0983

------------------------------------------------------------------------------
             |             Drisc/Kraay
    __00000K |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         FDI |
         LD. |   .1090312   .0662507     1.65   0.119    -.0314141    .2494765
             |
         GDP |
         D1. |    .138741    .048443     2.86   0.011     .0360463    .2414356
             |
       TRADE |
         D1. |   .1310995   .0549137     2.39   0.030     .0146876    .2475114
             |
         AID |
         D1. |   .0026035   .0089651     0.29   0.775    -.0164017    .0216087
             |
         EXR |
         D1. |  -.0031345   .0046832    -0.67   0.513    -.0130624    .0067935
             |
        NATR |   .0106743   .0047457     2.25   0.039     .0006138    .0207348
           I |   .0403172   .0088159     4.57   0.000     .0216284     .059006
        yr_1 |          0  (omitted)
        yr_2 |          0  (omitted)
        yr_3 |   .1163223   .0201306     5.78   0.000     .0736473    .1589972
        yr_4 |   .0697446   .0124469     5.60   0.000     .0433584    .0961308
        yr_5 |   .0908792   .0127773     7.11   0.000     .0637926    .1179658
        yr_6 |  -.0163546   .0109941    -1.49   0.156    -.0396611    .0069519
        yr_7 |   .0911496    .019723     4.62   0.000     .0493386    .1329605
        yr_8 |   .2674504   .0130407    20.51   0.000     .2398053    .2950956
        yr_9 |          0  (omitted)
       yr_10 |   .1628334   .0207781     7.84   0.000     .1187858     .206881
       yr_11 |   .0933184   .0102961     9.06   0.000     .0714916    .1151453
       yr_12 |   .1116583   .0118736     9.40   0.000     .0864873    .1368293
       yr_13 |   .0524463    .011628     4.51   0.000      .027796    .0770966
       yr_14 |   .0866192    .016139     5.37   0.000      .052406    .1208325
       yr_15 |   .0087971    .015348     0.57   0.574    -.0237391    .0413334
       yr_16 |   .0385855   .0230498     1.67   0.114    -.0102777    .0874488
       yr_17 |   .0549583   .0204628     2.69   0.016      .011579    .0983375
       yr_18 |   .0504483   .0166365     3.03   0.008     .0151805    .0857162
       yr_19 |   .0005665   .0153748     0.04   0.971    -.0320266    .0331596
       _cons |   .0093688    .027222     0.34   0.735    -.0483392    .0670768
------------------------------------------------------------------------------

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double(FDI GDP TRADE AID EXR NATR I)
8.1252975 24.726702 4.1409704 19.113628  4.320946 3.1907837 -1.395516
8.4100485  24.72592 4.0725812 19.108112  4.346594 3.0786939 -1.010776
8.6228029 24.762064 4.1131238 19.041902 4.3780425 3.0803736 -.6260364
8.7314687 24.940803  4.129113 19.292978 4.3489219 3.1500927 -.4944898
8.8645602 25.169731 4.1851091 19.574318 4.2775081 3.2198058 -.1928248
end

How to tell Stata not to ignore missing values in bysort: egen mean command

Dear Stata-Listers,

I am posting a question for the first time, so I hope I'm doing it correctly.

I have a panel data set and I want Stata to calculate the mean of a variable within a certain group. The group is identified by "country" and "period". "period" is a three-year period, i.e. there are always three year-observations in one "country period" group. I use the following code:

Code:

bysort country period: egen mean_ideology = mean(ideology)

This works perfectly. It is only that sometimes "ideology" is missing for one or two years and with the code above Stata ignores these, i.e. mean_ideology is then calculated on the basis of the one or two non-missing observations. However, I want "mean_ideology" to be missing as soon as "ideology" is missing in one of the three years.

What happens is:

year	country	period	ideology	mean_ideology
2007	Belgium	1	3	3
2008	Belgium	1	3	3
2009	Belgium	1	3	3
2010	Belgium	2	2	2
2011	Belgium	2	1	2
2012	Belgium	2	3	2
2013	Belgium	3	3	3
2014	Belgium	3	.	3
2015	Belgium	3	.	3

What I want is:

year	country	period	ideology	mean_ideology
2007	Belgium	1	3	3
2008	Belgium	1	3	3
2009	Belgium	1	3	3
2010	Belgium	2	2	2
2011	Belgium	2	1	2
2012	Belgium	2	3	2
2013	Belgium	3	3	.
2014	Belgium	3	.	.
2015	Belgium	3	.	.

I thought it might make sense to first replace "ideology" with "." if there is one other missing value for ideology within a "country period" group and then calculate the mean. Based on other Stata list questions and responses I have played around with:

Code:

bysort country period: replace ideology = . if

...?

but I have been unsuccessful.

I'd be very grateful for some help.
Thank you in advance!
Rike

I'm converting a date but the new date variable is all missing values

I'm using Stata 14.2. i have a dataset with dates in a string variable e.g. 01-Apr-2018. I have used the below code to convert them to format them as date variables, however the new date variable that I create is all missing values

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str9 dateperformed
"2-Apr-20" 
"2-Aug-18" 
"16-Oct-18"
"8-Jan-19" 
"19-Sep-18"
"11-Oct-18"
"2-Jul-20" 
"29-Oct-18"
"24-Apr-19"
"25-Nov-19"
"10-Sep-19"
"31-Jul-18"
"29-May-18"
"3-Jun-19" 
"18-Dec-18"
"18-Dec-18"
"18-Mar-20"
"21-May-18"
"26-Jun-20"
"9-May-18" 
"21-Aug-19"
"29-Aug-18"
"2-Nov-18" 
"16-Jul-20"
"6-Apr-20" 
"6-Apr-20" 
"23-Jul-19"
"18-Oct-18"
"10-Jul-20"
"18-Sep-19"
"21-Mar-19"
"16-Mar-20"
"21-Jan-20"
"10-Mar-20"
"30-Jul-18"
"29-Aug-18"
end

Code:

. gen date_performed = date(dateperformed, "DMY")
(1,483 missing values generated)

. format %tdDD/NN/CCYY date_performed

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float date_performed
.
.
.
.
.
.
.
.
.
.
.
end
format %tdDD/NN/CCYY date_performed

I have seen similar threads and tried all the fixes I've found but nothing seems to work for my dataset. Can someone advise how to fix the formatting of this date variable?

Thanks

Difference between areg and xtset for firmxyear fixed effect

Dear all,

I would like to find the effect of bank liquidity risk (my independent variable) on loan pricing (y). Professor suggested me to use firm(borrower) x year fixed effect to fix the effect of borrower for demand side and cluster SE with firmxyear. I am not sure whether what I did is correct or not.

egen B_year = group(BID year) (where BID is borrowerID)
then, " areg logy liqtoA2 logA logAmount logMat, absorb(B_year) vce(cluster B_year)"

I also ran xtreg using

xtset B_year
xtreg logy liqtoA2 logA logAmount logMat,fe vce(cluster B_year)

For both commands, I got the same coefficient estimates, but slightly different Standard errors. Do both commands can be used interchangeably?

Additionally, If I use firm fixed effect and year fixed effect separately, Does it mean the same? (because the coefficient I got was different)
For example, I use

xtset BID (where BID is borrowerID)
xtreg logy liqtoA2 logA logAmount logMat i.year, fe vce(cluster B_year) / or vce(cluster BID)

Please advise and Thank you in advance.

Estimating Modified Jones Model ,No final values generated for Jones model

Hi,

I am trying to estimate the discretionary accruals through the modified Jones model in Stata. This is the code I am running for, to estimate the discretionary accruals. I am using data from 75 companies from 2007 to 2019. I am sorry in advance for the errors in writing my post.

Here is the data format i am using.

Code:

FirmID    companies    years    Total_assets    laggedassets    SIZE
1    PEPCO HOLDINGS, INC    2007    15111000        16.530934
1        2008    16475000    15111000    16.617355
1        2009    15779000    16475000    16.574191
1        2010    14341000    15779000    16.478633
1        2011    14765000    14341000    16.50777
1        2012    15776000    14765000    16.574
1        2013    14848000    15776000    16.513376
1        2014    15667000    14848000    16.567067
1        2015    16311000    15667000    16.60735
1        2016    21019000    16311000    16.860937
1        2017    21243000    21019000    16.871538
1        2018    21972000    21243000    16.905279
1        2019    22706000    21972000    16.93814
2    NISOURCE INC    2007    18004800    22706000    16.706149
2        2008    20032200    18004800    16.812852
2        2009    19271700    20032200    16.774148
2        2010    19938800    19271700    16.808178
2        2011    20708300    19938800    16.846045
2        2012    21844700    20708300    16.899469
2        2013    22653900    21844700    16.935843
2        2014    24866300    22653900    17.029024

Here is the code I used, that I got from this same platform. I have also read all the FAQ but I am unable to find any workable code for me.

Code:

gen Jones_3 = .
forval y = 2007(1) 2019 {
forval i = 1(1) 75 {
display `i'
display `y'
reg TACC2 DAPterm1 DAPterm2 DAPterm3 if `i' == FirmID & `y' == years
predict r2 if `i' == FirmID & `y' == years
replace Jones_3 = r1 if `i' == FirmID & `y' == years
drop r2

}
}

This is the output that I get.

Code:

. forval y = 2007(1) 2019 {
  2.     forval i = 1(1) 75 {
  3.         display `i'
  4.         display `y'
  5.         reg TACC2 DAPterm1 DAPterm2 DAPterm3 if `i' == FirmID & `y' == years
  6.             predict r2 if `i' == FirmID & `y' == years
  7.             replace Jones_3 = r2 if `i' == FirmID & `y' == years
  8.             drop r2
  9.        
.     }
 10. }
1
2007
no observations
r(2000);

end of do-file

r(2000);

.

Just to let you know that I am using Thompson Reuters Data. I might be committing basic errors while writing the code but if I execute code without the 'if' conditions I get the required result. So, I am thinking that I m committing some error while writing' if' conditions or in classifying the FirmID and years as below. I do not know how we use industry and companies because I have run this code with FirmID, companies as, I do not have any Industry code.But these are all my guesses ,I still need your guidance and help in this regard.

This is the code i used by replacing FirmID,Industry and companies.I got same eroor after these.

Code:

forval y = 2007(1) 2019 {
forval i = 1(1) 75 {
display `i'
display `y'
reg TACC2 DAPterm1 DAPterm2 DAPterm3 if `i' == FirmID & `y' == years
predict r2 if `i' == FirmID & `y' == years

This is the result that i get when i run the code without 'if ' conditions.So can i run it without if conditions?

Code:


      Source |       SS           df       MS      Number of obs   =       881
-------------+----------------------------------   F(3, 877)       =   3469.79
       Model |  356.595872         3  118.865291   Prob > F        =    0.0000
    Residual |  30.0435815       877   .03425722   R-squared       =    0.9223
-------------+----------------------------------   Adj R-squared   =    0.9220
       Total |  386.639453       880  .439363015   Root MSE        =    .18509

------------------------------------------------------------------------------
       TACC2 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    DAPterm1 |  -170208.8   45996.52    -3.70   0.000    -260484.9    -79932.7
    DAPterm2 |   -.199707   .0107229   -18.62   0.000    -.2207525   -.1786616
    DAPterm3 |   .5226463   .0120609    43.33   0.000     .4989748    .5463178
       _cons |  -.0007244   .0112775    -0.06   0.949    -.0228583    .0214096
------------------------------------------------------------------------------

I am sorry for the long post ,But i am trying to clearly state the problem that i am facing.This is my thesis data.

I am using stata 14.

Dashed line graph for parts of the line

Hello,

I would like to use two line styles for the same graph to denote value above and below a certain threshold: twoway (tsline var if var<=1) (tsline var if var>1, lpattern(dash))

I also tried creating two variables, one for values <=1 and another for values >1. But in both cases I run into the problem that - because the values for each variable to do not occur continuously - I get different shapes than if I were to just draw one line for all values. What I would like is to have the same shape as when I don't split it, but with a dashed line for values above 1.

Could anyone suggest a solution?
Many thanks!

Fixed-effects on sibling data: which N to report?

Dear colleagues, good day! I would be very grateful if you could help me with an issue.

Setup: I have data on siblings, 2-3 children per family. Totally, there are about 4000 families. Dependent variable is "Educational attainment" (continuous), independent variable is "Cognitive ability" (continuous). I am running family fixed effects (xtreg) to take into account any confounding effects shared by siblings.

Questions:
1. Which N should I report in my paper? The nominal N that STATA reports? Or N with only discordant sibling families (where, for example, all the children have different educational attainment and cognitive ability)? For example, xtlogit just drops those groups (families) where individuals do not differ in terms of a dependent variable (thus they are non-discordant) and reports only N with non-discordant families. But obviously, xtreg does not work in that way, because it always reports the nominal N.

2. Which families are really involved in calculating b-coefficients and standard errors? All the cases despite whether or not the children are discordant? Only those that differ within-family in terms of the depvar? Or only those that differ in terms of the indepvar? Or only when both depvar and indep var are discordant?

With best regards,
V

Sunday, August 30, 2020

Issues with Creating a New Variable for Observations with the Same ID

Hi,

I'm putting together an NBA Dataset for each season that includes data on the outcome of each game of the season.
Each game has 2 observations (from the perspective of each team) and both observations share the same Unique ID.

I have recorded the Win Percentage (# of Wins/Games Played) for each team at the time the game was played.
However, in order to complete my dataset I need to create a new variable for the Opponent's Win Percentage.

The following is an small example of what the dataset looks like for two different games i.e. Boston vs Philadelphia and Miami vs Philadelphia (with only the relevant variables):

Team	TeamID	UniqueID	WinPct	OWinPct
Boston Celtics	1610612738	100	0.5
Philadelphia 76ers	1610612755	100	0.75
Miami Heat	1610612791	101	0.33
Philadelphia 76ers	1610612755	101	0.77

For example, I want to code the observations so the OWinPct Column corresponding to the Boston Celtics has the 76ers WinPct of 0.75, and the 76ers OWinPct column has the Celtics WinPct of 0.5 (since they share the same UniqueID - 100), and so forth for the rest of the observations.

To do this, I tried something like the following code:

gen OWinPct==WinPct if UniqueID==UniqueID & TeamID!=TeamID

Obviously to no avail.

I was wondering if there is a way to actually do this?

Apologies for the lack of technical knowledge and clarity, this is my first time using STATA as I am an undergraduate student. Thanks!

help

Hi, everyone,

I would greatly appreciate the help that you can give me, it is the following, I have a panel data structure, in long format, where for each id there can be more than one record as i show bellow

id month year income
1 03 2016 20,000
1 02 2016 15,000
1 01 2016. 35,000
2 02. 2016. 40,000
2 03. 2016. 45,000

I need to obtain is that, for each id (subject), leave only the entry of the last declared date, that is, I would need the final base to be in this way

id month year income
1 03 2016 20000
2 03 2016 45000

I would greatly appreciate your help to solve this problem, thank you very much in advance,

Regards,

Ariel.

farvalue command showing error messages

Dear Professors, hope everyone is blessed with good health.

I use the following loop command for a number of regressions, the command is successfully executed, but after a certain level, it shows this error "insufficient observations"

I don't know how to drop those range of observations which are insufficient to be executed for regression.

Code:

clear
use ~\returns_stock_market_industry_calendar
drop if num_calendar<30
* drop if num_calendar<26


egen id = group(Stkcd year)
egen max_id = max(id)
local group_number = max_id


gen w=.
gen R2_SYNCH1 =.
gen R2_SYNCH2 =.
sort Stkcd Trdwnt
forvalue item =1(1)`group_number'{
    reg ret_stock ret_market ret_market_l1 ret_market_l2 ret_market_f1 ret_market_f2 if id==`item'
    predict e if id==`item',residual
    replace w = ln(1+e) if id==`item'    
    drop e
    reg ret_stock ret_market ret_market_l1 ret_ind ret_ind_l1 if id==`item'
    replace R2_SYNCH1 = e(r2) if id==`item'
    reg ret_stock ret_market ret_ind if id==`item'
    replace R2_SYNCH2 = e(r2) if id==`item'
    local progress `item'/`group_number'
    disp `progress'
}

Following error message:

insufficient observations
r(2001);

end of do-file

r(2001);

Thanks Professor

Rename using regex?

Is it possible to rename a variable using regex?

Let's say I have a variable:
g wth_t1_xxx = .

Is it possible to rename this variable using a regular expression so that it would be wth_xxx_t1? I'm looking for a pattern to capture the content between the underscores and move to the end, while taking the content at the end and move it to the middle.

Group by xtabond2

What's wrong with this command ?
by COUNTR, sort xtabond2 ROA L.ROA GRTH MIX EFFIC SIZE SOLV RISK LIQUID LEVR INFL ECOGRTH, gmm(L.ROA GRTH MIX EFFIC SIZE SOLV RISK LIQUID LEVR INFL ECOGRTH,collapse) iv(GRTH MIX EFFIC SIZE SOLV RISK LIQUID LEVR INFL ECOGRTH) noleveleq nodiffsargan robust small

I want to sort my results by country and the error message I get is the following :
: required
r(100);

Scatter plot with bar graph

Hi,

I have an issue with plotting a scatter plot combine with bar graph.

x-axis is exchange rate (usd)
y-axis is stock market return (mkt)
bar graph is net supply and net demand

I would like to plot exchange rate and market return as scatter plot with corresponding net supply and net demand as bar graph only period that net demand or supply not equal to zero but I don't know how to plot this.

Please refer to the picture below:

Array

This is my data

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(monthly usd mkt netsupply netdemand)
264  -2.0478468  2.6026726 -.36339945  -4.4860897
265   -3.548061   4.905219 -1.4746455   -6.191761
266   -2.268332  -.3730015 -1.5177087   -4.958127
267   2.1891139  -2.417745  1.7183888 -.016695466
268   -.7965394    2.75148   1.034773    1.235328
269   -4.806432  -3.081741  -.1064344    .9887276
270  -.14926174   2.023372  -.9868883   -6.543846
271  -1.6432375   -8.63042  -.1670051   -2.919932
272  -1.6013564  -.8948722   .8827286    3.333377
273   -3.158446  -7.349725  .51623994    .6833113
274   1.7308536  -2.869434  -.3589642   4.0273294
275   2.9869945  1.9039277 -1.4439176   -4.123436
276  -2.0306973   .6909526          0           0
277   .18168476  2.2888284          0           0
278   -2.492754   3.283591          0           0
279  -.02510364  -1.318898          0           0
280  -1.2056882  3.2190094          0    7.413932
281  -1.3727076 -.58127767          0           0
282  -2.2790751   6.712985          0           0
283  -2.0085866  .28069946          0           0
284   1.7866004  .14747038          0           0
285    .2549829  .19207136          0           0
286  -1.8254682  1.5198965          0           0
287   -1.056921   3.767254          0           0
288  -2.1628854   7.489547          0           0
289    5.019365  1.6776668          0           0
290    .3332144 -.14918093          0           0
291   -3.148557  1.3693012          0           0
292   -.9738461   2.628027          0           0
293  -2.0366795  -3.996268          0   -3.879605
294   -3.313439  .28118122          0   -3.170889
295    .6506623  -4.912808          0   -5.747722
296  -3.7560015   1.222479          0    8.387284
297   -.2991006  -.4632317          0           0
298  -1.3677107  1.4437127          0           0
299   -1.517194 -1.5099367 -1.3193282           0
300  -1.2432095 -1.5515237          0           0
301   -4.976399   .9448336   .7583591           0
302     6.16092   .8358523          0           0
303   -.6085969   2.977546          0           0
304      .33738 -2.0507245          0           0
305   1.0175719 -1.5831537          0           0
306    7.399529   1.823915          0           0
307  -.56519353  3.5563564          0           0
308    3.045999   5.390651          0           0
309   1.7724197  .58163613          0           0
310    2.919131  -2.761727  .57214993           0
311    .6424765  -2.209427 -1.8585502           0
312    1.881372   .0390115  -7.228486           0
313    5.454595 -4.4356527 -11.452032           0
314   -2.949179   3.479473  -6.914683           0
315    4.762784   5.959227   5.081755           0
316    -4.98297 -4.2681546  1.8545836           0
317    3.490815 -1.7711565 -1.9068682           0
318    2.231739  4.5845294  -5.160953           0
319    .6852428  -.6882348   7.192981           0
320    .8559012    6.62383 -2.0540004    8.609133
321  -2.0567417  -3.349355          0           0
322   2.7369986  -.8146685          0           0
323    1.433118   4.171295   3.849671           0
324    5.127577   -9.99656          0           0
325  -.33373895  -1.547396          0           0
326   2.0369604   .6943071          0           0
327   1.2747744  3.4344125          0           0
328   -1.570745  .56554157          0           0
329   -.5054746 -2.1775107          0           0
330  -1.1351837  2.2197416          0           0
331   2.0284727  -2.059003          0           0
332   -.7645482   3.374284          0    -2.89184
333    3.448859  2.8669145          0  -2.4535186
334    5.003475 -1.0862093          0   -13.08276
335       2.878  -8.394817          0   -6.471038
336  -4.0246353  -3.604314          0           0
337  -1.0641106   1.864765          0           0
338   1.0285614   8.815811          0           0
339   -.7560795   1.176813          0           0
340   -2.339489  2.5349166          0           0
341   -5.294498   2.631751          0           0
342   -2.862269   3.649575          0           0
343  -1.2069873   2.827087          0           0
344  -.12935299  -.8367257          0           0
345   4.4109654  -.4823661          0           0
346   2.0764723  1.1829519          0           0
347  -2.0971348   1.415238          0           0
348   -3.010541  -3.461796          0           0
349 -.030732745   5.039765          0           0
350   -2.588777  4.1222777  4.5398674           0
351    -7.12883  1.1752771          0           0
352   -8.811241   3.891067          0           0
353   -.9347946   9.330248          0           0
354   -.7597386   -6.05091  -3.531442           0
355   -3.523827   5.498465          0           0
356    3.002756   4.945375          0           0
end
format %tm monthly

Survival analysis - creating "failure" variable

Dear all,

I'm hoping for some advice re. creating a 'failure" variable for a survival (or time till event) analysis I'm working on.

We're looking at intraocular pressure (IOP) after surgery.
"Failure" is either (i) IOP >21, (ii) IOP <20% decrease from first IOP, or (iii) re-operation.

I have my data in long format, where IOP's are documented (_iop), time from first op calculated (iop_days), and observations since a unique surgery numbered (iop_no), using

Code:

bysort patient_guid _date (iop_date): gen iop_no = _n

I have merged them to unique patient IDs, using the merge m:m function.

I hope the screenshot (attached) is of help.

I would really appreciate any advice anyone has for me, regarding creating these failure variables.
Thanks very much for your time, and consideration.
Array
Will

Calculating advertising returns after estimation

Hello everyone, the following question may sound a bit silly but I have been a bit puzzled about this and wanted to share it with you - also for pedagogical reasons to whoever reads any potential answers.

I am using firm-level data and want to estimates the returns of advertising on revenue. Every variable in my model is in logarithmic form and I estimate the following:

Code:

xtreg lnrevenue lnage lnemployment lnadvertising, fe vce(robust)

The estimated coefficient on lnadvertising is 0.014 (p=0.000). This means that a 1% increase in advertising expenditure can increase revenue by 0.014%. To calculate the return, I multiply the ratio of the median value of revenue over the median value of advertising expenditure of the firms in my sample (which is 7,300,000/680,000 ≈ 10.74) with the coefficient on lnadvertising (0.014) to yield:

advertising return = 0.014*7,300,000/680,000 ≈ 0.15

This means that for every $1 spent on advertisement, there is a $0.15 increase in revenue (or a return of 15%).

My question is: can I use the estimated coefficient on lnadvertising (i.e. 0.014) and multiply it with the ratio of the revenue over advertising expenditure for each of the firms in my sample and get individual firm advertising returns on revenue? In other words, does the advertising_returns variable created as below capture individual firm advertising returns?

Code:

generate advertising_returns = 0.014*revenue/advertising

I understand that the value of an estimated coefficient in a regression model is the mean change in the dependent variable due to a 1-unit change in the independent variable (ceteris paribus) and therefore it refers to ALL firms in the sample and does not apply to each firm individually. Of course, my assumption here is that effect of advertising on revenue is the same for each firm (which is not plausible indeed from a theoretical perspective; but if we assume this holds, is my rationale for calculating advertising_returns correct)?

Export unit root test results to excel?

Hello,
We have estout/esttab, outreg2 commands to exports regression output.
Is it possible to export unit root test (Harris–Tzavalis test) results using these packages?
If yes, what is the code? I have a panel dataset and I'm testing the Harris–Tzavalis test.

Code:

xtset id Daily2
tsset id Daily2
xtunitroot ht Return if event_window==1
esttab using example1.csv, replace

which gives me the following result in STATA:
__________________________________________________ ____________________________
Harris-Tzavalis unit-root test for Return
-----------------------------------------
Ho: Panels contain unit roots Number of panels = 214
Ha: Panels are stationary Number of periods = 85
AR parameter: Common
Asymptotics: N -> Infinity Panel means: Included
T Fixed Time trend: Not included
------------------------------------------------------------------------------
Statistic z p-value
------------------------------------------------------------------------------
rho -0.0010 -3.8e+02 0.0000
------------------------------------------------------------------------------

Thanking you in advance.

Can STATA shows the p-value for each variable used in the discriminant functions?

I get my results and I can interpret the coefficient of each independent variable in each function.

However, can we also get a Prob>F column to see how each variable is significant or not in each discriminant function? Just like we have in regression models. Thanks.
Array

Recode Var

Hello How can we find duplicates in a database? Which command I can use to change the observations of a variable. for example Occupation variable 1 = driver 2 = doctor 3 = teacher 4 = architect 5 = seller I would like it to follow this order closely 1 which becomes 2 3 which becomes 4 So change the values and not the labels.

Triple conditions such as relimp==relat==1 do not do what I expect them to do. Why? Where is this behaviour explained?

Working on a problem posed in this thread
https://www.statalist.org/forums/for...other-variable
I discovered to my shock that triple conditions such as relimp==relat==1 and relimp==relat==3 do not do what I expect them to do.

In my mind, (relimp==relat==1) should be equivalent to (relimp==relat & relat==1). But it is not so, as the example below demonstrates.

To have some data to work with

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long id byte wave float(relimp relat)
110 1 1 1
116 1 1 1
116 2 1 1
116 3 1 1
123 1 3 3
123 2 3 3
123 3 3 2
123 4 3 3
123 5 3 2
126 2 3 3
126 3 3 3
126 4 3 3
138 3 3 2
end

I want to generate two dummies, dummy1 equal to one when relimp==relat==1, and dummy3 equal to one when relimp==relat==3. And the triple condition fails me in both cases:

Code:

. gen dummy1 = relimp==relat==1

. gen dummy11 = relimp==relat & relat==1

. gen dummy3 = relimp==relat==3

. gen dummy33 = relimp==relat & relat==3

. compare dummy1 dummy11

                                        ---------- difference ----------
                            count       minimum      average     maximum
------------------------------------------------------------------------
dummy1=dummy11                  7
dummy1>dummy11                  6             1            1           1
                       ----------
jointly defined                13             0     .4615385           1
                       ----------
total                          13

. compare dummy3 dummy33

                                        ---------- difference ----------
                            count       minimum      average     maximum
------------------------------------------------------------------------
dummy3<dummy33                  6            -1           -1          -1
dummy3=dummy33                  7
                       ----------
jointly defined                13            -1    -.4615385           0
                       ----------
total                          13

. list, sep(0)

     +-------------------------------------------------------------------+
     |  id   wave   relimp   relat   dummy1   dummy11   dummy3   dummy33 |
     |-------------------------------------------------------------------|
  1. | 110      1        1       1        1         1        0         0 |
  2. | 116      1        1       1        1         1        0         0 |
  3. | 116      2        1       1        1         1        0         0 |
  4. | 116      3        1       1        1         1        0         0 |
  5. | 123      1        3       3        1         0        0         1 |
  6. | 123      2        3       3        1         0        0         1 |
  7. | 123      3        3       2        0         0        0         0 |
  8. | 123      4        3       3        1         0        0         1 |
  9. | 123      5        3       2        0         0        0         0 |
 10. | 126      2        3       3        1         0        0         1 |
 11. | 126      3        3       3        1         0        0         1 |
 12. | 126      4        3       3        1         0        0         1 |
 13. | 138      3        3       2        0         0        0         0 |
     +-------------------------------------------------------------------+

Does anybody know why Stata does not interpret (relimp==relat==1) as equivalent to ( relimp==relat & relat==1), and where this behaviour is explained?

svydescribe for mi set data

I have been hunting around and trying everything I can think of but cannot figure out how to get the information I need. I have some mi svy data that I did not personally svyset. I want to check the svyset settings to see if they are what I think they should be, but svydescribe does not work on mi data. I have tried every variation of syntax I can think of and done a ton of internet searching and reading of the online documentation for both commands, and I see absolutely no way to find out how the data has been svyset. This is really worrying, as it means I can never check the survey structure of data that has been mi set. There must be some way to retrieve this info? I just need to check which weight was used, and what the strata/psu were set to be. Is there any simple way to do this that I am missing?
Thanks in advance for your time!

Panel data preparation and xtreg function

Hi

I am new to econometrics and stata, and appreciate some help for the following qns related to my dissertation. I am investigating property price impact due to opening of rail station in year 2010.

The intention is to develop a fixed effect model with panel dataset, using difference-in-difference concept where there is treated and controlled transactions.
(a) find price growth within 0-1km, 1-2km (treated) against 2-3km (control) of rail stations
(b) investigate any anticipation effect i.e. which year did the price start increasing before year 2010 and how much each year/percentage. Also the increase trend after year 2010.

I have property transactions from 1995 to 2019 and using GIS software, i have filtered the transactions within 0-1km, 1-2km and 2-3km of stations. And i have created a column name buffer_km with 1/2/3 to denote the corresponding distance.

However i am rather confused over category variable and where to put 0 and 1. I have done two methods and it gave very different coefficients for buffer_km1/2 and r1km/r2km.

Code:

gen logprice = log(price)
encode type, gen(typehouse)
gen YearsStr = substr(date,1,4)
encode YearsStr, gen(YearsS)
encode lsoa11, gen(LSOA_num)
encode tenure, gen(Tenure)
gen YearsN = real(YearsStr)
gen r1km = (YearsN>=2010 & buffer_km==1)
gen r2km = (YearsN>=2010 & buffer_km==2)
gen r3km = (YearsN>=2010 & buffer_km==3)
xtset LSOA_num

xtreg logprice ib4.typehouse ib2.Tenure i.YearsS ib3.buffer_km, fe
xtreg logprice ib4.typehouse ib2.Tenure i.YearsS r1km r2km, fe

Which method should i use to answer qn (a) above that reflects the price growth within 0-1km and 1-2km comparing transactions before and after year 2010?

For qn (b) on the anticipation effect, i was told i have to find a way to regress against year 2010 but i don't really know how to do it. What do the coefficient in the results output under YearsS 1996 to 2019 mean in statistical terms?

Grateful for any advice to the queries above. Thank you!

Regards
Cleo

the conclusion for the correlation of variables

Hi
I am doing with panel data ( N=760. T=8) / When I check the correlation of variables the result as below

Code:

. cor lgdp lpgdp llpitotal ldistance dummyrta dummyland
(obs=760)

             |     lgdp    lpgdp llpito~l ldista~e dummyrta dummyl~d
-------------+------------------------------------------------------
        lgdp |   1.0000
       lpgdp |   0.5828   1.0000
   llpitotal |   0.6877   0.8127   1.0000
   ldistance |  -0.1035   0.0362  -0.0964   1.0000
    dummyrta |   0.1802  -0.0449   0.0936  -0.6347   1.0000
   dummyland |  -0.1100   0.0345  -0.0166  -0.0981  -0.0314   1.0000


. vif

    Variable |       VIF       1/VIF  
-------------+----------------------
   llpitotal |      3.81    0.262300
       lpgdp |      3.16    0.316329
        lgdp |      2.01    0.498312
    dummyrta |      1.76    0.566745
   ldistance |      1.74    0.575154
   dummyland |      1.05    0.949584
-------------+----------------------
    Mean VIF |      2.26

As in the first table, the correlation between llpitota and lpgdp is quite high ( 0.81) but in the vif test, the result is quite small
So how can I conclude about the correlation between llpitotal and lpgdp or is there any problem if I use these 2 variables together to estimate
Thanks

Saturday, August 29, 2020

How to fix heteroskedasticity and autocorrelation on OLS

Hi
I did OLS test for my panel data ( n= 760 and t=8) and checked heteroskedasticity and autocorrelation as below ( the result show that there is heteroskedasticity and autocorrelation

Code:

White's test for Ho: homoskedasticity
         against Ha: unrestricted heteroskedasticity

         chi2(25)     =    268.88
         Prob > chi2  =    0.0000

Cameron & Trivedi's decomposition of IM-test

---------------------------------------------------
              Source |       chi2     df      p
---------------------+-----------------------------
  Heteroskedasticity |     268.88     25    0.0000
            Skewness |      40.02      6    0.0000
            Kurtosis |       1.57      1    0.2105
---------------------+-----------------------------
               Total |     310.46     32    0.0000

. xtserial lexport lgdp lpgdp llpitotal ldistance dummyrta dummyland

Wooldridge test for autocorrelation in panel data
H0: no first-order autocorrelation
    F(  1,      94) =    110.533
           Prob > F =      0.0000

How can I fix this . is this ok to add vce cluster ( country1) into ols test to fix
Thank you

AR(1) and AR(2) test in dynamic panel GMM estimation?

Hi,

I am using diff-GMM and sys-GMM for an unbalanced panel with time (T=5) and country (N=84). I am trying to get Maintained Statistical Model (MSM) following the guidelines given by
Kiviet 2020 (J. of Econometrics and Statistics). Across different model specifications, p-value of AR(1) test >0.1 and AR(2) test>0.1.

But, Kiviet(2020), suggested that p-value of AR(1) should be less than 0.05.

Will the MSM be invalid if p-value of AR(1)>0.05(0.1) ??

Correcting the problem of Multicollinearity in categorical variables

Dear All,

I have among my explanatory variables, a number of categorical variables and after testing for multicollinearity, I had some of the variables suffering from multicollinearity. Please, I need assistance on what to do in order to correct this problem.

Thanks.

Regards,
Stephen.

Finding duplicate variables

Dear Statalisters,

I am writing you because I would like to find out if I have duplicate variables in my database.
I have been looking in previous posts, but I only found a recommendation about doing a pwcorr, that I think is an interesting possibility, but considering that I have almost 400 variables is going to be hard considering the large output.
If you could advice me, I really appreciate it.

Thank you so much !!
Alejandro

Coding a new variable: where a category selects 'high' responses for one variable only if there are low responses for another variable

Hi Statalist.

I want to code 'impat' with three categories (1 = both low, 2 = one high, 3 = both high). These three levels reflect the levels/categories of each of the categorical variables in my dataset. Note, each variable has a value for the respondent (e.g. relimp, relat) and for their partner (e.g. p_relimp, p_relat). In words, I want "impat == 2" only if relimp or relat == 3, but not both as this is captured in "impat == 3".

Code:

gen impat = 1 if (relimp2 == p_relimp2 & relimp2 == 1) & (relat2 == p_relat2 & relat2 == 1)     // both low
replace impat = 2                                                                     // one high (import/attend)
replace impat = 3 if (relimp2 == p_relimp2 & relimp2 == 3) & (relat2 == p_relat2 & relat2 == 3) // both high

I believe I have coded 'impat = 1' and 'impat = 3' correctly, so I appreciate help to code 'impat == 2'.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long(id p_id) byte wave float(relimp2 p_relimp2 relat2 p_relat2)
100010 1200063 1 1 1 1 1
100010 1200063 2 2 1 1 1
100014  100015 1 3 1 1 1
100014  100015 2 3 1 1 1
100014  100015 3 3 1 1 1
100016  200179 1 1 1 1 1
100016  200179 2 1 1 1 1
100016  200179 3 1 1 1 1
100018  100019 1 2 1 1 1
100018  100019 2 3 1 1 1
100018  100019 3 2 1 1 1
100018  100019 4 3 1 1 1
100023  100024 1 3 3 3 3
100023  100024 2 3 3 3 3
100023  100024 3 3 3 2 2
100023  100024 4 3 3 3 3
100023  100024 5 3 3 2 2
100025 1200332 1 3 1 3 3
100026 1000535 2 3 3 3 3
100026 1000535 3 3 3 3 3
100026 1000535 4 3 3 3 3
100029  100030 1 1 2 1 1
100029  100030 2 1 2 1 1
100029  100030 3 2 1 2 1
100038  100039 1 3 2 1 1
100038  100039 2 3 2 1 1
100038  100039 3 3 3 2 2
100038  100039 4 2 1 1 1
end

Note: I have included multiple responses over different waves for a number of couples as I want to ask how I deal with changes in responses over time in my code? Should I take the average or the last level recorded?

Count and record

Hi all,

My data has four variables: acty, deady, nation, and id.
"acty" is the year when company is listed in stock market.
"deady" is the year when company is delisted from stock market.
"nation" is the country code.
"id" is the company id.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int(acty deady) str3(nation id)
2000 2000 "AUS" "1" 
2012 2017 "AUS" "2" 
2002 2004 "AUS" "3" 
2005 2008 "AUS" "4" 
2000 2006 "AUS" "5" 
2000 2010 "AUS" "6" 
2000 2002 "AUS" "7" 
2000 2006 "AUS" "8" 
2008 2019 "AUS" "9" 
2000 2000 "AUS" "10"
2016 2019 "AUS" "11"
2001 2019 "AUS" "12"
2000 2001 "AUS" "13"
2000 2005 "AUS" "14"
2007 2019 "AUS" "15"
2000 2000 "AUS" "16"
2000 2000 "AUS" "17"
2017 2019 "AUS" "18"
2000 2019 "AUS" "19"
2000 2002 "AUS" "20"
2000 2019 "AUS" "21"
2002 2019 "AUS" "22"
2006 2009 "AUS" "23"
2001 2019 "AUS" "24"
2007 2019 "AUS" "25"
2015 2019 "AUS" "26"
2000 2004 "AUS" "27"
2000 2004 "AUS" "28"
2000 2003 "AUS" "29"
2000 2005 "AUS" "30"
2000 2003 "AUS" "31"
2000 2002 "AUS" "32"
2000 2017 "AUS" "33"
2000 2001 "AUS" "34"
2005 2019 "AUS" "35"
2003 2007 "AUS" "36"
2015 2017 "AUT" "1" 
2000 2002 "BEL" "1" 
2015 2019 "BEL" "2" 
2016 2018 "BEL" "3" 
2015 2019 "BEL" "4" 
2001 2016 "BEL" "5" 
2009 2019 "BEL" "6" 
2014 2019 "BEL" "7" 
2017 2018 "BEL" "8" 
2002 2020 "CAN" "1" 
2000 2020 "CAN" "2" 
2005 2009 "CAN" "3" 
2002 2014 "CAN" "4" 
2011 2015 "CAN" "5" 
end
format %ty acty
format %ty deady

I need to count the number of companies (in each nation) that are active in a specific year and record the number as new variable.
For example, in AUS, there are 21 companies that were active in the year of 2000 (acty<=2000<=deady), 19 companies that were active in the year of 2001(acty<=2001<=deady), and so on.
The output would look like;

nation	year	nbr
AUS	2000	21
AUS	2001	19
AUS	2002	19
AUS	2003
AUS	2004
AUS	2005
AUS	2006
AUS	2007
AUS	2008
AUS	2009
AUS	2010
AUS	2011
AUS	2012
AUS	2013
AUS	2014
AUS	2015
AUS	2016
AUS	2017
BEL	2000
BEL	2001
BEL	2002
BEL	2003
BEL	2004
BEL	2005
BEL	2006
BEL	2007
BEL	2008
BEL	2009
BEL	2010
BEL	2011
BEL	2012
BEL	2013
BEL	2014
BEL	2015
BEL	2016
BEL	2017
CAN	2000
CAN	2001
CAN	2002
CAN	2003
CAN	2004
CAN	2005
CAN	2006
CAN	2007
CAN	2008
CAN	2009
CAN	2010
CAN	2011
CAN	2012
CAN	2013
CAN	2014
CAN	2015
CAN	2016
CAN	2017

Please help me with this task.
Thank you for any help you can offer.

Creating a variable equal to 1 if a name (i.e., string) at time t-4 appears at time t

I have Brazilian electoral data with the names of candidates who ran for mayor from 2000 to 2016. The dataset contains information about the candidates' names (NOME_CANDIDATO), their party codes (party_code), the codes of the states (ibge_uf_code) and municipalities (ibge_mun_code) where they ran, and the number of votes they got (QTDE_VOTOS).

What I need: to create a variable called "incumbent_run_reelection" equal to 1 if the candidate who won an election at t-4 ran for reelection at election t, and zero otherwise (e.g., if John Doe won the election in 2000 and ran again in 2004, incumbency for him would be equal to 1). My goal is to know which municipalities have incumbent mayors running for reelection every election year.

The problem: I only have a string variable to work with. One possibility would be to encode "NOME_CANDIDATO", but the issue is that there are hundreds of repeated names in the dataset across municipalities.

Question: how can I create the "incumbent_run_reelection" variable with the "NOME_CANDIDATO" variable?

Thanks

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int year byte ibge_uf_code str2 uf long ibge_mun_code str61 NOME_CANDIDATO byte party_code str13 party_label long QTDE_VOTOS
2000 11 "RO" 1100015 "NEREU JOSE KLOSINSKI"             13 "PT"    1655
2008 11 "RO" 1100015 "DANIEL DEINA"                     19 "PTN"   7930
2012 11 "RO" 1100015 "GIOVAN DAMO"                      45 "PSDB"  6552
2016 11 "RO" 1100015 "VALDOIR GOMES FERREIRA"           15 "PMDB"  3638
2016 11 "RO" 1100015 "CARLOS BORGES DA SILVA"           11 "PP"    9553
2004 11 "RO" 1100015 "VALDOIR GOMES FERREIRA"           12 "PDT"   7054
2000 11 "RO" 1100015 "MOISES JOSE RIBEIRO DE OLIVEIRA"  14 "PTB"   5133
2008 11 "RO" 1100015 "VALDOIR GOMES FERREIRA"           15 "PMDB"  6394
2012 11 "RO" 1100015 "VALDOIR GOMES FERREIRA"           15 "PMDB"  7465
2000 11 "RO" 1100015 "JOSE PEREIRA DE ASSIS"            15 "PMDB"  1075
2000 11 "RO" 1100015 "DARCILA TERESINHA CASSOL"         11 "PPB"   5452
2004 11 "RO" 1100015 "DARCILA TERESINHA CASSOL"         45 "PSDB"  6431
2004 11 "RO" 1100023 "ANTENOR KLOCH"                    43 "PV"    2601
2008 11 "RO" 1100023 "CONFUCIO AIRES MOURA"             15 "PMDB" 31329
2012 11 "RO" 1100023 "ADELINO ANGELO FOLLADOR"          25 "DEM"   8290
2000 11 "RO" 1100023 "ELISEU MULLER DE SIQUEIRA"        13 "PT"    3581
2004 11 "RO" 1100023 "CONFUCIO AIRES MOURA"             15 "PMDB" 20097
2000 11 "RO" 1100023 "ERNANDES SANTOS AMORIM"           11 "PPB"  16898
2000 11 "RO" 1100023 "MARCOS JUNIOR DOS SANTOS"         25 "PFL"   2419
2012 11 "RO" 1100023 "SAULO PIGNATON"                   14 "PTB"  12806
2008 11 "RO" 1100023 "LORIVAL RIBEIRO DE AMORIM"        27 "PSDC" 11122
2012 11 "RO" 1100023 "LORIVAL RIBEIRO DE AMORIM"        33 "PMN"  20371
2000 11 "RO" 1100023 "DONIZETTI JOSE"                   14 "PTB"   7209
2012 11 "RO" 1100023 "VALMIR FRANCISCO DOS SANTOS"      13 "PT"    4351
2008 11 "RO" 1100023 "ANTONIO EVERALDO JOCA"            50 "PSOL"   590
2004 11 "RO" 1100023 "DANIELA SANTANA AMORIM"           14 "PTB"  17736
2000 11 "RO" 1100023 "JOAO MARIA DE LIZ"                23 "PPS"   3392
2016 11 "RO" 1100023 "THIAGO LEITE FLORES PEREIRA"      15 "PMDB" 26808
2016 11 "RO" 1100023 "LORIVAL RIBEIRO DE AMORIM"        12 "PDT"  19333
2016 11 "RO" 1100031 "SILVENIO ANTONIO DE ALMEIDA"      15 "PMDB"  1868
2008 11 "RO" 1100031 "JOSE ROZARIO BARROSO"             22 "PR"    1458
2004 11 "RO" 1100031 "MOZAIR DIVINO DOS SANTOS"         13 "PT"    1723
2004 11 "RO" 1100031 "JOSE ROZARIO BARROSO"             12 "PDT"   1941
2000 11 "RO" 1100031 "JOSE ROZARIO BARROSO"             12 "PDT"   1275
2008 11 "RO" 1100031 "GILMAR DE CARLI"                  25 "DEM"   1814
2008 11 "RO" 1100031 "IZAEL DIAS MOREIRA"               14 "PTB"   2276
2016 11 "RO" 1100031 "GILMAR DE CARLI"                  43 "PV"    1858
2000 11 "RO" 1100031 "MOZAIR DIVINO DOS SANTOS"         13 "PT"     512
2000 11 "RO" 1100031 "MILTON MITSUO SAIKI"              25 "PFL"   2176
2012 11 "RO" 1100031 "IZAEL DIAS MOREIRA"               14 "PTB"   2889
2008 11 "RO" 1100031 "JOSE ROZARIO BARROSO"             22 "PR"    2229
2016 11 "RO" 1100049 "GLAUCIONE MARIA RODRIGUES NERI"   15 "PMDB" 19715
2000 11 "RO" 1100049 "NERI FIRIGOLO"                    13 "PT"    8151
2008 11 "RO" 1100049 "FRANCESCO VIALETTO"               13 "PT"   24601
2016 11 "RO" 1100049 "MARCO AURELIO BLAZ VASQUES"       25 "DEM"   7909
2016 11 "RO" 1100049 "ADAILTON ANTUNES FERREIRA"        10 "PRB"  12870
2012 11 "RO" 1100049 "FRANCESCO VIALETTO"               13 "PT"   21700
2016 11 "RO" 1100049 "JOSE COSTA"                       50 "PSOL"   217
2012 11 "RO" 1100049 "GLAUCIONE MARIA RODRIGUES"        27 "PSDC" 20133
2016 11 "RO" 1100049 "ACELINO LUIZ MARCON"              12 "PDT"   1604
2016 11 "RO" 1100049 "DIMAS GIACOMIN SELVATICI"         13 "PT"    1592
2004 11 "RO" 1100049 "SUELI ALVES ARAGAO"               15 "PMDB" 23035
2008 11 "RO" 1100049 "GLAUCIONE MARIA RODRIGUES NERI"   27 "PSDC" 14653
2000 11 "RO" 1100049 "SUELI ALVES ARAGAO"               15 "PMDB" 16686
2000 11 "RO" 1100049 "VILSON STECCA"                    23 "PPS"   7231
2004 11 "RO" 1100049 "DIVINO CARDOSO CAMPOS"            14 "PTB"  14900
2008 11 "RO" 1100049 "SILVERIO DOS SANTOS OLIVEIRA"     40 "PSB"   1243
2000 11 "RO" 1100056 "AMALIA CAMPOS MILANI E SILVA"     22 "PL"     499
2012 11 "RO" 1100056 "PEDRO JOSE ALVES SANCHES"         15 "PMDB"  4154
2016 11 "RO" 1100056 "KLEBER CALISTO DE SOUZA"          15 "PMDB"  3500
2008 11 "RO" 1100056 "KLEBER CALISTO DE SOUZA"          15 "PMDB"  7313
2000 11 "RO" 1100056 "MANOEL FRANCISCO DE ALMEIDA"      12 "PDT"   2124
2000 11 "RO" 1100056 "AIRTON GOMES"                     25 "PFL"   1440
2004 11 "RO" 1100056 "MANOEL FRANCISCO DE ALMEIDA"      45 "PSDB"  6924
2000 11 "RO" 1100056 "ISRAEL NEIVA DE CARVALHO"         15 "PMDB"  1764
2012 11 "RO" 1100056 "AIRTON GOMES"                     11 "PP"    6159
2000 11 "RO" 1100056 "MARIA FERREIRA"                   13 "PT"    1948
2008 11 "RO" 1100056 "VANESSA SIMOES D EFREITAS"        13 "PT"    2216
2004 11 "RO" 1100056 "VALDIR BENEDITO NAVARRO"          25 "PFL"   2895
2000 11 "RO" 1100056 "JOSE EUGENIO DE SOUZA"            14 "PTB"   2440
2016 11 "RO" 1100056 "AIRTON GOMES"                     11 "PP"    6421
2004 11 "RO" 1100064 "MARIO RODRIGUES LEITE"            15 "PMDB"  3859
2000 11 "RO" 1100064 "CERENEU JOAO NAUE"                12 "PDT"   4179
2000 11 "RO" 1100064 "EDSON LOPES DA SILVA"             45 "PSDB"  1430
2016 11 "RO" 1100064 "EDMILSON RODRIGUES DE ALMEIDA"    43 "PV"    3126
2016 11 "RO" 1100064 "JOSEMAR BEATTO"                   27 "PSDC"  2085
2012 11 "RO" 1100064 "APARECIDO DIAS DE OLIVEIRA"       13 "PT"    4518
2008 11 "RO" 1100064 "ANEDINO CARLOS PEREIRA JUNIOR"    11 "PP"    5699
2000 11 "RO" 1100064 "JOSE RODRIGUES DE SOUZA"          11 "PPB"   2203
2012 11 "RO" 1100064 "ANEDINO CARLOS PEREIRA JUNIOR"    11 "PP"    5976
2000 11 "RO" 1100064 "JUSSARA DIAS LEOPOLDO FERREIRA"   13 "PT"    2998
2000 11 "RO" 1100064 "JOUBERT ANTONIO MURACAMI"         40 "PSB"     68
2008 11 "RO" 1100064 "APARECIDO DIAS DE OLIVEIRA"       13 "PT"    4698
2016 11 "RO" 1100064 "JOSE RIBAMAR DE OLIVEIRA"         40 "PSB"   4519
2004 11 "RO" 1100064 "MIRIAN DONADON CAMPOS"            14 "PTB"   6741
2000 11 "RO" 1100072 "LEIDSON FERREIRA DE SOUSA"        15 "PMDB"  2429
2000 11 "RO" 1100072 "JOSE NUNES NETO"                  13 "PT"    1713
2012 11 "RO" 1100072 "DEOCLECIANO FERREIRA FILHO"       14 "PTB"   2100
2004 11 "RO" 1100072 "JOSUE DA SILVA LOPES"             45 "PSDB"   652
2016 11 "RO" 1100072 "LAERCIO MARCHINI"                 12 "PDT"   2206
2016 11 "RO" 1100072 "MARCELO CRISOSTOMO DO NASCIMENTO" 25 "DEM"    958
2016 11 "RO" 1100072 "LEANDRO TEIXEIRA VIEIRA"          40 "PSB"   1604
2004 11 "RO" 1100072 "SILVINO ALVES BOAVENTURA"         14 "PTB"   2269
2008 11 "RO" 1100072 "JOAO PEREIRA DE AGUIAR"           40 "PSB"    115
2004 11 "RO" 1100072 "JOUBERT ANTONIO MURACAMI"         40 "PSB"     24
2004 11 "RO" 1100072 "ADENIVAL MARCON"                  44 "PRP"    437
2012 11 "RO" 1100072 "RONELSON TERRES PORTELA"          23 "PPS"   1452
2012 11 "RO" 1100072 "VAGNER MEIRA TEIXEIRA"            13 "PT"    1343
2004 11 "RO" 1100072 "TEREZINHA APARECIDA ROSA"         13 "PT"    1285
2008 11 "RO" 1100072 "SELVINO ALVES BOAVENTURA"         14 "PTB"   2661
end

Threshold for small/large T in quarterly panel data

Dear all,

Sorry for basic question. I am estimating dynamic panel data model for 40 countries & 20-30 years at quarterly frequency. I am trying to decide whether I should use xtreg FE model or xtabond. From what I read, xtabond is for large N and small T, while xtreg FE can work (less dynamic panel bias) for large T. Since I have 20-30 years of data (depending on the country) at quarterly frequency, is that a small or large T? It is lower than 25 in terms of year, but of course more in terms of quarter. Thank you.

Best regards,

Abdan

Creating variables for husband and wife using data for respondent and their partner

Dear Statalist.

I would like help to generate a variable, say level of education "educ" for the male partner and female partner in a union (either married or de facto). Sample from panel dataset includes data for respondent (hgsex mrcurr edhigh1) and their partner (p_hgsex p_mrcurr p_edhigh1) hgsex == 1 (male) == 2 (female) mrcurr == 1 (married) == 2 (de facto).

Help appreciated.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(hgsex p_hgsex mrcurr p_mrcurr edhigh1 p_edhigh1) 
1 2 1 1 5 9
2 1 2 2 8 9
2 1 2 2 8 9
2 1 1 1 5 5
2 1 2 2 5 5
2 1 2 2 9 9
1 2 1 1 9 9
1 2 1 1 9 9
2 1 1 1 1 3
2 1 1 1 1 3
2 1 2 2 9 4
2 1 2 2 9 4
1 2 1 1 5 5
1 2 1 1 5 5
1 2 1 1 5 8
1 2 1 1 5 5
1 2 1 1 5 8
1 2 1 1 5 5
1 2 1 1 5 8
1 2 1 1 3 3
1 2 1 1 3 3
1 2 2 2 3 4
1 2 1 1 3 3
1 2 1 1 3 3
1 2 2 2 3 8
1 2 1 1 3 3
1 2 1 1 3 3
1 2 2 2 8 8
2 1 1 1 4 3
2 1 1 1 4 1
2 1 1 1 4 1
2 1 1 1 4 3
2 1 1 1 4 1
end

Array

Choose Which Fixed Effect Has Coefficient 0

Hi Statalisters,

I'm running a fixed-effects regression with fixed-effects by region (e.g. Americas, Africa, Oceania) using standard OLS and a variable for region index. In this regression, Stata automatically sets the coefficient on the Africa indicator variable to 0, as Africa is the first region in alphabetic order. Is there a way to change which region has its FE coefficient set to 0?

Code:

encode region, gen(region_idx)
eststo: reg y x i.region_idx
esttab * using filename.tex, se replace label

In the resulting table, Africa has coefficient 0, and all other regions have non-negative FE coefficients.

Many thanks,
Eric

help exporting lstat results after logistic regression into word/excel documents

hi all,
does anyone know of a way to export the results of lstat into a word document or a table/excel?
I have tried using outreg2 and asdoc but all this does is exports the logistic regression results ive performed to the documents and it ignores the lstat results.
grateful for any help,
thank you

Heckman Procedure for Gender Wage Gap (Oaxaca Decomposition)

I am trying to decompose the gender wage gap into explained and unexplained factors using the Blinder Oaxaca decomposition technique using labour survey data. For the same, I ran the following command:
oaxaca log_monthlywages_salary_casual num_frml_edu tech_deg exp_n expsq_n mar_18 ss, by(gender) svy pooled
For this I had restricted my sample to wage earners.
After reading more literature I realised the selectivity bias and want to correct using the heckman two step procedure. Since my sample is restricted to wage earners and I had dropped all other categories such as unemployed, not willing to work, self-employed etc my lfpr (dependent variable) for the probit regression for heckman will always be 1. Should I expand the sample for this and then restrict it back for the Oaxaca eqn?

ppmlhdfe in panels

Dear everyone,

I post this because I am confused with bilateral fixed effect estimates:

I estimate the effects of restrictions in services (logistics, banking and transport) on bilateral trade in food goods between 36 OECD countries from 2014-2018 (panel data).
My restrictiveness variables are bilateral and vary by sector and country, but no variation by year for some countries. My dependent variable is sectoral.

I estimate the fixed-effects model (exporter-sectors time, importer-sectors time, and exporter-importer-sector). As independent variables I have my bilateral restrictiveness measures by sector and the FTAij (Free Trade agreement). Through ppml_panel_sg and ppmlhdfe as controls my results are not at all significant.

Other effects when I use the ppml_panel_sg command, no pair with my invariant bilateral cost variables (common language, common border, bilateral distance), my variables become significant.

According to you, should I use the model with bilateral fixed effects or two-step method to construct bilateral trade costs?

Best regards.

vector of control variables for panel data

i want to know the command to get vector of control variable. I want to get vector of country level control variables - democratic accountability, log gdp, investment freedom and uncertainty index. These are single values for each country. I want to regress stock returns in this and a fixed effect dummy variable.