BJ Data Tech Solution

Specialized on Data processing, Data management Implementation plan, Data Collection tools - electronic and paper base, Data cleaning specifications, Data extraction, Data transformation, Data load, Analytical Datasets, and Data analysis. BJ Data Tech Solutions teaches on design and developing Electronic Data Collection Tools using CSPro, and STATA commands for data manipulation. Setting up Data Management systems using modern data technologies such as Relational Databases, C#, PHP and Android.

Thursday, June 30, 2022

Possible to put interaction terms of X and indictors of Y quintile in a regression?

Dear Statalist users,

I am wondering if I would estimate a regression like this:

Y = X + X * Y_Q2 + X * Y_Q3 + X * Y_Q4 + X * Y_Q5 + Y_Q2 + Y_Q3 +Y_Q4 + Y_Q5

Y_Q2 is an indicator variable of whether Y is in the 2^nd quintile of the sample, and Y_Q3-5 is defined similarly.

In other words, I want to put interaction terms of X and indictors of Y quintile.

The reason I want to put interaction terms of X and Y quintile is that I suspect the relationship between Y and X is non-linear and I want to see how the coefficients changes as Y changes.

The reason I don’t flip Y and X (that is, making Y the independent variable and X the dependent variable) is that it is the convention in the literature that Y is the dependent variable.

Technically I can run the regression. But I am wondering if there is any problem econometrics-wise.

I would appreciate any comments/advices. Thank you so much!

how to replace/recode variables that are a multiple of some number

Hi I have a variable "level" that ranges from 1 to 125
I would like to recode it into a variable "pitch", such that values 1, 6, 11, 16, 21 ... etc are coded as "1" for pitch and values 2, 7, 12, 17, 22... etc ared coded as "2" for pitch
I could do that manually with recode e.g.

HTML Code:

recode level 1=1 6=1 11=1 etc, gen(obj)

but it would take a long time

Is there a quicker way to do this?

Shortcut for generating new variables based on many existing

I have groups of variables contactoutcome_con_<y>_<x> and contactmethod_con_<y>_<x>, where x is a program from 1-8 and where y indexes the contact attempt (goes from 1-30) associated with that program. For example, contactoutcome_con_30_8 is the outcome of the 30^th contact attempt made by the 8^th program, and contactmethod_con_30_8 is the contact method that was used for that attempt. I need to generate a new variable "contact_success" based on the values of these two variable. For example, contact_success should = 0 if contactoutcome_con_<y>_<x>==9 AND contactmethod_con_<y>_<x>==11. What is the shortcut for doing this for all the 240 matching contactoutcome_con_<y>_<x> and contactmethod_con_<y>_<x> pairs? Thanks in advance!

create a loop to convert string variables in numeric variables

Hello,
I am Salvatore, happy to join the Stata Forum community.
I am a new user who recently started using Stata. For my thesis research, I am using panel data. At the time I imported this data on Stata all my variables has been converted to strings. Since my dataset consists of over 60 columns (due to the fact that I use six variables and 12 years-time period of 2021-2010), I would like to try to set up a loop that will allow me to convert all the strings to numbers without typing each code. Please help me and I will give every kind of information that would be helpful. The command I am trying to use is with foreach, but I don't know how to use it.
[foreach var...]
Thank you very much. Sorry for my mistakes in following the posting rules, but this is my first post and I am still learning. I apologize for not being able to use dataex, but Stata says "input statement exceeds linesize limit. Try specifying fewer variables" Hope you will understand.
Salvatore

Matching without replacement from a file of pairs for case-control and other applications

Short version: I’m seeking a solution to how to do 1:m matching of cases and controls without replacement, *given a file in long (“edge”) format of all possible matching pairs.* Code to create sample data occurs at the end of this post.

Longer version:
In corresponding off-list with Rima Saliba about this problem, I discovered, that the code I had posted here several years ago in response to a question about this problem is wrong.

While there have been a number of threads over the year about case control matching, (see e.g. here). I now have encountered and refined the problem in a way that seems different enough from previous work to be worth posting in a more generalized way. What follows is my refined version of the problem, to which I'm interested in solutions.

Via the use of -joinby- or perhaps -rangejoin- or -cross-, one can have a file that pairs up cases or treatment subjects with matching potential controls. In such situations, analysts may want to have 1:m matching *without replacement.* Complications include that:
1) The controls that match one case/treatment subject may match many others.
2) There are varying numbers of controls available for each case, in general more or perhaps less than m
3) If possible, one wants to avoid too “greedy” an algorithm, which can result in the extreme in one case getting assigned all m controls and some other similar case getting 0.

I have the idea that some solution involving -merge- should be possible, per some earlier threads, but I have not successfully figured how to do that. I also have the thought that one of the many built-in or community-contributed matching commands might be used, but I have not worked that out either. I *have* discovered, that some of the “greediness” problem can be avoided by having an algorithm that picks only *1* control without replacement for every case, and then applying this iteratively, so a solution that only picks one control per case would solve the problem.

In that context, here is a code snippet to create what I’d consider a representative kind of data set with which to work:

Code:

// Create an “edge” file of matched pairs.
set seed 82743
local ncases = 100
local maxmatch = 100
local maxcontrolid = 2000
clear
set obs `ncases'
gen int caseid = _n
gen navail = ceil(runiform() * `maxmatch')
label var navail "# controls matched to this case"
expand navail
gen int controlid = ceil(runiform() * `maxcontrolid')
summ navail
order caseid controlid

I realize that “without replacement“ is not necessarily analytically preferable, but that’s another issue.

bysort query issues

Hello all:

I am trying to replace first row of path with "CLL" and second row of path as "RT" for _merge == 2 cases. The code is replacing both rows with either CLL or RT. What am I doing wrong? studyid 4 and 5 would be examples.

bys studyid: replace path[`1'] = "CLL" if _merge == 2
bys studyid: replace path[`_N'] = "RT" if _merge == 2 // not working

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int studyid str55 path byte _merge
 1 "Normal"       3
 1 "CLL"          3
 1 "CLL"          3
 1 "CLL"          3
 1 "CLL/PL"       3
 1 "CLL"          3
 1 "CLL"          3
 1 "CLL"          3
 1 "CLL"          3
 2 "CLL"          3
 3 "CLL"          3
 3 "Mixed CLL-RT" 3
 3 "Mixed CLL-RT" 3
 3 "Mixed CLL-RT" 3
 4 ""             2
 4 ""             2
 5 ""             2
 5 ""             2
 6 "CLL"          3
 6 "Mixed CLL-RT" 3
 6 "CLL"          3
 6 "CLL"          3
 6 "CLL"          3
 6 "CLL"          3
 6 "RT"           3
 6 "CLL"          3
 7 "Mixed CLL-RT" 3
 8 ""             2
 8 ""             2
 9 "CLL"          3
 9 "CLL"          3
 9 "CLL"          3
 9 "CLL"          3
10 ""             2
10 ""             2
11 ""             2
11 ""             2
12 "Normal"       3
12 "CLL"          3
12 "Normal"       3
12 "RT"           3
12 "Normal"       3
12 "Normal"       3
12 "Normal"       3
13 "CLL"          3
13 "CLL"          3
14 "Other"        3
14 "Other"        3
14 "Other"        3
14 "Other"        3
15 "CLL"          3
15 "Mixed CLL-RT" 3
15 "CLL"          3
15 "CLL"          3
15 "CLL"          3
16 "Mixed CLL-RT" 3
16 "Mixed CLL-RT" 3
16 "CLL"          3
16 "CLL"          3
17 "CLL"          3
17 "Mixed CLL-RT" 3
17 "CLL"          3
18 "CLL"          3
18 "RT"           3
18 "Normal"       3
18 "Normal"       3
18 "."            3
18 "Normal"       3
18 "Normal"       3
19 "CLL"          3
19 "CLL"          3
19 "CLL"          3
20 ""             2
20 ""             2
21 "Normal"       3
22 ""             2
22 ""             2
23 "CLL"          3
23 "Normal"       3
23 "CLL"          3
23 "CLL"          3
23 "CLL"          3
23 "CLL"          3
23 "CLL"          3
23 "CLL"          3
23 "CLL"          3
24 "."            3
24 "CLL"          3
24 "CLL"          3
25 "CLL"          3
25 "."            3
25 "Mixed CLL-RT" 3
25 "Mixed CLL-RT" 3
26 "RT"           3
26 "CLL"          3
26 "CLL"          3
26 "CLL"          3
26 "Mixed CLL-RT" 3
27 "Normal"       3
27 "CLL"          3
end
label values _merge _merge
label def _merge 2 "Using only (2)", modify
label def _merge 3 "Matched (3)", modify

Replace command with non-mutually exclusive categorical data

Hello,

I am working with a dataset from a Twitter content analysis project and am stuck trying out figure out how to take 8 categorical tweet characteristic variables (resource, news, personal experience, personal opinion, marketing, spam, question, jokes/parody) and create one "tweet characteristic" variable (code below).

The problem I am having is that the categories are not mutually exclusive. The n for JokesParody is 22, but when I run this code it reduces it to 5 since a tweet can have several of these characteristics. Any help you can provide would be very much appreciated.

gen Characteristics=.
replace Characteristics = 0 if JokesParody==1
replace Characteristics = 1 if Resource==1
replace Characteristics = 2 if News==1
replace Characteristics = 3 if PersonalExperience==1
replace Characteristics = 4 if PersonalOpinion==1
replace Characteristics = 5 if Marketing==1
replace Characteristics = 6 if Spam==1
replace Characteristics = 7 if Question==1
label var Characteristics "Tweet Characteristics"
label define Characteristics 0 "Jokes/Parody" 1 "Resource" 2 "News" 3 "Personal Experience" 4 "Personal Opinion" 5 "Marketing" 6 "Spam" 7 "Question"
label val Characteristics Characteristics

Wednesday, June 29, 2022

Number of lags too high

Hi everyone,

I am writing as I am currently working on my bachelor's thesis with Stata.
I am doing a time series analysis.
I have found that 4 is the optimal number of lags, I have one cointegrating equation, and the variables are stationary at first difference.
However, my professor pointed out that, because I am estimating a VAR of dimension 5 with 4 lags, the estimated parameters are far more than the observations.
The problem is that I do not really know what to do. Because of my research topic, I cannot really change my data.
Therefore, what do you recommend me to do? To perform an additional test? To simply state this issue in the limitations of the thesis?

Thank you in advance for the help (I'm still a beginner in Stata)!

pweights in reghdfe allow colinear variable to generate a coefficient?

Hi,

This is my first post here. This question pertains to the use of pweight in reghdfe. I have a variable that is colinear with the fixed effects (see the output of the first regression. But when I add in the probability weights, reghdfe does manage do output a coefficient (see second regression output below). I was hoping someone could point me to some resources/an explanation to help me understand why this is happening.

Thank you,

Claire
Array

VAR or PVAR different lags for endgenous variables

Dear community,

in a VAR or PVAR model would it be possible for endgenous variables to have different lags?

The PVAR seems to always use an equal number of lags but would it be possible in a PVAR model to have for exemple 5 lags for Endegenous_var_A and 3 lags for Endegenous_var_B.

Best regards,
Farid

comparing the parametric survival regression models

What is the command for comparing the parametric survival regression models? For example, I estimate the model by both exponential and Weibull distributions. But, I also would like to compare (estimate) which distribution to use is better.

Thank you!

comparing the parametric survival regression models

Tuesday, June 28, 2022

Dummy for export-starters and non-exporters in the period just before the export-starter enters the export market

I want to create a dummy for export starters to non-exporters in the years before entry. How can I code this?

When Export[_n]==1 & Export[_n-1]==0, generate Starter=1 and Non-exporter=0 in the previous year.

Year	Firm	Export	Export_Starter(Desired dummy)
2010	1	0	0
2011	1	0	.
2012	1	0	.
2010	2	0	1
2011	2	1	.
2012	2	0	.
2010	3	1	.
2011	3	1	.
2012	3	1	.
2010	4	0	0
2011	4	1	.
2012	4	1	.

How to inform readers the context of hierarchical logistic regression that removes significant main effects

Hello all,

I'm examining a hypothetical scenario to determine how living alone and the mechanism of feedback affects a person's willingness to turn themselves into police for a crime. Mechanism of feedback refers to the person being told the positive or negative consequences of turning themselves in (e.g. a positive mechanism would tell the person all of the good things that come with turning him or herself in while a negative mechanism would tell the person all the bad things that come with turning him or herself in - a neutral mechanism doesn't mention any good or bad things).

I'm running a hierarchical logistic regression model using two steps: the main effects on the first step and the interaction on the second step where v1 is the dichotomous variable "Lives alone" (yes/no) and v2 is the categorical variable "Feedback mechanism" with three categories (positive/negative/neutral)

Code:

nestreg, lr: logistic turnselfin (i.livesalone ib0.feedbackmech) (i.livesalone#ib0.feedbackmech)

Which spits out:

Code:

note: 0.livesalone omitted because of estimability.
note: 0.feedbackmech omitted because of estimability.
note: 0.livesalone#0.feedbackmech omitted because of estimability.
note: 0.livesalone#1.feedbackmech omitted because of estimability.
note: 0.livesalone#2.feedbackmech omitted because of estimability.
note: 1.livesalone#0.feedbackmech omitted because of estimability.

Block 1: 1.livesalone 1.feedbackmech 2.feedbackmech

Logistic regression                                     Number of obs =    308
                                                        LR chi2(3)    =  31.75
                                                        Prob > chi2   = 0.0000
Log likelihood = -186.56561                             Pseudo R2     = 0.0784

------------------------------------------------------------------------------
  turnselfin | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
  livesalone |
Lives alone  |   1.935901    .499876     2.56   0.011     1.167055    3.211256
             |
feedbackmech |
   Positive  |   4.574841   1.510175     4.61   0.000      2.39547    8.736979
    Neutral  |   3.450492    1.14231     3.74   0.000     1.803369     6.60203
             |
       _cons |    .142871   .0454746    -6.11   0.000     .0765622    .2666085
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

Block 2: 1.livesalone#1.feedbackmech 1.livesalone#2.feedbackmech

Logistic regression                                     Number of obs =    308
                                                        LR chi2(5)    =  33.36
                                                        Prob > chi2   = 0.0000
Log likelihood = -185.76036                             Pseudo R2     = 0.0824

-----------------------------------------------------------------------------------------
             turnselfin | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
------------------------+----------------------------------------------------------------
             livesalone |
           Lives alone  |    1.93617    1.10679     1.16   0.248     .6314858    5.936405
                        |
           feedbackmech |
              Positive  |   5.727273   3.287981     3.04   0.002     1.859002    17.64476
               Neutral  |   2.757576   1.597961     1.75   0.080     .8856718    8.585826
                        |
livesalone#feedbackmech |
  Lives alone#Positive  |   .6943834    .486771    -0.52   0.603     .1757507    2.743479
   Lives alone#Neutral  |   1.451546   1.028417     0.53   0.599     .3620396    5.819764
                        |
                  _cons |   .1428571   .0682988    -4.07   0.000     .0559693    .3646315
-----------------------------------------------------------------------------------------
Note: _cons estimates baseline odds.


  +----------------------------------------------------------------+
  | Block |        LL       LR     df  Pr > LR       AIC       BIC |
  |-------+--------------------------------------------------------|
  |     1 | -186.5656    31.75      3   0.0000  381.1312  396.0516 |
  |     2 | -185.7604     1.61      2   0.4470  383.5207  405.9013 |
  +----------------------------------------------------------------+

My question is how to interpret and explain this in lay terms. My two real struggles are:
1. Living alone loses its significance when controlling for the interaction between living alone and the mechanism of feedback.
2. Positive mechanism of feedback retains significance after the interaction, but the interactions themselves are not significant.

And I still struggle to understand exactly how to interpret interaction effects in this way. Is it correct to say that living alone does not significantly impact someone turning themselves in when they live alone and are presented with feedback (Block 2)? But, living alone without considering feedback does significantly increase turning themselves in (Block 1)?

Asking how to interpret and discuss these results for the lay person might be elementary, but I've looked everywhere and virtually all threads, videos, etc. interpret hierarchical regression with interactions in a statistical way. I suppose I'm asking how I would explain this result to my grandmother, as the saying goes.

Thanks for all the help.

PPML - generating a time dependent threshold

Dear Stata community,

I am working with a gravity model and what to cluster my observations depending on if they are in the highest, middle or lowest third of observations in the respective year, but I am open for different suggestions. More on that after I explain my data and method. I use the ppmlhdfe command written by Sergio Correia, Paulo Guimarães, Thomas Zylkin. The tool can be installed with the following commands:

Code:

ssc install ftools
ssc install reghdfe
ssc install ppmlhdfe

I am looking only at data where China is the import or export partner:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int year str3 iso3_o byte rta float(BRI_mem_o ln_exports_to_china ln_imports_from_china ln_total_china_trade) str6 country_pair float(ln_gdp_2_pop ln_preimp ln_preexpo ln_pretotal)
1990 "ABW" 0 0 . . . "ABWABW"         . . . .
1991 "ABW" 0 0 . . . "ABWABW"         . . . .
1992 "ABW" 0 0 . . . "ABWABW"         . . . .
1993 "ABW" 0 0 . . . "ABWABW"         . . . .
1994 "ABW" 0 0 . . . "ABWABW"  19.52183 . . .
1995 "ABW" 0 0 . . . "ABWABW" 19.415113 . . .
1996 "ABW" 0 0 . . . "ABWABW"  19.43265 . . .
1997 "ABW" 0 0 . . . "ABWABW" 19.588173 . . .
1998 "ABW" 0 0 . . . "ABWABW" 19.712955 . . .
1999 "ABW" 0 0 . . . "ABWABW"  19.74156 . . .
2000 "ABW" 0 0 . . . "ABWABW"  19.86799 . . .
2001 "ABW" 0 0 . . . "ABWABW"  19.87303 . . .
2002 "ABW" 0 0 . . . "ABWABW" 19.849876 . . .
2003 "ABW" 0 0 . . . "ABWABW" 19.888773 . . .
2004 "ABW" 0 0 . . . "ABWABW"  20.04846 . . .
2005 "ABW" 0 0 . . . "ABWABW"  20.11266 . . .
2006 "ABW" 0 0 . . . "ABWABW" 20.172903 . . .
2007 "ABW" 0 0 . . . "ABWABW"  20.32564 . . .
2008 "ABW" 0 0 . . . "ABWABW"  20.44747 . . .
2009 "ABW" 0 0 . . . "ABWABW" 20.224247 . . .
2010 "ABW" 0 0 . . . "ABWABW"  20.19557 . . .
2011 "ABW" 0 0 . . . "ABWABW" 20.281445 . . .
2012 "ABW" 0 0 . . . "ABWABW"         . . . .
2013 "ABW" 0 0 . . . "ABWABW"         . . . .
2014 "ABW" 0 0 . . . "ABWABW"         . . . .
2015 "ABW" 0 0 . . . "ABWABW"         . . . .
2016 "ABW" 0 0 . . . "ABWABW"         . . . .
2017 "ABW" 0 0 . . . "ABWABW"  20.55063 . . .
2018 "ABW" 0 0 . . . "ABWABW"         . . . .
2019 "ABW" 0 0 . . . "ABWABW"         . . . .
1990 "ABW" 0 0 . . . "ABWAFG"         . . . .
1991 "ABW" 0 0 . . . "ABWAFG"         . . . .
1992 "ABW" 0 0 . . . "ABWAFG"         . . . .
1993 "ABW" 0 0 . . . "ABWAFG"         . . . .
1994 "ABW" 0 0 . . . "ABWAFG"         . . . .
1995 "ABW" 0 0 . . . "ABWAFG"         . . . .
1996 "ABW" 0 0 . . . "ABWAFG"         . . . .
1997 "ABW" 0 0 . . . "ABWAFG"         . . . .
1998 "ABW" 0 0 . . . "ABWAFG"         . . . .
1999 "ABW" 0 0 . . . "ABWAFG"         . . . .
2000 "ABW" 0 0 . . . "ABWAFG"         . . . .
2001 "ABW" 0 0 . . . "ABWAFG"  14.68416 . . .
2002 "ABW" 0 0 . . . "ABWAFG" 15.150466 . . .
2003 "ABW" 0 0 . . . "ABWAFG" 15.234106 . . .
2004 "ABW" 0 0 . . . "ABWAFG" 15.418114 . . .
2005 "ABW" 0 0 . . . "ABWAFG" 15.587377 . . .
2006 "ABW" 0 0 . . . "ABWAFG" 15.704497 . . .
2007 "ABW" 0 0 . . . "ABWAFG" 16.085983 . . .
2008 "ABW" 0 0 . . . "ABWAFG"  16.15592 . . .
2009 "ABW" 0 0 . . . "ABWAFG" 16.222836 . . .
2010 "ABW" 0 0 . . . "ABWAFG" 16.427858 . . .
2011 "ABW" 0 0 . . . "ABWAFG" 16.560684 . . .
2012 "ABW" 0 0 . . . "ABWAFG"         . . . .
2013 "ABW" 0 0 . . . "ABWAFG"         . . . .
2014 "ABW" 0 0 . . . "ABWAFG"         . . . .
2015 "ABW" 0 0 . . . "ABWAFG"         . . . .
2016 "ABW" 0 0 . . . "ABWAFG"         . . . .
2017 "ABW" 0 0 . . . "ABWAFG" 16.528923 . . .
2018 "ABW" 0 0 . . . "ABWAFG"         . . . .
2019 "ABW" 0 0 . . . "ABWAFG"         . . . .
1990 "ABW" 0 0 . . . "ABWAGO"         . . . .
1991 "ABW" 0 0 . . . "ABWAGO"         . . . .
1992 "ABW" 0 0 . . . "ABWAGO"         . . . .
1993 "ABW" 0 0 . . . "ABWAGO"         . . . .
1994 "ABW" 0 0 . . . "ABWAGO"   15.6064 . . .
1995 "ABW" 0 0 . . . "ABWAGO" 15.739015 . . .
1996 "ABW" 0 0 . . . "ABWAGO" 16.120626 . . .
1997 "ABW" 0 0 . . . "ABWAGO" 16.187563 . . .
1998 "ABW" 0 0 . . . "ABWAGO"  16.05207 . . .
1999 "ABW" 0 0 . . . "ABWAGO" 15.991986 . . .
2000 "ABW" 0 0 . . . "ABWAGO" 16.419592 . . .
2001 "ABW" 0 0 . . . "ABWAGO"  16.36816 . . .
2002 "ABW" 0 0 . . . "ABWAGO" 16.657751 . . .
2003 "ABW" 0 0 . . . "ABWAGO"  16.76887 . . .
2004 "ABW" 0 0 . . . "ABWAGO" 17.138466 . . .
2005 "ABW" 0 0 . . . "ABWAGO" 17.498556 . . .
2006 "ABW" 0 0 . . . "ABWAGO" 17.886463 . . .
2007 "ABW" 0 0 . . . "ABWAGO" 18.298084 . . .
2008 "ABW" 0 0 . . . "ABWAGO" 18.656734 . . .
2009 "ABW" 0 0 . . . "ABWAGO" 18.403341 . . .
2010 "ABW" 0 0 . . . "ABWAGO" 18.445055 . . .
2011 "ABW" 0 0 . . . "ABWAGO" 18.689266 . . .
2012 "ABW" 0 0 . . . "ABWAGO"         . . . .
2013 "ABW" 0 0 . . . "ABWAGO"         . . . .
2014 "ABW" 0 0 . . . "ABWAGO"         . . . .
2015 "ABW" 0 0 . . . "ABWAGO"         . . . .
2016 "ABW" 0 0 . . . "ABWAGO"         . . . .
2017 "ABW" 0 0 . . . "ABWAGO" 18.593037 . . .
2018 "ABW" 0 0 . . . "ABWAGO"         . . . .
2019 "ABW" 0 0 . . . "ABWAGO"         . . . .
1990 "ABW" 1 0 . . . "ABWAIA"         . . . .
1991 "ABW" 1 0 . . . "ABWAIA"         . . . .
1992 "ABW" 1 0 . . . "ABWAIA"         . . . .
1993 "ABW" 1 0 . . . "ABWAIA"         . . . .
1994 "ABW" 1 0 . . . "ABWAIA"         . . . .
1995 "ABW" 1 0 . . . "ABWAIA"         . . . .
1996 "ABW" 1 0 . . . "ABWAIA"         . . . .
1997 "ABW" 1 0 . . . "ABWAIA"         . . . .
1998 "ABW" 1 0 . . . "ABWAIA"         . . . .
1999 "ABW" 1 0 . . . "ABWAIA"         . . . .
end

The variable ln_gdp_2_pop controls for the market size the two trading partners form together, it is the product of both partners gdp over the product of both populations (log linearized). ln_preimp is the average value of imports of the previous three years (my data goes further back than 1990, thats why this value is always avaiable if the country existed in 1987, log linearized too). rta is a dummy for regional trade agreements with China and BRI_mem_o is the variable of intrest. country_pair is a unique identifyer for each country pair. Using this as fixed effects controls for all time invariant country pair specific variables like distance, contiguity etc.
I then estimate the follwing ppml regressions (one for imports, exports and total trade each) with hdfe:

Code:

ppmlhdfe imports_from_china ln_gdp_2_pop ln_preimp BRI_mem_o rta if year> 1990, absorb(year country_pair) vce(robust)

ppmlhdfe exports_to_china ln_gdp_2_pop ln_preexpo BRI_mem_o rta if year> 1990, absorb(year country_pair) vce(robust)

ppmlhdfe total_china_trade ln_gdp_2_pop ln_pretotal BRI_mem_o rta if year> 1990, absorb(year country_pair) vce(robust)

I obtain the follwing results for total trade for example:

Code:

(dropped 3 observations that are either singletons or separated by a fixed effect)
warning: dependent variable takes very low values after standardizing (2.1798e-07)
Iteration 1:   deviance = 2.1018e+10  eps = .         iters = 4    tol = 1.0e-04  min(eta) =  -4.02  P   
Iteration 2:   deviance = 5.7823e+09  eps = 2.63e+00  iters = 3    tol = 1.0e-04  min(eta) =  -6.11      
Iteration 3:   deviance = 1.6919e+09  eps = 2.42e+00  iters = 3    tol = 1.0e-04  min(eta) =  -8.86      
Iteration 4:   deviance = 8.5597e+08  eps = 9.77e-01  iters = 3    tol = 1.0e-04  min(eta) = -10.70      
Iteration 5:   deviance = 7.2615e+08  eps = 1.79e-01  iters = 3    tol = 1.0e-04  min(eta) = -11.49      
Iteration 6:   deviance = 6.9979e+08  eps = 3.77e-02  iters = 3    tol = 1.0e-04  min(eta) = -12.25      
Iteration 7:   deviance = 6.9484e+08  eps = 7.13e-03  iters = 2    tol = 1.0e-04  min(eta) = -12.74      
Iteration 8:   deviance = 6.9421e+08  eps = 9.11e-04  iters = 2    tol = 1.0e-04  min(eta) = -13.17      
Iteration 9:   deviance = 6.9417e+08  eps = 5.78e-05  iters = 2    tol = 1.0e-04  min(eta) = -13.31      
Iteration 10:  deviance = 6.9417e+08  eps = 9.69e-07  iters = 2    tol = 1.0e-05  min(eta) = -13.32      
Iteration 11:  deviance = 6.9417e+08  eps = 1.74e-09  iters = 2    tol = 1.0e-06  min(eta) = -13.32   S O
------------------------------------------------------------------------------------------------------------
(legend: p: exact partial-out   s: exact solver   h: step-halving   o: epsilon below tolerance)
Converged in 11 iterations and 29 HDFE sub-iterations (tol = 1.0e-08)

HDFE PPML regression                              No. of obs      =      5,145
Absorbing 2 HDFE groups                           Residual df     =      4,909
                                                  Wald chi2(4)    =    3346.53
Deviance             =  694167322.6               Prob > chi2     =     0.0000
Log pseudolikelihood = -347121413.4               Pseudo R2       =     0.9971
------------------------------------------------------------------------------
             |               Robust
total_chin~e | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
ln_gdp_2_pop |   .2087208   .0355928     5.86   0.000     .1389601    .2784815
 ln_pretotal |   .7336988   .0189645    38.69   0.000      .696529    .7708685
   BRI_mem_o |  -.0624149   .0221315    -2.82   0.005    -.1057918    -.019038
         rta |    -.03272   .0221649    -1.48   0.140    -.0761623    .0107223
       _cons |   1.186367    .494268     2.40   0.016     .2176198    2.155115
------------------------------------------------------------------------------

Absorbed degrees of freedom:
------------------------------------------------------+
  Absorbed FE | Categories  - Redundant  = Num. Coefs |
--------------+---------------------------------------|
         year |        29           0          29     |
 country_pair |       204           1         203     |
------------------------------------------------------+

I suspect that I need to cluster my error terms to reduce the effect of heteroskedasticity (I know ppml is already doing that). So I want to assign each import, export and total trade value to a group depending if they are in the lowest, middle or higest third of observations in the respective year. Then I could cluster for these three groups, because I read that the error term is affected by how large the trade flow is in gravity models.
The reason I suspect heteroskedasticity is that the estimation is not working consistent for different subsamples of countries and that the rta variable is changing signs and has very different levels of significance. But as mentioned before I am open for any other suggestion what else I could change.

Kind regards
Michael

Generating dummy observations to balance a panel

I hope this request makes sense, as it is just to aid in my estimation. Below is the dataex of a dummy dataset resembling my original, and below that I will describe my problem.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str1 ID float(phase HasMembership)
"A" 1 0
"A" 3 1
"B" 1 1
"B" 2 0
"B" 3 1
"C" 1 0
"C" 2 1
"C" 3 1
"D" 2 1
"D" 3 0
"E" 1 1
"E" 3 0
"F" 1 1
"F" 3 0
end

In my previous post, I had requested a way to track an individual's membership changes between phases. The advice given in that post was very good. I was able to generate variables which described whether an individual gained, lost, retained, or retained lack of a membership between any two consecutive phases.

The problem with my actual full fledged dataset is that there are individuals who don't always have consecutive phases. For example, in the given dataex, individual A has observations only in phase 1 and phase 3, we don't know anything about him in phase 2. Therefore with the solution code given in my previous post, the generated variables could not capture anything for individual A. It is my mistake that when I provided a dummy representative dataset, I made it balanced instead of unbalanced.

To counter this problem, is there any code or solution in stata by which I can generate dummy observations for individuals whose observations are not in every phase? And of course the values of Membership for those dummy variables would be the missing value. This is only to counter the problem that the solutions won't work for non consecutive periods. Hence since individual A has no observations in phase 2, his p2_p3 variable is missing. But I still want to capture the change that some time between phase 1 and phase 3, he did gain membership.

Otherwise if there is any other viable solution, I would be grateful to know.

EDIT: Thanks to Mr. Schechter for pointing out the mistake in the dataex, I have updated it

Determining the explanatory power of an interaction term

Hi everyone,

I'm trying to fill a table with each line representing the explanatory power of a particular part of my model (such as fixed effects, independent variables, the residuals...), that is, the variance of this specific part divided by the variance of the model.


Independant variables	Var(xb)/Var(model)
Fixed effect 1	Var(fe1)/Var(model)
Residuals	Var(residuals)/Var(model)
Interaction term	?

I am trying to find the last cell of column 2, that is, the explanatory power of an interaction term that I'd like to isolate from the predicted xb.

So what I do is that I use the command from SSC -reghdfe- to store my fixed effects in a variable, as well as the command -predict- to save the xb and the residuals in a variable.

Code:

reghdfe y var1 var2 var3 i.var4##i.var5, absorb(fe1 fe2, savefe) resid
predict xb, xb
predict residuals, r

Then I summarize the different variables I obtained to fill the table with my data:

Code:

sum xb
display r(Var)/`variance'

sum __hdfe1__
display r(Var)/`variance'
* __hdfe1__ is obtained with the savefe option

with `variance' being the model's variance defined in a local previously.

Now my problem is that xb is for all the independent variables, including the interaction term. How can I possibly isolate the variance of the different levels of the interaction term to fill the last cell of the table? My lead so far has been to generate manually a variable representing the interaction term between var4 and var5 and to put it in the fixed effect option in the regression, but it seems that the command xi generates an important quantity of variables for each combination of var4 and var5. I'm not sure this is what I want. Apologies if my post isn't clear (and it probably is!). I can explain further if needed.

Conditional loop analysis with sums in panel data

I have an unbalanced data of

company id (i)	country code (c)	year (y)	ratio-1 (r1)	ratio-2 (r2)	ratio-3 (r3)	ratio-4 (r4)	ratio-5 (r5)
1 ≤ i ≤ 1587	1 ≤ c ≤ 18	2005 ≤ y ≤ 2021	0 ≤ r1 ≤1	0 ≤ r2 ≤1	0 ≤ r3 ≤1	0 ≤ r4 ≤1	0 ≤ r5 ≤1

where 1587 companies from 18 different countries report 5 different financial ratios annually from 2005 to 2021.

As such, there are 26979 rows (1587 * 17) in the data (1587 companies and 17 years) in which each company is assigned to a single country in all years.

I want to generate a new variable (Xa,y) for each and every row. Since I couldn't properly post the formula for Xa,y I attached it in a pdf file Array

Eventually, all companies should have a unique "Xa,y" for each year and 17 (number of years) different "Xa.y"s in total.

I appreciate your help with the code for Xa,y.

Best,

Lütfi

Monday, June 27, 2022

Using a loop to calculate new variable

Hello all,

I want to find the new sale price for each year represented through new_sales. I will use the sale price in 2000q4 as the baseline (100). For example, year=2001q1 should be calculated by: new_sales=0.08*100=8+100=108. Then, I want to use this calculated new_sales value (108) to calculate new_sales for 2001q2. Then, find the new_sales value in 2001q3 by using the value in 2001q2 and so on.

I have many different new_sales columns that need to be calculated given different scenarios, so I think some sort of loop would work best.

Here is an example of my data:

Code:


input str7 year float(sales_percent new_sales)
2000q4 0.01 100
2001q1 0.08
2001q2 0.06
2001q3 0.07
2001q4 0.001
2002q1 0.02
2002q2 0.02
2002q3 0.03
2002q4 0.05
2003q1 0.02
2003q2 0.03
2003q3 0.001
2003q4 0.01

I would appreciate any assistance with this!

Thanks,
Anoush K.

How to define treatment & control groups properly?

I’m working on a project examining the effect of a 2016 cash transfer on fertility.

Who is eligible for the cash?
All families with: 1.) 2+ children, or 2.) 1 low-income or disabled child.

The data doesn’t have a variable indicating who got the cash transfer, so as I understand it, I would be doing an “intent to treat” analysis by defining the treatment & control groups based on eligibility.

However, I keep getting stuck on how to define the treatment & control groups. I guess my question is since the cash transfer is universal for ALL families with 2+ kids, what would be the control group then? Theoretically, there should be two similar groups of families with 2+ kids (one who get the cash transfer and the other who don’t), but that’s not possible in this case?

Comparing eligible families (2+ kids or 1 poor/disabled kid) to ineligible families (1 kid that is not poor/disabled or zero kids) would violate one of the core assumptions of causal inference (that the treatment and control groups be similar and only differ in the “treatment”).

I think I’m getting tripped up by how the cash transfer is both universal and birth-dependent.

I’m exploring using a linear probability model with FE or a DID model, but not sure if a DID makes sense? Is synthetic control more appropriate? Any thoughts on modeling strategies?

More context: The data comes from a household survey, which I’ve organized into a panel with fertility histories for each childbearing-aged woman (e.g. each woman has 17 observations, or 18 years containing her time-variant birth information). I have data from 2010-2018 and the program started in 2016. The program grandfathers in anyone who falls in either one of two eligibility categories. The cash transfer is not means or work tested.

What miss option means in gunique ?

Deat Stata user,
I found this gunique function, and using the same variable list I got different total observations and unique observations by adding or not add

Code:

miss

option.

Code:

gunique id name_first name_last
N = 19,632,203; 4,678,062 unbalanced groups of sizes 1 to 6,026

Code:

gunique id name_first name_last, miss
N = 19,632,557; 4,678,416 unbalanced groups of sizes 1 to 6,026

Does anyone knows why the N and number of unbalanced groups are different in these 2 cases? And what means size 1 to 6026?

Thanks a lot!

CI Decmposition Results not Showing percentage contribution

Hi All,

I am using Stata 16 and trying to decompose the concentration index but the results show nothing for percentage contribution. My dependent variable is a binary variable, obese or not obese. My dependent variables are quintile , feedingscheme (whether a student participates in school feeding program or not), adorace(ado for adolescent, African/White/Coloured/Asian), adogender(male/female), employ_m (whether mother is employed), educ_m and wgt_m.

The commands are as follows: using Erreygers

Code:

sca CI=r(CI)
global X  quintile feedingscheme adorace adogender empl_m educ_m wgt_m
qui sum obese [aw=wt]
sca m_obese=r(mean)
qui glm obese $X [aw=wt], family(binomial) link(logit)
qui margins , dydx(*) post

foreach x of varlist $X {
sca b_`x'=_b[`x']
}
foreach x of varlist $X {
qui{
conindex `x' [aw=wt], rankvar(quintile) truezero
sca CI_`x' = r(CI)
sum `x' [aw=wt]
sca m_`x'=r(mean)
sca elas_`x' = b_`x'*m_`x'
sca con_`x' = 4*elas_`x'*CI_`x'
sca prcnt_`x' = (con_`x'/CI)*100
}
di "`x' elasticity:", elas_`x'
di "`x' concentration index:", CI_`x'
di "`x' contribution:", con_`x'
di "`x' percentage contribution:", prcnt_`x'
}
matrix Aaa = nullmat(Aaa) \ (elas_`x', CI_`x', con_`x', prcnt_`x')
}
matrix rownames Caa= $X
matrix colnames Caa = "Elasticity""CI""Absolute""%"
matrix list Caa, format(%8.4f)
clear matrix

The results show just a fullstop next to each variable percentage contribution. And it also gives an error message "elas_ not found". I have defined elasticity

quintile elasticity: .02855949
quintile concentration index: .26154029
quintile contribution: .02987783
quintile percentage contribution: .
feedingscheme elasticity: .03727037
feedingscheme concentration index: .05987572
feedingscheme contribution: .00892636
feedingscheme percentage contribution: .
adorace elasticity: -.01348892
adorace concentration index: .01036392
adorace contribution: -.00055919
adorace percentage contribution: .
adogender elasticity: .02694672
adogender concentration index: -.00403962
adogender contribution: -.00043542
adogender percentage contribution: .
empl_m elasticity: .01834728
empl_m concentration index: .06688754
empl_m contribution: .00490882
empl_m percentage contribution: .
educ_m elasticity: .02113921
educ_m concentration index: .08112557
educ_m contribution: .00685972
educ_m percentage contribution: .
wgt_m elasticity: .15021019
wgt_m concentration index: .01183257
wgt_m contribution: .00710949
wgt_m percentage contribution: .

. matrix Aaa = nullmat(Aaa) \ (elas_`x', CI_`x', con_`x', prcnt_`x')
elas_ not found
r(111);

My dataex results are as follows:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte quintile float(feedingscheme adorace adogender empl_m educ_m wgt_m)
. . . . . . .
. . . . . . .
. . . . . . .
. . . . 1 1 2
1 1 2 2 1 1 2
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . 1 1 2
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . 2 1 3
. . . . 2 1 3
. . . . 2 1 3
. . . . . . .
2 2 2 2 2 1 3
. . . . 2 1 3
. . . . . . .
. . . . . . .
. . . . . . .
3 1 2 2 . . .
3 1 2 1 . . .
. . . . . . .
. . . . . . .
3 1 2 2 . . .
. . . . 1 1 3
. . . . . . .
. . . . . . .
3 1 2 1 2 1 2
3 1 2 2 2 1 2
. . . . 2 1 2
. . . . 2 1 2
1 1 2 1 . . .
1 1 2 1 . . .
1 . 2 1 . . .
1 1 2 1 . . .
. . . . . . .
. . . . . . .
1 . 2 2 . . .
1 1 2 2 1 1 3
. . . . . . .
. . . . . . .
. . . . . . .
1 1 2 1 1 1 3
1 1 2 2 1 1 2
1 1 2 2 1 1 3
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
1 1 2 2 . . .
. . . . . . .
. . . . . . .
1 1 2 2 . . .
1 1 2 1 . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . 1 2 2
. . . . . . .
. . . . . . .
. . . . 1 1 2
. . . . . . .
. . . . 1 2 2
2 1 2 2 1 2 2
. . . . 1 2 2
1 . 2 1 1 1 1
. . . . . . .
. . . . . . .
2 1 2 1 2 1 3
2 2 2 2 2 1 3
2 . 2 2 2 1 3
. . . . . . .
2 1 2 2 2 1 3
. . . . 1 3 3
2 1 2 2 2 1 3
. . . . 1 1 3
. . . . 1 1 3
. . . . . . .
. . . . 1 2 3
. . . . 1 2 3
. . . . 1 2 3
. . . . 1 2 3
. . . . 1 1 3
. . . . . . .
1 2 2 2 1 1 3
. . . . . . .
. . . . 1 2 3
. . . . 1 2 3
1 1 2 1 1 1 3
1 1 2 2 1 1 3
. . . . 1 1 3
. . . . 1 1 3
. . . . . . .
1 2 2 1 1 2 3
end
label values feedingscheme feedingscheme5
label def feedingscheme5 1 "fscheme", modify
label def feedingscheme5 2 "nfscheme", modify

Please assist.

Regards
Nthato

synth_runner automatically generated predictorvars

I am using synth_runner in STATA 17. I got the exact same results when I run two specifications. depvar remains the same. In the second specification, I added one year of outcome variable in the predictorvars and it doesn't make any difference. I suspect that this is because training_proper(real>0) automatically generates predictors from both depvar and predictorvars. From the help file, it is not clear whether depvar is considered as a potential predictor. Is there any other reason that leads to the same results from these two specifications? Is there any way that I can use training_proper(real >0) option, but also use only one year's outcome variable as a predictor (I have multiple treatment units)?

I will use the data and code from synth_runner help file to illustrate the problem.

Code:

clear all
use smoking, clear
tsset state year
capture drop D

program my_pred, rclass
    args tyear
    return local predictors "beer(`=`tyear'-4'(1)`=`tyear'-1') lnincome(`=`tyear'-4'(1)`=`tyear'-1')"
end
program my_drop_units
    args tunit
    if `tunit'==39 qui drop if inlist(state,21,38)
    if `tunit'==3 qui drop if state==21
end
program my_xperiod, rclass
    args tyear
    return local xperiod "`=`tyear'-12'(1)`=`tyear'-1'"
end
program my_mspeperiod, rclass
args tyear
    return local mspeperiod "`=`tyear'-12'(1)`=`tyear'-1'"
end
generate byte D = (state==3 & year>=1989) | (state==7 & year>=1988)


* Specification 1

synth_runner cigsale retprice age15to24, d(D) pred_prog(my_pred) trends training_propr(`=13/18') drop_units_prog(my_drop_units)) xperiod_prog(my_xperiod) mspeperiod_prog(my_mspeperiod)


* Specification 2: add cigsale(1980) as a potential predictor

synth_runner cigsale cigsale(1980) retprice age15to24, d(D) pred_prog(my_pred) trends training_propr(`=13/18') drop_units_prog(my_drop_units)) xperiod_prog(my_xperiod) mspeperiod_prog(my_mspeperiod)

Stata output

Specification 1
Post-treatment results: Effects, p-values, standardized p-values

	estimates	pvals	pvals_std

c1	-0.027493	0.3002191	0.0021914
c2	-0.0485773	0.1775018	0.0043828
c3	-0.0921521	0.0394449	0
c4	-0.1017043	0.0409058	0
c5	-0.1270111	0.0241052	0
c6	-0.1352273	0.0219138	0
c7	-0.141674	0.0262966	0
c8	-0.196867	0.0051132	0
c9	-0.1754307	0.0124178	0
c10	-0.1833944	0.0197224	0
c11	-0.1910038	0.0233747	0
c12	-0.1889059	0.0219138	0



Specification 2
Post-treatment results: Effects, p-values, standardized p-values
	estimates	pvals	pvals_std
c1	-0.027493	0.3002191	0.0021914
c2	-0.0485773	0.1775018	0.0043828
c3	-0.0921521	0.0394449	0
c4	-0.1017043	0.0409058	0
c5	-0.1270111	0.0241052	0
c6	-0.1352273	0.0219138	0
c7	-0.141674	0.0262966	0
c8	-0.196867	0.0051132	0
c9	-0.1754307	0.0124178	0
c10	-0.1833944	0.0197224	0
c11	-0.1910038	0.0233747	0
c12	-0.1889059	0.0219138	0

Extract country names from affiliations

Hi,

I have a dataset of about 1000 articles with variables such as id, title, abstract and affiliation. I was unable to get a dataex due to an error:

Code:

input strL affiliation
data width (7267 chars) exceeds max linesize. Try specifying fewer variables
r(1000);

I have to separate the affiliation details (e.g. school, department, email), street address and country names into different columns, but my main interest is in the country names for each row. Is it possible to do so using Stata 16? Here are first 2 rows.

Department of Radiation Oncology, University of Brescia and Spedali Civili and Hospital, Brescia, Italy. and Department of Radiation Oncology, University of Brescia and Spedali Civili and Hospital, Brescia, Italy. and University Department of Infectious and Tropical Diseases, University of Brescia and and ASST Spedali Civili, Brescia, Italy. and Department of Molecular and Translational Medicine and Clinical Chemistry and Laboratory ASST Spedali Civili, Brescia, Italy. and Department of Radiation Oncology, University of Brescia and Spedali Civili and Hospital, Brescia, Italy. and Department of Radiation Oncology, University of Brescia and Spedali Civili and Hospital, Brescia, Italy; a.e.guerini@gmail.com. and Department of Radiation Oncology, University of Brescia and Spedali Civili and Hospital, Brescia, Italy.

Division of Allergy, Pulmonary, and Critical Care Medicine, Vanderbilt University and Medical Center, Nashville, TN, USA. Electronic address: and Jonathan.D.Casey@VUMC.org. and Vanderbilt Center for Biomedical Ethics and Society, Vanderbilt University and Medical Center, Nashville, TN, USA. and Office of Emergency Care Research, National Institute of Neurological Disorders and and Stroke, Division of Clinical Research, National Institutes of Health, and Bethesda, MD, USA.

Sunday, June 26, 2022

Flexible case-control matching command

Hello,

First, thanks in advance for anyone who can help me with this. I haven't had much luck recently with other avenues so I hoped I might find some advice here.

I'm performing a case-control analysis on a dataset of mine in which I want to match 2 controls to 1 case based on a date value (matched on day). To increase the power I'm trying to add some flexibility by allowing controls to be matched +/- 1 day, rather than only on the specific day. For example, if I have 1 case whose date is 06/29/2018, and there's only 1 control who shares that specific date, I want to be able to match a second control whose date is either 06/28/2018 or 06/30/2018.

Variables:
case (binary, 0 or 1)
date (str11, format is mm/dd/yyyy)

I'm using the ccmatch command, which does match entries on one variable based on a given set of other variables. However it does not match multiple controls to a case and doesn't account for flexible matching criteria like mine.

The only way I can do this now is by painstakingly poring through the (rather large) dataset and individually matching a second control to each identified case, and since ccmatch matches 1:1 at random, such additional code is difficult to replicate.

Let me know if any more information is needed, I'm happy to explain myself better. This is my first post so apologies if I've made an error with this post.

Working with Time and DateTime variables from Excel

Hi,

I noticed that whenever I import an excel file that contains a time or datetime variable onto Stata (time up to seconds), the values for those variables in Stata appear to be slightly off sometimes (generally by a second). I understand that there is a difference in the way Excel and Stata read dates and times and there might be some rounding issue at play here. I am looking for a way for Stata to read the excel time/datetime variables accurately. I see some similar questions and documentation online that address my question but I am still a little confused. Any help with this would be appreciated. Thank you!

scatter graph with different styles and specific axis

Hi statalist,

I have the following data:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte period float(time return predicted_return abnormal_return)
1 1  -.0015407993 -.0008083973  -.000732402
1 2   .0019429044 -.0010541489   .002997053
1 3   -.023466034  .0014670364   -.02493307
1 4  -.0030919516 -.0022815885 -.0008103631
1 5  -.0015958005  .0001712715  -.001767072
2 1    .003430538   .001985073   .001445465
2 2   -.001894033  .0004734376 -.0023674704
2 3   -.020554027  .0014578053   -.02201183
2 4    .002066248   .002363168 -.0002969196
2 5 -.00005386278  .0003358708 -.0003897336
end
label values time t
label def t 1 "t-1", modify
label def t 2 "t", modify
label def t 3 "t+1", modify
label def t 4 "t+2", modify
label def t 5 "t+3", modify

I want to make a graph out if it so that I get points for all of the values of the variables return, predicted_return and abnormal_return.
The x-axis should be the time t-1 to t+3.
the y-axis should be the values.

Also would be great if the three variables get different shapes and the period 1 is blue and period 2 is red.

Have you some recommendations how to do that?

Twoway Line: Deleting a straight line

Hello, I am trying to create a figure using the code: twoway line sum mdate. The figure shows both a actual line and a straight fitted line. Please find attached. Is there anyway I can delete the straight fitted line. Any advice would be appreciated.

Panel Data fixed effects and time effects

Hello there,
For my master thesis I am conducting research about the effects of the digital divide on the educational attainment in the European continent. For this research I gathered data of 29 countries over a period of 14 years
My dependent variable is the % of the population that compelted tertiary education( age group 24-34)
Independent are : Population that has acces to broadband internet (in %), gini score(from 0 to 100, lower means better)
Then I looked up for some control variables: Population (total) & mean income , (still thinking about adding unemployment rate as another control var)

Upon using fixed and random effect
Fixed:

Code:

. xtreg educ population gini broadband incomeMean, fe

Fixed-effects (within) regression               Number of obs     =        398
Group variable: country                         Number of groups  =         29

R-squared:                                      Obs per group:
     Within  = 0.7214                                         min =         11
     Between = 0.3053                                         avg =       13.7
     Overall = 0.3832                                         max =         14

                                                F(4,365)          =     236.33
corr(u_i, Xb) = -0.3124                         Prob > F          =     0.0000

------------------------------------------------------------------------------
        educ | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
  population |  -9.70e-08   2.92e-07    -0.33   0.740    -6.71e-07    4.77e-07
        gini |   -.150809   .1070503    -1.41   0.160    -.3613219    .0597038
   broadband |   .2142734   .0098261    21.81   0.000     .1949504    .2335963
  incomeMean |   .0004968   .0000762     6.52   0.000      .000347    .0006467
       _cons |   20.37698   5.580489     3.65   0.000     9.403037    31.35093
-------------+----------------------------------------------------------------
     sigma_u |  7.6449458
     sigma_e |  2.5569006
         rho |  .89939297   (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(28, 365) = 91.84                    Prob > F = 0.0000

random:

Code:

 xtreg educ population gini broadband incomeMean, re

Random-effects GLS regression                   Number of obs     =        398
Group variable: country                         Number of groups  =         29

R-squared:                                      Obs per group:
     Within  = 0.7206                                         min =         11
     Between = 0.3157                                         avg =       13.7
     Overall = 0.3976                                         max =         14

                                                Wald chi2(4)      =     945.94
corr(u_i, X) = 0 (assumed)                      Prob > chi2       =     0.0000

------------------------------------------------------------------------------
        educ | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
  population |  -8.62e-08   5.70e-08    -1.51   0.130    -1.98e-07    2.54e-08
        gini |  -.0609734   .1026333    -0.59   0.552     -.262131    .1401841
   broadband |   .2166468   .0096297    22.50   0.000     .1977729    .2355207
  incomeMean |   .0004461   .0000651     6.85   0.000     .0003184    .0005737
       _cons |   18.24877   3.490679     5.23   0.000     11.40716    25.09037
-------------+----------------------------------------------------------------
     sigma_u |  6.9655036
     sigma_e |  2.5569006
         rho |  .88125285   (fraction of variance due to u_i)
------------------------------------------------------------------------------

.

I used the Hausman test to confirm that fixed effects would be the better method to use :

Code:

 hausman fixed random

Note: the rank of the differenced variance matrix (3) does not equal the number of coefficients being tested (4); be sure this is what you expect, or there may be problems
        computing the test.  Examine the output of your estimators for anything unexpected and possibly consider scaling your variables so that the coefficients are on a
        similar scale.

                 ---- Coefficients ----
             |      (b)          (B)            (b-B)     sqrt(diag(V_b-V_B))
             |     fixed        random       Difference       Std. err.
-------------+----------------------------------------------------------------
  population |   -9.70e-08    -8.62e-08       -1.08e-08        2.86e-07
        gini |    -.150809    -.0609734       -.0898356        .0304333
   broadband |    .2142734     .2166468       -.0023734        .0019549
  incomeMean |    .0004968     .0004461        .0000508        .0000395
------------------------------------------------------------------------------
                          b = Consistent under H0 and Ha; obtained from xtreg.
           B = Inconsistent under Ha, efficient under H0; obtained from xtreg.

Test of H0: Difference in coefficients not systematic

    chi2(3) = (b-B)'[(V_b-V_B)^(-1)](b-B)
            =   8.95
Prob > chi2 = 0.0299
(V_b-V_B is not positive definite)

I did add robust to cluster my standard errors and got this as a result:

Code:

. xtreg educ population gini broadband incomeMean, fe robust

Fixed-effects (within) regression               Number of obs     =        398
Group variable: country                         Number of groups  =         29

R-squared:                                      Obs per group:
     Within  = 0.7214                                         min =         11
     Between = 0.3053                                         avg =       13.7
     Overall = 0.3832                                         max =         14

                                                F(4,28)           =      35.53
corr(u_i, Xb) = -0.3124                         Prob > F          =     0.0000

                               (Std. err. adjusted for 29 clusters in country)
------------------------------------------------------------------------------
             |               Robust
        educ | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
  population |  -9.70e-08   4.97e-07    -0.20   0.847    -1.12e-06    9.21e-07
        gini |   -.150809   .1814625    -0.83   0.413    -.5225182    .2209001
   broadband |   .2142734   .0252991     8.47   0.000     .1624506    .2660962
  incomeMean |   .0004968   .0001977     2.51   0.018      .000092    .0009017
       _cons |   20.37698   9.725685     2.10   0.045     .4548208    40.29915
-------------+----------------------------------------------------------------
     sigma_u |  7.6449458
     sigma_e |  2.5569006
         rho |  .89939297   (fraction of variance due to u_i)
------------------------------------------------------------------------------

Now two of my independent variables are significant and overall the model seems also significant if I read the F Stat.

Upon adding i.year in the xtreg code like this:

Code:

. xtreg educ population gini broadband incomeMean i.year, fe robust

Fixed-effects (within) regression               Number of obs     =        398
Group variable: country                         Number of groups  =         29

R-squared:                                      Obs per group:
     Within  = 0.7746                                         min =         11
     Between = 0.0199                                         avg =       13.7
     Overall = 0.0545                                         max =         14

                                                F(17,28)          =      24.52
corr(u_i, Xb) = -0.8738                         Prob > F          =     0.0000

                               (Std. err. adjusted for 29 clusters in country)
------------------------------------------------------------------------------
             |               Robust
        educ | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
  population |  -8.02e-07   4.62e-07    -1.73   0.094    -1.75e-06    1.45e-07
        gini |  -.1726822    .165679    -1.04   0.306    -.5120602    .1666957
   broadband |   .0004282   .0541787     0.01   0.994    -.1105518    .1114082
  incomeMean |  -.0000562   .0001719    -0.33   0.746    -.0004084     .000296
             |
        year |
       2008  |   1.393604    .514999     2.71   0.011     .3386763    2.448531
       2009  |   2.785943   .9833509     2.83   0.008     .7716395    4.800246
       2010  |   3.947826   1.240529     3.18   0.004     1.406718    6.488935
       2011  |   4.918202   1.571527     3.13   0.004     1.699074    8.137329
       2012  |   6.289633   1.930353     3.26   0.003     2.335484    10.24378
       2013  |   7.574748   2.091623     3.62   0.001     3.290252    11.85924
       2014  |   9.325942    2.29589     4.06   0.000     4.623026    14.02886
       2015  |    9.79276   2.449228     4.00   0.000     4.775744    14.80978
       2016  |   10.65857   2.584618     4.12   0.000     5.364219    15.95292
       2017  |   11.28827   2.743002     4.12   0.000     5.669486    16.90705
       2018  |   12.10324   2.847884     4.25   0.000     6.269612    17.93686
       2019  |   12.90674   3.009904     4.29   0.000     6.741226    19.07224
       2020  |   13.88196   3.176958     4.37   0.000     7.374253    20.38966
             |
       _cons |   50.14916   8.640671     5.80   0.000     32.44955    67.84878
-------------+----------------------------------------------------------------
     sigma_u |  19.538604
     sigma_e |  2.3420248
         rho |  .98583553   (fraction of variance due to u_i)
----

with testparm for year:

Code:

. testparm i.year

 ( 1)  2008.year = 0
 ( 2)  2009.year = 0
 ( 3)  2010.year = 0
 ( 4)  2011.year = 0
 ( 5)  2012.year = 0
 ( 6)  2013.year = 0
 ( 7)  2014.year = 0
 ( 8)  2015.year = 0
 ( 9)  2016.year = 0
 (10)  2017.year = 0
 (11)  2018.year = 0
 (12)  2019.year = 0
 (13)  2020.year = 0

       F( 13,    28) =    3.71
            Prob > F =    0.0018

Now my question is am I doing this right by adding i.year into the regression? Because it seems that my dependent variables that were significant are not anymore. Also R-Squared here changed drastically but the F stat still says it's significant.
How can I fix this? Help or hints would greatly help me and is enormously appreciated.
Thank you and sorry for this very long message, but I tried to be as clear as possible by adding every step I took.

Kind regards,
Karim

Question on metacumbounds

Hi everyone,

I would like to seek advice on the metacumbounds package used for trial sequential analysis.
I have 2 questions:

1. error message on finding the R pathname:

For the metacumbounds package, R is required - i have downloaded both foreign and ldbounds on R (v 3.3.3)
I have tried this on both a Mac and Windows PC but I receive this error message on my mac (default setting on the dialog box for metacumbounds)

R executable Rterm.exe not found in C:\Program Files\R\R-2.12.2\bin\i386\, cannot access R software

I have modified the pathname file to /Library/Frameworks/R.framework - the location for Mac but similarly received the above error.
Is there anyone that can help with this?

2. "observation numbers out of range" error

The error in 1. (R pathname error) was obtained when i used the sample dataset provided alongside the metacumbounds package. This is the setting used:
Array
When i used my own data set (shown in screenshot), I obtained this error:

observation numbers out of range
r(198);

How should i deal with this error?

Thanks for your help.

Saturday, June 25, 2022

Populating a column with values

I need to populate the empty columns with corresponding values within that year, it could be that I have to repeat the entry of the values (for instance fill up 2004 with the same values-128, 2005-130, 2006-130, 2007-127). Any suggestions of the right code to use? This is just an example, I have observations of over 6 million: 24 countries,

year	country	GDP
2004	SE	128
2004	SE
2004	SE
2005	SE	131
2005	SE
2005	SE
2006	SE	130
2006	SE
2006	SE
2007	SE	127
2007	SE
2007	SE

Failing to convert .shp and .dbf dile to .dta format

Hello,

I'm trying to convert a gis file to .dta format. To answer a few questions: .shp and .dbf are in the same directory and there is no mistake in selecting the right directory. I used the same coding to convert other gis files before and it worked. But, for some reason it's not working. Any idea why ?

Code:

spshape2dta "Georgia.shp", replace saving(Georgia2014)
  (importing .shp file)
             st_sstore():  3300  argument out of range
      import_shp_polym():     -  function returned error
import_shp_read_shapes():     -  function returned error
import_shp_read_write_data():     -  function returned error
            import_shp():     -  function returned error
                 <istmt>:     -  function returned error
could not import .shp file
r(3300);

chnaging values of a variable

Dear Listers,

I am using Stata 15.1.
I am working with a dataset with more than a million observations.
I would like to change the values of a variable in order to merge another data with one of the data having an "ID" which is one digit higher than the master data. So I would like to drop the last digit in the using data in order to merge the two datasets.

What I want: In the following synthetic data, I would like to get rid of the last digit number from the values off variable

length

, so that it shortens to a two-digit value.
I tried to use

levelsof

with

usubstr

. It did not work. The closest I found is this : https://www.statalist.org/forums/for...velsof-command.

Code:

sysuse auto, clear
levelsof length, local(levels)

foreach l of local levels{
gen length2=usubstr("`levels'", 1, 2)
}

Thank you in advance for any tip.

log differenced model and GMM Estimator

Hi all,

I have panel data (T=30) and due to potential threat of non stationarity in my data i've transformed my data into log differences.
There is endogeneity bias in my model as well, i want to apply GMM estimator. Since GMM apply differenced transformation on data so it would be double differencing in my case.

In this case can I use Difference or System GMM on the model? if not which estimator would be appropriate?

I'll appreciate early response.
Thanks

heterofactor and ML maximization

I am using the heterofactor command (https://www.stata-journal.com/articl...article=st0431), but it is not interacting properly with mle maximize options. For example, I want to stop after one iteration and I write:

local dep_vars "ln_mean_wage high_shool college black latino hgc_mother_1979"
local controls "black latino hgc_mother_1979 highest_grade_complete"
local testslist " asvab_8_1981 asvab_6_1981 asvab_5_1981 asvab_4_1981 asvab_10_1981 rotter_score_1979 rosenberg_esteem_score_1980"

heterofactor unwanted `dep_vars' , indvarsc(`controls') scores(`testslist') factors(2) numf1tests(5) numf2tests(2) triangular difficult fdistonly initialreg nohats nochoice nodes(4)"

but the output that I get is:

Estimating Initial Values Vector
Running Factor Model

Twostep option specified
Step: 1

Factor: 1

Iteration 0: log likelihood = -41281.163 (not concave)
Iteration 1: log likelihood = -40578.565 (not concave)
Iteration 2: log likelihood = -39899.142 (not concave)
Iteration 3: log likelihood = -39490.265 (not concave)
Iteration 4: log likelihood = -39334.409 (not concave)
Iteration 5: log likelihood = -39276.052 (not concave)
Iteration 6: log likelihood = -39235.622 (not concave)
Iteration 7: log likelihood = -39198.96
Iteration 8: log likelihood = -39130.351 (not concave)
Iteration 9: log likelihood = -39113.587
Iteration 10: log likelihood = -39095.938
Iteration 11: log likelihood = -39085.982
Iteration 12: log likelihood = -39083.795
Iteration 13: log likelihood = -39072.259

I have the same problem with ltolerance(#) or any maximize option.

Does someone know how to solve it?

Significant in one-way, but not in two-way ANOVA

Hey, I'm currently doing my data analysis for my thesis but I encountered a problem. The main effect is significant in the one-way ANOVA, but insignificant in the two-way ANOVA. Can somebody help me?

Thanks in advance!

Stset with durations

Stset with durations.
hello i have wide format data with different dates which i used to create different DURATIONS since the begining of observations.
i am hesitating between 2 setup methods.
To set up stata, i use the time duration until last news DURATION1 and as failure i use the duration until death for those who died DURATION2.
So i do:
stset DURATION1, fail(DURATION2)
I had doubts so i tried another way.
I use DURATION2 but i replace missing values for those who don't die by the value of DURATION1 and i create a dichotomous failure variable
This leads to:
stset DURATION2modified, fail(dead)

The results are different and i am enclined to use the second method. Could anyone give me a feedback?
thank you
Mathieu

Question: Bayesian Vector Autoregression (BVAR)

Hello everyone, is there any code to summarise and or combine results from BVAR model?
For instance...quietly, esttab, eststo is not working for BVAR model. Kindly advise please.

Split population duration

Hello, I'm trying to make a Split population duration with the spsurv command but I can't understand how to get on one side the survival determinants and on the other side the lifetime determinants

stset duree, failure(survie)

spsurv survie duree region secteur regime_jur FCEFFE lcapsocial cap_class, id(rge_id) seq(duree)

I need your help

Merge accuracy using str format when most contain only numbers

Dear stata user,
I have a question regarding the merge accuracy of str. I have dataset A whose firm_id are in string format, but most of them actually contain only numbers, those not jusr numbers are like the following:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str8 patent_id
"RE43814"
"RE43864"
"RE43868"
"RE43956"
"RE43986"
"RE43997"
"RE44164"
"RE44215"
"RE44861"
"RE44874"
"RE44924"
"RE44930"
"RE44958"
"RE44993"
"RE45248"
"RE45348"
"RE45418"
"RE45473"
"RE45539"
"RE45733"
"RE45782"
"RE45804"
"RE45956"
"RE45962"
"RE45990"
"RE46020"
"RE46089"
"RE46096"
"RE46176"
"RE46193"
"RE46351"
"RE46409"
"RE46436"
"RE46436"
"RE46436"
"RE46473"
"RE46488"
"RE46518"
"RE46558"
"RE46564"
"RE46630"
"RE46686"
"RE46703"
"RE46746"
"RE46850"
"RE46891"
"RE47055"
"RE47257"
"RE47341"
"RE47342"
"RE47351"
"RE47425"
"RE47487"
"RE47553"
"RE47663"
"RE47698"
"RE47715"
"RE47736"
"RE47737"
"RE47761"
"RE47763"
"RE47813"
"RE47857"
"RE47949"
"RE48267"
"RE48274"
"RE48308"
"RE48359"
"RE48378"
"RE48446"
"RE48524"
"RE48532"
"RE48599"
"RE48641"
"RE48695"
"RE48702"
"T100501"
"T958006"
"T962010"
"T964006"
"T965001"
"T988005"
end

And I have dataset B whose firm_id contains only numbers and are long format.
Now I want to merge them using firm_id as a key, I have 2 options:
1. turn str to long
2. turn long to str
For the 1st one I think using long format to merge will be more accurate, but I have to drop those firms with characters. And I don't know how to test those contain characters and drop them
For the 2nd one I don't need to give up any observations but I wonder the accuracy of merge using str format, will this be accurate when most of them contain only numbers?

Thanks!

Friday, June 24, 2022

Summing over a value of variable for repeating county

Hello respected stata community,

I have a dummy variable which takes the value 0 and 1. The observation for different counites are given below in a single year 2017. Here, the counties get repeated several times in a single year. I need to find the combined value of the variable length for each unique county of a unique state in a single year when dummy variable takes the value 1. And, if in any observation of a unique county never takes the value 1 , then the desired value of variable combined length will show zero.

Can anyone kindly tell me how I can do it ? I've thought for couple of days but havent come up with any efficient coding yet.

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int Year_Recor byte State_Code int County_Cod byte dummy float length
2017 1  89 0    .1
2017 1   1 1  .043
2017 1 123 0    .1
2017 1  45 0  .097
2017 1  97 0  .438
2017 1 123 0    .1
2017 1 117 1    .1
2017 1  25 0    .1
2017 1 123 1    .1
2017 1  51 0    .1
2017 1 123 0    .1
2017 1   3 0    .1
2017 1 115 0    .1
2017 1  71 0    .1
2017 1  77 0  .034
2017 1  61 0    .1
2017 1  39 0    .1
2017 1 125 0    .1
2017 1  59 0    .1
2017 1  15 0    .1
2017 1  89 1    .1
2017 1  81 1    .1
2017 1  77 0    .1
2017 1 127 0  .036
2017 1 103 0    .1
2017 1  49 0  .018
2017 1  81 0    .1
2017 1 101 0    .1
2017 1  91 0    .1
2017 1  91 0  .087
2017 1  49 1    .1
2017 1 115 0    .1
2017 1 107 0    .1
2017 1  89 0  .004
2017 1  43 0  2.22
2017 1  31 0    .1
2017 1  79 0 1.013
2017 1 125 1  .056
2017 1 117 0    .1
2017 1   3 1   .04
2017 1  49 1    .1
2017 1 125 1  .052
end

Box plot help

Hello everyone,

I am trying to make this visual from a book by Edward Tufte where he talks about using a stripped down version of the box plot as practice.

I wrote a code but I am unable to figure out why the length of whiskers keep coming out incorrect. Can someone help me out here?

Code:

    sysuse auto, clear
    
    * Name of variable to use for box plot:
    local variable price
    
    * Display boxplot by which group?
    local group foreign
    
    

* Plot type 1    

capture separate `variable', by(`group')
    
    levelsof `group', local(lvl)
    foreach level of local lvl {
        sort `variable'
        
        quietly summ `group'
        local max = `r(max)'
        local min = `r(min)'
        local scale = `r(max)' - `r(min)'
        local offset : display abs(`scale'*0.02)
    
        quietly summ `variable' if `group' == `level', detail
        local level = `level' + 1
        local xlab "`xlab' `level' `" "`:lab (`group') `=`level'-1''" "'"

        local med_p_`level' = `r(p50)'
        local p75_`level' = `r(p75)'
        local p25_`level' = `r(p25)'
        local iqr_`level' = `p75_`level'' - `p25_`level''
        display "Median = `med_p_`level''"
        display "P75 = `p75_`level''"
        display "P25 = `p25_`level''"
        display "IQR = `iqr_`level''"
        display "Low = `=`p25_`level''-(1.5*`iqr_`level'')'"
        display "Max = `=`p75_`level''+(1.5*`iqr_`level'')'""
        display "Varname = `variable'`=`level'-1'"
        
        egen llw_`level' = min(max(`variable'`=`level'-1', `=`p25_`level''-(1.5*`iqr_`level'')'))
        egen uuw_`level' = max(min(`variable'`=`level'-1', `=`p75_`level''+(1.5*`iqr_`level'')'))
        
        quietly summ uuw_`level'
        local max_`level' = `r(mean)'
        quietly summ llw_`level'
        local min_`level' = `r(mean)'        
        
        
        local     lines `lines' ///
                (scatteri `p75_`level'' `level' `max_`level'' `level', recast(line) lpattern(solid) lcolor(black) lwidth(1)) || ///
                (scatteri `p25_`level'' `level' `min_`level'' `level', recast(line) lpattern(solid) lcolor(black) lwidth(1)) || ///
                (scatteri `p75_`level'' `=`level' + `offset'' `p25_`level'' `=`level' + `offset'', recast(line) lpattern(solid) lcolor(black) lwidth(1)) || ///
                (scatteri `med_p_`level'' `=`level' + `offset'', ms(square) mcolor(background)) ||

}
    
    *drop llw* uuw*
    
    twoway `lines', ///
            ytitle("`: variable label `variable''") ///
            ylabel(2000(2000)10000) xtitle("") ///
            xlabel(`xlab', nogrid) ///
            xscale(range(`=`min' + 0.5' `=`max' + 1.5')) ///
            scheme(white_tableau) ///
            title("{bf}Tufte Styled Box Plot", pos(11) margin(b+3) size(*.7)) ///
            subtitle("`: variable label `variable'' grouped by `: variable label `group''", pos(11) margin(b+6 t=-3) size(*.6)) ///
            legend(off)
    
    
    
    * Tufte style box plot version 2
    graph box mpg, box(1, color(white%0)) medtype(marker) medmarker(mcolor(black) mlwidth(0)) cwhiskers alsize(0) intensity(0) over(foreign) lintensity(1) lines(lpattern(solid) lwidth(medium) lcolor(black)) nooutside ylabel(, nogrid) scheme(white_tableau)

The plot type 1 whsiters are different from the one in 2. Can someone kindly check what I am doing wrong in this code, will be grateful.

Thursday, June 23, 2022

How to do a Box Plot with mean instead of median and SD instead of quartiles?

Dear Statalisters,

Please have a look at my data:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float event byte countrycode
0 2
0 4
0 7
0 7
0 5
1 3
0 9
1 4
0 8
0 5
0 8
0 5
0 5
0 5
0 5
0 3
0 5
0 3
0 5
0 7
end

event is a binary variable. I'd like to generate some visual descriptive statistics for event. Given that it is a binary variable, a box plot is unappropriate as there would be no boundaries. But I would like something that looks like a box plot, showing the mean where it traditionally shows the median, and the borders would be the mean +/- the standard deviation. To make things look clearer, here's an image of how it would look :

Array

For each country code, the center of the box would be the mean of event, the left boundary of the box would be the mean - sd of event, and the left boundary of the box would be the mean + the sd of event, so the mean would be at the center of each box.

Is there a way to generate such plots? I tried the command -graph box- but my attempts remained unsuccessful. Any help would be appreciated. Thanks a lot!

Help with bootstrap in obtaining a standard error.

Hi Everyone: I think my problem has nothing to do with the data set and so I'm not showing a data example. In the bootstrap, I want a standard error for the average of six estimated marginal effects but I get a missing value. I get the standard errors for the marginal effects themselves, but not the average. Clearly I don't know how to compute a function of coefficients within a bootstrap program. See my calculation of tauavg. Thanks for any hints.

Code:

. capture program drop aggregate_boot

.         
. program aggregate_boot, rclass
  1. 
. poisson y i.w#c.d4#c.f04 i.w#c.d4#c.f05 i.w#c.d4#c.f06 ///
>         i.w#c.d5#c.f05 i.w#c.d5#c.f06 ///
>         i.w#c.d6#c.f06 ///
>         i.w#c.d4#c.f04#c.x i.w#c.d4#c.f05#c.x i.w#c.d4#c.f06#c.x ///
>         i.w#c.d5#c.f05#c.x i.w#c.d5#c.f06#c.x ///
>         i.w#c.d6#c.f06#c.x ///
>         f02 f03 f04 f05 f06 ///
>         c.f02#c.x c.f03#c.x c.f04#c.x c.f05#c.x c.f06#c.x ///
>         d4 d5 d6 x c.d4#c.x c.d5#c.x c.d6#c.x, noomitted
  2. estimates store beta
  3.         
. margins, dydx(w) at(d4 = 1 d5 = 0 d6 = 0 f02 = 0 f03 = 0 f04 = 1 f05 = 0 f06 = 0) ///
>         subpop(if d4 == 1) noestimcheck post
  4. return scalar tau44 = _b[1.w]
  5. estimates restore beta
  6. margins, dydx(w) at(d4 = 1 d5 = 0 d6 = 0 f02 = 0 f03 = 0 f04 = 0 f05 = 1 f06 = 0) ///
>         subpop(if d4 == 1) noestimcheck post
  7. return scalar tau45 = _b[1.w]
  8. estimates restore beta
  9. margins, dydx(w) at(d4 = 1 d5 = 0 d6 = 0 f02 = 0 f03 = 0 f04 = 0 f05 = 0 f06 = 1) ///
>         subpop(if d4 == 1) noestimcheck post
 10. return scalar tau46 = _b[1.w]
 11. estimates restore beta
 12. margins, dydx(w) at(d4 = 0 d5 = 1 d6 = 0 f02 = 0 f03 = 0 f04 = 0 f05 = 1 f06 = 0) ///
>         subpop(if d5 == 1) noestimcheck post
 13. return scalar tau55 = _b[1.w]
 14. estimates restore beta
 15. margins, dydx(w) at(d4 = 0 d5 = 1 d6 = 0 f02 = 0 f03 = 0 f04 = 0 f05 = 0 f06 = 1) ///
>         subpop(if d5 == 1) noestimcheck post
 16. return scalar tau56 = _b[1.w]
 17. estimates restore beta
 18. margins, dydx(w) at(d4 = 0 d5 = 0 d6 = 1 f02 = 0 f03 = 0 f04 = 0 f05 = 0 f06 = 1) ///
>         subpop(if d6 == 1) noestimcheck post
 19. return scalar tau66 = _b[1.w]
 20. 
. return scalar tauavg = (tau44 + tau45 + tau46 + tau55 + tau56 + tau66)/6
 21. 
. end

. 
. bootstrap r(tau44) r(tau45) r(tau46) r(tau55) r(tau56) r(tau66) r(tauavg),  ///
>         reps(50) seed(123) cluster(id) idcluster(newid): aggregate_boot
(running aggregate_boot on estimation sample)

Bootstrap replications (50)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50

Bootstrap results                                        Number of obs = 6,000
                                                         Replications  =    50

      Command: aggregate_boot
        _bs_1: r(tau44)
        _bs_2: r(tau45)
        _bs_3: r(tau46)
        _bs_4: r(tau55)
        _bs_5: r(tau56)
        _bs_6: r(tau66)
        _bs_7: r(tauavg)

                                  (Replications based on 1,000 clusters in id)
------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             | coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       _bs_1 |   1.017501   1.119012     0.91   0.363    -1.175723    3.210725
       _bs_2 |    6.00713   1.978965     3.04   0.002     2.128431    9.885829
       _bs_3 |   4.569667   1.379358     3.31   0.001     1.866174    7.273159
       _bs_4 |   7.170127   2.957668     2.42   0.015     1.373203    12.96705
       _bs_5 |   7.185492   2.297489     3.13   0.002     2.682496    11.68849
       _bs_6 |   13.73294   12.83413     1.07   0.285    -11.42151    38.88738
       _bs_7 |   6.613809          .        .       .            .           .
------------------------------------------------------------------------------