Dear Statalist users,
I am wondering if I would estimate a regression like this:
Y = X + X * Y_Q2 + X * Y_Q3 + X * Y_Q4 + X * Y_Q5 + Y_Q2 + Y_Q3 +Y_Q4 + Y_Q5
Y_Q2 is an indicator variable of whether Y is in the 2nd quintile of the sample, and Y_Q3-5 is defined similarly.
In other words, I want to put interaction terms of X and indictors of Y quintile.
The reason I want to put interaction terms of X and Y quintile is that I suspect the relationship between Y and X is non-linear and I want to see how the coefficients changes as Y changes.
The reason I don’t flip Y and X (that is, making Y the independent variable and X the dependent variable) is that it is the convention in the literature that Y is the dependent variable.
Technically I can run the regression. But I am wondering if there is any problem econometrics-wise.
I would appreciate any comments/advices. Thank you so much!
Specialized on Data processing, Data management Implementation plan, Data Collection tools - electronic and paper base, Data cleaning specifications, Data extraction, Data transformation, Data load, Analytical Datasets, and Data analysis. BJ Data Tech Solutions teaches on design and developing Electronic Data Collection Tools using CSPro, and STATA commands for data manipulation. Setting up Data Management systems using modern data technologies such as Relational Databases, C#, PHP and Android.
Thursday, June 30, 2022
how to replace/recode variables that are a multiple of some number
Hi I have a variable "level" that ranges from 1 to 125
I would like to recode it into a variable "pitch", such that values 1, 6, 11, 16, 21 ... etc are coded as "1" for pitch and values 2, 7, 12, 17, 22... etc ared coded as "2" for pitch
I could do that manually with recode e.g.
but it would take a long time
Is there a quicker way to do this?
I would like to recode it into a variable "pitch", such that values 1, 6, 11, 16, 21 ... etc are coded as "1" for pitch and values 2, 7, 12, 17, 22... etc ared coded as "2" for pitch
I could do that manually with recode e.g.
HTML Code:
recode level 1=1 6=1 11=1 etc, gen(obj)
Is there a quicker way to do this?
Shortcut for generating new variables based on many existing
I have groups of variables contactoutcome_con_<y>_<x> and contactmethod_con_<y>_<x>, where x is a program from 1-8 and where y indexes the contact attempt (goes from 1-30) associated with that program. For example, contactoutcome_con_30_8 is the outcome of the 30th contact attempt made by the 8th program, and contactmethod_con_30_8 is the contact method that was used for that attempt. I need to generate a new variable "contact_success" based on the values of these two variable. For example, contact_success should = 0 if contactoutcome_con_<y>_<x>==9 AND contactmethod_con_<y>_<x>==11. What is the shortcut for doing this for all the 240 matching contactoutcome_con_<y>_<x> and contactmethod_con_<y>_<x> pairs? Thanks in advance!
create a loop to convert string variables in numeric variables
Hello,
I am Salvatore, happy to join the Stata Forum community.
I am a new user who recently started using Stata. For my thesis research, I am using panel data. At the time I imported this data on Stata all my variables has been converted to strings. Since my dataset consists of over 60 columns (due to the fact that I use six variables and 12 years-time period of 2021-2010), I would like to try to set up a loop that will allow me to convert all the strings to numbers without typing each code. Please help me and I will give every kind of information that would be helpful. The command I am trying to use is with foreach, but I don't know how to use it.
[foreach var...]
Thank you very much. Sorry for my mistakes in following the posting rules, but this is my first post and I am still learning. I apologize for not being able to use dataex, but Stata says "input statement exceeds linesize limit. Try specifying fewer variables" Hope you will understand.
Salvatore
I am Salvatore, happy to join the Stata Forum community.
I am a new user who recently started using Stata. For my thesis research, I am using panel data. At the time I imported this data on Stata all my variables has been converted to strings. Since my dataset consists of over 60 columns (due to the fact that I use six variables and 12 years-time period of 2021-2010), I would like to try to set up a loop that will allow me to convert all the strings to numbers without typing each code. Please help me and I will give every kind of information that would be helpful. The command I am trying to use is with foreach, but I don't know how to use it.
[foreach var...]
Thank you very much. Sorry for my mistakes in following the posting rules, but this is my first post and I am still learning. I apologize for not being able to use dataex, but Stata says "input statement exceeds linesize limit. Try specifying fewer variables" Hope you will understand.
Salvatore
Matching without replacement from a file of pairs for case-control and other applications
Short version: I’m seeking a solution to how to do 1:m matching of cases and controls without replacement, *given a file in long (“edge”) format of all possible matching pairs.* Code to create sample data occurs at the end of this post.
Longer version:
In corresponding off-list with Rima Saliba about this problem, I discovered, that the code I had posted here several years ago in response to a question about this problem is wrong.
While there have been a number of threads over the year about case control matching, (see e.g. here). I now have encountered and refined the problem in a way that seems different enough from previous work to be worth posting in a more generalized way. What follows is my refined version of the problem, to which I'm interested in solutions.
Via the use of -joinby- or perhaps -rangejoin- or -cross-, one can have a file that pairs up cases or treatment subjects with matching potential controls. In such situations, analysts may want to have 1:m matching *without replacement.* Complications include that:
1) The controls that match one case/treatment subject may match many others.
2) There are varying numbers of controls available for each case, in general more or perhaps less than m
3) If possible, one wants to avoid too “greedy” an algorithm, which can result in the extreme in one case getting assigned all m controls and some other similar case getting 0.
I have the idea that some solution involving -merge- should be possible, per some earlier threads, but I have not successfully figured how to do that. I also have the thought that one of the many built-in or community-contributed matching commands might be used, but I have not worked that out either. I *have* discovered, that some of the “greediness” problem can be avoided by having an algorithm that picks only *1* control without replacement for every case, and then applying this iteratively, so a solution that only picks one control per case would solve the problem.
In that context, here is a code snippet to create what I’d consider a representative kind of data set with which to work:
I realize that “without replacement“ is not necessarily analytically preferable, but that’s another issue.
Longer version:
In corresponding off-list with Rima Saliba about this problem, I discovered, that the code I had posted here several years ago in response to a question about this problem is wrong.
While there have been a number of threads over the year about case control matching, (see e.g. here). I now have encountered and refined the problem in a way that seems different enough from previous work to be worth posting in a more generalized way. What follows is my refined version of the problem, to which I'm interested in solutions.
Via the use of -joinby- or perhaps -rangejoin- or -cross-, one can have a file that pairs up cases or treatment subjects with matching potential controls. In such situations, analysts may want to have 1:m matching *without replacement.* Complications include that:
1) The controls that match one case/treatment subject may match many others.
2) There are varying numbers of controls available for each case, in general more or perhaps less than m
3) If possible, one wants to avoid too “greedy” an algorithm, which can result in the extreme in one case getting assigned all m controls and some other similar case getting 0.
I have the idea that some solution involving -merge- should be possible, per some earlier threads, but I have not successfully figured how to do that. I also have the thought that one of the many built-in or community-contributed matching commands might be used, but I have not worked that out either. I *have* discovered, that some of the “greediness” problem can be avoided by having an algorithm that picks only *1* control without replacement for every case, and then applying this iteratively, so a solution that only picks one control per case would solve the problem.
In that context, here is a code snippet to create what I’d consider a representative kind of data set with which to work:
Code:
// Create an “edge” file of matched pairs. set seed 82743 local ncases = 100 local maxmatch = 100 local maxcontrolid = 2000 clear set obs `ncases' gen int caseid = _n gen navail = ceil(runiform() * `maxmatch') label var navail "# controls matched to this case" expand navail gen int controlid = ceil(runiform() * `maxcontrolid') summ navail order caseid controlid
I realize that “without replacement“ is not necessarily analytically preferable, but that’s another issue.
bysort query issues
Hello all:
I am trying to replace first row of path with "CLL" and second row of path as "RT" for _merge == 2 cases. The code is replacing both rows with either CLL or RT. What am I doing wrong? studyid 4 and 5 would be examples.
bys studyid: replace path[`1'] = "CLL" if _merge == 2
bys studyid: replace path[`_N'] = "RT" if _merge == 2 // not working
I am trying to replace first row of path with "CLL" and second row of path as "RT" for _merge == 2 cases. The code is replacing both rows with either CLL or RT. What am I doing wrong? studyid 4 and 5 would be examples.
bys studyid: replace path[`1'] = "CLL" if _merge == 2
bys studyid: replace path[`_N'] = "RT" if _merge == 2 // not working
Code:
* Example generated by -dataex-. For more info, type help dataex clear input int studyid str55 path byte _merge 1 "Normal" 3 1 "CLL" 3 1 "CLL" 3 1 "CLL" 3 1 "CLL/PL" 3 1 "CLL" 3 1 "CLL" 3 1 "CLL" 3 1 "CLL" 3 2 "CLL" 3 3 "CLL" 3 3 "Mixed CLL-RT" 3 3 "Mixed CLL-RT" 3 3 "Mixed CLL-RT" 3 4 "" 2 4 "" 2 5 "" 2 5 "" 2 6 "CLL" 3 6 "Mixed CLL-RT" 3 6 "CLL" 3 6 "CLL" 3 6 "CLL" 3 6 "CLL" 3 6 "RT" 3 6 "CLL" 3 7 "Mixed CLL-RT" 3 8 "" 2 8 "" 2 9 "CLL" 3 9 "CLL" 3 9 "CLL" 3 9 "CLL" 3 10 "" 2 10 "" 2 11 "" 2 11 "" 2 12 "Normal" 3 12 "CLL" 3 12 "Normal" 3 12 "RT" 3 12 "Normal" 3 12 "Normal" 3 12 "Normal" 3 13 "CLL" 3 13 "CLL" 3 14 "Other" 3 14 "Other" 3 14 "Other" 3 14 "Other" 3 15 "CLL" 3 15 "Mixed CLL-RT" 3 15 "CLL" 3 15 "CLL" 3 15 "CLL" 3 16 "Mixed CLL-RT" 3 16 "Mixed CLL-RT" 3 16 "CLL" 3 16 "CLL" 3 17 "CLL" 3 17 "Mixed CLL-RT" 3 17 "CLL" 3 18 "CLL" 3 18 "RT" 3 18 "Normal" 3 18 "Normal" 3 18 "." 3 18 "Normal" 3 18 "Normal" 3 19 "CLL" 3 19 "CLL" 3 19 "CLL" 3 20 "" 2 20 "" 2 21 "Normal" 3 22 "" 2 22 "" 2 23 "CLL" 3 23 "Normal" 3 23 "CLL" 3 23 "CLL" 3 23 "CLL" 3 23 "CLL" 3 23 "CLL" 3 23 "CLL" 3 23 "CLL" 3 24 "." 3 24 "CLL" 3 24 "CLL" 3 25 "CLL" 3 25 "." 3 25 "Mixed CLL-RT" 3 25 "Mixed CLL-RT" 3 26 "RT" 3 26 "CLL" 3 26 "CLL" 3 26 "CLL" 3 26 "Mixed CLL-RT" 3 27 "Normal" 3 27 "CLL" 3 end label values _merge _merge label def _merge 2 "Using only (2)", modify label def _merge 3 "Matched (3)", modify
Replace command with non-mutually exclusive categorical data
Hello,
I am working with a dataset from a Twitter content analysis project and am stuck trying out figure out how to take 8 categorical tweet characteristic variables (resource, news, personal experience, personal opinion, marketing, spam, question, jokes/parody) and create one "tweet characteristic" variable (code below).
The problem I am having is that the categories are not mutually exclusive. The n for JokesParody is 22, but when I run this code it reduces it to 5 since a tweet can have several of these characteristics. Any help you can provide would be very much appreciated.
gen Characteristics=.
replace Characteristics = 0 if JokesParody==1
replace Characteristics = 1 if Resource==1
replace Characteristics = 2 if News==1
replace Characteristics = 3 if PersonalExperience==1
replace Characteristics = 4 if PersonalOpinion==1
replace Characteristics = 5 if Marketing==1
replace Characteristics = 6 if Spam==1
replace Characteristics = 7 if Question==1
label var Characteristics "Tweet Characteristics"
label define Characteristics 0 "Jokes/Parody" 1 "Resource" 2 "News" 3 "Personal Experience" 4 "Personal Opinion" 5 "Marketing" 6 "Spam" 7 "Question"
label val Characteristics Characteristics
I am working with a dataset from a Twitter content analysis project and am stuck trying out figure out how to take 8 categorical tweet characteristic variables (resource, news, personal experience, personal opinion, marketing, spam, question, jokes/parody) and create one "tweet characteristic" variable (code below).
The problem I am having is that the categories are not mutually exclusive. The n for JokesParody is 22, but when I run this code it reduces it to 5 since a tweet can have several of these characteristics. Any help you can provide would be very much appreciated.
gen Characteristics=.
replace Characteristics = 0 if JokesParody==1
replace Characteristics = 1 if Resource==1
replace Characteristics = 2 if News==1
replace Characteristics = 3 if PersonalExperience==1
replace Characteristics = 4 if PersonalOpinion==1
replace Characteristics = 5 if Marketing==1
replace Characteristics = 6 if Spam==1
replace Characteristics = 7 if Question==1
label var Characteristics "Tweet Characteristics"
label define Characteristics 0 "Jokes/Parody" 1 "Resource" 2 "News" 3 "Personal Experience" 4 "Personal Opinion" 5 "Marketing" 6 "Spam" 7 "Question"
label val Characteristics Characteristics
Wednesday, June 29, 2022
Number of lags too high
Hi everyone,
I am writing as I am currently working on my bachelor's thesis with Stata.
I am doing a time series analysis.
I have found that 4 is the optimal number of lags, I have one cointegrating equation, and the variables are stationary at first difference.
However, my professor pointed out that, because I am estimating a VAR of dimension 5 with 4 lags, the estimated parameters are far more than the observations.
The problem is that I do not really know what to do. Because of my research topic, I cannot really change my data.
Therefore, what do you recommend me to do? To perform an additional test? To simply state this issue in the limitations of the thesis?
Thank you in advance for the help (I'm still a beginner in Stata)!
I am writing as I am currently working on my bachelor's thesis with Stata.
I am doing a time series analysis.
I have found that 4 is the optimal number of lags, I have one cointegrating equation, and the variables are stationary at first difference.
However, my professor pointed out that, because I am estimating a VAR of dimension 5 with 4 lags, the estimated parameters are far more than the observations.
The problem is that I do not really know what to do. Because of my research topic, I cannot really change my data.
Therefore, what do you recommend me to do? To perform an additional test? To simply state this issue in the limitations of the thesis?
Thank you in advance for the help (I'm still a beginner in Stata)!
pweights in reghdfe allow colinear variable to generate a coefficient?
Hi,
This is my first post here. This question pertains to the use of pweight in reghdfe. I have a variable that is colinear with the fixed effects (see the output of the first regression. But when I add in the probability weights, reghdfe does manage do output a coefficient (see second regression output below). I was hoping someone could point me to some resources/an explanation to help me understand why this is happening.
Thank you,
Claire
Array
VAR or PVAR different lags for endgenous variables
Dear community,
in a VAR or PVAR model would it be possible for endgenous variables to have different lags?
The PVAR seems to always use an equal number of lags but would it be possible in a PVAR model to have for exemple 5 lags for Endegenous_var_A and 3 lags for Endegenous_var_B.
Best regards,
Farid
in a VAR or PVAR model would it be possible for endgenous variables to have different lags?
The PVAR seems to always use an equal number of lags but would it be possible in a PVAR model to have for exemple 5 lags for Endegenous_var_A and 3 lags for Endegenous_var_B.
Best regards,
Farid
comparing the parametric survival regression models
What is the command for comparing the parametric survival regression models? For example, I estimate the model by both exponential and Weibull distributions. But, I also would like to compare (estimate) which distribution to use is better.
Thank you!
Thank you!
comparing the parametric survival regression models
What is the command for comparing the parametric survival regression models? For example, I estimate the model by both exponential and Weibull distributions. But, I also would like to compare (estimate) which distribution to use is better.
Tuesday, June 28, 2022
Dummy for export-starters and non-exporters in the period just before the export-starter enters the export market
I want to create a dummy for export starters to non-exporters in the years before entry. How can I code this?
When Export[_n]==1 & Export[_n-1]==0, generate Starter=1 and Non-exporter=0 in the previous year.
When Export[_n]==1 & Export[_n-1]==0, generate Starter=1 and Non-exporter=0 in the previous year.
Year | Firm | Export | Export_Starter(Desired dummy) |
2010 | 1 | 0 | 0 |
2011 | 1 | 0 | . |
2012 | 1 | 0 | . |
2010 | 2 | 0 | 1 |
2011 | 2 | 1 | . |
2012 | 2 | 0 | . |
2010 | 3 | 1 | . |
2011 | 3 | 1 | . |
2012 | 3 | 1 | . |
2010 | 4 | 0 | 0 |
2011 | 4 | 1 | . |
2012 | 4 | 1 | . |
How to inform readers the context of hierarchical logistic regression that removes significant main effects
Hello all,
I'm examining a hypothetical scenario to determine how living alone and the mechanism of feedback affects a person's willingness to turn themselves into police for a crime. Mechanism of feedback refers to the person being told the positive or negative consequences of turning themselves in (e.g. a positive mechanism would tell the person all of the good things that come with turning him or herself in while a negative mechanism would tell the person all the bad things that come with turning him or herself in - a neutral mechanism doesn't mention any good or bad things).
I'm running a hierarchical logistic regression model using two steps: the main effects on the first step and the interaction on the second step where v1 is the dichotomous variable "Lives alone" (yes/no) and v2 is the categorical variable "Feedback mechanism" with three categories (positive/negative/neutral)
Which spits out:
My question is how to interpret and explain this in lay terms. My two real struggles are:
1. Living alone loses its significance when controlling for the interaction between living alone and the mechanism of feedback.
2. Positive mechanism of feedback retains significance after the interaction, but the interactions themselves are not significant.
And I still struggle to understand exactly how to interpret interaction effects in this way. Is it correct to say that living alone does not significantly impact someone turning themselves in when they live alone and are presented with feedback (Block 2)? But, living alone without considering feedback does significantly increase turning themselves in (Block 1)?
Asking how to interpret and discuss these results for the lay person might be elementary, but I've looked everywhere and virtually all threads, videos, etc. interpret hierarchical regression with interactions in a statistical way. I suppose I'm asking how I would explain this result to my grandmother, as the saying goes.
Thanks for all the help.
I'm examining a hypothetical scenario to determine how living alone and the mechanism of feedback affects a person's willingness to turn themselves into police for a crime. Mechanism of feedback refers to the person being told the positive or negative consequences of turning themselves in (e.g. a positive mechanism would tell the person all of the good things that come with turning him or herself in while a negative mechanism would tell the person all the bad things that come with turning him or herself in - a neutral mechanism doesn't mention any good or bad things).
I'm running a hierarchical logistic regression model using two steps: the main effects on the first step and the interaction on the second step where v1 is the dichotomous variable "Lives alone" (yes/no) and v2 is the categorical variable "Feedback mechanism" with three categories (positive/negative/neutral)
Code:
nestreg, lr: logistic turnselfin (i.livesalone ib0.feedbackmech) (i.livesalone#ib0.feedbackmech)
Code:
note: 0.livesalone omitted because of estimability. note: 0.feedbackmech omitted because of estimability. note: 0.livesalone#0.feedbackmech omitted because of estimability. note: 0.livesalone#1.feedbackmech omitted because of estimability. note: 0.livesalone#2.feedbackmech omitted because of estimability. note: 1.livesalone#0.feedbackmech omitted because of estimability. Block 1: 1.livesalone 1.feedbackmech 2.feedbackmech Logistic regression Number of obs = 308 LR chi2(3) = 31.75 Prob > chi2 = 0.0000 Log likelihood = -186.56561 Pseudo R2 = 0.0784 ------------------------------------------------------------------------------ turnselfin | Odds ratio Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- livesalone | Lives alone | 1.935901 .499876 2.56 0.011 1.167055 3.211256 | feedbackmech | Positive | 4.574841 1.510175 4.61 0.000 2.39547 8.736979 Neutral | 3.450492 1.14231 3.74 0.000 1.803369 6.60203 | _cons | .142871 .0454746 -6.11 0.000 .0765622 .2666085 ------------------------------------------------------------------------------ Note: _cons estimates baseline odds. Block 2: 1.livesalone#1.feedbackmech 1.livesalone#2.feedbackmech Logistic regression Number of obs = 308 LR chi2(5) = 33.36 Prob > chi2 = 0.0000 Log likelihood = -185.76036 Pseudo R2 = 0.0824 ----------------------------------------------------------------------------------------- turnselfin | Odds ratio Std. err. z P>|z| [95% conf. interval] ------------------------+---------------------------------------------------------------- livesalone | Lives alone | 1.93617 1.10679 1.16 0.248 .6314858 5.936405 | feedbackmech | Positive | 5.727273 3.287981 3.04 0.002 1.859002 17.64476 Neutral | 2.757576 1.597961 1.75 0.080 .8856718 8.585826 | livesalone#feedbackmech | Lives alone#Positive | .6943834 .486771 -0.52 0.603 .1757507 2.743479 Lives alone#Neutral | 1.451546 1.028417 0.53 0.599 .3620396 5.819764 | _cons | .1428571 .0682988 -4.07 0.000 .0559693 .3646315 ----------------------------------------------------------------------------------------- Note: _cons estimates baseline odds. +----------------------------------------------------------------+ | Block | LL LR df Pr > LR AIC BIC | |-------+--------------------------------------------------------| | 1 | -186.5656 31.75 3 0.0000 381.1312 396.0516 | | 2 | -185.7604 1.61 2 0.4470 383.5207 405.9013 | +----------------------------------------------------------------+
1. Living alone loses its significance when controlling for the interaction between living alone and the mechanism of feedback.
2. Positive mechanism of feedback retains significance after the interaction, but the interactions themselves are not significant.
And I still struggle to understand exactly how to interpret interaction effects in this way. Is it correct to say that living alone does not significantly impact someone turning themselves in when they live alone and are presented with feedback (Block 2)? But, living alone without considering feedback does significantly increase turning themselves in (Block 1)?
Asking how to interpret and discuss these results for the lay person might be elementary, but I've looked everywhere and virtually all threads, videos, etc. interpret hierarchical regression with interactions in a statistical way. I suppose I'm asking how I would explain this result to my grandmother, as the saying goes.
Thanks for all the help.
PPML - generating a time dependent threshold
Dear Stata community,
I am working with a gravity model and what to cluster my observations depending on if they are in the highest, middle or lowest third of observations in the respective year, but I am open for different suggestions. More on that after I explain my data and method. I use the ppmlhdfe command written by Sergio Correia, Paulo GuimarĂ£es, Thomas Zylkin. The tool can be installed with the following commands:
I am looking only at data where China is the import or export partner:
The variable ln_gdp_2_pop controls for the market size the two trading partners form together, it is the product of both partners gdp over the product of both populations (log linearized). ln_preimp is the average value of imports of the previous three years (my data goes further back than 1990, thats why this value is always avaiable if the country existed in 1987, log linearized too). rta is a dummy for regional trade agreements with China and BRI_mem_o is the variable of intrest. country_pair is a unique identifyer for each country pair. Using this as fixed effects controls for all time invariant country pair specific variables like distance, contiguity etc.
I then estimate the follwing ppml regressions (one for imports, exports and total trade each) with hdfe:
I obtain the follwing results for total trade for example:
I suspect that I need to cluster my error terms to reduce the effect of heteroskedasticity (I know ppml is already doing that). So I want to assign each import, export and total trade value to a group depending if they are in the lowest, middle or higest third of observations in the respective year. Then I could cluster for these three groups, because I read that the error term is affected by how large the trade flow is in gravity models.
The reason I suspect heteroskedasticity is that the estimation is not working consistent for different subsamples of countries and that the rta variable is changing signs and has very different levels of significance. But as mentioned before I am open for any other suggestion what else I could change.
Kind regards
Michael
I am working with a gravity model and what to cluster my observations depending on if they are in the highest, middle or lowest third of observations in the respective year, but I am open for different suggestions. More on that after I explain my data and method. I use the ppmlhdfe command written by Sergio Correia, Paulo GuimarĂ£es, Thomas Zylkin. The tool can be installed with the following commands:
Code:
ssc install ftools ssc install reghdfe ssc install ppmlhdfe
Code:
* Example generated by -dataex-. For more info, type help dataex clear input int year str3 iso3_o byte rta float(BRI_mem_o ln_exports_to_china ln_imports_from_china ln_total_china_trade) str6 country_pair float(ln_gdp_2_pop ln_preimp ln_preexpo ln_pretotal) 1990 "ABW" 0 0 . . . "ABWABW" . . . . 1991 "ABW" 0 0 . . . "ABWABW" . . . . 1992 "ABW" 0 0 . . . "ABWABW" . . . . 1993 "ABW" 0 0 . . . "ABWABW" . . . . 1994 "ABW" 0 0 . . . "ABWABW" 19.52183 . . . 1995 "ABW" 0 0 . . . "ABWABW" 19.415113 . . . 1996 "ABW" 0 0 . . . "ABWABW" 19.43265 . . . 1997 "ABW" 0 0 . . . "ABWABW" 19.588173 . . . 1998 "ABW" 0 0 . . . "ABWABW" 19.712955 . . . 1999 "ABW" 0 0 . . . "ABWABW" 19.74156 . . . 2000 "ABW" 0 0 . . . "ABWABW" 19.86799 . . . 2001 "ABW" 0 0 . . . "ABWABW" 19.87303 . . . 2002 "ABW" 0 0 . . . "ABWABW" 19.849876 . . . 2003 "ABW" 0 0 . . . "ABWABW" 19.888773 . . . 2004 "ABW" 0 0 . . . "ABWABW" 20.04846 . . . 2005 "ABW" 0 0 . . . "ABWABW" 20.11266 . . . 2006 "ABW" 0 0 . . . "ABWABW" 20.172903 . . . 2007 "ABW" 0 0 . . . "ABWABW" 20.32564 . . . 2008 "ABW" 0 0 . . . "ABWABW" 20.44747 . . . 2009 "ABW" 0 0 . . . "ABWABW" 20.224247 . . . 2010 "ABW" 0 0 . . . "ABWABW" 20.19557 . . . 2011 "ABW" 0 0 . . . "ABWABW" 20.281445 . . . 2012 "ABW" 0 0 . . . "ABWABW" . . . . 2013 "ABW" 0 0 . . . "ABWABW" . . . . 2014 "ABW" 0 0 . . . "ABWABW" . . . . 2015 "ABW" 0 0 . . . "ABWABW" . . . . 2016 "ABW" 0 0 . . . "ABWABW" . . . . 2017 "ABW" 0 0 . . . "ABWABW" 20.55063 . . . 2018 "ABW" 0 0 . . . "ABWABW" . . . . 2019 "ABW" 0 0 . . . "ABWABW" . . . . 1990 "ABW" 0 0 . . . "ABWAFG" . . . . 1991 "ABW" 0 0 . . . "ABWAFG" . . . . 1992 "ABW" 0 0 . . . "ABWAFG" . . . . 1993 "ABW" 0 0 . . . "ABWAFG" . . . . 1994 "ABW" 0 0 . . . "ABWAFG" . . . . 1995 "ABW" 0 0 . . . "ABWAFG" . . . . 1996 "ABW" 0 0 . . . "ABWAFG" . . . . 1997 "ABW" 0 0 . . . "ABWAFG" . . . . 1998 "ABW" 0 0 . . . "ABWAFG" . . . . 1999 "ABW" 0 0 . . . "ABWAFG" . . . . 2000 "ABW" 0 0 . . . "ABWAFG" . . . . 2001 "ABW" 0 0 . . . "ABWAFG" 14.68416 . . . 2002 "ABW" 0 0 . . . "ABWAFG" 15.150466 . . . 2003 "ABW" 0 0 . . . "ABWAFG" 15.234106 . . . 2004 "ABW" 0 0 . . . "ABWAFG" 15.418114 . . . 2005 "ABW" 0 0 . . . "ABWAFG" 15.587377 . . . 2006 "ABW" 0 0 . . . "ABWAFG" 15.704497 . . . 2007 "ABW" 0 0 . . . "ABWAFG" 16.085983 . . . 2008 "ABW" 0 0 . . . "ABWAFG" 16.15592 . . . 2009 "ABW" 0 0 . . . "ABWAFG" 16.222836 . . . 2010 "ABW" 0 0 . . . "ABWAFG" 16.427858 . . . 2011 "ABW" 0 0 . . . "ABWAFG" 16.560684 . . . 2012 "ABW" 0 0 . . . "ABWAFG" . . . . 2013 "ABW" 0 0 . . . "ABWAFG" . . . . 2014 "ABW" 0 0 . . . "ABWAFG" . . . . 2015 "ABW" 0 0 . . . "ABWAFG" . . . . 2016 "ABW" 0 0 . . . "ABWAFG" . . . . 2017 "ABW" 0 0 . . . "ABWAFG" 16.528923 . . . 2018 "ABW" 0 0 . . . "ABWAFG" . . . . 2019 "ABW" 0 0 . . . "ABWAFG" . . . . 1990 "ABW" 0 0 . . . "ABWAGO" . . . . 1991 "ABW" 0 0 . . . "ABWAGO" . . . . 1992 "ABW" 0 0 . . . "ABWAGO" . . . . 1993 "ABW" 0 0 . . . "ABWAGO" . . . . 1994 "ABW" 0 0 . . . "ABWAGO" 15.6064 . . . 1995 "ABW" 0 0 . . . "ABWAGO" 15.739015 . . . 1996 "ABW" 0 0 . . . "ABWAGO" 16.120626 . . . 1997 "ABW" 0 0 . . . "ABWAGO" 16.187563 . . . 1998 "ABW" 0 0 . . . "ABWAGO" 16.05207 . . . 1999 "ABW" 0 0 . . . "ABWAGO" 15.991986 . . . 2000 "ABW" 0 0 . . . "ABWAGO" 16.419592 . . . 2001 "ABW" 0 0 . . . "ABWAGO" 16.36816 . . . 2002 "ABW" 0 0 . . . "ABWAGO" 16.657751 . . . 2003 "ABW" 0 0 . . . "ABWAGO" 16.76887 . . . 2004 "ABW" 0 0 . . . "ABWAGO" 17.138466 . . . 2005 "ABW" 0 0 . . . "ABWAGO" 17.498556 . . . 2006 "ABW" 0 0 . . . "ABWAGO" 17.886463 . . . 2007 "ABW" 0 0 . . . "ABWAGO" 18.298084 . . . 2008 "ABW" 0 0 . . . "ABWAGO" 18.656734 . . . 2009 "ABW" 0 0 . . . "ABWAGO" 18.403341 . . . 2010 "ABW" 0 0 . . . "ABWAGO" 18.445055 . . . 2011 "ABW" 0 0 . . . "ABWAGO" 18.689266 . . . 2012 "ABW" 0 0 . . . "ABWAGO" . . . . 2013 "ABW" 0 0 . . . "ABWAGO" . . . . 2014 "ABW" 0 0 . . . "ABWAGO" . . . . 2015 "ABW" 0 0 . . . "ABWAGO" . . . . 2016 "ABW" 0 0 . . . "ABWAGO" . . . . 2017 "ABW" 0 0 . . . "ABWAGO" 18.593037 . . . 2018 "ABW" 0 0 . . . "ABWAGO" . . . . 2019 "ABW" 0 0 . . . "ABWAGO" . . . . 1990 "ABW" 1 0 . . . "ABWAIA" . . . . 1991 "ABW" 1 0 . . . "ABWAIA" . . . . 1992 "ABW" 1 0 . . . "ABWAIA" . . . . 1993 "ABW" 1 0 . . . "ABWAIA" . . . . 1994 "ABW" 1 0 . . . "ABWAIA" . . . . 1995 "ABW" 1 0 . . . "ABWAIA" . . . . 1996 "ABW" 1 0 . . . "ABWAIA" . . . . 1997 "ABW" 1 0 . . . "ABWAIA" . . . . 1998 "ABW" 1 0 . . . "ABWAIA" . . . . 1999 "ABW" 1 0 . . . "ABWAIA" . . . . end
I then estimate the follwing ppml regressions (one for imports, exports and total trade each) with hdfe:
Code:
ppmlhdfe imports_from_china ln_gdp_2_pop ln_preimp BRI_mem_o rta if year> 1990, absorb(year country_pair) vce(robust) ppmlhdfe exports_to_china ln_gdp_2_pop ln_preexpo BRI_mem_o rta if year> 1990, absorb(year country_pair) vce(robust) ppmlhdfe total_china_trade ln_gdp_2_pop ln_pretotal BRI_mem_o rta if year> 1990, absorb(year country_pair) vce(robust)
Code:
(dropped 3 observations that are either singletons or separated by a fixed effect) warning: dependent variable takes very low values after standardizing (2.1798e-07) Iteration 1: deviance = 2.1018e+10 eps = . iters = 4 tol = 1.0e-04 min(eta) = -4.02 P Iteration 2: deviance = 5.7823e+09 eps = 2.63e+00 iters = 3 tol = 1.0e-04 min(eta) = -6.11 Iteration 3: deviance = 1.6919e+09 eps = 2.42e+00 iters = 3 tol = 1.0e-04 min(eta) = -8.86 Iteration 4: deviance = 8.5597e+08 eps = 9.77e-01 iters = 3 tol = 1.0e-04 min(eta) = -10.70 Iteration 5: deviance = 7.2615e+08 eps = 1.79e-01 iters = 3 tol = 1.0e-04 min(eta) = -11.49 Iteration 6: deviance = 6.9979e+08 eps = 3.77e-02 iters = 3 tol = 1.0e-04 min(eta) = -12.25 Iteration 7: deviance = 6.9484e+08 eps = 7.13e-03 iters = 2 tol = 1.0e-04 min(eta) = -12.74 Iteration 8: deviance = 6.9421e+08 eps = 9.11e-04 iters = 2 tol = 1.0e-04 min(eta) = -13.17 Iteration 9: deviance = 6.9417e+08 eps = 5.78e-05 iters = 2 tol = 1.0e-04 min(eta) = -13.31 Iteration 10: deviance = 6.9417e+08 eps = 9.69e-07 iters = 2 tol = 1.0e-05 min(eta) = -13.32 Iteration 11: deviance = 6.9417e+08 eps = 1.74e-09 iters = 2 tol = 1.0e-06 min(eta) = -13.32 S O ------------------------------------------------------------------------------------------------------------ (legend: p: exact partial-out s: exact solver h: step-halving o: epsilon below tolerance) Converged in 11 iterations and 29 HDFE sub-iterations (tol = 1.0e-08) HDFE PPML regression No. of obs = 5,145 Absorbing 2 HDFE groups Residual df = 4,909 Wald chi2(4) = 3346.53 Deviance = 694167322.6 Prob > chi2 = 0.0000 Log pseudolikelihood = -347121413.4 Pseudo R2 = 0.9971 ------------------------------------------------------------------------------ | Robust total_chin~e | Coefficient std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- ln_gdp_2_pop | .2087208 .0355928 5.86 0.000 .1389601 .2784815 ln_pretotal | .7336988 .0189645 38.69 0.000 .696529 .7708685 BRI_mem_o | -.0624149 .0221315 -2.82 0.005 -.1057918 -.019038 rta | -.03272 .0221649 -1.48 0.140 -.0761623 .0107223 _cons | 1.186367 .494268 2.40 0.016 .2176198 2.155115 ------------------------------------------------------------------------------ Absorbed degrees of freedom: ------------------------------------------------------+ Absorbed FE | Categories - Redundant = Num. Coefs | --------------+---------------------------------------| year | 29 0 29 | country_pair | 204 1 203 | ------------------------------------------------------+
The reason I suspect heteroskedasticity is that the estimation is not working consistent for different subsamples of countries and that the rta variable is changing signs and has very different levels of significance. But as mentioned before I am open for any other suggestion what else I could change.
Kind regards
Michael
Generating dummy observations to balance a panel
I hope this request makes sense, as it is just to aid in my estimation. Below is the dataex of a dummy dataset resembling my original, and below that I will describe my problem.
In my previous post, I had requested a way to track an individual's membership changes between phases. The advice given in that post was very good. I was able to generate variables which described whether an individual gained, lost, retained, or retained lack of a membership between any two consecutive phases.
The problem with my actual full fledged dataset is that there are individuals who don't always have consecutive phases. For example, in the given dataex, individual A has observations only in phase 1 and phase 3, we don't know anything about him in phase 2. Therefore with the solution code given in my previous post, the generated variables could not capture anything for individual A. It is my mistake that when I provided a dummy representative dataset, I made it balanced instead of unbalanced.
To counter this problem, is there any code or solution in stata by which I can generate dummy observations for individuals whose observations are not in every phase? And of course the values of Membership for those dummy variables would be the missing value. This is only to counter the problem that the solutions won't work for non consecutive periods. Hence since individual A has no observations in phase 2, his p2_p3 variable is missing. But I still want to capture the change that some time between phase 1 and phase 3, he did gain membership.
Otherwise if there is any other viable solution, I would be grateful to know.
EDIT: Thanks to Mr. Schechter for pointing out the mistake in the dataex, I have updated it
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input str1 ID float(phase HasMembership) "A" 1 0 "A" 3 1 "B" 1 1 "B" 2 0 "B" 3 1 "C" 1 0 "C" 2 1 "C" 3 1 "D" 2 1 "D" 3 0 "E" 1 1 "E" 3 0 "F" 1 1 "F" 3 0 end
The problem with my actual full fledged dataset is that there are individuals who don't always have consecutive phases. For example, in the given dataex, individual A has observations only in phase 1 and phase 3, we don't know anything about him in phase 2. Therefore with the solution code given in my previous post, the generated variables could not capture anything for individual A. It is my mistake that when I provided a dummy representative dataset, I made it balanced instead of unbalanced.
To counter this problem, is there any code or solution in stata by which I can generate dummy observations for individuals whose observations are not in every phase? And of course the values of Membership for those dummy variables would be the missing value. This is only to counter the problem that the solutions won't work for non consecutive periods. Hence since individual A has no observations in phase 2, his p2_p3 variable is missing. But I still want to capture the change that some time between phase 1 and phase 3, he did gain membership.
Otherwise if there is any other viable solution, I would be grateful to know.
EDIT: Thanks to Mr. Schechter for pointing out the mistake in the dataex, I have updated it
Determining the explanatory power of an interaction term
Hi everyone,
I'm trying to fill a table with each line representing the explanatory power of a particular part of my model (such as fixed effects, independent variables, the residuals...), that is, the variance of this specific part divided by the variance of the model.
I am trying to find the last cell of column 2, that is, the explanatory power of an interaction term that I'd like to isolate from the predicted xb.
So what I do is that I use the command from SSC -reghdfe- to store my fixed effects in a variable, as well as the command -predict- to save the xb and the residuals in a variable.
Then I summarize the different variables I obtained to fill the table with my data:
with `variance' being the model's variance defined in a local previously.
Now my problem is that xb is for all the independent variables, including the interaction term. How can I possibly isolate the variance of the different levels of the interaction term to fill the last cell of the table? My lead so far has been to generate manually a variable representing the interaction term between var4 and var5 and to put it in the fixed effect option in the regression, but it seems that the command xi generates an important quantity of variables for each combination of var4 and var5. I'm not sure this is what I want. Apologies if my post isn't clear (and it probably is!). I can explain further if needed.
I'm trying to fill a table with each line representing the explanatory power of a particular part of my model (such as fixed effects, independent variables, the residuals...), that is, the variance of this specific part divided by the variance of the model.
Independant variables | Var(xb)/Var(model) |
Fixed effect 1 | Var(fe1)/Var(model) |
Residuals | Var(residuals)/Var(model) |
Interaction term | ? |
So what I do is that I use the command from SSC -reghdfe- to store my fixed effects in a variable, as well as the command -predict- to save the xb and the residuals in a variable.
Code:
reghdfe y var1 var2 var3 i.var4##i.var5, absorb(fe1 fe2, savefe) resid predict xb, xb predict residuals, r
Code:
sum xb display r(Var)/`variance' sum __hdfe1__ display r(Var)/`variance' * __hdfe1__ is obtained with the savefe option
Now my problem is that xb is for all the independent variables, including the interaction term. How can I possibly isolate the variance of the different levels of the interaction term to fill the last cell of the table? My lead so far has been to generate manually a variable representing the interaction term between var4 and var5 and to put it in the fixed effect option in the regression, but it seems that the command xi generates an important quantity of variables for each combination of var4 and var5. I'm not sure this is what I want. Apologies if my post isn't clear (and it probably is!). I can explain further if needed.
Conditional loop analysis with sums in panel data
I have an unbalanced data of
where 1587 companies from 18 different countries report 5 different financial ratios annually from 2005 to 2021.
As such, there are 26979 rows (1587 * 17) in the data (1587 companies and 17 years) in which each company is assigned to a single country in all years.
I want to generate a new variable (Xa,y) for each and every row. Since I couldn't properly post the formula for Xa,y I attached it in a pdf file Array
Eventually, all companies should have a unique "Xa,y" for each year and 17 (number of years) different "Xa.y"s in total.
I appreciate your help with the code for Xa,y.
Best,
LĂ¼tfi
company id (i) |
country code (c) |
year (y) |
ratio-1 (r1) |
ratio-2 (r2) |
ratio-3 (r3) |
ratio-4 (r4) |
ratio-5 (r5) |
1 ≤ i ≤ 1587 | 1 ≤ c ≤ 18 | 2005 ≤ y ≤ 2021 | 0 ≤ r1 ≤1 | 0 ≤ r2 ≤1 | 0 ≤ r3 ≤1 | 0 ≤ r4 ≤1 | 0 ≤ r5 ≤1 |
where 1587 companies from 18 different countries report 5 different financial ratios annually from 2005 to 2021.
As such, there are 26979 rows (1587 * 17) in the data (1587 companies and 17 years) in which each company is assigned to a single country in all years.
I want to generate a new variable (Xa,y) for each and every row. Since I couldn't properly post the formula for Xa,y I attached it in a pdf file Array
Eventually, all companies should have a unique "Xa,y" for each year and 17 (number of years) different "Xa.y"s in total.
I appreciate your help with the code for Xa,y.
Best,
LĂ¼tfi
Monday, June 27, 2022
Using a loop to calculate new variable
Hello all,
I want to find the new sale price for each year represented through new_sales. I will use the sale price in 2000q4 as the baseline (100). For example, year=2001q1 should be calculated by: new_sales=0.08*100=8+100=108. Then, I want to use this calculated new_sales value (108) to calculate new_sales for 2001q2. Then, find the new_sales value in 2001q3 by using the value in 2001q2 and so on.
I have many different new_sales columns that need to be calculated given different scenarios, so I think some sort of loop would work best.
Here is an example of my data:
I would appreciate any assistance with this!
Thanks,
Anoush K.
I want to find the new sale price for each year represented through new_sales. I will use the sale price in 2000q4 as the baseline (100). For example, year=2001q1 should be calculated by: new_sales=0.08*100=8+100=108. Then, I want to use this calculated new_sales value (108) to calculate new_sales for 2001q2. Then, find the new_sales value in 2001q3 by using the value in 2001q2 and so on.
I have many different new_sales columns that need to be calculated given different scenarios, so I think some sort of loop would work best.
Here is an example of my data:
Code:
input str7 year float(sales_percent new_sales)
2000q4 0.01 100
2001q1 0.08
2001q2 0.06
2001q3 0.07
2001q4 0.001
2002q1 0.02
2002q2 0.02
2002q3 0.03
2002q4 0.05
2003q1 0.02
2003q2 0.03
2003q3 0.001
2003q4 0.01
Thanks,
Anoush K.
How to define treatment & control groups properly?
I’m working on a project examining the effect of a 2016 cash transfer on fertility.
Who is eligible for the cash?
All families with: 1.) 2+ children, or 2.) 1 low-income or disabled child.
The data doesn’t have a variable indicating who got the cash transfer, so as I understand it, I would be doing an “intent to treat” analysis by defining the treatment & control groups based on eligibility.
However, I keep getting stuck on how to define the treatment & control groups. I guess my question is since the cash transfer is universal for ALL families with 2+ kids, what would be the control group then? Theoretically, there should be two similar groups of families with 2+ kids (one who get the cash transfer and the other who don’t), but that’s not possible in this case?
Comparing eligible families (2+ kids or 1 poor/disabled kid) to ineligible families (1 kid that is not poor/disabled or zero kids) would violate one of the core assumptions of causal inference (that the treatment and control groups be similar and only differ in the “treatment”).
I think I’m getting tripped up by how the cash transfer is both universal and birth-dependent.
I’m exploring using a linear probability model with FE or a DID model, but not sure if a DID makes sense? Is synthetic control more appropriate? Any thoughts on modeling strategies?
More context: The data comes from a household survey, which I’ve organized into a panel with fertility histories for each childbearing-aged woman (e.g. each woman has 17 observations, or 18 years containing her time-variant birth information). I have data from 2010-2018 and the program started in 2016. The program grandfathers in anyone who falls in either one of two eligibility categories. The cash transfer is not means or work tested.
Who is eligible for the cash?
All families with: 1.) 2+ children, or 2.) 1 low-income or disabled child.
The data doesn’t have a variable indicating who got the cash transfer, so as I understand it, I would be doing an “intent to treat” analysis by defining the treatment & control groups based on eligibility.
However, I keep getting stuck on how to define the treatment & control groups. I guess my question is since the cash transfer is universal for ALL families with 2+ kids, what would be the control group then? Theoretically, there should be two similar groups of families with 2+ kids (one who get the cash transfer and the other who don’t), but that’s not possible in this case?
Comparing eligible families (2+ kids or 1 poor/disabled kid) to ineligible families (1 kid that is not poor/disabled or zero kids) would violate one of the core assumptions of causal inference (that the treatment and control groups be similar and only differ in the “treatment”).
I think I’m getting tripped up by how the cash transfer is both universal and birth-dependent.
I’m exploring using a linear probability model with FE or a DID model, but not sure if a DID makes sense? Is synthetic control more appropriate? Any thoughts on modeling strategies?
More context: The data comes from a household survey, which I’ve organized into a panel with fertility histories for each childbearing-aged woman (e.g. each woman has 17 observations, or 18 years containing her time-variant birth information). I have data from 2010-2018 and the program started in 2016. The program grandfathers in anyone who falls in either one of two eligibility categories. The cash transfer is not means or work tested.
What miss option means in gunique ?
Deat Stata user,
I found this gunique function, and using the same variable list I got different total observations and unique observations by adding or not add
option.
Does anyone knows why the N and number of unbalanced groups are different in these 2 cases? And what means size 1 to 6026?
Thanks a lot!
I found this gunique function, and using the same variable list I got different total observations and unique observations by adding or not add
Code:
miss
Code:
gunique id name_first name_last N = 19,632,203; 4,678,062 unbalanced groups of sizes 1 to 6,026
Code:
gunique id name_first name_last, miss N = 19,632,557; 4,678,416 unbalanced groups of sizes 1 to 6,026
Thanks a lot!
CI Decmposition Results not Showing percentage contribution
Hi All,
I am using Stata 16 and trying to decompose the concentration index but the results show nothing for percentage contribution. My dependent variable is a binary variable, obese or not obese. My dependent variables are quintile , feedingscheme (whether a student participates in school feeding program or not), adorace(ado for adolescent, African/White/Coloured/Asian), adogender(male/female), employ_m (whether mother is employed), educ_m and wgt_m.
The commands are as follows: using Erreygers
The results show just a fullstop next to each variable percentage contribution. And it also gives an error message "elas_ not found". I have defined elasticity
quintile elasticity: .02855949
quintile concentration index: .26154029
quintile contribution: .02987783
quintile percentage contribution: .
feedingscheme elasticity: .03727037
feedingscheme concentration index: .05987572
feedingscheme contribution: .00892636
feedingscheme percentage contribution: .
adorace elasticity: -.01348892
adorace concentration index: .01036392
adorace contribution: -.00055919
adorace percentage contribution: .
adogender elasticity: .02694672
adogender concentration index: -.00403962
adogender contribution: -.00043542
adogender percentage contribution: .
empl_m elasticity: .01834728
empl_m concentration index: .06688754
empl_m contribution: .00490882
empl_m percentage contribution: .
educ_m elasticity: .02113921
educ_m concentration index: .08112557
educ_m contribution: .00685972
educ_m percentage contribution: .
wgt_m elasticity: .15021019
wgt_m concentration index: .01183257
wgt_m contribution: .00710949
wgt_m percentage contribution: .
. matrix Aaa = nullmat(Aaa) \ (elas_`x', CI_`x', con_`x', prcnt_`x')
elas_ not found
r(111);
My dataex results are as follows:
Please assist.
Regards
Nthato
I am using Stata 16 and trying to decompose the concentration index but the results show nothing for percentage contribution. My dependent variable is a binary variable, obese or not obese. My dependent variables are quintile , feedingscheme (whether a student participates in school feeding program or not), adorace(ado for adolescent, African/White/Coloured/Asian), adogender(male/female), employ_m (whether mother is employed), educ_m and wgt_m.
The commands are as follows: using Erreygers
Code:
sca CI=r(CI) global X quintile feedingscheme adorace adogender empl_m educ_m wgt_m qui sum obese [aw=wt] sca m_obese=r(mean) qui glm obese $X [aw=wt], family(binomial) link(logit) qui margins , dydx(*) post foreach x of varlist $X { sca b_`x'=_b[`x'] } foreach x of varlist $X { qui{ conindex `x' [aw=wt], rankvar(quintile) truezero sca CI_`x' = r(CI) sum `x' [aw=wt] sca m_`x'=r(mean) sca elas_`x' = b_`x'*m_`x' sca con_`x' = 4*elas_`x'*CI_`x' sca prcnt_`x' = (con_`x'/CI)*100 } di "`x' elasticity:", elas_`x' di "`x' concentration index:", CI_`x' di "`x' contribution:", con_`x' di "`x' percentage contribution:", prcnt_`x' } matrix Aaa = nullmat(Aaa) \ (elas_`x', CI_`x', con_`x', prcnt_`x') } matrix rownames Caa= $X matrix colnames Caa = "Elasticity""CI""Absolute""%" matrix list Caa, format(%8.4f) clear matrix
quintile elasticity: .02855949
quintile concentration index: .26154029
quintile contribution: .02987783
quintile percentage contribution: .
feedingscheme elasticity: .03727037
feedingscheme concentration index: .05987572
feedingscheme contribution: .00892636
feedingscheme percentage contribution: .
adorace elasticity: -.01348892
adorace concentration index: .01036392
adorace contribution: -.00055919
adorace percentage contribution: .
adogender elasticity: .02694672
adogender concentration index: -.00403962
adogender contribution: -.00043542
adogender percentage contribution: .
empl_m elasticity: .01834728
empl_m concentration index: .06688754
empl_m contribution: .00490882
empl_m percentage contribution: .
educ_m elasticity: .02113921
educ_m concentration index: .08112557
educ_m contribution: .00685972
educ_m percentage contribution: .
wgt_m elasticity: .15021019
wgt_m concentration index: .01183257
wgt_m contribution: .00710949
wgt_m percentage contribution: .
. matrix Aaa = nullmat(Aaa) \ (elas_`x', CI_`x', con_`x', prcnt_`x')
elas_ not found
r(111);
My dataex results are as follows:
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input byte quintile float(feedingscheme adorace adogender empl_m educ_m wgt_m) . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 1 1 2 2 1 1 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1 3 . . . . 2 1 3 . . . . 2 1 3 . . . . . . . 2 2 2 2 2 1 3 . . . . 2 1 3 . . . . . . . . . . . . . . . . . . . . . 3 1 2 2 . . . 3 1 2 1 . . . . . . . . . . . . . . . . . 3 1 2 2 . . . . . . . 1 1 3 . . . . . . . . . . . . . . 3 1 2 1 2 1 2 3 1 2 2 2 1 2 . . . . 2 1 2 . . . . 2 1 2 1 1 2 1 . . . 1 1 2 1 . . . 1 . 2 1 . . . 1 1 2 1 . . . . . . . . . . . . . . . . . 1 . 2 2 . . . 1 1 2 2 1 1 3 . . . . . . . . . . . . . . . . . . . . . 1 1 2 1 1 1 3 1 1 2 2 1 1 2 1 1 2 2 1 1 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 2 . . . . . . . . . . . . . . . . . 1 1 2 2 . . . 1 1 2 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 . . . . . . . . . . . . . . . . . . 1 1 2 . . . . . . . . . . . 1 2 2 2 1 2 2 1 2 2 . . . . 1 2 2 1 . 2 1 1 1 1 . . . . . . . . . . . . . . 2 1 2 1 2 1 3 2 2 2 2 2 1 3 2 . 2 2 2 1 3 . . . . . . . 2 1 2 2 2 1 3 . . . . 1 3 3 2 1 2 2 2 1 3 . . . . 1 1 3 . . . . 1 1 3 . . . . . . . . . . . 1 2 3 . . . . 1 2 3 . . . . 1 2 3 . . . . 1 2 3 . . . . 1 1 3 . . . . . . . 1 2 2 2 1 1 3 . . . . . . . . . . . 1 2 3 . . . . 1 2 3 1 1 2 1 1 1 3 1 1 2 2 1 1 3 . . . . 1 1 3 . . . . 1 1 3 . . . . . . . 1 2 2 1 1 2 3 end label values feedingscheme feedingscheme5 label def feedingscheme5 1 "fscheme", modify label def feedingscheme5 2 "nfscheme", modify
Regards
Nthato
synth_runner automatically generated predictorvars
I am using synth_runner in STATA 17. I got the exact same results when I run two specifications. depvar remains the same. In the second specification, I added one year of outcome variable in the predictorvars and it doesn't make any difference. I suspect that this is because training_proper(real>0) automatically generates predictors from both depvar and predictorvars. From the help file, it is not clear whether depvar is considered as a potential predictor. Is there any other reason that leads to the same results from these two specifications? Is there any way that I can use training_proper(real >0) option, but also use only one year's outcome variable as a predictor (I have multiple treatment units)?
I will use the data and code from synth_runner help file to illustrate the problem.
Stata output
I will use the data and code from synth_runner help file to illustrate the problem.
Code:
clear all use smoking, clear tsset state year capture drop D program my_pred, rclass args tyear return local predictors "beer(`=`tyear'-4'(1)`=`tyear'-1') lnincome(`=`tyear'-4'(1)`=`tyear'-1')" end program my_drop_units args tunit if `tunit'==39 qui drop if inlist(state,21,38) if `tunit'==3 qui drop if state==21 end program my_xperiod, rclass args tyear return local xperiod "`=`tyear'-12'(1)`=`tyear'-1'" end program my_mspeperiod, rclass args tyear return local mspeperiod "`=`tyear'-12'(1)`=`tyear'-1'" end generate byte D = (state==3 & year>=1989) | (state==7 & year>=1988) * Specification 1 synth_runner cigsale retprice age15to24, d(D) pred_prog(my_pred) trends training_propr(`=13/18') drop_units_prog(my_drop_units)) xperiod_prog(my_xperiod) mspeperiod_prog(my_mspeperiod) * Specification 2: add cigsale(1980) as a potential predictor synth_runner cigsale cigsale(1980) retprice age15to24, d(D) pred_prog(my_pred) trends training_propr(`=13/18') drop_units_prog(my_drop_units)) xperiod_prog(my_xperiod) mspeperiod_prog(my_mspeperiod)
Specification 1 | |||
Post-treatment results: Effects, p-values, standardized p-values | |||
estimates | pvals | pvals_std | |
c1 | -0.027493 | 0.3002191 | 0.0021914 |
c2 | -0.0485773 | 0.1775018 | 0.0043828 |
c3 | -0.0921521 | 0.0394449 | 0 |
c4 | -0.1017043 | 0.0409058 | 0 |
c5 | -0.1270111 | 0.0241052 | 0 |
c6 | -0.1352273 | 0.0219138 | 0 |
c7 | -0.141674 | 0.0262966 | 0 |
c8 | -0.196867 | 0.0051132 | 0 |
c9 | -0.1754307 | 0.0124178 | 0 |
c10 | -0.1833944 | 0.0197224 | 0 |
c11 | -0.1910038 | 0.0233747 | 0 |
c12 | -0.1889059 | 0.0219138 | 0 |
Specification 2 | |||
Post-treatment results: Effects, p-values, standardized p-values | |||
estimates | pvals | pvals_std | |
c1 | -0.027493 | 0.3002191 | 0.0021914 |
c2 | -0.0485773 | 0.1775018 | 0.0043828 |
c3 | -0.0921521 | 0.0394449 | 0 |
c4 | -0.1017043 | 0.0409058 | 0 |
c5 | -0.1270111 | 0.0241052 | 0 |
c6 | -0.1352273 | 0.0219138 | 0 |
c7 | -0.141674 | 0.0262966 | 0 |
c8 | -0.196867 | 0.0051132 | 0 |
c9 | -0.1754307 | 0.0124178 | 0 |
c10 | -0.1833944 | 0.0197224 | 0 |
c11 | -0.1910038 | 0.0233747 | 0 |
c12 | -0.1889059 | 0.0219138 | 0 |
Extract country names from affiliations
Hi,
I have a dataset of about 1000 articles with variables such as id, title, abstract and affiliation. I was unable to get a dataex due to an error:
I have to separate the affiliation details (e.g. school, department, email), street address and country names into different columns, but my main interest is in the country names for each row. Is it possible to do so using Stata 16? Here are first 2 rows.
I have a dataset of about 1000 articles with variables such as id, title, abstract and affiliation. I was unable to get a dataex due to an error:
Code:
input strL affiliation data width (7267 chars) exceeds max linesize. Try specifying fewer variables r(1000);
Department of Radiation Oncology, University of Brescia and Spedali Civili and Hospital, Brescia, Italy. and Department of Radiation Oncology, University of Brescia and Spedali Civili and Hospital, Brescia, Italy. and University Department of Infectious and Tropical Diseases, University of Brescia and and ASST Spedali Civili, Brescia, Italy. and Department of Molecular and Translational Medicine and Clinical Chemistry and Laboratory ASST Spedali Civili, Brescia, Italy. and Department of Radiation Oncology, University of Brescia and Spedali Civili and Hospital, Brescia, Italy. and Department of Radiation Oncology, University of Brescia and Spedali Civili and Hospital, Brescia, Italy; a.e.guerini@gmail.com. and Department of Radiation Oncology, University of Brescia and Spedali Civili and Hospital, Brescia, Italy.
|
Sunday, June 26, 2022
Flexible case-control matching command
Hello,
First, thanks in advance for anyone who can help me with this. I haven't had much luck recently with other avenues so I hoped I might find some advice here.
I'm performing a case-control analysis on a dataset of mine in which I want to match 2 controls to 1 case based on a date value (matched on day). To increase the power I'm trying to add some flexibility by allowing controls to be matched +/- 1 day, rather than only on the specific day. For example, if I have 1 case whose date is 06/29/2018, and there's only 1 control who shares that specific date, I want to be able to match a second control whose date is either 06/28/2018 or 06/30/2018.
Variables:
case (binary, 0 or 1)
date (str11, format is mm/dd/yyyy)
I'm using the ccmatch command, which does match entries on one variable based on a given set of other variables. However it does not match multiple controls to a case and doesn't account for flexible matching criteria like mine.
The only way I can do this now is by painstakingly poring through the (rather large) dataset and individually matching a second control to each identified case, and since ccmatch matches 1:1 at random, such additional code is difficult to replicate.
Let me know if any more information is needed, I'm happy to explain myself better. This is my first post so apologies if I've made an error with this post.
First, thanks in advance for anyone who can help me with this. I haven't had much luck recently with other avenues so I hoped I might find some advice here.
I'm performing a case-control analysis on a dataset of mine in which I want to match 2 controls to 1 case based on a date value (matched on day). To increase the power I'm trying to add some flexibility by allowing controls to be matched +/- 1 day, rather than only on the specific day. For example, if I have 1 case whose date is 06/29/2018, and there's only 1 control who shares that specific date, I want to be able to match a second control whose date is either 06/28/2018 or 06/30/2018.
Variables:
case (binary, 0 or 1)
date (str11, format is mm/dd/yyyy)
I'm using the ccmatch command, which does match entries on one variable based on a given set of other variables. However it does not match multiple controls to a case and doesn't account for flexible matching criteria like mine.
The only way I can do this now is by painstakingly poring through the (rather large) dataset and individually matching a second control to each identified case, and since ccmatch matches 1:1 at random, such additional code is difficult to replicate.
Let me know if any more information is needed, I'm happy to explain myself better. This is my first post so apologies if I've made an error with this post.
Working with Time and DateTime variables from Excel
Hi,
I noticed that whenever I import an excel file that contains a time or datetime variable onto Stata (time up to seconds), the values for those variables in Stata appear to be slightly off sometimes (generally by a second). I understand that there is a difference in the way Excel and Stata read dates and times and there might be some rounding issue at play here. I am looking for a way for Stata to read the excel time/datetime variables accurately. I see some similar questions and documentation online that address my question but I am still a little confused. Any help with this would be appreciated. Thank you!
I noticed that whenever I import an excel file that contains a time or datetime variable onto Stata (time up to seconds), the values for those variables in Stata appear to be slightly off sometimes (generally by a second). I understand that there is a difference in the way Excel and Stata read dates and times and there might be some rounding issue at play here. I am looking for a way for Stata to read the excel time/datetime variables accurately. I see some similar questions and documentation online that address my question but I am still a little confused. Any help with this would be appreciated. Thank you!
scatter graph with different styles and specific axis
Hi statalist,
I have the following data:
I want to make a graph out if it so that I get points for all of the values of the variables return, predicted_return and abnormal_return.
The x-axis should be the time t-1 to t+3.
the y-axis should be the values.
Also would be great if the three variables get different shapes and the period 1 is blue and period 2 is red.
Have you some recommendations how to do that?
I have the following data:
Code:
* Example generated by -dataex-. For more info, type help dataex clear input byte period float(time return predicted_return abnormal_return) 1 1 -.0015407993 -.0008083973 -.000732402 1 2 .0019429044 -.0010541489 .002997053 1 3 -.023466034 .0014670364 -.02493307 1 4 -.0030919516 -.0022815885 -.0008103631 1 5 -.0015958005 .0001712715 -.001767072 2 1 .003430538 .001985073 .001445465 2 2 -.001894033 .0004734376 -.0023674704 2 3 -.020554027 .0014578053 -.02201183 2 4 .002066248 .002363168 -.0002969196 2 5 -.00005386278 .0003358708 -.0003897336 end label values time t label def t 1 "t-1", modify label def t 2 "t", modify label def t 3 "t+1", modify label def t 4 "t+2", modify label def t 5 "t+3", modify
I want to make a graph out if it so that I get points for all of the values of the variables return, predicted_return and abnormal_return.
The x-axis should be the time t-1 to t+3.
the y-axis should be the values.
Also would be great if the three variables get different shapes and the period 1 is blue and period 2 is red.
Have you some recommendations how to do that?
Twoway Line: Deleting a straight line
Hello, I am trying to create a figure using the code: twoway line sum mdate. The figure shows both a actual line and a straight fitted line. Please find attached. Is there anyway I can delete the straight fitted line. Any advice would be appreciated.
Panel Data fixed effects and time effects
Hello there,
For my master thesis I am conducting research about the effects of the digital divide on the educational attainment in the European continent. For this research I gathered data of 29 countries over a period of 14 years
My dependent variable is the % of the population that compelted tertiary education( age group 24-34)
Independent are : Population that has acces to broadband internet (in %), gini score(from 0 to 100, lower means better)
Then I looked up for some control variables: Population (total) & mean income , (still thinking about adding unemployment rate as another control var)
Upon using fixed and random effect
Fixed:
random:
I used the Hausman test to confirm that fixed effects would be the better method to use :
I did add robust to cluster my standard errors and got this as a result:
Now two of my independent variables are significant and overall the model seems also significant if I read the F Stat.
Upon adding i.year in the xtreg code like this:
with testparm for year:
Now my question is am I doing this right by adding i.year into the regression? Because it seems that my dependent variables that were significant are not anymore. Also R-Squared here changed drastically but the F stat still says it's significant.
How can I fix this? Help or hints would greatly help me and is enormously appreciated.
Thank you and sorry for this very long message, but I tried to be as clear as possible by adding every step I took.
Kind regards,
Karim
For my master thesis I am conducting research about the effects of the digital divide on the educational attainment in the European continent. For this research I gathered data of 29 countries over a period of 14 years
My dependent variable is the % of the population that compelted tertiary education( age group 24-34)
Independent are : Population that has acces to broadband internet (in %), gini score(from 0 to 100, lower means better)
Then I looked up for some control variables: Population (total) & mean income , (still thinking about adding unemployment rate as another control var)
Upon using fixed and random effect
Fixed:
Code:
. xtreg educ population gini broadband incomeMean, fe Fixed-effects (within) regression Number of obs = 398 Group variable: country Number of groups = 29 R-squared: Obs per group: Within = 0.7214 min = 11 Between = 0.3053 avg = 13.7 Overall = 0.3832 max = 14 F(4,365) = 236.33 corr(u_i, Xb) = -0.3124 Prob > F = 0.0000 ------------------------------------------------------------------------------ educ | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- population | -9.70e-08 2.92e-07 -0.33 0.740 -6.71e-07 4.77e-07 gini | -.150809 .1070503 -1.41 0.160 -.3613219 .0597038 broadband | .2142734 .0098261 21.81 0.000 .1949504 .2335963 incomeMean | .0004968 .0000762 6.52 0.000 .000347 .0006467 _cons | 20.37698 5.580489 3.65 0.000 9.403037 31.35093 -------------+---------------------------------------------------------------- sigma_u | 7.6449458 sigma_e | 2.5569006 rho | .89939297 (fraction of variance due to u_i) ------------------------------------------------------------------------------ F test that all u_i=0: F(28, 365) = 91.84 Prob > F = 0.0000
Code:
xtreg educ population gini broadband incomeMean, re Random-effects GLS regression Number of obs = 398 Group variable: country Number of groups = 29 R-squared: Obs per group: Within = 0.7206 min = 11 Between = 0.3157 avg = 13.7 Overall = 0.3976 max = 14 Wald chi2(4) = 945.94 corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ educ | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- population | -8.62e-08 5.70e-08 -1.51 0.130 -1.98e-07 2.54e-08 gini | -.0609734 .1026333 -0.59 0.552 -.262131 .1401841 broadband | .2166468 .0096297 22.50 0.000 .1977729 .2355207 incomeMean | .0004461 .0000651 6.85 0.000 .0003184 .0005737 _cons | 18.24877 3.490679 5.23 0.000 11.40716 25.09037 -------------+---------------------------------------------------------------- sigma_u | 6.9655036 sigma_e | 2.5569006 rho | .88125285 (fraction of variance due to u_i) ------------------------------------------------------------------------------ .
Code:
hausman fixed random Note: the rank of the differenced variance matrix (3) does not equal the number of coefficients being tested (4); be sure this is what you expect, or there may be problems computing the test. Examine the output of your estimators for anything unexpected and possibly consider scaling your variables so that the coefficients are on a similar scale. ---- Coefficients ---- | (b) (B) (b-B) sqrt(diag(V_b-V_B)) | fixed random Difference Std. err. -------------+---------------------------------------------------------------- population | -9.70e-08 -8.62e-08 -1.08e-08 2.86e-07 gini | -.150809 -.0609734 -.0898356 .0304333 broadband | .2142734 .2166468 -.0023734 .0019549 incomeMean | .0004968 .0004461 .0000508 .0000395 ------------------------------------------------------------------------------ b = Consistent under H0 and Ha; obtained from xtreg. B = Inconsistent under Ha, efficient under H0; obtained from xtreg. Test of H0: Difference in coefficients not systematic chi2(3) = (b-B)'[(V_b-V_B)^(-1)](b-B) = 8.95 Prob > chi2 = 0.0299 (V_b-V_B is not positive definite)
Code:
. xtreg educ population gini broadband incomeMean, fe robust Fixed-effects (within) regression Number of obs = 398 Group variable: country Number of groups = 29 R-squared: Obs per group: Within = 0.7214 min = 11 Between = 0.3053 avg = 13.7 Overall = 0.3832 max = 14 F(4,28) = 35.53 corr(u_i, Xb) = -0.3124 Prob > F = 0.0000 (Std. err. adjusted for 29 clusters in country) ------------------------------------------------------------------------------ | Robust educ | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- population | -9.70e-08 4.97e-07 -0.20 0.847 -1.12e-06 9.21e-07 gini | -.150809 .1814625 -0.83 0.413 -.5225182 .2209001 broadband | .2142734 .0252991 8.47 0.000 .1624506 .2660962 incomeMean | .0004968 .0001977 2.51 0.018 .000092 .0009017 _cons | 20.37698 9.725685 2.10 0.045 .4548208 40.29915 -------------+---------------------------------------------------------------- sigma_u | 7.6449458 sigma_e | 2.5569006 rho | .89939297 (fraction of variance due to u_i) ------------------------------------------------------------------------------
Upon adding i.year in the xtreg code like this:
Code:
. xtreg educ population gini broadband incomeMean i.year, fe robust Fixed-effects (within) regression Number of obs = 398 Group variable: country Number of groups = 29 R-squared: Obs per group: Within = 0.7746 min = 11 Between = 0.0199 avg = 13.7 Overall = 0.0545 max = 14 F(17,28) = 24.52 corr(u_i, Xb) = -0.8738 Prob > F = 0.0000 (Std. err. adjusted for 29 clusters in country) ------------------------------------------------------------------------------ | Robust educ | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- population | -8.02e-07 4.62e-07 -1.73 0.094 -1.75e-06 1.45e-07 gini | -.1726822 .165679 -1.04 0.306 -.5120602 .1666957 broadband | .0004282 .0541787 0.01 0.994 -.1105518 .1114082 incomeMean | -.0000562 .0001719 -0.33 0.746 -.0004084 .000296 | year | 2008 | 1.393604 .514999 2.71 0.011 .3386763 2.448531 2009 | 2.785943 .9833509 2.83 0.008 .7716395 4.800246 2010 | 3.947826 1.240529 3.18 0.004 1.406718 6.488935 2011 | 4.918202 1.571527 3.13 0.004 1.699074 8.137329 2012 | 6.289633 1.930353 3.26 0.003 2.335484 10.24378 2013 | 7.574748 2.091623 3.62 0.001 3.290252 11.85924 2014 | 9.325942 2.29589 4.06 0.000 4.623026 14.02886 2015 | 9.79276 2.449228 4.00 0.000 4.775744 14.80978 2016 | 10.65857 2.584618 4.12 0.000 5.364219 15.95292 2017 | 11.28827 2.743002 4.12 0.000 5.669486 16.90705 2018 | 12.10324 2.847884 4.25 0.000 6.269612 17.93686 2019 | 12.90674 3.009904 4.29 0.000 6.741226 19.07224 2020 | 13.88196 3.176958 4.37 0.000 7.374253 20.38966 | _cons | 50.14916 8.640671 5.80 0.000 32.44955 67.84878 -------------+---------------------------------------------------------------- sigma_u | 19.538604 sigma_e | 2.3420248 rho | .98583553 (fraction of variance due to u_i) ----
Code:
. testparm i.year ( 1) 2008.year = 0 ( 2) 2009.year = 0 ( 3) 2010.year = 0 ( 4) 2011.year = 0 ( 5) 2012.year = 0 ( 6) 2013.year = 0 ( 7) 2014.year = 0 ( 8) 2015.year = 0 ( 9) 2016.year = 0 (10) 2017.year = 0 (11) 2018.year = 0 (12) 2019.year = 0 (13) 2020.year = 0 F( 13, 28) = 3.71 Prob > F = 0.0018
Now my question is am I doing this right by adding i.year into the regression? Because it seems that my dependent variables that were significant are not anymore. Also R-Squared here changed drastically but the F stat still says it's significant.
How can I fix this? Help or hints would greatly help me and is enormously appreciated.
Thank you and sorry for this very long message, but I tried to be as clear as possible by adding every step I took.
Kind regards,
Karim
Question on metacumbounds
Hi everyone,
I would like to seek advice on the metacumbounds package used for trial sequential analysis.
I have 2 questions:
1. error message on finding the R pathname:
For the metacumbounds package, R is required - i have downloaded both foreign and ldbounds on R (v 3.3.3)
I have tried this on both a Mac and Windows PC but I receive this error message on my mac (default setting on the dialog box for metacumbounds)
I have modified the pathname file to /Library/Frameworks/R.framework - the location for Mac but similarly received the above error.
Is there anyone that can help with this?
2. "observation numbers out of range" error
The error in 1. (R pathname error) was obtained when i used the sample dataset provided alongside the metacumbounds package. This is the setting used:
Array
When i used my own data set (shown in screenshot), I obtained this error:
How should i deal with this error?
Thanks for your help.
I would like to seek advice on the metacumbounds package used for trial sequential analysis.
I have 2 questions:
1. error message on finding the R pathname:
For the metacumbounds package, R is required - i have downloaded both foreign and ldbounds on R (v 3.3.3)
I have tried this on both a Mac and Windows PC but I receive this error message on my mac (default setting on the dialog box for metacumbounds)
R executable Rterm.exe not found in C:\Program Files\R\R-2.12.2\bin\i386\, cannot access R software
Is there anyone that can help with this?
2. "observation numbers out of range" error
The error in 1. (R pathname error) was obtained when i used the sample dataset provided alongside the metacumbounds package. This is the setting used:
Array
When i used my own data set (shown in screenshot), I obtained this error:
observation numbers out of range
r(198);
r(198);
Thanks for your help.
Saturday, June 25, 2022
Populating a column with values
I need to populate the empty columns with corresponding values within that year, it could be that I have to repeat the entry of the values (for instance fill up 2004 with the same values-128, 2005-130, 2006-130, 2007-127). Any suggestions of the right code to use? This is just an example, I have observations of over 6 million: 24 countries,
year | country | GDP |
2004 | SE | 128 |
2004 | SE | |
2004 | SE | |
2005 | SE | 131 |
2005 | SE | |
2005 | SE | |
2006 | SE | 130 |
2006 | SE | |
2006 | SE | |
2007 | SE | 127 |
2007 | SE | |
2007 | SE | |
Failing to convert .shp and .dbf dile to .dta format
Hello,
I'm trying to convert a gis file to .dta format. To answer a few questions: .shp and .dbf are in the same directory and there is no mistake in selecting the right directory. I used the same coding to convert other gis files before and it worked. But, for some reason it's not working. Any idea why ?
I'm trying to convert a gis file to .dta format. To answer a few questions: .shp and .dbf are in the same directory and there is no mistake in selecting the right directory. I used the same coding to convert other gis files before and it worked. But, for some reason it's not working. Any idea why ?
Code:
spshape2dta "Georgia.shp", replace saving(Georgia2014) (importing .shp file) st_sstore(): 3300 argument out of range import_shp_polym(): - function returned error import_shp_read_shapes(): - function returned error import_shp_read_write_data(): - function returned error import_shp(): - function returned error <istmt>: - function returned error could not import .shp file r(3300);
chnaging values of a variable
Dear Listers,
I am using Stata 15.1.
I am working with a dataset with more than a million observations.
I would like to change the values of a variable in order to merge another data with one of the data having an "ID" which is one digit higher than the master data. So I would like to drop the last digit in the using data in order to merge the two datasets.
What I want: In the following synthetic data, I would like to get rid of the last digit number from the values off variable
, so that it shortens to a two-digit value.
I tried to use
with
. It did not work. The closest I found is this : https://www.statalist.org/forums/for...velsof-command.
Thank you in advance for any tip.
I am using Stata 15.1.
I am working with a dataset with more than a million observations.
I would like to change the values of a variable in order to merge another data with one of the data having an "ID" which is one digit higher than the master data. So I would like to drop the last digit in the using data in order to merge the two datasets.
What I want: In the following synthetic data, I would like to get rid of the last digit number from the values off variable
length
I tried to use
levelsof
usubstr
Code:
sysuse auto, clear levelsof length, local(levels) foreach l of local levels{ gen length2=usubstr("`levels'", 1, 2) }
log differenced model and GMM Estimator
Hi all,
I have panel data (T=30) and due to potential threat of non stationarity in my data i've transformed my data into log differences.
There is endogeneity bias in my model as well, i want to apply GMM estimator. Since GMM apply differenced transformation on data so it would be double differencing in my case.
In this case can I use Difference or System GMM on the model? if not which estimator would be appropriate?
I'll appreciate early response.
Thanks
I have panel data (T=30) and due to potential threat of non stationarity in my data i've transformed my data into log differences.
There is endogeneity bias in my model as well, i want to apply GMM estimator. Since GMM apply differenced transformation on data so it would be double differencing in my case.
In this case can I use Difference or System GMM on the model? if not which estimator would be appropriate?
I'll appreciate early response.
Thanks

heterofactor and ML maximization
I am using the heterofactor command (https://www.stata-journal.com/articl...article=st0431), but it is not interacting properly with mle maximize options. For example, I want to stop after one iteration and I write:
local dep_vars "ln_mean_wage high_shool college black latino hgc_mother_1979"
local controls "black latino hgc_mother_1979 highest_grade_complete"
local testslist " asvab_8_1981 asvab_6_1981 asvab_5_1981 asvab_4_1981 asvab_10_1981 rotter_score_1979 rosenberg_esteem_score_1980"
heterofactor unwanted `dep_vars' , indvarsc(`controls') scores(`testslist') factors(2) numf1tests(5) numf2tests(2) triangular difficult fdistonly initialreg nohats nochoice nodes(4)"
but the output that I get is:
Estimating Initial Values Vector
Running Factor Model
Twostep option specified
Step: 1
Factor: 1
Iteration 0: log likelihood = -41281.163 (not concave)
Iteration 1: log likelihood = -40578.565 (not concave)
Iteration 2: log likelihood = -39899.142 (not concave)
Iteration 3: log likelihood = -39490.265 (not concave)
Iteration 4: log likelihood = -39334.409 (not concave)
Iteration 5: log likelihood = -39276.052 (not concave)
Iteration 6: log likelihood = -39235.622 (not concave)
Iteration 7: log likelihood = -39198.96
Iteration 8: log likelihood = -39130.351 (not concave)
Iteration 9: log likelihood = -39113.587
Iteration 10: log likelihood = -39095.938
Iteration 11: log likelihood = -39085.982
Iteration 12: log likelihood = -39083.795
Iteration 13: log likelihood = -39072.259
I have the same problem with ltolerance(#) or any maximize option.
Does someone know how to solve it?
local dep_vars "ln_mean_wage high_shool college black latino hgc_mother_1979"
local controls "black latino hgc_mother_1979 highest_grade_complete"
local testslist " asvab_8_1981 asvab_6_1981 asvab_5_1981 asvab_4_1981 asvab_10_1981 rotter_score_1979 rosenberg_esteem_score_1980"
heterofactor unwanted `dep_vars' , indvarsc(`controls') scores(`testslist') factors(2) numf1tests(5) numf2tests(2) triangular difficult fdistonly initialreg nohats nochoice nodes(4)"
but the output that I get is:
Estimating Initial Values Vector
Running Factor Model
Twostep option specified
Step: 1
Factor: 1
Iteration 0: log likelihood = -41281.163 (not concave)
Iteration 1: log likelihood = -40578.565 (not concave)
Iteration 2: log likelihood = -39899.142 (not concave)
Iteration 3: log likelihood = -39490.265 (not concave)
Iteration 4: log likelihood = -39334.409 (not concave)
Iteration 5: log likelihood = -39276.052 (not concave)
Iteration 6: log likelihood = -39235.622 (not concave)
Iteration 7: log likelihood = -39198.96
Iteration 8: log likelihood = -39130.351 (not concave)
Iteration 9: log likelihood = -39113.587
Iteration 10: log likelihood = -39095.938
Iteration 11: log likelihood = -39085.982
Iteration 12: log likelihood = -39083.795
Iteration 13: log likelihood = -39072.259
I have the same problem with ltolerance(#) or any maximize option.
Does someone know how to solve it?
Significant in one-way, but not in two-way ANOVA
Hey, I'm currently doing my data analysis for my thesis but I encountered a problem. The main effect is significant in the one-way ANOVA, but insignificant in the two-way ANOVA. Can somebody help me?
Thanks in advance!
Thanks in advance!
Stset with durations
Stset with durations.
hello i have wide format data with different dates which i used to create different DURATIONS since the begining of observations.
i am hesitating between 2 setup methods.
To set up stata, i use the time duration until last news DURATION1 and as failure i use the duration until death for those who died DURATION2.
So i do:
stset DURATION1, fail(DURATION2)
I had doubts so i tried another way.
I use DURATION2 but i replace missing values for those who don't die by the value of DURATION1 and i create a dichotomous failure variable
This leads to:
stset DURATION2modified, fail(dead)
The results are different and i am enclined to use the second method. Could anyone give me a feedback?
thank you
Mathieu
hello i have wide format data with different dates which i used to create different DURATIONS since the begining of observations.
i am hesitating between 2 setup methods.
To set up stata, i use the time duration until last news DURATION1 and as failure i use the duration until death for those who died DURATION2.
So i do:
stset DURATION1, fail(DURATION2)
I had doubts so i tried another way.
I use DURATION2 but i replace missing values for those who don't die by the value of DURATION1 and i create a dichotomous failure variable
This leads to:
stset DURATION2modified, fail(dead)
The results are different and i am enclined to use the second method. Could anyone give me a feedback?
thank you
Mathieu
Question: Bayesian Vector Autoregression (BVAR)
Hello everyone, is there any code to summarise and or combine results from BVAR model?
For instance...quietly, esttab, eststo is not working for BVAR model. Kindly advise please.
For instance...quietly, esttab, eststo is not working for BVAR model. Kindly advise please.
Split population duration
Hello, I'm trying to make a Split population duration with the spsurv command but I can't understand how to get on one side the survival determinants and on the other side the lifetime determinants
stset duree, failure(survie)
spsurv survie duree region secteur regime_jur FCEFFE lcapsocial cap_class, id(rge_id) seq(duree)
I need your help
stset duree, failure(survie)
spsurv survie duree region secteur regime_jur FCEFFE lcapsocial cap_class, id(rge_id) seq(duree)
I need your help
Merge accuracy using str format when most contain only numbers
Dear stata user,
I have a question regarding the merge accuracy of str. I have dataset A whose firm_id are in string format, but most of them actually contain only numbers, those not jusr numbers are like the following:
And I have dataset B whose firm_id contains only numbers and are long format.
Now I want to merge them using firm_id as a key, I have 2 options:
1. turn str to long
2. turn long to str
For the 1st one I think using long format to merge will be more accurate, but I have to drop those firms with characters. And I don't know how to test those contain characters and drop them
For the 2nd one I don't need to give up any observations but I wonder the accuracy of merge using str format, will this be accurate when most of them contain only numbers?
Thanks!
I have a question regarding the merge accuracy of str. I have dataset A whose firm_id are in string format, but most of them actually contain only numbers, those not jusr numbers are like the following:
Code:
* Example generated by -dataex-. For more info, type help dataex clear input str8 patent_id "RE43814" "RE43864" "RE43868" "RE43956" "RE43986" "RE43997" "RE44164" "RE44215" "RE44861" "RE44874" "RE44924" "RE44930" "RE44958" "RE44993" "RE45248" "RE45348" "RE45418" "RE45473" "RE45539" "RE45733" "RE45782" "RE45804" "RE45956" "RE45962" "RE45990" "RE46020" "RE46089" "RE46096" "RE46176" "RE46193" "RE46351" "RE46409" "RE46436" "RE46436" "RE46436" "RE46473" "RE46488" "RE46518" "RE46558" "RE46564" "RE46630" "RE46686" "RE46703" "RE46746" "RE46850" "RE46891" "RE47055" "RE47257" "RE47341" "RE47342" "RE47351" "RE47425" "RE47487" "RE47553" "RE47663" "RE47698" "RE47715" "RE47736" "RE47737" "RE47761" "RE47763" "RE47813" "RE47857" "RE47949" "RE48267" "RE48274" "RE48308" "RE48359" "RE48378" "RE48446" "RE48524" "RE48532" "RE48599" "RE48641" "RE48695" "RE48702" "T100501" "T958006" "T962010" "T964006" "T965001" "T988005" end
Now I want to merge them using firm_id as a key, I have 2 options:
1. turn str to long
2. turn long to str
For the 1st one I think using long format to merge will be more accurate, but I have to drop those firms with characters. And I don't know how to test those contain characters and drop them
For the 2nd one I don't need to give up any observations but I wonder the accuracy of merge using str format, will this be accurate when most of them contain only numbers?
Thanks!
Friday, June 24, 2022
Summing over a value of variable for repeating county
Hello respected stata community,
I have a dummy variable which takes the value 0 and 1. The observation for different counites are given below in a single year 2017. Here, the counties get repeated several times in a single year. I need to find the combined value of the variable length for each unique county of a unique state in a single year when dummy variable takes the value 1. And, if in any observation of a unique county never takes the value 1 , then the desired value of variable combined length will show zero.
Can anyone kindly tell me how I can do it ? I've thought for couple of days but havent come up with any efficient coding yet.
----------------------- copy starting from the next line -----------------------
I have a dummy variable which takes the value 0 and 1. The observation for different counites are given below in a single year 2017. Here, the counties get repeated several times in a single year. I need to find the combined value of the variable length for each unique county of a unique state in a single year when dummy variable takes the value 1. And, if in any observation of a unique county never takes the value 1 , then the desired value of variable combined length will show zero.
Can anyone kindly tell me how I can do it ? I've thought for couple of days but havent come up with any efficient coding yet.
----------------------- copy starting from the next line -----------------------
Code:
* Example generated by -dataex-. For more info, type help dataex clear input int Year_Recor byte State_Code int County_Cod byte dummy float length 2017 1 89 0 .1 2017 1 1 1 .043 2017 1 123 0 .1 2017 1 45 0 .097 2017 1 97 0 .438 2017 1 123 0 .1 2017 1 117 1 .1 2017 1 25 0 .1 2017 1 123 1 .1 2017 1 51 0 .1 2017 1 123 0 .1 2017 1 3 0 .1 2017 1 115 0 .1 2017 1 71 0 .1 2017 1 77 0 .034 2017 1 61 0 .1 2017 1 39 0 .1 2017 1 125 0 .1 2017 1 59 0 .1 2017 1 15 0 .1 2017 1 89 1 .1 2017 1 81 1 .1 2017 1 77 0 .1 2017 1 127 0 .036 2017 1 103 0 .1 2017 1 49 0 .018 2017 1 81 0 .1 2017 1 101 0 .1 2017 1 91 0 .1 2017 1 91 0 .087 2017 1 49 1 .1 2017 1 115 0 .1 2017 1 107 0 .1 2017 1 89 0 .004 2017 1 43 0 2.22 2017 1 31 0 .1 2017 1 79 0 1.013 2017 1 125 1 .056 2017 1 117 0 .1 2017 1 3 1 .04 2017 1 49 1 .1 2017 1 125 1 .052 end
Box plot help
Hello everyone,
I am trying to make this visual from a book by Edward Tufte where he talks about using a stripped down version of the box plot as practice.
I wrote a code but I am unable to figure out why the length of whiskers keep coming out incorrect. Can someone help me out here?
The plot type 1 whsiters are different from the one in 2. Can someone kindly check what I am doing wrong in this code, will be grateful.
I am trying to make this visual from a book by Edward Tufte where he talks about using a stripped down version of the box plot as practice.
I wrote a code but I am unable to figure out why the length of whiskers keep coming out incorrect. Can someone help me out here?
Code:
sysuse auto, clear * Name of variable to use for box plot: local variable price * Display boxplot by which group? local group foreign * Plot type 1 capture separate `variable', by(`group') levelsof `group', local(lvl) foreach level of local lvl { sort `variable' quietly summ `group' local max = `r(max)' local min = `r(min)' local scale = `r(max)' - `r(min)' local offset : display abs(`scale'*0.02) quietly summ `variable' if `group' == `level', detail local level = `level' + 1 local xlab "`xlab' `level' `" "`:lab (`group') `=`level'-1''" "'" local med_p_`level' = `r(p50)' local p75_`level' = `r(p75)' local p25_`level' = `r(p25)' local iqr_`level' = `p75_`level'' - `p25_`level'' display "Median = `med_p_`level''" display "P75 = `p75_`level''" display "P25 = `p25_`level''" display "IQR = `iqr_`level''" display "Low = `=`p25_`level''-(1.5*`iqr_`level'')'" display "Max = `=`p75_`level''+(1.5*`iqr_`level'')'"" display "Varname = `variable'`=`level'-1'" egen llw_`level' = min(max(`variable'`=`level'-1', `=`p25_`level''-(1.5*`iqr_`level'')')) egen uuw_`level' = max(min(`variable'`=`level'-1', `=`p75_`level''+(1.5*`iqr_`level'')')) quietly summ uuw_`level' local max_`level' = `r(mean)' quietly summ llw_`level' local min_`level' = `r(mean)' local lines `lines' /// (scatteri `p75_`level'' `level' `max_`level'' `level', recast(line) lpattern(solid) lcolor(black) lwidth(1)) || /// (scatteri `p25_`level'' `level' `min_`level'' `level', recast(line) lpattern(solid) lcolor(black) lwidth(1)) || /// (scatteri `p75_`level'' `=`level' + `offset'' `p25_`level'' `=`level' + `offset'', recast(line) lpattern(solid) lcolor(black) lwidth(1)) || /// (scatteri `med_p_`level'' `=`level' + `offset'', ms(square) mcolor(background)) || } *drop llw* uuw* twoway `lines', /// ytitle("`: variable label `variable''") /// ylabel(2000(2000)10000) xtitle("") /// xlabel(`xlab', nogrid) /// xscale(range(`=`min' + 0.5' `=`max' + 1.5')) /// scheme(white_tableau) /// title("{bf}Tufte Styled Box Plot", pos(11) margin(b+3) size(*.7)) /// subtitle("`: variable label `variable'' grouped by `: variable label `group''", pos(11) margin(b+6 t=-3) size(*.6)) /// legend(off) * Tufte style box plot version 2 graph box mpg, box(1, color(white%0)) medtype(marker) medmarker(mcolor(black) mlwidth(0)) cwhiskers alsize(0) intensity(0) over(foreign) lintensity(1) lines(lpattern(solid) lwidth(medium) lcolor(black)) nooutside ylabel(, nogrid) scheme(white_tableau)
Thursday, June 23, 2022
How to do a Box Plot with mean instead of median and SD instead of quartiles?
Dear Statalisters,
Please have a look at my data:
event is a binary variable. I'd like to generate some visual descriptive statistics for event. Given that it is a binary variable, a box plot is unappropriate as there would be no boundaries. But I would like something that looks like a box plot, showing the mean where it traditionally shows the median, and the borders would be the mean +/- the standard deviation. To make things look clearer, here's an image of how it would look :
Array
For each country code, the center of the box would be the mean of event, the left boundary of the box would be the mean - sd of event, and the left boundary of the box would be the mean + the sd of event, so the mean would be at the center of each box.
Is there a way to generate such plots? I tried the command -graph box- but my attempts remained unsuccessful. Any help would be appreciated. Thanks a lot!
Please have a look at my data:
Code:
* Example generated by -dataex-. For more info, type help dataex clear input float event byte countrycode 0 2 0 4 0 7 0 7 0 5 1 3 0 9 1 4 0 8 0 5 0 8 0 5 0 5 0 5 0 5 0 3 0 5 0 3 0 5 0 7 end
Array
For each country code, the center of the box would be the mean of event, the left boundary of the box would be the mean - sd of event, and the left boundary of the box would be the mean + the sd of event, so the mean would be at the center of each box.
Is there a way to generate such plots? I tried the command -graph box- but my attempts remained unsuccessful. Any help would be appreciated. Thanks a lot!
Help with bootstrap in obtaining a standard error.
Hi Everyone: I think my problem has nothing to do with the data set and so I'm not showing a data example. In the bootstrap, I want a standard error for the average of six estimated marginal effects but I get a missing value. I get the standard errors for the marginal effects themselves, but not the average. Clearly I don't know how to compute a function of coefficients within a bootstrap program. See my calculation of tauavg. Thanks for any hints.
Code:
. capture program drop aggregate_boot . . program aggregate_boot, rclass 1. . poisson y i.w#c.d4#c.f04 i.w#c.d4#c.f05 i.w#c.d4#c.f06 /// > i.w#c.d5#c.f05 i.w#c.d5#c.f06 /// > i.w#c.d6#c.f06 /// > i.w#c.d4#c.f04#c.x i.w#c.d4#c.f05#c.x i.w#c.d4#c.f06#c.x /// > i.w#c.d5#c.f05#c.x i.w#c.d5#c.f06#c.x /// > i.w#c.d6#c.f06#c.x /// > f02 f03 f04 f05 f06 /// > c.f02#c.x c.f03#c.x c.f04#c.x c.f05#c.x c.f06#c.x /// > d4 d5 d6 x c.d4#c.x c.d5#c.x c.d6#c.x, noomitted 2. estimates store beta 3. . margins, dydx(w) at(d4 = 1 d5 = 0 d6 = 0 f02 = 0 f03 = 0 f04 = 1 f05 = 0 f06 = 0) /// > subpop(if d4 == 1) noestimcheck post 4. return scalar tau44 = _b[1.w] 5. estimates restore beta 6. margins, dydx(w) at(d4 = 1 d5 = 0 d6 = 0 f02 = 0 f03 = 0 f04 = 0 f05 = 1 f06 = 0) /// > subpop(if d4 == 1) noestimcheck post 7. return scalar tau45 = _b[1.w] 8. estimates restore beta 9. margins, dydx(w) at(d4 = 1 d5 = 0 d6 = 0 f02 = 0 f03 = 0 f04 = 0 f05 = 0 f06 = 1) /// > subpop(if d4 == 1) noestimcheck post 10. return scalar tau46 = _b[1.w] 11. estimates restore beta 12. margins, dydx(w) at(d4 = 0 d5 = 1 d6 = 0 f02 = 0 f03 = 0 f04 = 0 f05 = 1 f06 = 0) /// > subpop(if d5 == 1) noestimcheck post 13. return scalar tau55 = _b[1.w] 14. estimates restore beta 15. margins, dydx(w) at(d4 = 0 d5 = 1 d6 = 0 f02 = 0 f03 = 0 f04 = 0 f05 = 0 f06 = 1) /// > subpop(if d5 == 1) noestimcheck post 16. return scalar tau56 = _b[1.w] 17. estimates restore beta 18. margins, dydx(w) at(d4 = 0 d5 = 0 d6 = 1 f02 = 0 f03 = 0 f04 = 0 f05 = 0 f06 = 1) /// > subpop(if d6 == 1) noestimcheck post 19. return scalar tau66 = _b[1.w] 20. . return scalar tauavg = (tau44 + tau45 + tau46 + tau55 + tau56 + tau66)/6 21. . end . . bootstrap r(tau44) r(tau45) r(tau46) r(tau55) r(tau56) r(tau66) r(tauavg), /// > reps(50) seed(123) cluster(id) idcluster(newid): aggregate_boot (running aggregate_boot on estimation sample) Bootstrap replications (50) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 Bootstrap results Number of obs = 6,000 Replications = 50 Command: aggregate_boot _bs_1: r(tau44) _bs_2: r(tau45) _bs_3: r(tau46) _bs_4: r(tau55) _bs_5: r(tau56) _bs_6: r(tau66) _bs_7: r(tauavg) (Replications based on 1,000 clusters in id) ------------------------------------------------------------------------------ | Observed Bootstrap Normal-based | coefficient std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- _bs_1 | 1.017501 1.119012 0.91 0.363 -1.175723 3.210725 _bs_2 | 6.00713 1.978965 3.04 0.002 2.128431 9.885829 _bs_3 | 4.569667 1.379358 3.31 0.001 1.866174 7.273159 _bs_4 | 7.170127 2.957668 2.42 0.015 1.373203 12.96705 _bs_5 | 7.185492 2.297489 3.13 0.002 2.682496 11.68849 _bs_6 | 13.73294 12.83413 1.07 0.285 -11.42151 38.88738 _bs_7 | 6.613809 . . . . . ------------------------------------------------------------------------------