How does Stata plot the probability density function f (x) on the vertical axis and the quantile (0 to 1) on the horizontal axis?
Specialized on Data processing, Data management Implementation plan, Data Collection tools - electronic and paper base, Data cleaning specifications, Data extraction, Data transformation, Data load, Analytical Datasets, and Data analysis. BJ Data Tech Solutions teaches on design and developing Electronic Data Collection Tools using CSPro, and STATA commands for data manipulation. Setting up Data Management systems using modern data technologies such as Relational Databases, C#, PHP and Android.
Sunday, February 28, 2021
Time to event graph
Hi,
I am really appreciative of all the help that is provided here. Thank you very much. I am trying to build a bar graph with superimposed line.
The x axis is time in days to the procedure from admission.
The y axis is the number of the procedures.
The dataset is as:
patient_id timetoprocedure
1 2
2 5
3 2
4 1
5 6
6 8
7 11
8 3
9 4
10 4
11 3
12 5
13 1
14 2
15 3
I will really appreciate the answer.
Kind regards, WQ
I am really appreciative of all the help that is provided here. Thank you very much. I am trying to build a bar graph with superimposed line.
The x axis is time in days to the procedure from admission.
The y axis is the number of the procedures.
The dataset is as:
patient_id timetoprocedure
1 2
2 5
3 2
4 1
5 6
6 8
7 11
8 3
9 4
10 4
11 3
12 5
13 1
14 2
15 3
I will really appreciate the answer.
Kind regards, WQ
append error (r198)
Dear Madam/Sir,
I am a beginner for STATA. It will be highly appreciative if you can advise me how to fix this error potentially related to "outreg2".
regress ln_audit gafscore mascore big4 cspec ln_nonaudit icw restatement gc auchange merger financing yearend abaccrual ln_at mb leverage roa loss fsalepro SQ_SEGS ar_in spec
ial_item ln_tenure i.cyear i.sic2, robust cluster(gvkey) outreg2 using "C:\Users\hakjoon\Documents.out", append bdec(3) tstat excel bracket
invalid 'append'
r(198);
Thank you
Joon1
I am a beginner for STATA. It will be highly appreciative if you can advise me how to fix this error potentially related to "outreg2".
regress ln_audit gafscore mascore big4 cspec ln_nonaudit icw restatement gc auchange merger financing yearend abaccrual ln_at mb leverage roa loss fsalepro SQ_SEGS ar_in spec
ial_item ln_tenure i.cyear i.sic2, robust cluster(gvkey) outreg2 using "C:\Users\hakjoon\Documents.out", append bdec(3) tstat excel bracket
invalid 'append'
r(198);
Thank you
Joon1
synth_runner package
Hello,
I am working on a comparative case-study using the synthetic control method (traditionally implemented by synth package).
However, I find the synth_runner (package) more handy and easy to implement unlike synth (package).
But, when I use synth_runner, the reference line for the policy/treatment implementation is plotted one period (say year) prior to the actual assigned date.
For instance, see the following codes:
. ssc install synth, all
. cap ado uninstall synth_runner //in-case already installed
. net install synth_runner, from(https://raw.github.com/bquistorff/synth_runner/master/) replace
. sysuse synth_smoking, clear
. tsset state year
1. The following code uses synth (the traditional package/code) and at least assigns the reference line to the actual treatment period. This makes sense to me because there reference period aught to be the actual treatment period.
. synth cigsale beer lnincome(1980&1985) retprice cigsale(1988) cigsale(1980) cigsale(1975), trunit(3) trperiod(1989) fig
Array
2. The following codes uses synth_runner and ideally the treatment period is 1989 but the reference line (the vertical red solid line in the second fig.) is at 1988 (not 1989).
. synth_runner cigsale beer(1984(1)1988) lnincome(1972(1)1988) retprice age15to24 cigsale(1988) cigsale(1980) cigsale(1975), trunit(3) trperiod(1989) gen_vars
. single_treatment_graphs, trlinediff(-1) raw_gname(cigsale1_raw) effects_gname(cigsale1_effects) effects_ylabels(-30(10)30) effects_ymax(35) effects_ymin(-35)
. effect_graphs , trlinediff(-1) effect_gname(cigsale1_effect) tc_gname(cigsale1_tc)
Array
Now, my question is, after using the synth_runner, is it appropriate to manually change the reference line to reflect the actual treatment date?
Would that alter the treatment effect?
Thank you.
I am working on a comparative case-study using the synthetic control method (traditionally implemented by synth package).
However, I find the synth_runner (package) more handy and easy to implement unlike synth (package).
But, when I use synth_runner, the reference line for the policy/treatment implementation is plotted one period (say year) prior to the actual assigned date.
For instance, see the following codes:
. ssc install synth, all
. cap ado uninstall synth_runner //in-case already installed
. net install synth_runner, from(https://raw.github.com/bquistorff/synth_runner/master/) replace
. sysuse synth_smoking, clear
. tsset state year
1. The following code uses synth (the traditional package/code) and at least assigns the reference line to the actual treatment period. This makes sense to me because there reference period aught to be the actual treatment period.
. synth cigsale beer lnincome(1980&1985) retprice cigsale(1988) cigsale(1980) cigsale(1975), trunit(3) trperiod(1989) fig
Array
2. The following codes uses synth_runner and ideally the treatment period is 1989 but the reference line (the vertical red solid line in the second fig.) is at 1988 (not 1989).
. synth_runner cigsale beer(1984(1)1988) lnincome(1972(1)1988) retprice age15to24 cigsale(1988) cigsale(1980) cigsale(1975), trunit(3) trperiod(1989) gen_vars
. single_treatment_graphs, trlinediff(-1) raw_gname(cigsale1_raw) effects_gname(cigsale1_effects) effects_ylabels(-30(10)30) effects_ymax(35) effects_ymin(-35)
. effect_graphs , trlinediff(-1) effect_gname(cigsale1_effect) tc_gname(cigsale1_tc)
Array
Now, my question is, after using the synth_runner, is it appropriate to manually change the reference line to reflect the actual treatment date?
Would that alter the treatment effect?
Thank you.
Convert export values to USD
Hello,
I have a dataset that looks like with trade data from 1988-2017
The issue is for the column export CAD, I want to make it into USD. I have a separate excel file that has a column with the exchange rate from 1988-2017similar to
. I need to multiply the exchange rate for every year by the export value for that particular year. Is there a way to do it by uploading my excel dataset of exchange rate to my trade dataset that already exists in stata?
THanks
I have a dataset that looks like with trade data from 1988-2017
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input float year str43 hs06_name str28 country double export_CAD 2014 "Coconut, abaca, ramie/veg tex fib" "United States" 7613 2015 "Coconut, abaca, ramie/veg tex fib" "United States" 22107 2016 "Coconut, abaca, ramie/veg tex fib" "United States" 123544 2017 "Coconut, abaca, ramie/veg tex fib" "United States" 8401 2002 "Horses,asses,mules,hinnies,pure-bred" "United Kingdom" 7500 2003 "Horses,asses,mules,hinnies,pure-bred" "United Kingdom" 7500 2004 "Horses,asses,mules,hinnies,pure-bred" "United Kingdom" 23250 2005 "Horses,asses,mules,hinnies,pure-bred" "United Kingdom" 4000 2006 "Horses,asses,mules,hinnies,pure-bred" "United Kingdom" 21085 2007 "Horses,asses,mules,hinnies,pure-bred" "United Kingdom" 27562 2008 "Horses,asses,mules,hinnies,pure-bred" "United Kingdom" 160485 2009 "Horses,asses,mules,hinnies,pure-bred" "United Kingdom" 37155 2010 "Horses,asses,mules,hinnies,pure-bred" "United Kingdom" 60613 2011 "Horses,asses,mules,hinnies,pure-bred" "United Kingdom" 66815 1988 "Horses, pure-bred breeding" "United Kingdom" 478979 end
year | avg. |
1988 | 0.81 |
1989 | 0.84 |
1990 | 0.85 |
1991 | 0.87 |
1992 | 0.82 |
1993 | 0.77 |
1994 | 0.73 |
1995 | 0.72 |
THanks
description of tsappend1 data set
Hello, Does anybody have a quick description of the tsappend1 data set? Even if it was fictitious i guess it has description of the variables.
thanks
thanks
How to analyze Panel Survey Data (Likert scale questions)
Hi all,
I would like to analyze a two-period survey data with attrition, with both dependent and independent variables being Likert-scale questions. However, I have struggled to find answers to all my questions.
Thank you for your help
I would like to analyze a two-period survey data with attrition, with both dependent and independent variables being Likert-scale questions. However, I have struggled to find answers to all my questions.
- Should I dichotomize Likert-scale variables? I.e. go from a Strongly agree/agree/neither agree nor disagree/disagree/strongly disagree grading to a Not agree/Agree grading? It seems that keeping the variables as they are would allow to better identify changes in attitude (i.e. from not agree to neither agree nor disagree) which would not be captured by dichotomizing the variable. However, I’ve talked to a couple of economists who support dichotomizing variables (I forgot why!)
- How to generate composite indicators? Based on what I read, there seem to be two school of thoughts: the “naïve” approach, i.e. just averaging the scores; and using explanatory factor analysis to identify factors and use them instead of the variables. However, I’ve been exploring using Confirmatory Factor Analysis to test whether variables measuring different aspect of a concept share common variance (i.e. test the validity of the theory).
- Given the above, I was originally looking at using a Wald Difference-in-differences approach, using indicators for both the dependent and independent variables. However, I’m not sure if instead, I should be using a Wald Ordinary Logit/Probit Difference-in-Differences method. If so, I’m not sure whether such approach exists and how they would be implemented.
Thank you for your help
Lexis-diagram
Lexis
I have wondered why there isn't any good Lexis-plot command in Stata? Most of them are very old and do not use new commands stsplit etc. Evhistplot, grlexis2 and stlexis are quite old and lack of many features. Clayton & Hills stlexis is interesting but for Stata 5! Does anybody know a good ado for Lexis. Why there not any in official st-commands?
I was just wondering...
I have wondered why there isn't any good Lexis-plot command in Stata? Most of them are very old and do not use new commands stsplit etc. Evhistplot, grlexis2 and stlexis are quite old and lack of many features. Clayton & Hills stlexis is interesting but for Stata 5! Does anybody know a good ado for Lexis. Why there not any in official st-commands?
I was just wondering...
lincom calculation for a subset of data
Hello,
I'm trying to calculate the dfiference of certain Times on the basis of a mixed-effect linear regression model.
The model output looks like the following:
Array
The consecutive margins output is then giving me this:
Array
Well,since we clearly know, that there aren't any data for Saentis == 0 and Time == 4
the linear combination of (lincom 0#15 - 0#4) should result in an error.
But the output gives me an unclear output.
Array
For your help, I am very grateful.
Best regards, Simon
I'm trying to calculate the dfiference of certain Times on the basis of a mixed-effect linear regression model.
The model output looks like the following:
Array
The consecutive margins output is then giving me this:
Array
Well,since we clearly know, that there aren't any data for Saentis == 0 and Time == 4
the linear combination of (lincom 0#15 - 0#4) should result in an error.
But the output gives me an unclear output.
Array
For your help, I am very grateful.
Best regards, Simon
Function to give year of start
Dear all,
I have a dataset with the US census for the 1970-1999 period combined with a dataset with information about different public policies concerning the labor market. I have a cross-section of several million individuals, with a variable indicating the state of birth. I also have a series of dummy variables indicating if a policy was active at each of the available years. For example, if in Nevada the policy called G started in 1980 and was active for the rest of the period, for individuals born in Nevada, the dummy variables g_1970birth-g_1979birth will be 0, while the variables g_1980birth-g_1999birth will be 1.
I want to create a new variable, call it year_g, which will give me the year the policy G started in for the state of birth of each individual (year_g=1980 for the previous example). For that purpose, I have written the following line:
The idea is that since it requires the previous year to be 0, this can only work with the starting year. However, I'm obtaing year_g=1999 for all the observations, and I can't seem to figure out what is wrong with the code. I'd be very grateful if someone could help me figure it out.
Thank you very much for your time.
I have a dataset with the US census for the 1970-1999 period combined with a dataset with information about different public policies concerning the labor market. I have a cross-section of several million individuals, with a variable indicating the state of birth. I also have a series of dummy variables indicating if a policy was active at each of the available years. For example, if in Nevada the policy called G started in 1980 and was active for the rest of the period, for individuals born in Nevada, the dummy variables g_1970birth-g_1979birth will be 0, while the variables g_1980birth-g_1999birth will be 1.
I want to create a new variable, call it year_g, which will give me the year the policy G started in for the state of birth of each individual (year_g=1980 for the previous example). For that purpose, I have written the following line:
Code:
forval w=1970(1)1999{ replace year_g=`w' if g_`w'birth==1 & g_`w-1'birth==0 }
Thank you very much for your time.
Percentile based on age and year using loop
Hi all,
I have a dataset that includes total income, variable age ranges from 25 to 54, year from 1982 to 2018 and immigrant dummy variable. I want to generate two dummy variables for those who are at the top 1% and top10% of income distribution based on the year and age and then plot the share of those immigrants who are at the top 1% and top10% over years. So, basically, I generated the percentiles for each age group and for each year as follow:
This code is for only two years. However, doing this for each year needs too many codes and also creates too many variables. I am just wondering is there a neat syntax for creating top1% and top 10% dummy variables. Any guidance on this is much appreciated.
I have a dataset that includes total income, variable age ranges from 25 to 54, year from 1982 to 2018 and immigrant dummy variable. I want to generate two dummy variables for those who are at the top 1% and top10% of income distribution based on the year and age and then plot the share of those immigrants who are at the top 1% and top10% over years. So, basically, I generated the percentiles for each age group and for each year as follow:
Code:
gen ptile_inc=. forvalues a = 25/54 { xtile p`a' = totinc if age==`a' & year==1982 [aw=weight], nq(100) replace ptile_inc=p`a' if age==`a' & year==1982 } gen ptile_inc1=. forvalues a = 25/54 { xtile p1_`a' = totinc if age==`a' & year==1983 [aw=weight], nq(100) replace ptile_inc1=p1_`a' if age==`a' & year==1983 } gen top1_1982=(ptile_inc==100 | gen top1_1983=ptile_inc1==100 gen top10_1982=ptile_inc>90 gen top10_1983=ptile_inc1>90
Calculating proportional effects with a GLM model
Dear Statalist Members
When comparing the output of the margins [eydx] for semielasticity (proportional effects) of a GLM model with log link and another with identity link I found that they are very close. The estimates of course could differ. With the identity link function no transformation is done. With the log link the linear index is exponentiated. See how the estimates by class origin of the advantages of the white group in the children's income were for Brazil.
“GLM with a log link models the logarithm of the expected value of Y, conditional on X”, as explain Partha Deb and Edward C. Norton. I make this comparison to assess whether the semielasticity [margins, eydx] estimates made from the previous results of a GLM model with a log link would be distorted or not. It seems not. If the margins command [with eydx] had strictly calculated the logarithm from previous estimates already in a logarithmic scale the compression of the differences would be very large. Just calculate the logarithm of a value already in logarithm to show the degree of compression that occurs with its average. In the estimate presented no compression occurred. It appears that the margins command identifies the situation and uses the predicted average income retransformed to the original metric. As it is in the “Expression” posted at the output for both models: Predicted mean income, predict().
Note that a similar procedure cannot be done with OLS models with a logged dependent variable. “OLS regression with a log-transformed dependent variable models the expected value of the logarithm of Y conditional on X”, as explain Partha Deb and Edward C. Norton. As the dependent variable is already in log, there will be a strong compression of the estimated value.
A comment is welcome,
José Alcides
When comparing the output of the margins [eydx] for semielasticity (proportional effects) of a GLM model with log link and another with identity link I found that they are very close. The estimates of course could differ. With the identity link function no transformation is done. With the log link the linear index is exponentiated. See how the estimates by class origin of the advantages of the white group in the children's income were for Brazil.
Code:
GLM model with family(gamma) link(log)
Average marginal effects Number of obs = 30414
Model VCE : OIM
Expression : Predicted mean income, predict()
ey/dx w.r.t. : 1.white
------------------------------------------------------------------------------
| Delta-method
| ey/dx Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
1. white |
class |
social top | .535371 .0523935 10.22 0.000 .4326816 .6380604
skilled | .5341033 .0568641 9.39 0.000 .4226518 .6455549
small assets| .4788623 .0274684 17.43 0.000 .4250253 .5326994
worker | .3075652 .0313738 9.80 0.000 .2460737 .3690567
destitute | .353975 .025128 14.09 0.000 .3047249 .403225
------------------------------------------------------------------------------
GLM model with family(gamma) link(identify)
Average marginal effects Number of obs = 30414
Model VCE : OIM
Expression : Predicted mean income, predict()
ey/dx w.r.t. : 1. white
------------------------------------------------------------------------------
| Delta-method
| ey/dx Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
1. white |
class |
social top | .567763 .0560234 10.13 0.000 .4579592 .6775669
skilled | .5688268 .0608277 9.35 0.000 .4496066 .688047
small assets| .4061208 .024772 16.39 0.000 .3575686 .4546731
worker | .3342984 .0340144 9.83 0.000 .2676314 .4009654
destitute | .3144392 .0246612 12.75 0.000 .2661042 .3627742
Note that a similar procedure cannot be done with OLS models with a logged dependent variable. “OLS regression with a log-transformed dependent variable models the expected value of the logarithm of Y conditional on X”, as explain Partha Deb and Edward C. Norton. As the dependent variable is already in log, there will be a strong compression of the estimated value.
A comment is welcome,
José Alcides
Invalid IVs with Fixed Effects XTIVREG2
Hi,
I am currently writing my thesis, using panel data to assess the relationship between financial development and economic growth. I want to control for alleged endogeneity, in the financial development indicators, I have used xtivreg2, but the Hansen J Stat is significant, which states that that the instruments are invalid. What would be suggested to do now?
This is the code I used:
xtivreg2 loggdppercap GrosscapitalformationofGD Schoolenrollmentsecondary (StockmarketcapitalizationtoG LiquidliabilitiestoGDPG = l.StockmarketcapitalizationtoG l2.StockmarketcapitalizationtoG l.LiquidliabilitiestoGDPG l2.LiquidliabilitiestoGDPG ) , fe cluster(c_id) endog( LiquidliabilitiestoGDPG StockmarketcapitalizationtoG)
Results:
------------------------
Number of groups = 89 Obs per group: min = 3
avg = 18.1
max = 28
IV (2SLS) estimation
--------------------
Estimates efficient for homoskedasticity only
Statistics robust to heteroskedasticity and clustering on c_id
Number of clusters (c_id) = 89 Number of obs = 1610
F( 4, 88) = 16.32
Prob > F = 0.0000
Total (centered) SS = 51.85112797 Centered R2 = 0.3885
Total (uncentered) SS = 51.85112797 Uncentered R2 = 0.3885
Residual SS = 31.70731729 Root MSE = .1444
----------------------------------------------------------------------------------------------
| Robust
loggdppercap | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-----------------------------+----------------------------------------------------------------
StockmarketcapitalizationtoG | .000945 .0005407 1.75 0.080 -.0001147 .0020047
LiquidliabilitiestoGDPG | .000902 .0005874 1.54 0.125 -.0002492 .0020533
GrosscapitalformationofGD | .0047421 .0028177 1.68 0.092 -.0007804 .0102646
Schoolenrollmentsecondary | .0092139 .0014798 6.23 0.000 .0063135 .0121143
----------------------------------------------------------------------------------------------
Underidentification test (Kleibergen-Paap rk LM statistic): 18.271
Chi-sq(3) P-val = 0.0004
------------------------------------------------------------------------------
Weak identification test (Cragg-Donald Wald F statistic): 1543.425
(Kleibergen-Paap rk Wald F statistic): 2035.606
Stock-Yogo weak ID test critical values: 5% maximal IV relative bias 11.04
10% maximal IV relative bias 7.56
20% maximal IV relative bias 5.57
30% maximal IV relative bias 4.73
10% maximal IV size 16.87
15% maximal IV size 9.93
20% maximal IV size 7.54
25% maximal IV size 6.28
Source: Stock-Yogo (2005). Reproduced by permission.
NB: Critical values are for Cragg-Donald F statistic and i.i.d. errors.
------------------------------------------------------------------------------
Hansen J statistic (overidentification test of all instruments): 9.791
Chi-sq(2) P-val = 0.0075
-endog- option:
Endogeneity test of endogenous regressors: 12.028
Chi-sq(2) P-val = 0.0024
Regressors tested: LiquidliabilitiestoGDPG StockmarketcapitalizationtoG
------------------------------------------------------------------------------
Instrumented: StockmarketcapitalizationtoG LiquidliabilitiestoGDPG
Included instruments: GrosscapitalformationofGD Schoolenrollmentsecondary
Excluded instruments: L.StockmarketcapitalizationtoG
L2.StockmarketcapitalizationtoG L.LiquidliabilitiestoGDPG
L2.LiquidliabilitiestoGDPG
Any help is much appreciated!
I am currently writing my thesis, using panel data to assess the relationship between financial development and economic growth. I want to control for alleged endogeneity, in the financial development indicators, I have used xtivreg2, but the Hansen J Stat is significant, which states that that the instruments are invalid. What would be suggested to do now?
This is the code I used:
xtivreg2 loggdppercap GrosscapitalformationofGD Schoolenrollmentsecondary (StockmarketcapitalizationtoG LiquidliabilitiestoGDPG = l.StockmarketcapitalizationtoG l2.StockmarketcapitalizationtoG l.LiquidliabilitiestoGDPG l2.LiquidliabilitiestoGDPG ) , fe cluster(c_id) endog( LiquidliabilitiestoGDPG StockmarketcapitalizationtoG)
Results:
------------------------
Number of groups = 89 Obs per group: min = 3
avg = 18.1
max = 28
IV (2SLS) estimation
--------------------
Estimates efficient for homoskedasticity only
Statistics robust to heteroskedasticity and clustering on c_id
Number of clusters (c_id) = 89 Number of obs = 1610
F( 4, 88) = 16.32
Prob > F = 0.0000
Total (centered) SS = 51.85112797 Centered R2 = 0.3885
Total (uncentered) SS = 51.85112797 Uncentered R2 = 0.3885
Residual SS = 31.70731729 Root MSE = .1444
----------------------------------------------------------------------------------------------
| Robust
loggdppercap | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-----------------------------+----------------------------------------------------------------
StockmarketcapitalizationtoG | .000945 .0005407 1.75 0.080 -.0001147 .0020047
LiquidliabilitiestoGDPG | .000902 .0005874 1.54 0.125 -.0002492 .0020533
GrosscapitalformationofGD | .0047421 .0028177 1.68 0.092 -.0007804 .0102646
Schoolenrollmentsecondary | .0092139 .0014798 6.23 0.000 .0063135 .0121143
----------------------------------------------------------------------------------------------
Underidentification test (Kleibergen-Paap rk LM statistic): 18.271
Chi-sq(3) P-val = 0.0004
------------------------------------------------------------------------------
Weak identification test (Cragg-Donald Wald F statistic): 1543.425
(Kleibergen-Paap rk Wald F statistic): 2035.606
Stock-Yogo weak ID test critical values: 5% maximal IV relative bias 11.04
10% maximal IV relative bias 7.56
20% maximal IV relative bias 5.57
30% maximal IV relative bias 4.73
10% maximal IV size 16.87
15% maximal IV size 9.93
20% maximal IV size 7.54
25% maximal IV size 6.28
Source: Stock-Yogo (2005). Reproduced by permission.
NB: Critical values are for Cragg-Donald F statistic and i.i.d. errors.
------------------------------------------------------------------------------
Hansen J statistic (overidentification test of all instruments): 9.791
Chi-sq(2) P-val = 0.0075
-endog- option:
Endogeneity test of endogenous regressors: 12.028
Chi-sq(2) P-val = 0.0024
Regressors tested: LiquidliabilitiestoGDPG StockmarketcapitalizationtoG
------------------------------------------------------------------------------
Instrumented: StockmarketcapitalizationtoG LiquidliabilitiestoGDPG
Included instruments: GrosscapitalformationofGD Schoolenrollmentsecondary
Excluded instruments: L.StockmarketcapitalizationtoG
L2.StockmarketcapitalizationtoG L.LiquidliabilitiestoGDPG
L2.LiquidliabilitiestoGDPG
Any help is much appreciated!
Set https proxy
Hi Statalisters,
I have some problems at importing data directly from internet.
I am using Stata in my company laptop, and I need to set up a proxy to avoid the company firewall. Therefore, I set the http proxy via netio:
The proxy is correctly set. Nonetheless, I have problems at importing data from https URLs.
For instance, this code works fine:
Whereas the following lines of code do not work:
I guess I need to set also a proxy for https website.
Do you know how to set an https proxy, or a workaround to solve this issue? I looked for a solution in help netio but I could not find anything.
My version of Stata is
Thanks in advance for your help!
I have some problems at importing data directly from internet.
I am using Stata in my company laptop, and I need to set up a proxy to avoid the company firewall. Therefore, I set the http proxy via netio:
Code:
set httpproxyhost my_proxy_host set httpproxyport my_port set httpproxyuser my_username set httpproxypw my_password set httpproxy on set httpproxyauth on
For instance, this code works fine:
Code:
import delimited http://www.stata.com/examples/auto.csv, clear
Code:
import delimited https://covid.ourworldindata.org/data/owid-covid-data.csv, clear import delimited https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv, clear PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target could not open url r(603)
Do you know how to set an https proxy, or a workaround to solve this issue? I looked for a solution in help netio but I could not find anything.
My version of Stata is
Code:
. about
Stata/MP 16.0 for Windows (64-bit x86-64)
Revision 16 Oct 2019
Sorting the order of bar graph values while using by
Hello!
I want to order the results of a bar graph in descending order. I have two facets/panels as I want to disaggregate results by gender, male and female. But I want to order both panels/bars in descending order based on the order of how women responded. I am using this code:
graph hbar Q101A , over( Q102 ) by( Q1002 )
to get the attached graph but if I use the descending option the two panels are ordered differently. Any tips? Thanks! Array
I want to order the results of a bar graph in descending order. I have two facets/panels as I want to disaggregate results by gender, male and female. But I want to order both panels/bars in descending order based on the order of how women responded. I am using this code:
graph hbar Q101A , over( Q102 ) by( Q1002 )
to get the attached graph but if I use the descending option the two panels are ordered differently. Any tips? Thanks! Array
Sequential Treatment Effects
Hi,
I have been struggling for a model to estimate related to sequential treatment effect and need a help desperately.
I would greatly appreciate it if you guide me to the resources or advice me on this matter.
In some situation in reality, treatment affects outcome, and outcome affects treatment in the next period.
For example, if we see the effect of construction of post box in the municipality and then increase the mail so that increase in the mail leads to more post box construction.
It can be interpreted "reverse causality", but, in the sense that outcome does not affect treatment in the past, it has a following sequential I guess. (Arrows means causal relationship).
T0 -> Y0 -> T1 -> Y1 -> T2 -> Y2 -> T3 -> Y3.....
To be specific:
There are several period t = 1,2,3,4,...,n, and T is a treatment/intervention variable (dummy) and Y is outcome (continuous; it can be converted into a dummy if needed).
Does this model can be appropriately estimated by simply using lag term?
For the causal inferences, what would be the best way to estimate this kind of mechanism?
Please kindly give your advice on this issue. Thank you in advance.
I have been struggling for a model to estimate related to sequential treatment effect and need a help desperately.
I would greatly appreciate it if you guide me to the resources or advice me on this matter.
In some situation in reality, treatment affects outcome, and outcome affects treatment in the next period.
For example, if we see the effect of construction of post box in the municipality and then increase the mail so that increase in the mail leads to more post box construction.
It can be interpreted "reverse causality", but, in the sense that outcome does not affect treatment in the past, it has a following sequential I guess. (Arrows means causal relationship).
T0 -> Y0 -> T1 -> Y1 -> T2 -> Y2 -> T3 -> Y3.....
To be specific:
Code:
Y_it = a + B*T_it + u_it T_it = c + b*T_it-1 + e_it-1
Does this model can be appropriately estimated by simply using lag term?
For the causal inferences, what would be the best way to estimate this kind of mechanism?
Please kindly give your advice on this issue. Thank you in advance.
Stata - statistical graphing using "coefplot" by presenting odd ratio with 95%CI
Dear colleagues, I wish this message find you well.
I am wondering how to use "coefplot" by presenting odd ratio with 95%CI?
If "coefplot" is not working, is there any other solution.
I don't have raw data, I only want to plot odd ratio with 95% by below simple data.
Looking forward to hearing from you.
Thank you very much.
Best regards, Jiancong
I am wondering how to use "coefplot" by presenting odd ratio with 95%CI?
If "coefplot" is not working, is there any other solution.
I don't have raw data, I only want to plot odd ratio with 95% by below simple data.
Looking forward to hearing from you.
Thank you very much.
Best regards, Jiancong
Name | Lower | Upper | Odds |
G-LIS | 1.0 | 2.3 | 1.5 |
Swedish BAM | 2.0 | 7.0 | 3.7 |
Durbin Watson d-statistic
Hi, I am running a regression with multiple lagged variables of the dependent variable
, can one use the Durbin Watson d-statistic, (
) to check whether serial correlation has been removed from my initial model
?
I am concerned that the Durbin Watson d-statistic can only be used wen there is one lag of the dependent variable from what I have understood online, but I am not sure, could someone clarify this to me please.
It is also confusing because when running the Durbin Watson d-statistic on Stata I get a closer value to 2 (about 2.005) when running the test on only 2 lags of the dependent variable, whereas when running the test on 8 lags the Durbin Watson d-statistic is around 1.95.
Is this because the Durbin Watson d-statistic cannot be used to regressions that use more than one lag of the dependent variable on the RHS of the regression?
Thank you.
Code:
reg CSAD RtnMrktPort AbsRtnMrktPort SquRtnMrktPort CSAD_L1 CSAD_L2 CSAD_L3 CSAD_L4 CSAD_L5 CSAD_L6 CSAD_L7 CSAD_L8
Code:
estat dwatson
Code:
reg CSAD RtnMrktPort AbsRtnMrktPort SquRtnMrktPort
I am concerned that the Durbin Watson d-statistic can only be used wen there is one lag of the dependent variable from what I have understood online, but I am not sure, could someone clarify this to me please.
It is also confusing because when running the Durbin Watson d-statistic on Stata I get a closer value to 2 (about 2.005) when running the test on only 2 lags of the dependent variable, whereas when running the test on 8 lags the Durbin Watson d-statistic is around 1.95.
Is this because the Durbin Watson d-statistic cannot be used to regressions that use more than one lag of the dependent variable on the RHS of the regression?
Thank you.
Non-linear fixed effects regression
Hi all!
I am currently working on my University dissertation and would greatly appreciate any help, as I am completely new to Stata and my University doesn't have any resources on it. I am trying to run a regression with several studies, looking at age of participants (x axis) and the concentration of a specific biomarker (y axis). I think the appropriate analysis to conduct in order to achieve a plot would be a non-linear fixed effects regression.
Would anyone please be able to advise me on whether this would be the correct analysis, and if so, whether this is achievable on Stata 15. I have been scouring the internet for help and have struggled to find anything, so any help (either on this forum or private messaged) would be greatly appreciated!
Many thanks
I am currently working on my University dissertation and would greatly appreciate any help, as I am completely new to Stata and my University doesn't have any resources on it. I am trying to run a regression with several studies, looking at age of participants (x axis) and the concentration of a specific biomarker (y axis). I think the appropriate analysis to conduct in order to achieve a plot would be a non-linear fixed effects regression.
Would anyone please be able to advise me on whether this would be the correct analysis, and if so, whether this is achievable on Stata 15. I have been scouring the internet for help and have struggled to find anything, so any help (either on this forum or private messaged) would be greatly appreciated!
Many thanks

Predict Residuals in SQREG - DO File
Hi there - I am using the following code to predict the residuals for SQREG and want some guidance if this is the right way to predict residuals? I didn't find any example that uses predict with SQREG so need help.
once I predict the residuals, I need to perform the SQREG again on the residuals. I used the below commands and getting separate output for each - is there any possibility to get the output in one table?
Code:
set seed 896321 sqreg y rmrf hml smb mom, quantiles (10 25 50 75 95 99) rep(100) predict y1, equation(q10) residuals predict y2, equation(q25) residuals predict y3, equation(q50) residuals predict y4, equation(q75) residuals predict y5, equation(q95) residuals predict y6, equation(q99) residuals
Code:
sqreg y1 rm rm2, quantiles (10) rep(100) sqreg y2 rm rm2, quantiles (25) rep(100) sqreg y3 rm rm2, quantiles (50) rep(100) sqreg y4 rm rm2, quantiles (75) rep(100) sqreg y5 rm rm2, quantiles (95) rep(100) sqreg y6 rm rm2, quantiles (99) rep(100)
Merge 1:1 problem
Hi,
I am trying to merge two excel datasets which have unemployment and education data at the US county level (by FIPS code). Both datasets have the same exact number of rows (one for each county) and when I upload them to Stata it works fine. However, when I try to merge them with "merge 1:1 FIPS using educationdata" it keeps keeps telling me "variable FIPS does not uniquely identify observations in the master data". Does anyone know why this may be happening? I also tried m:1 and 1:m (although I think 1:1 is the correct one given I am just adding columns) but I get the same problem.
Thank you very much!
Joan
I am trying to merge two excel datasets which have unemployment and education data at the US county level (by FIPS code). Both datasets have the same exact number of rows (one for each county) and when I upload them to Stata it works fine. However, when I try to merge them with "merge 1:1 FIPS using educationdata" it keeps keeps telling me "variable FIPS does not uniquely identify observations in the master data". Does anyone know why this may be happening? I also tried m:1 and 1:m (although I think 1:1 is the correct one given I am just adding columns) but I get the same problem.
Thank you very much!
Joan
Neumark or Oaxaca-Ransom
Quick question regarding decomposition methods and the use of omega with the Oaxaca command.
https://core.ac.uk/download/pdf/6442665.pdf page 16 indicates the use of omega with the Oaxaca command gives you the Oaxaca and Ransom's decomposition.
https://www.stata.com/statalist/arch.../msg00585.html indicates the use of omega with the Oaxaca command gives you the Neumark decomposition.
Which one is correct? Sorry to be a pain, just a tad confused!
https://core.ac.uk/download/pdf/6442665.pdf page 16 indicates the use of omega with the Oaxaca command gives you the Oaxaca and Ransom's decomposition.
https://www.stata.com/statalist/arch.../msg00585.html indicates the use of omega with the Oaxaca command gives you the Neumark decomposition.
Which one is correct? Sorry to be a pain, just a tad confused!
Loop and Append .dta files
Hello All,
I am trying to import d.ta file and append them, what is wrong with this code? For some reason it only uses the first file from 1995, but does not continue with the subsequent years. Thanks in advance
clear all
cd "mydir"
save file_all_green, replace emptyok
forvalues i= 1995(1)2018 {
use country_partner_hsproduct6digit_year_`i',clear
keep year export_value location_code partner_code hs_product_code
append using "file_all_green"
save file_all_green,replace
}
I am trying to import d.ta file and append them, what is wrong with this code? For some reason it only uses the first file from 1995, but does not continue with the subsequent years. Thanks in advance
clear all
cd "mydir"
save file_all_green, replace emptyok
forvalues i= 1995(1)2018 {
use country_partner_hsproduct6digit_year_`i',clear
keep year export_value location_code partner_code hs_product_code
append using "file_all_green"
save file_all_green,replace
}
graph with many lines
Dear All, Suppose that I have the following data
with code
and graph like Array
I wonder if anyone can suggest a way to improve the quality of graph as above. Thanks.
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input long id float(year y) 1 2006 11.695 1 2007 16.758 1 2008 21.342 1 2009 23.993 1 2010 26.932 1 2011 28.151 1 2012 32.635 1 2013 34.133 1 2014 38.974 1 2015 34.534 1 2016 37.355 2 2006 18.837 2 2007 22.279 2 2008 27.338 2 2009 30.184 2 2010 33.188 2 2011 32.255 2 2012 35.877 2 2013 37.679 2 2014 45.389 2 2015 43.858 2 2016 54.987 3 2006 102.782 3 2007 84.356 3 2008 94.765 3 2009 539.106 3 2010 1228.086 3 2011 1290.933 3 2012 2019.757 3 2013 2263.635 3 2014 3393.346 3 2015 3437.425 3 2016 2845.547 4 2006 4.616 4 2007 4.786 4 2008 5.955 4 2009 7.835 4 2010 7.253 4 2011 10.058 4 2012 11.42 4 2013 13.421 4 2014 16.156 4 2015 17.725 4 2016 20.976 5 2006 3.866 5 2007 4.401 5 2008 4.724 5 2009 6.238 5 2010 6.32 5 2011 6.717 5 2012 8.057 5 2013 8.715 5 2014 10.17 5 2015 10.911 5 2016 12.611 6 2006 9.166 6 2007 11.723 6 2008 18.203 6 2009 19.796 6 2010 18.498 6 2011 14.654 6 2012 15.82 6 2013 15.202 6 2014 20.151 6 2015 19.021 6 2016 21.341 7 2006 8.055 7 2007 10.428 7 2008 12.545 7 2009 13.285 7 2010 19.067 7 2011 31.286 7 2012 34.669 7 2013 35.08 7 2014 36.112 7 2015 32.004 7 2016 32.868 8 2006 1.165 8 2007 1.242 8 2008 1.802 8 2009 2.518 8 2010 1.763 8 2011 1.604 8 2012 1.972 8 2013 2.332 8 2014 2.649 8 2015 2.975 8 2016 3.617 9 2006 721.351 9 2007 737.638 9 2008 674.506 9 2009 662.644 9 2010 701.561 9 2011 902.733 9 2012 1037.067 9 2013 1019.113 9 2014 1195.34 9 2015 1187.552 9 2016 1379.233 10 2006 77.254 10 2007 102.233 10 2008 124.216 10 2009 64.794 10 2010 99.241 10 2011 133.793 10 2012 158.205 10 2013 241.44 10 2014 390.768 10 2015 508.226 10 2016 693.224 11 2006 59.295 11 2007 82.017 11 2008 95.538 11 2009 107.414 11 2010 209.93 11 2011 230.368 11 2012 262.935 11 2013 307.755 11 2014 339.11 11 2015 321.27 11 2016 339.71 12 2006 1.574 12 2007 2.089 12 2008 2.502 12 2009 2.875 12 2010 3.528 12 2011 3.928 12 2012 4.923 12 2013 5.177 12 2014 6.048 12 2015 6.223 12 2016 8.335 13 2006 5.164 13 2007 7.853 13 2008 9.425 13 2009 11.877 13 2010 15.165 13 2011 19.709 13 2012 23.011 13 2013 20.953 13 2014 25.282 13 2015 30.153 13 2016 48.198 14 2006 98.383 14 2007 115.322 14 2008 121.286 14 2009 121.988 14 2010 157.046 14 2011 214.113 14 2012 243.034 14 2013 273.245 14 2014 278.919 14 2015 290.18 14 2016 292.24 15 2006 5.456 15 2007 7.543 15 2008 9.091 15 2009 10.93 15 2010 14.164 15 2011 23.195 15 2012 27.525 15 2013 37.062 15 2014 34.344 15 2015 52.64 15 2016 110.116 16 2006 86.239 16 2007 92.724 16 2008 97.854 16 2009 103.557 16 2010 111.093 16 2011 101.673 16 2012 106.843 16 2013 111.016 16 2014 131.714 16 2015 143.14 16 2016 166.044 17 2006 .614 17 2007 .545 17 2008 .629 17 2009 .512 17 2010 .589 17 2011 .689 17 2012 .67 17 2013 .601 17 2014 .698 17 2015 .726 17 2016 .731 end label values id id label def id 1 "AUS", modify label def id 2 "CAN", modify label def id 3 "CHN", modify label def id 4 "DEU", modify label def id 5 "FRA", modify label def id 6 "GBR", modify label def id 7 "IDN", modify label def id 8 "ITA", modify label def id 9 "JPN", modify label def id 10 "KOR", modify label def id 11 "MYS", modify label def id 12 "NLD", modify label def id 13 "PHL", modify label def id 14 "SGP", modify label def id 15 "THA", modify label def id 16 "USA", modify label def id 17 "ZAF", modify
Code:
xtline y, overlay ytitle("y") ylabel(0(1000)4000) legend(position(11) cols(1) ring(0))
I wonder if anyone can suggest a way to improve the quality of graph as above. Thanks.
Saturday, February 27, 2021
How to Drop Duplicate ID Observations if There are Multiple Conditions I Want to Apply
Hello Everyone,
I hope that you could help me with the below.
I have a cross-sectional dataset of 343 observations of students' scores in a test. Students are from different schools and grades. However, some students have solved the test multiple times and thus resulting in duplicates.
I have multiple conditions that I would like to tell STATA in order to drop specific duplicates:
1. I would like to drop the duplicate with a missing "Score".
2. If the duplicate does not have any missing scores, I would like to drop the duplicate with the earliest recorded date "StartDate".
A snippet of my data:
To do this I have initially typed in the following syntax:
However, I could not come up with the code to achieve the above conditions.
Thank you. Looking forward.
I hope that you could help me with the below.
I have a cross-sectional dataset of 343 observations of students' scores in a test. Students are from different schools and grades. However, some students have solved the test multiple times and thus resulting in duplicates.
I have multiple conditions that I would like to tell STATA in order to drop specific duplicates:
1. I would like to drop the duplicate with a missing "Score".
2. If the duplicate does not have any missing scores, I would like to drop the duplicate with the earliest recorded date "StartDate".
A snippet of my data:
Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input double(StartDate id) float(schoolname2 gender2 grade2) byte(Score tag) float dup
1928621099000 505711 0 0 0 . 1 1
1928621827000 505711 0 0 0 9 1 2
1928624421000.0002 505713 0 0 0 19 1 1
1928624452000 505713 0 0 0 15 1 2
1928623906000 505715 0 0 0 20 0 0
1928621142000 505716 0 0 0 14 0 0
1928621051000.0002 505718 0 0 0 18 0 0
1928623971000 505724 0 0 0 13 0 0
1928614160000 505726 0 0 0 15 1 1
1928627513000.0002 505726 0 0 0 16 1 2
end
format %tcnn/dd/ccYY_hh:MM StartDate
Code:
duplicates report id schoolname2 duplicates list id schoolname2, sepby (id) duplicates tag id schoolname2, gen (tag) duplicates list id schoolname2 if tag >=1, sepby (id) sort schoolname2 id quietly by schoolname2 id: gen dup = cond(_N==1,0,_n) if schoolname!="" | id!=. sort schoolname2 id StartDate
Thank you. Looking forward.
Synthetic Control - minimum number of obs?
Can I build a synthetic control with 6 pre-treatment observations and 4 post-treatment? Is it valid, or does it raise questions about suitability? Do you know any paper with such a few number of pre-/post-treatment obs.?
The intersection of two Chinese variables
Dear All, I found this question here (in Chinese). The data set is
Given "var1" and "var2", the desired result is "newvar". Basically, "newvar" is the words appearing in both "var1" and "var2". Thanks.
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input str20(var1 var2 newvar) "中国人民大学" "中国农业大学" "中国大学" "北京大学" "清华大学" "大学" "北京科技大学" "北京大学" "北京大学" "北京师范大学" "上海师范大学" "师范大学" end
Building cross-lagged panel models with categorical variables on Stata
Can someone here suggest to me a suitable way to build a cross-lagged panel model with categorical variables on Stata MP 15.1? I built a cross-lagged panel model with categorical variables on Stata. Xs are binary variables, while Ys are continuous variables. They are repeated measurement across 5 waves. Equality constraints (i.e., a and b) were added as they have reciprocal associations. The causal directions of these variables are shown below. Array
In the first place, I used generalised response variables (i.e., family/link: Bernoulli / logit) to construct the variables of x, but I got errors when Stata displayed the GSEM results. Then, I changed the generalised response variables to observed variables (i.e., squares), and it ran smoothly. However, the effects of y on categorical x can’t be explained by the coefficients. That is, squares are for continuous variables, but Xs are binary variable. Alternatively, I should not use squares to construct Xs. What techniques should I apply to deal with this issue?
In the first place, I used generalised response variables (i.e., family/link: Bernoulli / logit) to construct the variables of x, but I got errors when Stata displayed the GSEM results. Then, I changed the generalised response variables to observed variables (i.e., squares), and it ran smoothly. However, the effects of y on categorical x can’t be explained by the coefficients. That is, squares are for continuous variables, but Xs are binary variable. Alternatively, I should not use squares to construct Xs. What techniques should I apply to deal with this issue?
Seasonality in annual data? Problem or not?
Hello, I have a panel data at an annual frequency. When I use the xtline command to get the linear plots, my data have obvious fluctuations and when I do further analyses, the results are not coming out as desired. I wonder if I should care about seasonality in annual data. Thank you!
Ranking the most important predictor variables from a series of regression models.
Hi,
I came across an interesting study that did the following:
'number of births in the last 5 years and the number of household members were the “most important” features for predicting whether a mother reported the death of a neonate. Out of the 20 models (10 countries and 2 DHS surveys per country) trained to predict neonatal mortality, the number of births in the last 5 years ranked first in all models except Burkina Faso 2003, Tanzania 2015, and Zambia 2007, where the feature ranked second.'
I am wondering whether this is possible to do using Stata.
Please let me know if you are aware of any process similar to this.
Thank you.
I came across an interesting study that did the following:
'number of births in the last 5 years and the number of household members were the “most important” features for predicting whether a mother reported the death of a neonate. Out of the 20 models (10 countries and 2 DHS surveys per country) trained to predict neonatal mortality, the number of births in the last 5 years ranked first in all models except Burkina Faso 2003, Tanzania 2015, and Zambia 2007, where the feature ranked second.'
I am wondering whether this is possible to do using Stata.
Please let me know if you are aware of any process similar to this.
Thank you.
reshape or transpose?
Dear All, I was asked this question here (https://bbs.pinggu.org/forum.php?mod...a=#pid74912031). The data set is
The desired output is that, for each "idstd", all the information is restructured in the same row (with elements in "type" as the variable names), i.e.,
Any suggestions are highly appreciated.
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input long idstd float(lat lon) str13 longname 518601 -1.95972 30.12857 "56" 518601 -1.95972 30.12857 "RN3" 518601 -1.95972 30.12857 "Nyarugunga" 518602 -1.936392 30.09101 "23" 518602 -1.936392 30.09101 "KG 594 Street" 518602 -1.936392 30.09101 "Kacyiru" end
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input long idstd float(lat lon) str13(street_number route political) 518601 -1.95972 30.12857 "56" "RN3" "Nyarugunga" 518602 -1.936392 30.09101 "23" "KG 594 Street" "Kacyiru" end
using forvalues loop and postfile command
Dear statalist
i have a sample of only 3 observation, i having problem calculating outer product of the gradient , the gradient and updating formula using forvalues loop and postfile command, i get an error message "post sim not found" another question could i use a variable or macro to hold the opg, gr and theta instead of scalar, any help would be greatly appreciated , my do file is as follows
clear all
set obs 100
input y
3.5
1
1.5
end
sca theta1 =1
gen f = -1/theta1 + y/theta1^2
tempname sim
tempfile result
postfile `sim' opg gr theta using result, replace
forvalues i=1/100 {
qui {
sca opg`i' = 1/3*(-1/theta`i' + y[1]/theta`i'^2 )^2 + 1/3*(-1/theta`i' + y[2]/theta`i'^2)^2+ 1/3*(-1/theta`i' + y[3]/theta`i'^2)^2 /* calculating OPG for every iteration*/
sca gr`i' = 1/3*(-1/theta`i' + y[1]/theta`i'^2 ) + 1/3*(-1/theta`i' + y[2]/theta`i'^2)+ 1/3*(-1/theta`i' + y[3]/theta`i'^2) /* calculating the gradient at every iteration*/
loc j=`i'+1
sca theta`j' = theta`i' + (opg`i')^-1*gr`i' /* updating formula*/
post sim (opg`i') (gr`i') (theta`j')
}
}
postclose `sim'
use result, clear
i get error message!!!
post sim not found
r(111);
end of do-file
r(111);
i have a sample of only 3 observation, i having problem calculating outer product of the gradient , the gradient and updating formula using forvalues loop and postfile command, i get an error message "post sim not found" another question could i use a variable or macro to hold the opg, gr and theta instead of scalar, any help would be greatly appreciated , my do file is as follows
clear all
set obs 100
input y
3.5
1
1.5
end
sca theta1 =1
gen f = -1/theta1 + y/theta1^2
tempname sim
tempfile result
postfile `sim' opg gr theta using result, replace
forvalues i=1/100 {
qui {
sca opg`i' = 1/3*(-1/theta`i' + y[1]/theta`i'^2 )^2 + 1/3*(-1/theta`i' + y[2]/theta`i'^2)^2+ 1/3*(-1/theta`i' + y[3]/theta`i'^2)^2 /* calculating OPG for every iteration*/
sca gr`i' = 1/3*(-1/theta`i' + y[1]/theta`i'^2 ) + 1/3*(-1/theta`i' + y[2]/theta`i'^2)+ 1/3*(-1/theta`i' + y[3]/theta`i'^2) /* calculating the gradient at every iteration*/
loc j=`i'+1
sca theta`j' = theta`i' + (opg`i')^-1*gr`i' /* updating formula*/
post sim (opg`i') (gr`i') (theta`j')
}
}
postclose `sim'
use result, clear
i get error message!!!
post sim not found
r(111);
end of do-file
r(111);
Differences between dates using panel data
Hi there,
I'm working with some panel data, whereby for each subject I have multiple events (lab tests) recorded in long form. Each event is dated and I need to calculate whether two specific tests (test_id 12 and 147) are conducted within 3 months of each other (to assess whether testing/diagnostic guidelines are being followed). Subjects may have multiple instances of each test.
At this moment in time, I'm considering looping through all instances of these tests to ascertain whether the other test occurs within 3 months. Given some subjects have 100s of test records and the dataset contains 100,000s of subjects, I'm wondering whether there is a more efficient method to derive my requirements.
(I found plenty of existing threads on the topic of dates and panel data, but nothing akin to the above)
Thanks in advance,
Rob.
I'm working with some panel data, whereby for each subject I have multiple events (lab tests) recorded in long form. Each event is dated and I need to calculate whether two specific tests (test_id 12 and 147) are conducted within 3 months of each other (to assess whether testing/diagnostic guidelines are being followed). Subjects may have multiple instances of each test.
At this moment in time, I'm considering looping through all instances of these tests to ascertain whether the other test occurs within 3 months. Given some subjects have 100s of test records and the dataset contains 100,000s of subjects, I'm wondering whether there is a more efficient method to derive my requirements.
(I found plenty of existing threads on the topic of dates and panel data, but nothing akin to the above)
Thanks in advance,
Rob.
Graph with conditioning
Hi there, I would like to create a graph with two lines one for each condition. This is what I tried:
graph twoway (line mean_BARTAdj weekday_ordered if origin_condition==0, line mean_BARTAdj weekday_ordered if origin_condition==1)
graph twoway (line mean_BARTAdj weekday_ordered if origin_condition==0, line mean_BARTAdj weekday_ordered if origin_condition==1)
Continous-Continous Interaction when one of the variables has many zeros.
Hello,
I am interested in measuring how previous purchases from a given platform and time spent on the platform impact the probability of the purchase of a given item. I am using the following model
However given that many customers are first time customers, my interaction term gets zero many times. In other words, although customer might spent a lot of time on the platform, if it is first time buyer, the interaction term always gets to value of zero.
What should I do in this? Center the variables? Any good references about what to do?
I am interested in measuring how previous purchases from a given platform and time spent on the platform impact the probability of the purchase of a given item. I am using the following model
Code:
probit purchase c.lntimespent##c.lnpreviouspurchases i.categoryFE i.timeFe, cluster(customer)
What should I do in this? Center the variables? Any good references about what to do?
Using returned results in r() with -egen- is not working. Is that a bug?
Why is this not working?
Code:
. sysuse auto, clear (1978 Automobile Data) . summ price Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- price | 74 6165.257 2949.496 3291 15906 . egen meanprice = mean(price/r(max)), by(rep) (74 missing values generated) . summ meanprice Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- meanprice | 0 .
Spearman correlation table export into Excel
Hi everyone,
I have a problem trying to export my Spearman correlation results into Excel.
I use:
matrix A = r(Rho)
esttab matrix(A, fmt(%5.2f)) using "Results.csv", replace
Using this command I have the correlation table BUT without stars (0.01) + I need only print (0.1) with 4 signs after comma but in Excel I get everything exported with 2 signs.
Is it anyhow possible to get only the needed results exported and with the stars?***
I find the solution only for "correlation" and not for "spearman".
Thank you a lot in advance!
I have a problem trying to export my Spearman correlation results into Excel.
I use:
matrix A = r(Rho)
esttab matrix(A, fmt(%5.2f)) using "Results.csv", replace
Using this command I have the correlation table BUT without stars (0.01) + I need only print (0.1) with 4 signs after comma but in Excel I get everything exported with 2 signs.
Is it anyhow possible to get only the needed results exported and with the stars?***
I find the solution only for "correlation" and not for "spearman".
Thank you a lot in advance!
Interpretation Interactionterm in fixed effect regression
Dear all
,
I have performed a fixed effect regression with cluster(id)
xreg y i.FFF i.young FFF#young + Controls, fe cluster(id)
i use an interaction term with 2 dummy variables.
FFF=1 Family firm else FFF =0
young = 1 if firm age < 20years else young = 0
my dependent variable is a measure of innovation
Var | Coef | Std.Err | t | P>t
1.FFF | .0010349 | .0007549 | 1.37 | 0.171
young | .0000637 | .0007209 | 0.09 | 0.930
1.young | 0 (omitted)
FFF#young | -.0001431 | .001038 | -0.14 | 0.890
1 1
margins i.iFFF_PU##i.young
Predictive margins
Model VCE : Robust
FFF Margin Std. Err.
0 | .0045633 | .0001112
1 | .0055404 | .0003584
young
0 | .0047803 | .0002522
1 | .0048104 .0003708
iFFF_PU#young |
0 0 | .0045375 | .0003248
0 1 | .0046012 | .0004298
1 0 | .0055725 | .0006025
1 1 | .005493 | .000499
i am not sure how to interpret the interaction term.
is my assumption correct that the family firm's influence on the mean of the innovation measure decreased by -.0001431 when FFF=1 and young=1 ?
I also don't understand why the interaction term in the regression is negative, and the margins are all shown positive.
Maybe someone would have a hint or advice for me
.

I have performed a fixed effect regression with cluster(id)
xreg y i.FFF i.young FFF#young + Controls, fe cluster(id)
i use an interaction term with 2 dummy variables.
FFF=1 Family firm else FFF =0
young = 1 if firm age < 20years else young = 0
my dependent variable is a measure of innovation
Var | Coef | Std.Err | t | P>t
1.FFF | .0010349 | .0007549 | 1.37 | 0.171
young | .0000637 | .0007209 | 0.09 | 0.930
1.young | 0 (omitted)
FFF#young | -.0001431 | .001038 | -0.14 | 0.890
1 1
margins i.iFFF_PU##i.young
Predictive margins
Model VCE : Robust
FFF Margin Std. Err.
0 | .0045633 | .0001112
1 | .0055404 | .0003584
young
0 | .0047803 | .0002522
1 | .0048104 .0003708
iFFF_PU#young |
0 0 | .0045375 | .0003248
0 1 | .0046012 | .0004298
1 0 | .0055725 | .0006025
1 1 | .005493 | .000499
i am not sure how to interpret the interaction term.
is my assumption correct that the family firm's influence on the mean of the innovation measure decreased by -.0001431 when FFF=1 and young=1 ?
I also don't understand why the interaction term in the regression is negative, and the margins are all shown positive.
Maybe someone would have a hint or advice for me

xtprobit margin problem
Hello to All Stata Users!
is the model I run . However when I run the margin commands:
the stata gives me following output:
. I know the importance of running probit with exp.variables respected according to whether fact/cont. However y is dependent variable here !
Thank you for your advice
HTML Code:
xtoprobit y x z, vce(cluster id)
HTML Code:
margins, predict(pu0) dydx(*)
HTML Code:
"y" not found in list of covariates
Thank you for your advice
merging datasets with different dimensions
Hi all,
I would like to merge together 16 datasets which are defined by country and are made as follows:
AUS.dta is:
var1
Mol1_a a1
Mol2_a a2
Mol3_a a3
Mol4_a a4
Mol5_a a5
...
Molk_a ak
GER.dta is:
var1
Mol1_g g1
Mol2_g g2
Mol3_g g3
Mol4_g g4
Mol5_g g5
...
Molk_g gk
where Moli_g is different from Moli_a. Both are string variables. And so on for other 14 countries.dta.
Now what I would like to do (possibly in a loop) is to merge together the datasets so that if a value (molecule) is missing in a country the data is maintained only in the country(its) where var1 is not missing and in the others it is put as missing.
For instance say Mol2_a and Mol5_a are present in AUS.dta but not in GER.dta and at the same time Mol1_g in GER.dta is not present in AUS.dta while other values are in common in the sense that Mol3_a coincides with Mol6_g for instance and Mol4_a with Mol5_g (and other with others molecules in GER.dta e.g. Mol6_a with g23 and so on)The resulting db should look like this:
GER AUS
Moll1_g g1 .
Moll1_a g2 .
Moll2_g g2 .
Moll2_a . a2
Moll3_g g3 a7
Moll3_a g6 a3
Moll4_g g2 .
Moll4_a g5 a4
Moll5_g g5 a8
Moll5_a . g5
Moll6_g g6 a3
Moll6_a a2 g23
Then I'll drop the duplicates...
I would like to merge together 16 datasets which are defined by country and are made as follows:
AUS.dta is:
var1
Mol1_a a1
Mol2_a a2
Mol3_a a3
Mol4_a a4
Mol5_a a5
...
Molk_a ak
GER.dta is:
var1
Mol1_g g1
Mol2_g g2
Mol3_g g3
Mol4_g g4
Mol5_g g5
...
Molk_g gk
where Moli_g is different from Moli_a. Both are string variables. And so on for other 14 countries.dta.
Now what I would like to do (possibly in a loop) is to merge together the datasets so that if a value (molecule) is missing in a country the data is maintained only in the country(its) where var1 is not missing and in the others it is put as missing.
For instance say Mol2_a and Mol5_a are present in AUS.dta but not in GER.dta and at the same time Mol1_g in GER.dta is not present in AUS.dta while other values are in common in the sense that Mol3_a coincides with Mol6_g for instance and Mol4_a with Mol5_g (and other with others molecules in GER.dta e.g. Mol6_a with g23 and so on)The resulting db should look like this:
GER AUS
Moll1_g g1 .
Moll1_a g2 .
Moll2_g g2 .
Moll2_a . a2
Moll3_g g3 a7
Moll3_a g6 a3
Moll4_g g2 .
Moll4_a g5 a4
Moll5_g g5 a8
Moll5_a . g5
Moll6_g g6 a3
Moll6_a a2 g23
Then I'll drop the duplicates...
Impact of an average change on dependent variable over time
Hello! For my thesis I wanted to test a hypothesis which requires a model that I wouldn't know how to design myself. For a panel data (12 years) sample of 500 companies, I have a dummy variable that equals 1 if a company discloses a certain information variable, and another dummy variable that equals 1 if a company is loss-making (not profitable). I should test whether companies stop disclosing the information when they become profitable.
example: company A is loss-making from 2007-2013 and discloses the information, and is profitable from 2014 and onwards and stops disclosing the information
--> I want to test if this, on average, holds for every company
My Stata knowledge is kind of limited, I only got a minor course about it last year, and I really have no idea how I would be able to test this
I hope there is someone here who could help me!
example: company A is loss-making from 2007-2013 and discloses the information, and is profitable from 2014 and onwards and stops disclosing the information
--> I want to test if this, on average, holds for every company
My Stata knowledge is kind of limited, I only got a minor course about it last year, and I really have no idea how I would be able to test this
I hope there is someone here who could help me!
Using reghdfe command with if-statements
Hello, bit of a complex one here:
I’m currently working as a research assistant, using my supervisor’s code, which uses employee-level data for a firm which “de-trashes” stock coming into its warehouse i.e., removes transit packaging.
The code is designed to estimate productivity, measured in units [de-trashed] per minute (upm). It uses the reghdfe command, a linear regression that absorbs multiple layers of fixed effects. It also uses an independent variable called PLANNED_UPH which is a target that, if reached, workers get paid a bonus.
The fixed effects used in the regression equation are:
reghdfe uph PLANNED_UPH, ///
absorb(fe3_j=SKU_ID fe3_i=user_code fe3_t=date_code fe3_dow=dow fe3_shift=shift_type fe3_h=HourDay1 ///
fe3_handle=HANDLING_CLASS fe3_station=STATION_ID fe3_group=GROUP_ID)
quietly estadd local controls "Yes"
quietly estadd local FE_t "Yes"
quietly estadd local FE_i "Yes"
quietly estadd local FE_j "Yes"
est store H3
The output (H3) is as follows:
What I have been asked to do is to first, split the data in half by date (I did this by just creating binary dummies called split1 and split2 to represent data from the first and second halves of the year, respectively). I then have to run the same regression again for just the first half and then copy the values of the coefficients on the fixed effects into the data subset from the second half.
To run the regression on the first half of code, I thought of running the code with if-statements so that the regressions would only run if split1==1. Then for each user ID (worker), I could copy the coefficients from split1 to split2 somehow, then run the code only for split2. However, wherever I place the if-statements in the code, it returns with errors. I’m grateful for any ideas, thanks.
I’m currently working as a research assistant, using my supervisor’s code, which uses employee-level data for a firm which “de-trashes” stock coming into its warehouse i.e., removes transit packaging.
The code is designed to estimate productivity, measured in units [de-trashed] per minute (upm). It uses the reghdfe command, a linear regression that absorbs multiple layers of fixed effects. It also uses an independent variable called PLANNED_UPH which is a target that, if reached, workers get paid a bonus.
The fixed effects used in the regression equation are:
- fe3_j (SKU code i.e., product fixed effects)
- fe3_i (worker fixed effects)
- fe3_t (date fixed effects)
- fe3_dow (day of week fixed effects)
- fe3_shift (shift type fixed effects i.e., day, early or late shift)
- fe3_h (hour of the day fixed effects)
- fe3_handle (handling class fixed effects)
- fe3_station (warehouse workstation fixed effects)
- fe3_group (group of workers fixed effects)
reghdfe uph PLANNED_UPH, ///
absorb(fe3_j=SKU_ID fe3_i=user_code fe3_t=date_code fe3_dow=dow fe3_shift=shift_type fe3_h=HourDay1 ///
fe3_handle=HANDLING_CLASS fe3_station=STATION_ID fe3_group=GROUP_ID)
quietly estadd local controls "Yes"
quietly estadd local FE_t "Yes"
quietly estadd local FE_i "Yes"
quietly estadd local FE_j "Yes"
est store H3
The output (H3) is as follows:
HDFE Linear regression | Number of obs = | 2,480,900 | ||
Absorbing 9 HDFE groups | F( 1,2454358) = | 1.66 | ||
Prob > F = | 0.1971 | |||
R-squared = | 0.5447 | |||
Adj R-squared = | 0.5398 | |||
Within R-sq. = | 0 | |||
Root MSE = | 0.2292 | |||
uph Coef. | Std. Err. | t | P>t [95% Conf. | Interval] |
PLANNED_UPH -2.25e-06 | 1.75E-06 | -1.29 | 0.197 -5.68e-06 | 1.17E-06 |
_cons .4962852 | 0.002311 | 214.75 | 0.000 .4917558 | 0.5008146 |
Absorbed degrees of freedom: | ||||
Absorbed FE | Categories | Redundant | Num. Coefs | |
- | ||||
SKU_ID | 25692 | 0 | 25692 | |
user_code | 567 | 1 | 566 | |
date_code | 232 | 1 | 231 | |
dow | 7 | 7 | 0 | |
shift_type | 3 | 1 | 2 | |
HourDay1 | 9 | 1 | 8 | |
HANDLING_CLASS | 2 | 2 | 0 | |
STATION_ID | 38 | 1 | 37 | |
GROUP_ID | 7 | 2 | 5 |
What I have been asked to do is to first, split the data in half by date (I did this by just creating binary dummies called split1 and split2 to represent data from the first and second halves of the year, respectively). I then have to run the same regression again for just the first half and then copy the values of the coefficients on the fixed effects into the data subset from the second half.
To run the regression on the first half of code, I thought of running the code with if-statements so that the regressions would only run if split1==1. Then for each user ID (worker), I could copy the coefficients from split1 to split2 somehow, then run the code only for split2. However, wherever I place the if-statements in the code, it returns with errors. I’m grateful for any ideas, thanks.
AR(1) is insignificant in difference GMM
Hello !
I am trying to estimate the following model in both difference and system GMM. However, the results change significantly both in terms of significance and in terms of the coefficients. Moreover, in difference GMM AR(1)=0.8 whereas in system GMM AR(2)=0.979 but AR(3)=0.031. What does this suggest and how I can fix this problem?
I have 33 groups and a time period=5.
Difference GMM
xtabond2 diff_gdp log_initial log_Mcap log_Liab log_trade log_school log_govsize log_infl td*, gmm( log_initi
> al , collapse sp) gmm( L.( log_Mcap log_Liab log_trade log_govsize log_school log_infl ), collapse) iv( td*
> ) robust two small ar(3) nolevel
Favoring space over speed. To switch, type or click on mata: mata set matafavor speed, perm.
res}Warning: split has no effect in Difference GMM.
td5 dropped due to collinearity
Warning: Two-step estimated covariance matrix of moments is singular.
Using a generalized inverse to calculate optimal weighting matrix for two-step estimation.
Difference-in-Sargan/Hansen statistics may be negative.
Dynamic panel-data estimation, two-step difference GMM
------------------------------------------------------------------------------
Group variable: Country_ Number of obs = 126
Time variable : period Number of groups = 33
Number of instruments = 26 Obs per group: min = 2
F(0, 33) = . avg = 3.82
Prob > F = . max = 4
------------------------------------------------------------------------------
| Corrected
diff_gdp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
log_initial | -.0766945 .059806 -1.28 0.209 -.1983708 .0449818
log_Mcap | .0410073 .0179618 2.28 0.029 .0044639 .0775508
log_Liab | .0198893 .0579202 0.34 0.733 -.0979503 .1377289
log_trade | -.1288009 .1143785 -1.13 0.268 -.3615057 .1039039
log_school | -.0846866 .0904975 -0.94 0.356 -.2688052 .099432
log_govsize | .0193566 .0762848 0.25 0.801 -.135846 .1745592
log_infl | .0061522 .0101976 0.60 0.550 -.0145949 .0268993
td1 | -.0457857 .0314332 -1.46 0.155 -.109737 .0181657
td2 | -.0283545 .0247333 -1.15 0.260 -.0786748 .0219658
td3 | -.0247058 .0176698 -1.40 0.171 -.0606554 .0112437
td4 | -.0070187 .0076711 -0.91 0.367 -.0226256 .0085883
------------------------------------------------------------------------------
Instruments for first differences equation
Standard
D.(td1 td2 td3 td4 td5)
GMM-type (missing=0, separate instruments for each period unless collapsed)
L(1/4).(L.log_Mcap L.log_Liab L.log_trade L.log_govsize L.log_school
L.log_infl) collapsed
L(1/4).log_initial collapsed
------------------------------------------------------------------------------
Arellano-Bond test for AR(1) in first differences: z = -0.25 Pr > z = 0.804
Arellano-Bond test for AR(2) in first differences: z = 0.56 Pr > z = 0.576
Arellano-Bond test for AR(3) in first differences: z = -1.07 Pr > z = 0.285
------------------------------------------------------------------------------
Sargan test of overid. restrictions: chi2(15) = 12.90 Prob > chi2 = 0.610
(Not robust, but not weakened by many instruments.)
Hansen test of overid. restrictions: chi2(15) = 14.73 Prob > chi2 = 0.471
(Robust, but weakened by many instruments.)
Difference-in-Hansen tests of exogeneity of instrument subsets:
gmm(log_initial, collapse lag(1 .))
Hansen test excluding group: chi2(11) = 11.10 Prob > chi2 = 0.435
Difference (null H = exogenous): chi2(4) = 3.62 Prob > chi2 = 0.459
iv(td1 td2 td3 td4 td5)
Hansen test excluding group: chi2(11) = 8.51 Prob > chi2 = 0.667
Difference (null H = exogenous): chi2(4) = 6.22 Prob > chi2 = 0.183
System GMM
xtabond2 diff_gdp log_initial log_Mcap log_Liab log_trade log_school log_govsize log_infl td*, gmm( log_initi
> al , collapse sp) gmm( L.( log_Mcap log_Liab log_trade log_govsize log_school log_infl ), collapse) iv( td*
> ) robust two small ar(3)
Favoring space over speed. To switch, type or click on mata: mata set matafavor speed, perm.
td2 dropped due to collinearity
Warning: Number of instruments may be large relative to number of observations.
Warning: Two-step estimated covariance matrix of moments is singular.
Using a generalized inverse to calculate optimal weighting matrix for two-step estimation.
Difference-in-Sargan/Hansen statistics may be negative.
Dynamic panel-data estimation, two-step system GMM
------------------------------------------------------------------------------
Group variable: Country_ Number of obs = 160
Time variable : period Number of groups = 33
Number of instruments = 34 Obs per group: min = 3
F(11, 32) = 34.48 avg = 4.85
Prob > F = 0.000 max = 5
------------------------------------------------------------------------------
| Corrected
diff_gdp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
log_initial | -.0108161 .0085561 -1.26 0.215 -.0282444 .0066121
log_Mcap | .0029089 .0072577 0.40 0.691 -.0118746 .0176924
log_Liab | -.0101865 .0105349 -0.97 0.341 -.0316455 .0112725
log_trade | .0107354 .0097941 1.10 0.281 -.0092146 .0306853
log_school | .0400703 .0233032 1.72 0.095 -.0073967 .0875373
log_govsize | .0007092 .0124605 0.06 0.955 -.0246721 .0260904
log_infl | -.0005491 .0050937 -0.11 0.915 -.0109245 .0098264
td1 | -.0057598 .0053446 -1.08 0.289 -.0166465 .0051269
td3 | -.0054331 .0042171 -1.29 0.207 -.014023 .0031569
td4 | -.006852 .0052233 -1.31 0.199 -.0174916 .0037875
td5 | -.0181138 .0077344 -2.34 0.026 -.0338683 -.0023593
_cons | .025506 .1402732 0.18 0.857 -.2602211 .3112332
------------------------------------------------------------------------------
Instruments for first differences equation
Standard
D.(td1 td2 td3 td4 td5)
GMM-type (missing=0, separate instruments for each period unless collapsed)
L(1/4).(L.log_Mcap L.log_Liab L.log_trade L.log_govsize L.log_school
L.log_infl) collapsed
L(1/4).log_initial collapsed
Instruments for levels equation
Standard
td1 td2 td3 td4 td5
_cons
GMM-type (missing=0, separate instruments for each period unless collapsed)
D.(L.log_Mcap L.log_Liab L.log_trade L.log_govsize L.log_school
L.log_infl) collapsed
D.log_initial collapsed
------------------------------------------------------------------------------
Arellano-Bond test for AR(1) in first differences: z = -1.87 Pr > z = 0.062
Arellano-Bond test for AR(2) in first differences: z = 0.03 Pr > z = 0.979
Arellano-Bond test for AR(3) in first differences: z = -2.16 Pr > z = 0.031
------------------------------------------------------------------------------
Sargan test of overid. restrictions: chi2(22) = 43.85 Prob > chi2 = 0.004
(Not robust, but not weakened by many instruments.)
Hansen test of overid. restrictions: chi2(22) = 25.91 Prob > chi2 = 0.255
(Robust, but weakened by many instruments.)
Difference-in-Hansen tests of exogeneity of instrument subsets:
GMM instruments for levels
Hansen test excluding group: chi2(15) = 15.40 Prob > chi2 = 0.423
Difference (null H = exogenous): chi2(7) = 10.52 Prob > chi2 = 0.161
gmm(log_initial, collapse eq(diff) lag(1 4))
Hansen test excluding group: chi2(18) = 23.43 Prob > chi2 = 0.175
Difference (null H = exogenous): chi2(4) = 2.49 Prob > chi2 = 0.647
gmm(log_initial, collapse eq(diff) lag(1 4)) eq(level) lag(0 0))
Hansen test excluding group: chi2(21) = 25.91 Prob > chi2 = 0.210
Difference (null H = exogenous): chi2(1) = -0.00 Prob > chi2 = 1.000
iv(td1 td2 td3 td4 td5)
Hansen test excluding group: chi2(18) = 24.99 Prob > chi2 = 0.125
Difference (null H = exogenous): chi2(4) = 0.92 Prob > chi2 = 0.921
I am trying to estimate the following model in both difference and system GMM. However, the results change significantly both in terms of significance and in terms of the coefficients. Moreover, in difference GMM AR(1)=0.8 whereas in system GMM AR(2)=0.979 but AR(3)=0.031. What does this suggest and how I can fix this problem?
I have 33 groups and a time period=5.
Difference GMM
xtabond2 diff_gdp log_initial log_Mcap log_Liab log_trade log_school log_govsize log_infl td*, gmm( log_initi
> al , collapse sp) gmm( L.( log_Mcap log_Liab log_trade log_govsize log_school log_infl ), collapse) iv( td*
> ) robust two small ar(3) nolevel
Favoring space over speed. To switch, type or click on mata: mata set matafavor speed, perm.
res}Warning: split has no effect in Difference GMM.
td5 dropped due to collinearity
Warning: Two-step estimated covariance matrix of moments is singular.
Using a generalized inverse to calculate optimal weighting matrix for two-step estimation.
Difference-in-Sargan/Hansen statistics may be negative.
Dynamic panel-data estimation, two-step difference GMM
------------------------------------------------------------------------------
Group variable: Country_ Number of obs = 126
Time variable : period Number of groups = 33
Number of instruments = 26 Obs per group: min = 2
F(0, 33) = . avg = 3.82
Prob > F = . max = 4
------------------------------------------------------------------------------
| Corrected
diff_gdp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
log_initial | -.0766945 .059806 -1.28 0.209 -.1983708 .0449818
log_Mcap | .0410073 .0179618 2.28 0.029 .0044639 .0775508
log_Liab | .0198893 .0579202 0.34 0.733 -.0979503 .1377289
log_trade | -.1288009 .1143785 -1.13 0.268 -.3615057 .1039039
log_school | -.0846866 .0904975 -0.94 0.356 -.2688052 .099432
log_govsize | .0193566 .0762848 0.25 0.801 -.135846 .1745592
log_infl | .0061522 .0101976 0.60 0.550 -.0145949 .0268993
td1 | -.0457857 .0314332 -1.46 0.155 -.109737 .0181657
td2 | -.0283545 .0247333 -1.15 0.260 -.0786748 .0219658
td3 | -.0247058 .0176698 -1.40 0.171 -.0606554 .0112437
td4 | -.0070187 .0076711 -0.91 0.367 -.0226256 .0085883
------------------------------------------------------------------------------
Instruments for first differences equation
Standard
D.(td1 td2 td3 td4 td5)
GMM-type (missing=0, separate instruments for each period unless collapsed)
L(1/4).(L.log_Mcap L.log_Liab L.log_trade L.log_govsize L.log_school
L.log_infl) collapsed
L(1/4).log_initial collapsed
------------------------------------------------------------------------------
Arellano-Bond test for AR(1) in first differences: z = -0.25 Pr > z = 0.804
Arellano-Bond test for AR(2) in first differences: z = 0.56 Pr > z = 0.576
Arellano-Bond test for AR(3) in first differences: z = -1.07 Pr > z = 0.285
------------------------------------------------------------------------------
Sargan test of overid. restrictions: chi2(15) = 12.90 Prob > chi2 = 0.610
(Not robust, but not weakened by many instruments.)
Hansen test of overid. restrictions: chi2(15) = 14.73 Prob > chi2 = 0.471
(Robust, but weakened by many instruments.)
Difference-in-Hansen tests of exogeneity of instrument subsets:
gmm(log_initial, collapse lag(1 .))
Hansen test excluding group: chi2(11) = 11.10 Prob > chi2 = 0.435
Difference (null H = exogenous): chi2(4) = 3.62 Prob > chi2 = 0.459
iv(td1 td2 td3 td4 td5)
Hansen test excluding group: chi2(11) = 8.51 Prob > chi2 = 0.667
Difference (null H = exogenous): chi2(4) = 6.22 Prob > chi2 = 0.183
System GMM
xtabond2 diff_gdp log_initial log_Mcap log_Liab log_trade log_school log_govsize log_infl td*, gmm( log_initi
> al , collapse sp) gmm( L.( log_Mcap log_Liab log_trade log_govsize log_school log_infl ), collapse) iv( td*
> ) robust two small ar(3)
Favoring space over speed. To switch, type or click on mata: mata set matafavor speed, perm.
td2 dropped due to collinearity
Warning: Number of instruments may be large relative to number of observations.
Warning: Two-step estimated covariance matrix of moments is singular.
Using a generalized inverse to calculate optimal weighting matrix for two-step estimation.
Difference-in-Sargan/Hansen statistics may be negative.
Dynamic panel-data estimation, two-step system GMM
------------------------------------------------------------------------------
Group variable: Country_ Number of obs = 160
Time variable : period Number of groups = 33
Number of instruments = 34 Obs per group: min = 3
F(11, 32) = 34.48 avg = 4.85
Prob > F = 0.000 max = 5
------------------------------------------------------------------------------
| Corrected
diff_gdp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
log_initial | -.0108161 .0085561 -1.26 0.215 -.0282444 .0066121
log_Mcap | .0029089 .0072577 0.40 0.691 -.0118746 .0176924
log_Liab | -.0101865 .0105349 -0.97 0.341 -.0316455 .0112725
log_trade | .0107354 .0097941 1.10 0.281 -.0092146 .0306853
log_school | .0400703 .0233032 1.72 0.095 -.0073967 .0875373
log_govsize | .0007092 .0124605 0.06 0.955 -.0246721 .0260904
log_infl | -.0005491 .0050937 -0.11 0.915 -.0109245 .0098264
td1 | -.0057598 .0053446 -1.08 0.289 -.0166465 .0051269
td3 | -.0054331 .0042171 -1.29 0.207 -.014023 .0031569
td4 | -.006852 .0052233 -1.31 0.199 -.0174916 .0037875
td5 | -.0181138 .0077344 -2.34 0.026 -.0338683 -.0023593
_cons | .025506 .1402732 0.18 0.857 -.2602211 .3112332
------------------------------------------------------------------------------
Instruments for first differences equation
Standard
D.(td1 td2 td3 td4 td5)
GMM-type (missing=0, separate instruments for each period unless collapsed)
L(1/4).(L.log_Mcap L.log_Liab L.log_trade L.log_govsize L.log_school
L.log_infl) collapsed
L(1/4).log_initial collapsed
Instruments for levels equation
Standard
td1 td2 td3 td4 td5
_cons
GMM-type (missing=0, separate instruments for each period unless collapsed)
D.(L.log_Mcap L.log_Liab L.log_trade L.log_govsize L.log_school
L.log_infl) collapsed
D.log_initial collapsed
------------------------------------------------------------------------------
Arellano-Bond test for AR(1) in first differences: z = -1.87 Pr > z = 0.062
Arellano-Bond test for AR(2) in first differences: z = 0.03 Pr > z = 0.979
Arellano-Bond test for AR(3) in first differences: z = -2.16 Pr > z = 0.031
------------------------------------------------------------------------------
Sargan test of overid. restrictions: chi2(22) = 43.85 Prob > chi2 = 0.004
(Not robust, but not weakened by many instruments.)
Hansen test of overid. restrictions: chi2(22) = 25.91 Prob > chi2 = 0.255
(Robust, but weakened by many instruments.)
Difference-in-Hansen tests of exogeneity of instrument subsets:
GMM instruments for levels
Hansen test excluding group: chi2(15) = 15.40 Prob > chi2 = 0.423
Difference (null H = exogenous): chi2(7) = 10.52 Prob > chi2 = 0.161
gmm(log_initial, collapse eq(diff) lag(1 4))
Hansen test excluding group: chi2(18) = 23.43 Prob > chi2 = 0.175
Difference (null H = exogenous): chi2(4) = 2.49 Prob > chi2 = 0.647
gmm(log_initial, collapse eq(diff) lag(1 4)) eq(level) lag(0 0))
Hansen test excluding group: chi2(21) = 25.91 Prob > chi2 = 0.210
Difference (null H = exogenous): chi2(1) = -0.00 Prob > chi2 = 1.000
iv(td1 td2 td3 td4 td5)
Hansen test excluding group: chi2(18) = 24.99 Prob > chi2 = 0.125
Difference (null H = exogenous): chi2(4) = 0.92 Prob > chi2 = 0.921
doflist command in reghdfe
Hi there,
I am currently having issues with collinearity of fixed effects in my 'reghdfe- regression'. I read online that the 'doflist- comannd' (Degrees of freedom adjustments) could help. Yet I do not know how to implement that command in my regression.
The do-file for the regression looks like this:
eststo FullSample: reghdfe CashtoAt TXUNCER fiveyearRTC fiveyearCASHETR fcons nol lossfirm NWC lev EBITDA MTBR size divpo capEx aquist atCF resDev, absorb(industry* year*) vce(cluster gvkey fyear)
Where and how could I implement doflist now to control for collinearity ??
I would be very grateful for any reply. Thanks in advance.
I am currently having issues with collinearity of fixed effects in my 'reghdfe- regression'. I read online that the 'doflist- comannd' (Degrees of freedom adjustments) could help. Yet I do not know how to implement that command in my regression.
The do-file for the regression looks like this:
eststo FullSample: reghdfe CashtoAt TXUNCER fiveyearRTC fiveyearCASHETR fcons nol lossfirm NWC lev EBITDA MTBR size divpo capEx aquist atCF resDev, absorb(industry* year*) vce(cluster gvkey fyear)
Where and how could I implement doflist now to control for collinearity ??
I would be very grateful for any reply. Thanks in advance.
Using starting values to overcome convergence problem - always getting r(503)
Dear all,
I am trying to fix convergence problem of my model using starting values as suggested many times here on statalist. However, I am always getting the following error:
simplified example of my models and starting values usage:
My version of Stata is 16.1 MP.
Could you tell me where the mistake is and give me a suggestion of code change please? Thanks in advance!
I am trying to fix convergence problem of my model using starting values as suggested many times here on statalist. However, I am always getting the following error:
a 0 x 0 is not a vector
an error occurred when mi estimate executed melogit on m=1
r(503)
an error occurred when mi estimate executed melogit on m=1
r(503)
Code:
mi estimate, dots cmdok: melogit volba X Y Z [pweight=VAHA] || okres_cd:, covariance(unstructured) noestimate mat a=e(b) mi estimate, dots cmdok: melogit volba X Y Z [pweight=VAHA] || okres_cd: Z, covariance(unstructured) from(a)
Could you tell me where the mistake is and give me a suggestion of code change please? Thanks in advance!
Pooled OLS Firm effect vs Industry Effect
Hello
I have explored this forum for the question of firm vs industry fixed effect. But they always talk about panel regression "xtreg".
In my case, I have unbalanced Panel Data and whenever I use Panel regression, the results are insignificant and can't be used to explain my inferences. So I follow Wooldridge(2010) and due to unbalanced nature of my panel, I use POLS.
Since, I am working on firms' share price returns, I want to fix the firms so to avoid any firm effects. I have 18 industries and more than 4000 firms per year for 13 years.
With this code, I am able to control year and industry, but it doesn't seem practical to control for firms "i.firms" because of number of observations.
My questions,
1. is it sufficient to apply industry and exclude firm control.
2. my friend suggested to use cluster
This code above only deals with the standard errors not fixing the firm effect (as per my knowledge).
Can you please give me any suggestion to fix the firm effect in POLS?
Thanks!
Qazi
I have explored this forum for the question of firm vs industry fixed effect. But they always talk about panel regression "xtreg".
In my case, I have unbalanced Panel Data and whenever I use Panel regression, the results are insignificant and can't be used to explain my inferences. So I follow Wooldridge(2010) and due to unbalanced nature of my panel, I use POLS.
Since, I am working on firms' share price returns, I want to fix the firms so to avoid any firm effects. I have 18 industries and more than 4000 firms per year for 13 years.
Code:
regress y x1 x2 x3 x4 x5 x6 i.industry i.year, robust
My questions,
1. is it sufficient to apply industry and exclude firm control.
2. my friend suggested to use cluster
Code:
regress y x1 x2 x3 x4 x5 x6 i.industry i.year, robust cluster(firms)
Can you please give me any suggestion to fix the firm effect in POLS?
Thanks!
Qazi
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input long Firms int Data_Year long Sector1 float(y x1 x2 x3 x4 x5 x6) 2 2006 1 . . . . . . . 2 2007 1 . . . . . . . 2 2008 1 . . . . . . . 2 2009 1 . . . . . . . 2 2010 1 . . . . . . . 2 2011 1 . . . . . . . 2 2012 1 . . . . . . . 2 2013 1 . . . . . . . 2 2014 1 . . . . . . . 2 2015 1 . . . . . . . 2 2016 1 . . . . . . . 2 2017 1 . . . . . . . 2 2018 1 . . . . . . . 3 2006 15 . .0588601 1.94666 .22520296 2.539871 0 1 3 2007 15 . .033332642 2.93485 .192904 2.547168 0 1 3 2008 15 . .01940063 1.84911 .14878628 2.5697694 0 1 3 2009 15 . .05142462 .953542 .246432 2.456603 0 1 3 2010 15 . .20495068 1.22109 .178483 2.408386 0 1 3 2011 15 . .009054668 1.4233 .11290167 2.4134254 0 1 3 2012 15 . .010843783 1.39291 .05148486 2.418654 0 1 3 2013 15 -.45762715 .00931621 2.33558 0 2.398067 0 1 3 2014 15 . .010274763 2.00774 0 2.427436 0 1 3 2015 15 . .013881045 3.23829 .23651055 2.69642 0 0 3 2016 15 -.8986493 .04266848 2.41774 .19479223 2.701517 0 0 3 2017 15 -.8433397 .08782447 2.25026 .18349776 2.7423086 0 0 3 2018 15 . .013235296 2.57593 .16161986 2.756552 0 0 5 2006 8 . . .978182 .00024064996 1.5905855 0 0 5 2007 8 . . .647546 0 1.549604 . 0 5 2008 8 . . . . . . . 5 2009 8 . . . . . . . 5 2010 8 . . . . . . . 5 2011 8 . . . . . . . 5 2012 8 . . . . . . . 5 2013 8 . . . . . . . 5 2014 8 . . . . . . . 5 2015 8 -1.2913605 . . 1.524727 -1.555955 . 0 5 2016 8 . . .235211 .013280518 1.1000602 . 0 5 2017 8 . . . . . . . 5 2018 8 . . . . . . . 13 2006 4 . .56787515 . .54185754 1.618226 0 0 13 2007 4 . . . . . . . 13 2008 4 . . . . . . . 13 2009 4 . . . . . . . 13 2010 4 . . . . . . . 13 2011 4 . . . . . . . 13 2012 4 . . . . . . . 13 2013 4 . . . . . . . 13 2014 4 . . . . . . . 13 2015 4 . . . . . . . 13 2016 4 . . . . . . . 13 2017 4 . . . . . . . 13 2018 4 . . . . . . . 14 2006 16 . .3742465 1.62565 .15619987 3.3326585 .845098 1 14 2007 16 . .04114099 1.02707 .14252478 3.24923 .69897 1 14 2008 16 . .009839674 1.51208 .08373041 3.25896 .60206 1 14 2009 16 . . . . . .60206 . 14 2010 16 . . . . . . . 14 2011 16 . . . . . . . 14 2012 16 . . . . . . . 14 2013 16 . . . . . . . 14 2014 16 . . . . . . . 14 2015 16 . . . . . . . 14 2016 16 . . . . . . . 14 2017 16 . . . . . . . 14 2018 16 . . . . . . . 15 2006 7 . .02684195 4.37205 .14560093 2.2206154 .60206 0 15 2007 7 . .011010598 3.26782 .05175494 2.2237165 .4771213 0 15 2008 7 . .0012241866 1.73697 .05533915 2.1846972 .4771213 0 15 2009 7 . .03988571 2.45002 .05487922 2.1772566 .4771213 0 15 2010 7 . .05301388 5.52119 .03857759 2.3197305 .4771213 0 15 2011 7 -.6537684 .010773852 2.86267 .2996194 2.665557 .69897 0 15 2012 7 -1.2263236 .013061752 6.60859 .12986204 2.830872 .845098 0 15 2013 7 -.9993629 .006437473 10.2847 .017026918 3.040543 1.20412 0 15 2014 7 . .0007668933 2.83528 .005819082 3.1847794 1.3802112 0 15 2015 7 . .006556697 1.48682 .009178674 2.950345 1.30103 0 15 2016 7 . .06107769 2.46043 .008934786 2.928986 1.2552725 0 15 2017 7 . .04057406 1.60284 .007892824 2.952678 1.230449 0 15 2018 7 . .01205367 2.03453 .03801258 2.9168916 1.0791812 0 16 2006 7 . . . . . . . 16 2007 7 . . . . . . . 16 2008 7 . . . . . . . 16 2009 7 . . . . . . . 16 2010 7 . . . . . . . 16 2011 7 . . . . . . . 16 2012 7 . . . . . . . 16 2013 7 . . . . . . . 16 2014 7 . . . . . . . 16 2015 7 . . . . . . . 16 2016 7 . . . . . . . 16 2017 7 . . . . . . . 16 2018 7 . . . . . . . 17 2006 14 . .0036297094 5.74644 .05222128 4.328257 1.1139433 1 17 2007 14 . .0083065005 5.09032 .1655463 4.3925915 1.1760913 1 17 2008 14 . .00050353137 4.03952 .20253557 4.411502 1.1760913 1 17 2009 14 . .02329066 4.60241 .19097248 4.4353666 1.146128 1 17 2010 14 . .008952872 3.92285 .14182915 4.479374 1.1760913 1 17 2011 14 . .0042309016 3.68352 .14432566 4.499907 1.2787536 1 17 2012 14 -1.780515 .016203607 3.62995 .14721337 4.529892 1.230449 1 17 2013 14 . .02400288 5.31523 .13067064 4.5256925 1.2552725 1 17 2014 14 -2.93236 .01348641 7.96134 .21756545 4.49428 1.230449 1 end label values Firms Firms label def Firms 2 "1-800-Attorney, Inc.", modify label def Firms 3 "1-800-FLOWERS.COM, Inc. Class A", modify label def Firms 5 "1PM Industries, Inc.", modify label def Firms 13 "360 Global Wine Co Com New", modify label def Firms 14 "3Com Corp", modify label def Firms 15 "3D Systems Corporation", modify label def Firms 16 "3Dfx Interactive", modify label def Firms 17 "3M Company", modify label values Sector1 Sector1 label def Sector1 1 "Commercial Services", modify label def Sector1 4 "Consumer Non-Durables", modify label def Sector1 7 "Electronic Technology", modify label def Sector1 8 "Energy Minerals", modify label def Sector1 14 "Producer Manufacturing", modify label def Sector1 15 "Retail Trade", modify label def Sector1 16 "Technology Services", modify
Quadratic term and concave relationship threshold
Hello everyone,
I am exploring whether the hypothesis that above a certain threshold financial depth yields negative results for economic growth. For this purpose, I take the quadratic term of my variable of interest, namely bank credit to the private sector. First I estimate the simple OLS with robust standard errors, which shows that there is indeed a quadratic relationship and that upon a certain threshold financial depth has a negative impact on growth.
reg gr linitial prcreditBI prcreditBI2 log_trade log_govsize log_school log_infl , robust
Linear regression Number of obs = 64
F(7, 56) = 8.36
Prob > F = 0.0000
R-squared = 0.5037
Root MSE = 1.1029
------------------------------------------------------------------------------
| Robust
gr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
linitial | -.9851453 .335963 -2.93 0.005 -1.65816 -.3121306
prcreditBI | 6.026801 2.051998 2.94 0.005 1.916155 10.13745
prcreditBI2 | -3.052359 .9509696 -3.21 0.002 -4.95738 -1.147338
log_trade | -.001041 .2636371 -0.00 0.997 -.5291696 .5270876
log_govsize | -1.297607 .5570733 -2.33 0.023 -2.413559 -.1816553
log_school | 2.171247 .8182169 2.65 0.010 .532161 3.810332
log_infl | .0640021 .1290749 0.50 0.622 -.1945661 .3225703
_cons | 7.712365 1.98893 3.88 0.000 3.72806 11.69667
As you can see the coefficient of prcreditBI2 becomes negative. Nevertheless, I do not know how to find the exact threshold when bank credit to the private sector starts yielding a negative effect on growth (e.g credit to the private sector starts yielding negative effects on growth when it reaches 80%of GDP). Can somebody help me with this?
Thank you!
I am exploring whether the hypothesis that above a certain threshold financial depth yields negative results for economic growth. For this purpose, I take the quadratic term of my variable of interest, namely bank credit to the private sector. First I estimate the simple OLS with robust standard errors, which shows that there is indeed a quadratic relationship and that upon a certain threshold financial depth has a negative impact on growth.
reg gr linitial prcreditBI prcreditBI2 log_trade log_govsize log_school log_infl , robust
Linear regression Number of obs = 64
F(7, 56) = 8.36
Prob > F = 0.0000
R-squared = 0.5037
Root MSE = 1.1029
------------------------------------------------------------------------------
| Robust
gr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
linitial | -.9851453 .335963 -2.93 0.005 -1.65816 -.3121306
prcreditBI | 6.026801 2.051998 2.94 0.005 1.916155 10.13745
prcreditBI2 | -3.052359 .9509696 -3.21 0.002 -4.95738 -1.147338
log_trade | -.001041 .2636371 -0.00 0.997 -.5291696 .5270876
log_govsize | -1.297607 .5570733 -2.33 0.023 -2.413559 -.1816553
log_school | 2.171247 .8182169 2.65 0.010 .532161 3.810332
log_infl | .0640021 .1290749 0.50 0.622 -.1945661 .3225703
_cons | 7.712365 1.98893 3.88 0.000 3.72806 11.69667
As you can see the coefficient of prcreditBI2 becomes negative. Nevertheless, I do not know how to find the exact threshold when bank credit to the private sector starts yielding a negative effect on growth (e.g credit to the private sector starts yielding negative effects on growth when it reaches 80%of GDP). Can somebody help me with this?
Thank you!
Missing values in snapspan command @StephenJenkins
Dear Stephen Jenkins,
I'm following your Survival Analysis guideline (https://www.iser.essex.ac.uk/resourc...sis-with-stata) and An Introduction to Survival Analysis Using Stata by Cleves et al. (2008).
I would like to ask your help about transforming panel data to duration data. I have a panel data with monthly observations of individuals' depression status. My research question is how individual characteristics affect the duration of depression stage. The problem is when I use the snapspan command, I could not control interval truncation in the data and missing observations counted as they were observed.
Current format of my data for individual A is:
For individual A month shows the month of interview, event=1 if individual depressed in this month, gender is one of the individual characteristics. As you can see I only observe individual A in January, June, August, September and October, while she missed the other 7 interviews. When I use the snapspan command
The transformed data became:
In the duration data it seems like I observed the individual A from month 1 (January) to month 6 (June) and from month 6 to month 8 which are not the case. My questions:
1) Would that be a problem, if so how should I fix the missing observation problem in transforming the data?
2) I'm planning to use following stset command
do I need to define enter, exit or origin options in case of multiple failures?
Best regards,
John
I'm following your Survival Analysis guideline (https://www.iser.essex.ac.uk/resourc...sis-with-stata) and An Introduction to Survival Analysis Using Stata by Cleves et al. (2008).
I would like to ask your help about transforming panel data to duration data. I have a panel data with monthly observations of individuals' depression status. My research question is how individual characteristics affect the duration of depression stage. The problem is when I use the snapspan command, I could not control interval truncation in the data and missing observations counted as they were observed.
Current format of my data for individual A is:
Code:
+------------------------+ | month event gender | |------------------------| A. | 1 0 Female | A. | 6 0 Female | A. | 8 1 Female | A. | 9 1 Female | A. | 10 1 Female | +------------------------+
Code:
snapspan individualidsys month event, gen(date0) replace rename month date1
Code:
+--------------------------------+ | date0 date1 event gender | |--------------------------------| A. | . 1 0 . | A. | 1 6 0 Female | A. | 6 8 1 Female | A. | 8 9 1 Female | A. | 9 10 1 Female | +--------------------------------+
In the duration data it seems like I observed the individual A from month 1 (January) to month 6 (June) and from month 6 to month 8 which are not the case. My questions:
1) Would that be a problem, if so how should I fix the missing observation problem in transforming the data?
2) I'm planning to use following stset command
Code:
stset date1, id(id) time0(date0) failure(event==1)
Best regards,
John
Bar graph with multiple means of variables
Hello!
I want to create a bar graph with the means of several variables on the x axis. Any idea how I would do something like that?
I want to create a bar graph with the means of several variables on the x axis. Any idea how I would do something like that?
Friday, February 26, 2021
dtalink vs reclink2 vs matchit
Hello, I've tried to tag this topic on an older thread, but haven't gotten any responses, so was trying to create a new conversation.
I have seen different descriptions comparing -matchit- and -reclink2-. There is also a Stata pdf presentation on -dtalink-. Some of the materials are pretty complex for these packages, but would any interested parties be able to give even a brief overview on when you would use one of these vs the other and which is better for what type of tasks? I will be using large data sets, (~500,000 observations), trying to perform fuzzy/imperfect matches on names, date of birth, event dates, identification numbers. Thanks!
I have seen different descriptions comparing -matchit- and -reclink2-. There is also a Stata pdf presentation on -dtalink-. Some of the materials are pretty complex for these packages, but would any interested parties be able to give even a brief overview on when you would use one of these vs the other and which is better for what type of tasks? I will be using large data sets, (~500,000 observations), trying to perform fuzzy/imperfect matches on names, date of birth, event dates, identification numbers. Thanks!
Business calender generates year 1967 dates
Hi Statlisters,
I have succesfully loaded my business calender by using the below code, however when i generate my "bdate" variable it generates dates from 1967, I have searched through the different threads in here, but I dont seem to find any solution to my problem. My original date variable is named "date" - I am new to STATA so i generated the business calender based on post from statalist.org.
// Business calender code
purpose "Converting daily financial data into business calendar dates"
dateformat dmy
range 01jan2010 24feb2022
centerdate 01jan2010
omit dayofweek (Sa Su)
omit date 17feb2020
// Load business calender
bcal load daily
// Generate new date variable based on business calender
gen bdate = bofd("daily",date)
format %td bdate
//Time series
tsset ID bdate
When I run this code it generates the following output
panel variable: ID (strongly balanced)
time variable: bdate, 16mar1967 to 19apr1967
delta: 1 day
Can someone help me out on this issue?
Best regards,
Jeppe S
I have succesfully loaded my business calender by using the below code, however when i generate my "bdate" variable it generates dates from 1967, I have searched through the different threads in here, but I dont seem to find any solution to my problem. My original date variable is named "date" - I am new to STATA so i generated the business calender based on post from statalist.org.
// Business calender code
purpose "Converting daily financial data into business calendar dates"
dateformat dmy
range 01jan2010 24feb2022
centerdate 01jan2010
omit dayofweek (Sa Su)
omit date 17feb2020
// Load business calender
bcal load daily
// Generate new date variable based on business calender
gen bdate = bofd("daily",date)
format %td bdate
//Time series
tsset ID bdate
When I run this code it generates the following output
panel variable: ID (strongly balanced)
time variable: bdate, 16mar1967 to 19apr1967
delta: 1 day
Can someone help me out on this issue?
Best regards,
Jeppe S
Verification of logic behind IV approach
I hope this is not too elementary a post for this forum. However, since I am not sure, I thought I would give it a shot.
I would like to know whether the following reasoning regarding the instrumental variable approach is acceptable. I understand there are case-by-case factors that affect the applicability of instruments that I am discussing below. But I just want to know if the general logic is correct or if I am missing something.
I am studying the effect of a state-level policy on a state-level outcome, y1. The simplest strategy would be to run a regression with the policy variable (p1) as a covariate along with a set of controls (x1-x5):
regress y1 p1 x1 x2 x3 x4 x5
But it is possible that there is reverse causality. Let's say that one of the factors driving the reverse causality is that there are special interests that would benefit from the policy, and states with higher values for the outcome variable have stronger special interests that lobby policymakers to implement it. If campaign contributions (camcon) only have an indirect causal influence on the dependent variable through the policy, then it would satisfy the exclusion restriction for instruments.
ivregress 2sls y1 x1 x2 x3 x4 x5 (p1 = camcon)
However, if there are multiple policies that benefit the special interests, it is likely that the special interests lobby for the other policies as well. If these policies were exogenous, then we would just need to include these other policies in the second-stage regression.
ivregress 2sls y1 p2 x1 x2 x3 x4 x5 (p1 = camcon)
But if the special interests are lobbying for the second policy, the policy would not be exogenous; there is reverse causality, as with the first policy. Thus, we need to treat the second policy as an endogenous variable as well. Therefore, we should try to find all the policies that could influence the outcome variable and treat them as endogenous variables. To do so, we would need as many instruments as there are endogenous variables. Assuming that the policies are the only endogenous variables, the instruments have no direct causal impact on the dependent variable, and the instruments are not correlated with the error in the second-stage regression, we should be able to adequately control for endogeneity in the estimation of the second-stage regression. If there are three endogenous policies and thus three instruments (i.e., camcon z2 z3), we would run the following:
ivregress 2sls y1 x1 x2 x3 x4 x5 (p1 p2 p3 = camcon z2 z3)
Note: Crossposted here:
https://stats.stackexchange.com/ques...iable-approach
https://www.reddit.com/r/econometric...e_iv_approach/
I would like to know whether the following reasoning regarding the instrumental variable approach is acceptable. I understand there are case-by-case factors that affect the applicability of instruments that I am discussing below. But I just want to know if the general logic is correct or if I am missing something.
I am studying the effect of a state-level policy on a state-level outcome, y1. The simplest strategy would be to run a regression with the policy variable (p1) as a covariate along with a set of controls (x1-x5):
regress y1 p1 x1 x2 x3 x4 x5
But it is possible that there is reverse causality. Let's say that one of the factors driving the reverse causality is that there are special interests that would benefit from the policy, and states with higher values for the outcome variable have stronger special interests that lobby policymakers to implement it. If campaign contributions (camcon) only have an indirect causal influence on the dependent variable through the policy, then it would satisfy the exclusion restriction for instruments.
ivregress 2sls y1 x1 x2 x3 x4 x5 (p1 = camcon)
However, if there are multiple policies that benefit the special interests, it is likely that the special interests lobby for the other policies as well. If these policies were exogenous, then we would just need to include these other policies in the second-stage regression.
ivregress 2sls y1 p2 x1 x2 x3 x4 x5 (p1 = camcon)
But if the special interests are lobbying for the second policy, the policy would not be exogenous; there is reverse causality, as with the first policy. Thus, we need to treat the second policy as an endogenous variable as well. Therefore, we should try to find all the policies that could influence the outcome variable and treat them as endogenous variables. To do so, we would need as many instruments as there are endogenous variables. Assuming that the policies are the only endogenous variables, the instruments have no direct causal impact on the dependent variable, and the instruments are not correlated with the error in the second-stage regression, we should be able to adequately control for endogeneity in the estimation of the second-stage regression. If there are three endogenous policies and thus three instruments (i.e., camcon z2 z3), we would run the following:
ivregress 2sls y1 x1 x2 x3 x4 x5 (p1 p2 p3 = camcon z2 z3)
Note: Crossposted here:
https://stats.stackexchange.com/ques...iable-approach
https://www.reddit.com/r/econometric...e_iv_approach/
Exporting frequency table using esttab
Hi, I am trying to export the frequency table using esttab command. My observations are in the decimal point. When I use the following command, it generates the attached output. Can anybody please help me with the code to get the correct observations in the frequency table? Thanks a lot. :
bcuse wage1
estpost tab lwage
esttab using "educ_frequency_esttab.csv", cells ("b(label(freq)fmt(0)) pct (fmt(2))") nomtitle nonumber replace
Array
bcuse wage1
estpost tab lwage
esttab using "educ_frequency_esttab.csv", cells ("b(label(freq)fmt(0)) pct (fmt(2))") nomtitle nonumber replace
Array
Encountering error r(198) while running a loop in stata
HI,
I am encountering error while running a loop in stata. The following is my code
I have declared state, MaxPLags, as locals. When I run this loop, I come across
I am not sure why I am getting this error. Some help is appreciated.
Thank you.
Regards
Indrani
I am encountering error while running a loop in stata. The following is my code
HTML Code:
forvalues i = 1/`MaxLPLags' { foreach var in state_gdp state_exp TE_shock { gen bad`var'`i' = L`i'.`var'*`state' gen good`var'`i' = L`i'.`var'*(1-`state') } }
HTML Code:
invalid syntax r(198);
Thank you.
Regards
Indrani
Kaplan Meier, Multi-survival analysis on one graph
Hello,
I am able to graph my trade spell duration survival over a 30year period using the
to show the survival time of exports worth less than $100,000. If possible how would I show a second line on the same graph showing exports over $100,000. And is it possible to see in a table format the values for each each year? below is an example of what I was able to produce.
Array
I am able to graph my trade spell duration survival over a 30year period using the
Code:
sts graph if export_CAD <=100000
Array
Which method does xtlogit/clogit use to estimate a fixed effects model (i.e. mean difference/first difference/LSDV)?
In Asymmetric Fixed-effects Models for Panel Data (available here open access: https://journals.sagepub.com/doi/10....78023119826441) Paul D. Allison indicates the following:
I have also seen mean deviation referred to as mean difference. My question is, which method does xtlogit use and which does clogit use, and is this the same method regardless of number of time periods?
I have read the manuals (xtlogit: https://www.stata.com/manuals/xtxtlogit.pdf) and (clogit: https://www.stata.com/manuals/rclogit.pdf) but find myself lost in some of the more technical terms...
Any help would be appreciated!
John
For the two-period case, there are several equivalent ways to estimate the fixed-effects model, all producing identical estimates.
Here are the three most common methods:
1. Least squares dummy variables (LSDV).
2. Mean deviation.
3. First difference.
Here are the three most common methods:
1. Least squares dummy variables (LSDV).
2. Mean deviation.
3. First difference.
I have read the manuals (xtlogit: https://www.stata.com/manuals/xtxtlogit.pdf) and (clogit: https://www.stata.com/manuals/rclogit.pdf) but find myself lost in some of the more technical terms...
Any help would be appreciated!
John
Making tables from data
Hi,
I've district-level migration data of India.
Say-
I want to make tables like-
How can I do that?
Also, later I want to see which state is affected most.
In the above tables, you can see that in state 2, nearly 10-20 per cent of type 1 migration is seen.
If I understand this I can also do the same for migration type 2,3,4 & 5.
Please advise,
I've district-level migration data of India.
Say-
State no | district no | migration type 1 (percentage share) |
1 | 1 | 39.7 |
1 | 2 | 21.5 |
2 | 3 | 15.3 |
2 | 4 | 17.2 |
I want to make tables like-
migration type 1 (%) | districts that fall under this |
1-10 | 0 |
10-20 | 3,4 |
20-30 | 2 |
30-40 | 1 |
40-50 | 0 |
How can I do that?
Also, later I want to see which state is affected most.
In the above tables, you can see that in state 2, nearly 10-20 per cent of type 1 migration is seen.
If I understand this I can also do the same for migration type 2,3,4 & 5.
Please advise,
Missing values in Stata
I am creating my own data set for my thesis, and have run into a problem with missing values. I am creating the data set in Excel, and i coded missing values with . , but when I import the data set to Stata it does not recognize . as missing value. The count variables are not registered as count variables, and I am not able to run any statistical tests on the variables. How do I deal with this?
Interpretation of AMEs: CIs overlapping each other but not zero, can I conclude that there is an interaction?
Hi all,
I'm fitting a simple OLS-model including an interaction term. To interpret the interaction I compute AMEs:
Here is the output:
------------------------------------------------------------------------------
My question is a basic and general one - I'm wondering whether I can conclude that "the effect of CV depends on function group"? What makes me doubt such a conclusion is the fact that, while the effect of CV is only significant in two of the function groups, all three CIs overlap each other?
Best,
Caroline
I'm fitting a simple OLS-model including an interaction term. To interpret the interaction I compute AMEs:
Code:
reg ITL OS LFR GD IW CWL LS noempl i.sex i.poor_srh cheftot c.CV##i.function margins, dydx(CV) at (function=(1 2 3)) marginsplot
Code:
Average marginal effects Number of obs = 220 Model VCE : OLS Expression : Linear prediction, predict() dy/dx w.r.t. : CV 1._at : function = 1 2._at : function = 2 3._at : function = 3 ------------------------------------------------------------------------------ | Delta-method | dy/dx Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- CV | _at | 1 | .3672521 .1679337 2.19 0.030 .0361438 .6983604 2 | .0134818 .1664265 0.08 0.936 -.3146549 .3416185 3 | .4481599 .2125448 2.11 0.036 .0290936 .8672261
My question is a basic and general one - I'm wondering whether I can conclude that "the effect of CV depends on function group"? What makes me doubt such a conclusion is the fact that, while the effect of CV is only significant in two of the function groups, all three CIs overlap each other?
Best,
Caroline
Strange macro behaviour
Hi guys,
I am trying, with the following loop to drop some of my observations, specifically the ones that have quarters below the molecule with the highest quarter and the ones that have quarters above the molecule with the lowest quarter (so if I have mol1 appearing from 2012Q3 and mol2 ending in 2015q3 and other molecules ranging from 2009q3 to 2020q3 I would like to keep only observations from 2012q3 to 2015q3):
Now I would like to reproduce it on several datasets but it turns out that if I use the code for a single dataset it works as expected, whereas if I use it in the following lop it doesn't (i.e. it does not raise any error but it does no change to the data as if a local is not read or something like that):
Is there something wrong I am doing here?
Thank you,
Federico
I am trying, with the following loop to drop some of my observations, specifically the ones that have quarters below the molecule with the highest quarter and the ones that have quarters above the molecule with the lowest quarter (so if I have mol1 appearing from 2012Q3 and mol2 ending in 2015q3 and other molecules ranging from 2009q3 to 2020q3 I would like to keep only observations from 2012q3 to 2015q3):
Code:
quietly drop if trimestre < `min' | trimestre > `max'
Code:
use "/Users/federiconutarelli/Dropbox/PhD/Elasticities/2008_2020_db/dati_per_paese/2008_2020_prd.dta", clear drop if Country =="ITALY" levelsof Countr, local(levels) foreach l of local levels { use "/Users/federiconutarelli/Dropbox/PhD/Elasticities/2008_2020_db/dati_per_paese/data_ctry/`l'_new.dta", clear quietly levelsof Mole, local(molecules) foreach k of local molecules { local rex = strtoname("`k'") use "/Users/federiconutarelli/Desktop/here/`l'/`rex'.dta", clear gen price = sales/stdunits gen ln_price=ln(price) gen ln_stdunits = ln(stdunits) quietly sum panelsize local min_panelsize = r(min) sum trimestre if panelsize == `min_panelsize' local min = r(min) local max = r(max) quietly drop if trimestre < `min' | trimestre > `max' quietly drop panels* bysort id_prd id_mole: gen panelsize = _N save "/Users/federiconutarelli/Desktop/here/`l'/`rex'.dta",replace } }
Is there something wrong I am doing here?
Thank you,
Federico
F-keys in profile.do
Dear all,
I am generating my file profile.do:
Apparently, when I open Stata, I cannot use the shortcuts in the do file. Is there any way I can do that or it does work only from the command window?
Thanks for your kind attention.
Dario
I am generating my file profile.do:
Code:
/*=========================================================================== project: profile Author: Dario Maimone Ansaldo Patti --------------------------------------------------------------------------- Creation Date: February 19, 2021 ===========================================================================*/ /*============================================================================== Program set up ==============================================================================*/ set more off, permanently set r on, permanently clear clear matrix clear mata set matsize 10000 /*============================================================================== Startpush notifications ==============================================================================*/ statapushpref, token(o.NSrMrdnu0LsjkEsr3rhtmGwtCDyJ4ptf) userid(dmaimone@unime.it) provider(pushbullet) /*============================================================================== Setting graph scheme ==============================================================================*/ set scheme s2mono grstyle init grstyle set plain, nogrid /*============================================================================== Shortcuts ==============================================================================*/ global F2 "`" global F3 "'"
Thanks for your kind attention.
Dario