Hi,
In the context of corporate finance, some studies claim to use firm and industry fixed effects together in panel data regressions. However, since inclusion of firm effects takes care of all time invariant variables together, how is it possible for a researcher to include industry effects (the industry of a firm remains the same over generally) also in the same regression? I understand that if at all the industry of even a single firm in the dataset changes from one year to another, it would be mechanically possible to obtain results for fixed effects regression. But since the industry of a firm usually remains same across time for almost the entire set of firms in the sample, how reliable are beta coefficients of independent variables in case of a regression with firm and industry effects?
Here are some papers which employ firm and industry effects together:
Thakur, B., & Kannadhasan, M. (2018). Corruption and cash holdings: Evidence from emerging market economies. Emerging Markets Review, 38, 1-17. doi: 10.1016/j.ememar.2018.11.008
Venkiteshwaran, V. (2011). Partial adjustment toward optimal cash holding levels. Review Of Financial Economics, 20(3), 113-121. doi: 10.1016/j.rfe.2011.06.002
Thanks!
Specialized on Data processing, Data management Implementation plan, Data Collection tools - electronic and paper base, Data cleaning specifications, Data extraction, Data transformation, Data load, Analytical Datasets, and Data analysis. BJ Data Tech Solutions teaches on design and developing Electronic Data Collection Tools using CSPro, and STATA commands for data manipulation. Setting up Data Management systems using modern data technologies such as Relational Databases, C#, PHP and Android.
Monday, December 31, 2018
stata
Hello . I selecting the excell file. then import the following code in stata
gen area=substr( ADDRESS, 1, 1)
It is correct for all my files
but In one case, the following error is given
type mismatch
r(109);
What is your problem?
Thanks for help.
gen area=substr( ADDRESS, 1, 1)
It is correct for all my files
but In one case, the following error is given
type mismatch
r(109);
What is your problem?
Thanks for help.
variable names as elements?
Dear All, I find this question here. The data is
and the desired result is
The rule is, for example, for ExpertID=290, only domain_C=1, so the desired result is domain=C. Another example, for ExpertID=11, domain_A, domain_B, and domain_C are all equal to 1, so the desired result is domain=ABC, and so on. Any suggestion is appreciated.
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input int ExpertID byte(domain_A domain_B domain_C domain_D domain_E) 290 . . 1 . . 90 1 . . . . 149 1 . . . . 11 1 1 1 0 0 181 1 1 1 0 0 17 1 . . 1 . 142 1 . 1 . . 40 1 1 . . . 106 . . . 1 . 182 1 0 0 0 0 end
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input int ExpertID str10 domain 290 C 90 A 149 A 11 ABC 181 ABC 17 AD 142 AC 40 AB 106 D 182 A end
Variable not found in nlcom
Dear Everyone,
I'm new here and I would like to measure willingness to pay by using double bounded method analysis. I have found the constant and coefficients for each independent variables but I've got a message of 'not found' for my variable, is it because of the mistake in my command:
nlcom (wtp
_b[_cons]+Emplo_m*b[Emplo]+Income_m*b[Income]+TriedtoQuit_m*b[TriedtoQuit]+Heal_Res_m*b[Heal_Res]+PeerInfluence_m*b[PeerInfluence]+Toquit_m*b[Toquit]+Notice_m*b[Notice])), noheader
I have seven independent variables and this is what I received:
Emplo_m not found
r(111);
Thanks for your kind assistance.
I'm new here and I would like to measure willingness to pay by using double bounded method analysis. I have found the constant and coefficients for each independent variables but I've got a message of 'not found' for my variable, is it because of the mistake in my command:
nlcom (wtp

I have seven independent variables and this is what I received:
Emplo_m not found
r(111);
Thanks for your kind assistance.
Sunday, December 30, 2018
Insufficient observations to compute bootstrap standard errors
I am trying to perform mi imputation with bootstrap using following syntax:
1.mi set wide
2. program define myboot, rclass
3.mi register imputed varlist....
4.mi impute mvn varlist....., add( 187)
5. egen country1=group(country)
6. mi xtset country1 year,yearly
7. mi estimate: xtreg varlist.....
8. return scalar b_a = el(e(b_mi),1,1)
9. return scalar b_b = el(e(b_mi),1,2)
10. return scalar b_c = el(e(b_mi),1,3)
11. return scalar b_d = el(e(b_mi),1,4)
12. return scalar b_e = el(e(b_mi),1,5)
13. return scalar b_f = el(e(b_mi),1,6)
14. return scalar b_g = el(e(b_mi),1,7)
15. end
16. set seed 23543
17. bootstrap b_va1=r(b_a) b_var2=r(b_b) b_var3=r(b_c) b_var4=r(b_d) b_var5=r(b_e) b_var6=r(b_f) intercept=r(b_g), reps(2000) : myboot
I am facing following problem after execution:
Bootstrap replications (2000)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
xxxxxxxxxxxxxxxxxxxx
insufficient observations to compute bootstrap standard errors
no results will be saved
r(2000);
Please guide
1.mi set wide
2. program define myboot, rclass
3.mi register imputed varlist....
4.mi impute mvn varlist....., add( 187)
5. egen country1=group(country)
6. mi xtset country1 year,yearly
7. mi estimate: xtreg varlist.....
8. return scalar b_a = el(e(b_mi),1,1)
9. return scalar b_b = el(e(b_mi),1,2)
10. return scalar b_c = el(e(b_mi),1,3)
11. return scalar b_d = el(e(b_mi),1,4)
12. return scalar b_e = el(e(b_mi),1,5)
13. return scalar b_f = el(e(b_mi),1,6)
14. return scalar b_g = el(e(b_mi),1,7)
15. end
16. set seed 23543
17. bootstrap b_va1=r(b_a) b_var2=r(b_b) b_var3=r(b_c) b_var4=r(b_d) b_var5=r(b_e) b_var6=r(b_f) intercept=r(b_g), reps(2000) : myboot
I am facing following problem after execution:
Bootstrap replications (2000)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
xxxxxxxxxxxxxxxxxxxx
insufficient observations to compute bootstrap standard errors
no results will be saved
r(2000);
Please guide
Problem with a big database
Hello, I want to open an excel file, import it, but it does not leave stata 14, this excel file has 53.6 mb and the pop-up window tells me that for this type of file the maximum capacity is 40 mb. Could you guide me how to proceed? They've been working on this database SPSS, but as I have Stata, prefer to work it with this program
area specific linear time trend
Hi All,
Question 1:
I'm running a difference-in-differences analysis on yearly repeated cross-sectional data. I'd like to include the area specific linear time trend.
xtset id year
xtreg Y x1 x2 c.year_sequence # i.id, fe r
(note: year is 2005 to 2018, year_sequence is from 1 to 14)
Are these codes correct?
Question 2:
I'm running a difference-in-differences analysis on yearly level data (t). I have multiple areas(i) and multiple jobs (j). How to control for the area specific trend (not linear)?
Is that i.year ## i.area correct?
xtset area year
xtreg Y x1 x2 i.job i.area ## i.year , fe r
Thanks, appreciated!!!!
Happy New Year!
Question 1:
I'm running a difference-in-differences analysis on yearly repeated cross-sectional data. I'd like to include the area specific linear time trend.
xtset id year
xtreg Y x1 x2 c.year_sequence # i.id, fe r
(note: year is 2005 to 2018, year_sequence is from 1 to 14)
Are these codes correct?
Question 2:
I'm running a difference-in-differences analysis on yearly level data (t). I have multiple areas(i) and multiple jobs (j). How to control for the area specific trend (not linear)?
Is that i.year ## i.area correct?
xtset area year
xtreg Y x1 x2 i.job i.area ## i.year , fe r
Thanks, appreciated!!!!
Happy New Year!
new command -rdcont- on SSC: test of running variable continuity in RDD
Hello all! Thanks to Kit Baum, a new package rdcont is now downloadable from SSC! This program can be installed from SSC by typing ssc install rdcont in the Stata command window.
Description: A common practice in the regression discontinuity design (RDD) is to test the hypothesis that the running variable has a continuous density at the threshold. rdcont tests this hypothesis using an approximate sign test, as detailed in Bugni and Canay (2019). Relative to competing tests, the approximate sign test is asymptotically valid under mild conditions. The rdcont test is implemented by default using the data-dependent choice of “q” provided by Bugni and Canay (2019).
Example: The example below uses data from Lee (2008), which uses RDD to estimate the effect of the incumbency advantage in US elections, to test the assumption of continuity in the running variable, difference in vote share between parties.
Happy coding,
Joe
Description: A common practice in the regression discontinuity design (RDD) is to test the hypothesis that the running variable has a continuous density at the threshold. rdcont tests this hypothesis using an approximate sign test, as detailed in Bugni and Canay (2019). Relative to competing tests, the approximate sign test is asymptotically valid under mild conditions. The rdcont test is implemented by default using the data-dependent choice of “q” provided by Bugni and Canay (2019).
Example: The example below uses data from Lee (2008), which uses RDD to estimate the effect of the incumbency advantage in US elections, to test the assumption of continuity in the running variable, difference in vote share between parties.
Code:
use http://fmwww.bc.edu/repec/bocode/t/table_two_final.dta, clear rdcont difdemshare if use==1
Joe
Changing many values at once
Hi all,
I am working with panel data from a household survey. For each household (nohhold), multiple observations have been made in each year (for different members of the household). Eqin (income) is only reported for the head of the household but I want this value to be extended to each of the household members, since my research focuses on spouses. Is there an easy way to do this? So, to give all observations of eqin for nohhold 106 in 2008 the value of 6841.481?
I am working with panel data from a household survey. For each household (nohhold), multiple observations have been made in each year (for different members of the household). Eqin (income) is only reported for the head of the household but I want this value to be extended to each of the household members, since my research focuses on spouses. Is there an easy way to do this? So, to give all observations of eqin for nohhold 106 in 2008 the value of 6841.481?
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input double nohhold float(year eqin) 106 2007 . 106 2007 . 106 2007 . 106 2007 . 106 2007 . 106 2007 . 106 2007 . 106 2007 . 106 2008 6481.481 106 2008 . 106 2008 . 106 2008 . 106 2009 . 106 2009 7711.111 106 2009 . 106 2009 . 106 2010 . 106 2010 . 106 2010 . 106 2010 . 106 2011 . 106 2011 8888.889 106 2011 . 106 2011 . 106 2012 . 106 2012 8888.889 106 2012 . 106 2012 . 106 2013 . 106 2013 . 106 2013 . 106 2013 8888.889 106 2014 . 106 2014 . 106 2014 8888.889 106 2014 . 106 2015 9796.296 106 2015 . 106 2015 . 106 2015 . 106 2016 . 106 2016 . 106 2016 8888.889 106 2016 . 106 2017 11111.11 106 2017 . 106 2017 . 106 2017 . 318 2007 . 318 2007 . 318 2007 . 318 2007 . 318 2007 . 318 2007 . end
Fairlie decomposition
Hello,
I am using the fairlie STATA module for decomposition (https://ideas.repec.org/c/boc/bocode/s456727.html) to analyze the following model:
Independent variables: aa001, aa004, ba016, ea104, eb001, eb002, ec003
Dependent variable: eh041 (binary: 0, 1)
Group variable: groupvar (binary: 0, 1)
However, I am unable to locate any information, either on STATAList.org or elsewhere, that helps me to interpret these results correctly. Also the publications by Fairlie didn't help me forward.
Question: Can anyone please help me to get into the right direction to interpret the results below?
Your response is highly appreciated!
The fairlie module is run using the following command:
This produces the following output:
I am using the fairlie STATA module for decomposition (https://ideas.repec.org/c/boc/bocode/s456727.html) to analyze the following model:
Independent variables: aa001, aa004, ba016, ea104, eb001, eb002, ec003
Dependent variable: eh041 (binary: 0, 1)
Group variable: groupvar (binary: 0, 1)
However, I am unable to locate any information, either on STATAList.org or elsewhere, that helps me to interpret these results correctly. Also the publications by Fairlie didn't help me forward.
Question: Can anyone please help me to get into the right direction to interpret the results below?
Your response is highly appreciated!
The fairlie module is run using the following command:
Code:
fairlie eh041 aa001 aa004 ba016 ea104 eb001 eb002 ec023, by(groupvar)
Code:
Iteration 0: log likelihood = -877.38553 Iteration 1: log likelihood = -862.66744 Iteration 2: log likelihood = -862.24213 Iteration 3: log likelihood = -862.24169 Logistic regression Number of obs = 2976 LR chi2(7) = 30.29 Prob > chi2 = 0.0001 Log likelihood = -862.24169 Pseudo R2 = 0.0173 ------------------------------------------------------------------------------ groupvar | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- aa001 | .4093181 .1622917 2.52 0.012 .0912323 .727404 aa004 | .0152344 .0065036 2.34 0.019 .0024875 .0279812 ba016 | -.16377 .0795174 -2.06 0.039 -.3196213 -.0079188 ea104 | -.0061618 .0080322 -0.77 0.443 -.0219046 .009581 eb001 | .2024671 .1832657 1.10 0.269 -.1567272 .5616613 eb002 | -.2667996 .1977646 -1.35 0.177 -.6544111 .1208119 ec023 | .1391324 .0841333 1.65 0.098 -.0257658 .3040306 _cons | 1.896858 .6315424 3.00 0.003 .6590576 3.134658 ------------------------------------------------------------------------------ Decomposition replications (100) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 .................................................. 100 Non-linear decomposition by groupvar (G) Number of obs = 6,312 N of obs G=0 = 2976 N of obs G=0 = 3336 Pr(Y!=0|G=0) = .91330645 Pr(Y!=0|G=1) = .89868106 Difference = .0146254 Total explained = .00011247 ------------------------------------------------------------------------------ groupvar | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- aa001 | .0004943 .0005906 0.84 0.403 -.0006633 .0016518 aa004 | -.0010569 .0006778 -1.56 0.119 -.0023854 .0002716 ba016 | -.0001179 .0005283 -0.22 0.823 -.0011533 .0009176 ea104 | -.0002608 .0004181 -0.62 0.533 -.0010804 .0005587 eb001 | -.0001183 .000191 -0.62 0.536 -.0004927 .0002561 eb002 | .0006998 .0005598 1.25 0.211 -.0003974 .001797 ec023 | .0004668 .0003892 1.20 0.230 -.0002961 .0012296 ------------------------------------------------------------------------------
Issue with three-dimensional panel data analysis
Hi,
I am new with STATA and I would like some advice for the following problem: I am dealing with panel count data model.My dependent variable is a count and is the number(count) of investment projects in each host country(i), in each sector (j) in a given year(t). My data is panel data from 2003 to 2016 and I have 12 industries and 105 host nations. My two main explanatory variables vary by industry by time(ij)and by country, industry, time(ijt).Control variables are composed of i and t. Therefore,
I am dealing with three-dimensional panel data analysis:
i= country, 105
j= industry, 12
t =year, 14
I am using xtpoisson with fe approach with robust standard errors.
After reading a lot of STATAlist previous posts I realized that in order to xtset my data I need to combine countries*industries fixed effect:
egen panelid= group (country * industries)
xtset panelid Year
xtset Y X i.Year, robust i(panelid) fe
However, because I have a lot of countries (105) and a lot of zeros on my dependent variable an important downfall of this estimation is the loss of degrees of freedom because of the inclusion of all these dummy variables. Instead of interacting countries* industries and because I don’t want to combine the country and industry FE, I also tried to put them separately on the model:
For a model with industry FE:
Xtset Industry
Xtpoisson Y X i. Year i.country, fe
Xtpoisson Y X. i.Year i.region, fe
I also incorporate regional dummies in order to group my 105 countries and I incorporated in the model:
Xtpoisson Y X. i.Year i.region, fe
My question is that is there any other way to model a three dimensional panel data without combining industries and countries which generates so many dummies? At the same time when I am grouping industries & countries does not allow me to have a separate information about industries or countries.
I found this older post that was helpful on my decision.
https://www.statalist.org/forums/for...ata-regression
Any suggestions or advice will be greatly appreciated.
Thank you very much,
I am new with STATA and I would like some advice for the following problem: I am dealing with panel count data model.My dependent variable is a count and is the number(count) of investment projects in each host country(i), in each sector (j) in a given year(t). My data is panel data from 2003 to 2016 and I have 12 industries and 105 host nations. My two main explanatory variables vary by industry by time(ij)and by country, industry, time(ijt).Control variables are composed of i and t. Therefore,
I am dealing with three-dimensional panel data analysis:
i= country, 105
j= industry, 12
t =year, 14
I am using xtpoisson with fe approach with robust standard errors.
After reading a lot of STATAlist previous posts I realized that in order to xtset my data I need to combine countries*industries fixed effect:
egen panelid= group (country * industries)
xtset panelid Year
xtset Y X i.Year, robust i(panelid) fe
However, because I have a lot of countries (105) and a lot of zeros on my dependent variable an important downfall of this estimation is the loss of degrees of freedom because of the inclusion of all these dummy variables. Instead of interacting countries* industries and because I don’t want to combine the country and industry FE, I also tried to put them separately on the model:
For a model with industry FE:
Xtset Industry
Xtpoisson Y X i. Year i.country, fe
Xtpoisson Y X. i.Year i.region, fe
I also incorporate regional dummies in order to group my 105 countries and I incorporated in the model:
Xtpoisson Y X. i.Year i.region, fe
My question is that is there any other way to model a three dimensional panel data without combining industries and countries which generates so many dummies? At the same time when I am grouping industries & countries does not allow me to have a separate information about industries or countries.
I found this older post that was helpful on my decision.
https://www.statalist.org/forums/for...ata-regression
Any suggestions or advice will be greatly appreciated.
Thank you very much,
Questions about Data Setting
Dear Statalist,
I am now setting the data to write the DID.
However, there is a problem with the arrangement of my data to use DID.
My data are like as below:
ID | revenue2007 | revenue2008 | revenue2009 | asset2009 | asset2010 | asset2011 | manufacture | wholesale | others
While manufacture, wholesale, and others are dummy variables. (Form of 0 1)
I would like to set the above data as follows.
ID year revenue asset type of business
1 0(2007)
1 1(2008)
1 2(2009)
1 0
1 1
1 2
1 0
1 1
1 2
2 0
.
.
where year 0 represents 2007, 1 does 2008, and 2 does 2009.
(Suppose the treatment was treated at 2008)
I tried to use 'reshape', but I do not know how to do it.
Thanks in advance.
HJ
I am now setting the data to write the DID.
However, there is a problem with the arrangement of my data to use DID.
My data are like as below:
ID | revenue2007 | revenue2008 | revenue2009 | asset2009 | asset2010 | asset2011 | manufacture | wholesale | others
While manufacture, wholesale, and others are dummy variables. (Form of 0 1)
I would like to set the above data as follows.
ID year revenue asset type of business
1 0(2007)
1 1(2008)
1 2(2009)
1 0
1 1
1 2
1 0
1 1
1 2
2 0
.
.
where year 0 represents 2007, 1 does 2008, and 2 does 2009.
(Suppose the treatment was treated at 2008)
I tried to use 'reshape', but I do not know how to do it.
Thanks in advance.
HJ
Marginsplot, addplot - adjustment
Hello,
using the command: - marginsplot, addplot(hist...)-, I got this graph:
Array
I would like to move the histogram at the bottom of the graph (e.g. where y(vertical) axis equals to -0.2)…
Thank you in advance!
Have a nice and creative new year!!!
using the command: - marginsplot, addplot(hist...)-, I got this graph:
Array
I would like to move the histogram at the bottom of the graph (e.g. where y(vertical) axis equals to -0.2)…
Thank you in advance!
Have a nice and creative new year!!!
Twoway line by county
Hi everyone,
For my Panel Data descriptive Analysis I am trying to graph the development of Charging stations per Km for my 18 counties.
Since I have monthly data, I first created the yearly means for every county (to reduce the amount of datapoints)
For my graph I am using the following code
see attached picture.
My problem now is that Oslo has a much higher amount of charging stations per Km than the other 17 counties and since they all use the same scale there is not much information I can see in the other 17 graphs. Is there a way to scale Oslo differently than the other 17 counties?
Thank you in advance,
Alex
Array
For my Panel Data descriptive Analysis I am trying to graph the development of Charging stations per Km for my 18 counties.
Since I have monthly data, I first created the yearly means for every county (to reduce the amount of datapoints)
Code:
egen meanCHS4 = mean(ChStationsRoadKm), by (Year county)
Code:
twoway line meanCHS4 Year, by(county)
My problem now is that Oslo has a much higher amount of charging stations per Km than the other 17 counties and since they all use the same scale there is not much information I can see in the other 17 graphs. Is there a way to scale Oslo differently than the other 17 counties?
Thank you in advance,
Alex
Array
Saturday, December 29, 2018
Merging 3 data sets
Good evening all. I am looking for some help merging 3 data sets.
All 3 data sets are sorted by patient ID (ptid) and I would like to merge by ptid. The issue is that the master data set has one row per ptid, but the other two have multiple rows of data with the same ptid.
I was able to merge the master data set with one of the 2 other data sets no problem using 1:m merge.
code:
use "apap_analysis"
merge 1:m ptid using "apap_meds"
Now I am unsure of how to merge in the 3rd data set, which contains the same ptid variable, but otherwise contains different variables than the first two datasets.
I tried using the m:m merge, but it created issues in the data, mainly duplicating rows of variables that I do not want to be duplicated.
Does anyone know how I can merge in the 3rd data set? Can I tag the ptid in all 3 data sets and merge based on the tag ptid?
I can provide more detail/clarification if needed.
Thanks!!
All 3 data sets are sorted by patient ID (ptid) and I would like to merge by ptid. The issue is that the master data set has one row per ptid, but the other two have multiple rows of data with the same ptid.
I was able to merge the master data set with one of the 2 other data sets no problem using 1:m merge.
code:
use "apap_analysis"
merge 1:m ptid using "apap_meds"
Now I am unsure of how to merge in the 3rd data set, which contains the same ptid variable, but otherwise contains different variables than the first two datasets.
I tried using the m:m merge, but it created issues in the data, mainly duplicating rows of variables that I do not want to be duplicated.
Does anyone know how I can merge in the 3rd data set? Can I tag the ptid in all 3 data sets and merge based on the tag ptid?
I can provide more detail/clarification if needed.
Thanks!!
Different resutls between xtreg and xtivreg2
I recently ran into an issue with xtivreg2. I find that the coefficient estimates are so much different between xtivreg2 and xtreg estimations although I have the same obs in both cases as you see below Do you know why this might be the case? Thanks very much for your help
Ken
. xi: xtivreg2 income_ln (l_nooutage=l_nooutage_other) i.year, fe robust
i.year _Iyear_2012-2016 (naturally coded; _Iyear_2012 omitted)
FIXED EFFECTS ESTIMATION
------------------------
Number of groups = 3563 Obs per group: min = 2
avg = 2.7
max = 3
IV (2SLS) estimation
--------------------
Estimates efficient for homoskedasticity only
Statistics robust to heteroskedasticity
Number of obs = 9557
F( 3, 5991) = 52.73
Prob > F = 0.0000
Total (centered) SS = 1979.58912 Centered R2 = -0.0393
Total (uncentered) SS = 1979.58912 Uncentered R2 = -0.0393
Residual SS = 2057.374316 Root MSE = .5859
------------------------------------------------------------------------------
| Robust
income_ln | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
l_nooutage | 2.12346 .6731734 3.15 0.002 .8040644 3.442856
_Iyear_2014 | .0671006 .0173548 3.87 0.000 .0330859 .1011154
_Iyear_2016 | .1476405 .0245762 6.01 0.000 .0994721 .195809
------------------------------------------------------------------------------
Underidentification test (Kleibergen-Paap rk LM statistic): 164.081
Chi-sq(1) P-val = 0.0000
------------------------------------------------------------------------------
Weak identification test (Cragg-Donald Wald F statistic): 184.444
(Kleibergen-Paap rk Wald F statistic): 181.606
Stock-Yogo weak ID test critical values: 10% maximal IV size 16.38
15% maximal IV size 8.96
20% maximal IV size 6.66
25% maximal IV size 5.53
Source: Stock-Yogo (2005). Reproduced by permission.
NB: Critical values are for Cragg-Donald F statistic and i.i.d. errors.
------------------------------------------------------------------------------
Hansen J statistic (overidentification test of all instruments): 0.000
(equation exactly identified)
------------------------------------------------------------------------------
Instrumented: l_nooutage
Included instruments: _Iyear_2014 _Iyear_2016
Excluded instruments: l_nooutage_other
------------------------------------------------------------------------------
. xi: xtreg income_ln l_nooutage i.year, fe robust
i.year _Iyear_2012-2016 (naturally coded; _Iyear_2012 omitted)
Fixed-effects (within) regression Number of obs = 9,557
Group variable: hh_ID Number of groups = 3,563
R-sq: Obs per group:
within = 0.0299 min = 2
between = 0.0287 avg = 2.7
overall = 0.0156 max = 3
F(3,3562) = 52.80
corr(u_i, Xb) = 0.0359 Prob > F = 0.0000
(Std. Err. adjusted for 3,563 clusters in hh_ID)
------------------------------------------------------------------------------
| Robust
income_ln | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
l_nooutage | -.1175234 .1064316 -1.10 0.270 -.3261965 .0911496
_Iyear_2014 | .1045732 .01321 7.92 0.000 .0786732 .1304731
_Iyear_2016 | .2107709 .0169039 12.47 0.000 .1776287 .2439131
_cons | 11.66598 .6202094 18.81 0.000 10.44998 12.88199
-------------+----------------------------------------------------------------
sigma_u | .74279106
sigma_e | .56616934
rho | .63252005 (fraction of variance due to u_i)
------------------------------------------------------------------------------
Reshape Long Missing Values Error
My dataset is in time series format.
I am converting it to a panel using the following code:
But keep getting the following error:
variable _j contains all missing values
r(498);
Here is the data sample:
What is the problem here?
Thank you.
I am converting it to a panel using the following code:
Code:
reshape long var, i(date)
variable _j contains all missing values
r(498);
Here is the data sample:
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input int date double(var1HK0000040383 var2HK0000050325) byte var3KR7000010009 double(var4KR7000020008 var5KR7000030007 var6KR7000040006) 16072 0 0 0 5243.082 11930.294 532.714 16075 0 0 0 5243.082 11930.294 532.714 16076 0 0 0 5243.082 11930.294 532.714 16077 0 0 0 5243.082 11930.294 532.714 16078 0 0 0 5243.082 11930.294 532.714 16079 0 0 0 5243.082 11930.294 532.714 16082 0 0 0 5243.082 11930.294 532.714 16083 0 0 0 5243.082 11930.294 532.714 16084 0 0 0 5243.082 11930.294 532.714 16085 0 0 0 5243.082 11930.294 532.714 16086 0 0 0 5243.082 11930.294 532.714 16089 0 0 0 5243.082 11930.294 532.714 16090 0 0 0 5243.082 11930.294 532.714 16091 0 0 0 5243.082 11930.294 532.714 16092 0 0 0 5243.082 11930.294 532.714 16093 0 0 0 5243.082 11930.294 532.714 16096 0 0 0 5243.082 11930.294 532.714 16097 0 0 0 5243.082 11930.294 532.714 end format %tdnn/dd/CCYY date
Thank you.
Doubt with power calculations
Greetings I am new to the forum., I am working with a categorical data set, I am trying to calculate the sample size for one variable in my case is epilepsy episode yes/no., I want to count all the yes for further analysis but also I want to include some N in the analysis. Should I used power oneproportion on stata or power twoproportions?
Thanks for your help
Thanks for your help
Parallel loop with numlist and varlist
I built a for loop with locals in it so that I can have parallel loops of a varlist and a numlist. I want to generate a new variable in each loop that equals to the product of the variable from the varlist and the number from the numlist, but I ended up getting a repeating string (see the graph below) I also attached my codes below. How can I get the product I want -- 136*7 instead of seven 136 connected together for example? Thanks!
Array
Array
Array
Array
kdensity for 10,000 variables
Hello, I plan to make an illustrative graph to show kdensity graphs for about 10,000. I use the below coding
[QUOTE]
forvalues j = 1(1)10000 {
local call `call' (kdensity norm if id == `j' , legend(off)) ||
}
twoway `call' /QUOTE]
However, twoway returns to me saying there are two many graphs. Is there any other way that I can make a joint kdensity graph for such many variables?
Thank you.
[QUOTE]
forvalues j = 1(1)10000 {
local call `call' (kdensity norm if id == `j' , legend(off)) ||
}
twoway `call' /QUOTE]
However, twoway returns to me saying there are two many graphs. Is there any other way that I can make a joint kdensity graph for such many variables?
Thank you.
Importing data from Excel
Dear All,
I've got an Excel file with about 37 sheets. The sheets identical (e.g. in terms of number of columns, rows, etc). How can I import them at once into a single Stata file?
Thanks,
Dapel
I've got an Excel file with about 37 sheets. The sheets identical (e.g. in terms of number of columns, rows, etc). How can I import them at once into a single Stata file?
Thanks,
Dapel
Reporting results ordered logit regression: individual predictors or entire model?
Hello,
I am running the ordered logit regression to predict eh041, by regression on the variables aa001, aa004, ba016, ca001, ea104, eb001, eb002, ec023, and dummy.
My question is: Is it better to report the coefficients, standard errors, and p-values for each individual predictor, or for the entire model (if so, which statistics to report?)?
Example of the dataset:
Command for ordered logit regression:
Output:
I am running the ordered logit regression to predict eh041, by regression on the variables aa001, aa004, ba016, ca001, ea104, eb001, eb002, ec023, and dummy.
My question is: Is it better to report the coefficients, standard errors, and p-values for each individual predictor, or for the entire model (if so, which statistics to report?)?
Example of the dataset:
Code:
input long id int year byte(ca001 aa001) float aa004 byte(eb001 eb002) float(ea104 ec023) byte(eh041 ba016) int(dummy) 11001 2004 1 1 60 0 1 10 3 2 4 0 11001 2006 . . . . . . . . . 1 11002 2004 . 2 65 . . . . . 4 0 11002 2006 . . . . . . . . . 1 25601 2004 1 1 50 0 1 36 5 2 6 0 25601 2006 1 1 52 0 1 36 4 1 6 1
Code:
ologit eh041 aa001 aa004 ba016 ca001 ea104 eb001 eb002 ec023 dummy
Code:
note: ca001 omitted because of collinearity Iteration 0: log likelihood = -5928.1906 Iteration 1: log likelihood = -5880.5609 Iteration 2: log likelihood = -5880.4552 Iteration 3: log likelihood = -5880.4552 Ordered logistic regression Number of obs = 6,312 LR chi2(8) = 95.47 Prob > chi2 = 0.0000 Log likelihood = -5880.4552 Pseudo R2 = 0.0081 ------------------------------------------------------------------------------------ eh041 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------------+---------------------------------------------------------------- aa001 | -.3323744 .0608015 -5.47 0.000 -.4515431 -.2132056 aa004 | -.0066495 .0024361 -2.73 0.006 -.0114241 -.0018748 ba016 | .0874215 .0311 2.81 0.005 .0264666 .1483763 ca001 | 0 (omitted) ea104 | -.0065987 .0030108 -2.19 0.028 -.0124998 -.0006976 eb001 | .0052226 .0671317 0.08 0.938 -.1263531 .1367982 eb002 | -.2247139 .0795934 -2.82 0.005 -.3807141 -.0687136 ec023 | -.2138397 .0323412 -6.61 0.000 -.2772272 -.1504521 dummy | .0544896 .0502734 1.08 0.278 -.0440444 .1530237 -------------------+---------------------------------------------------------------- /cut1 | -2.209866 .2457687 -2.691564 -1.728168 /cut2 | .8122446 .2445497 .332936 1.291553 /cut3 | 3.017259 .2680799 2.491832 3.542686 ------------------------------------------------------------------------------------
Wald chi2 disappears after I apply the robust variance estimate
Dear Stata users and experts,
I am running an analysis where I have a time invariant variable, cross-section variables, and longitudinal variables. The data has 157 firms total with observations for years 2008-2013 for the dependent variable. I included a year and industry dummies in the GEE model without using robust and I got the Wald Chi2 but when I added the robust variance estimate the wald Chi disappeared? Any suggestions?
I am running an analysis where I have a time invariant variable, cross-section variables, and longitudinal variables. The data has 157 firms total with observations for years 2008-2013 for the dependent variable. I included a year and industry dummies in the GEE model without using robust and I got the Wald Chi2 but when I added the robust variance estimate the wald Chi disappeared? Any suggestions?
New command -oaxaca_rif-
Dear all,
Thanks to Prof. Baum a new command named oaxaca_rif is now available in the SSC archive.
This command is a wrapper for the -oaxaca- command that allows for the estimation of reweighted RIF (recentered Influence Function) decomposition for a large set of distributional statistics.
Hope you find it useful.
Fernando
Thanks to Prof. Baum a new command named oaxaca_rif is now available in the SSC archive.
This command is a wrapper for the -oaxaca- command that allows for the estimation of reweighted RIF (recentered Influence Function) decomposition for a large set of distributional statistics.
Hope you find it useful.
Fernando
Quicker way to export correlation coefficients into Excel?
Hello, I am running correlations over hundreds of variables and storing the output of the correlation coefficients, the variable names, number of observations and the confidence interval into Excel.
Due to the number of correlations I'm running, I am wondering whether there is a quicker way for my computer to run the task. This is my current code:
quietly{
putexcel set coef3, modify
local i=0
foreach var of varlist ea_* {
foreach var2 of varlist wdi_* {
local i=`i'+1
esize unp `var'==`var2', pbcorr
return list
putexcel A`i'=`r(r_pb)' B`i'=`r(lb_r_pb)' C`i'=`r(ub_r_pb)' D`i'=`r(N_1)' E`i'="`var'" F`i'="`var2'", nformat(excelnfmt)
}
}
}
Thank you and happy holidays
Due to the number of correlations I'm running, I am wondering whether there is a quicker way for my computer to run the task. This is my current code:
quietly{
putexcel set coef3, modify
local i=0
foreach var of varlist ea_* {
foreach var2 of varlist wdi_* {
local i=`i'+1
esize unp `var'==`var2', pbcorr
return list
putexcel A`i'=`r(r_pb)' B`i'=`r(lb_r_pb)' C`i'=`r(ub_r_pb)' D`i'=`r(N_1)' E`i'="`var'" F`i'="`var2'", nformat(excelnfmt)
}
}
}
Thank you and happy holidays
Help with output/results window cutting off variable names
My variable names are not shown in full, is there any way of getting Stata to tabulate so the output shows the full variable name? There is plenty of space in the results window.
Code:
. tab stilling_i_husstand_std kn_std if Særbarn_in_household==1 & ægtefælle_in_household==0 stilling_i_husstand_s | kn_std td | ?? K M | Total ----------------------+---------------------------------+---------- Barn af Enke hos hu.. | 0 1 0 | 1 Enke hos husstandso.. | 0 6 1 | 7 Faglig medarbejder .. | 0 10 4 | 14 Husstandsoverhoved | 0 1,697 459 | 2,156 Husstandsoverhoveds.. | 0 10 2 | 12 Husstandsoverhoveds.. | 0 16 3 | 19 Husstandsoverhoveds.. | 0 1,230 356 | 1,586 Husstandsoverhoveds.. | 0 15 11 | 26 Husstandsoverhoveds.. | 0 1 0 | 1 Husstandsoverhoveds.. | 0 5 0 | 5 Husstandsoverhoveds.. | 0 104 9 | 113 Husstandsoverhoveds.. | 0 6 0 | 6 Husstandsoverhoveds.. | 1 201 464 | 666 Husstandsoverhoveds.. | 0 11 11 | 22 Husstandsoverhoveds.. | 0 1 0 | 1 Husstandsoverhoveds.. | 0 4 0 | 4 Husstandsoverhoveds.. | 0 2 0 | 2 Husstandsoverhoveds.. | 0 1 4 | 5 Husstandsoverhoveds.. | 0 55 51 | 106 Husstandsoverhoveds.. | 0 0 5 | 5 Husstandsoverhoveds.. | 0 219 2,914 | 3,133 Husstandsoverhoveds.. | 0 14 10 | 24 Husstandsoverhoveds.. | 0 29 2 | 31 Husstandsoverhoveds.. | 0 8 3 | 11 Husstandsoverhoveds.. | 0 467 51 | 518 Husstandsoverhoveds.. | 0 0 1 | 1 Husstandsoverhoveds.. | 0 5 0 | 5 Husstandsoverhoveds.. | 0 1 0 | 1 Husstandsoverhoveds.. | 0 6 1 | 7 OTHER | 0 233 73 | 306 Opholdende hos huss.. | 0 13 2 | 15 Tjenestefolk hos op.. | 0 2 0 | 2 ----------------------+---------------------------------+---------- Total | 1 4,373 4,437 | 8,811
De trending// De seasonalising data - find weekly mean
Hiya,
for my project i have daily stock market data (returns and volatility) and daily weather (cloud cover, rain, temp).
How do i get stata to take a weekly average and then subtract it from the daily value in order to just see the excess value from the mean?
Also, does anyone know how to make a graph showing the returns for specific observations only. - By this i mean, cloud cover is measured one to 8. how do i make a graph against the returns for only the values that are 0 and 8, so ommitting certain observation values
thank you so much!!
if you could write the do file commands would be great
for my project i have daily stock market data (returns and volatility) and daily weather (cloud cover, rain, temp).
How do i get stata to take a weekly average and then subtract it from the daily value in order to just see the excess value from the mean?
Also, does anyone know how to make a graph showing the returns for specific observations only. - By this i mean, cloud cover is measured one to 8. how do i make a graph against the returns for only the values that are 0 and 8, so ommitting certain observation values
thank you so much!!
if you could write the do file commands would be great
IDs in different categories : how to count ?
Dear statalist members,
I have a sample of about 1 million people (id) with one or more records in one or more categories (cat), 15 categories in total. In summary:
id cat
id1 cat1
id2 cat1
id2 cat1
id3 cat1
id3 cat3
… …
I'm trying to find out how many people have at least one record in different categories and what are these categories. I am not interested in other people (people with just one record or several records in one category). In summary, I would like to have a result of the type:
At least one record in cat1 and cat2: 1000 people;
At least one record in cat1 and cat3: 500 people;
At least one record in cat1, cat2 and cat3: 200 people;
...
For now, I have only managed to count the number of people with one or more records in general:
bysort id:gen obs=_N
bysort id:gen obs2=_n
keep if obs2==1
tab obs
Could someone tell me how I could solve this problem?
Many thanks,
Maxime
(Stata 13.1)
I have a sample of about 1 million people (id) with one or more records in one or more categories (cat), 15 categories in total. In summary:
id cat
id1 cat1
id2 cat1
id2 cat1
id3 cat1
id3 cat3
… …
I'm trying to find out how many people have at least one record in different categories and what are these categories. I am not interested in other people (people with just one record or several records in one category). In summary, I would like to have a result of the type:
At least one record in cat1 and cat2: 1000 people;
At least one record in cat1 and cat3: 500 people;
At least one record in cat1, cat2 and cat3: 200 people;
...
For now, I have only managed to count the number of people with one or more records in general:
bysort id:gen obs=_N
bysort id:gen obs2=_n
keep if obs2==1
tab obs
Could someone tell me how I could solve this problem?
Many thanks,
Maxime
(Stata 13.1)
Cox regression with enourmous Hazard Ratio (logarithmic)
Dear forum,
I have encountered a problem in that for my Cox regression my output gives enourmous Hazard Ratios for my outcome (disease recurrece), such as 1.33e+10.
A) First I would like to give you the specifics:
I have a project, where I assess the impact of reponse to chemotherapy (i.e. "pres" a variable with 3 levels: complete, partial, no response) on disease recurrence within my given follow up. Displayed graphically (Kaplan Meier plot), the outcome is quite impressive:
Array
However, when I run a Cox regression, adjusted for other variables (age, smoking status, etc), my ouput displays grotesque Hazard ratios:
Array
The problem remains, even if a run a univariable model. I believe the issue here is collinearity in that for example "No response" predicts my outcome (disease recurrence) almost perfectly and therefore has a very large HR.
B) My questions are be the following:
-Do you find my explanation plausible (s. above)?
-Is there a solution, i.e. a way to run the cox model (uni- or multivariable; as I only have 37 events I´d fear an overfit) and have more approachible HRs?
-Lastly, if I run the cox regression without a prefix/factorial (by that I mean omitting "i." for categorial) for my independent variable of choice ("pres"), I get a HR of approximately 6. I do however not know, how STATA runs that specific regression, if "pres" is not specified:
Does it treat the first level of the variable as reference against the other two levels which would be "partial" and "no response"?
Array
Thank you very much for your help and taking time to read this !
I have encountered a problem in that for my Cox regression my output gives enourmous Hazard Ratios for my outcome (disease recurrece), such as 1.33e+10.
A) First I would like to give you the specifics:
I have a project, where I assess the impact of reponse to chemotherapy (i.e. "pres" a variable with 3 levels: complete, partial, no response) on disease recurrence within my given follow up. Displayed graphically (Kaplan Meier plot), the outcome is quite impressive:
Array
However, when I run a Cox regression, adjusted for other variables (age, smoking status, etc), my ouput displays grotesque Hazard ratios:
Array
The problem remains, even if a run a univariable model. I believe the issue here is collinearity in that for example "No response" predicts my outcome (disease recurrence) almost perfectly and therefore has a very large HR.
B) My questions are be the following:
-Do you find my explanation plausible (s. above)?
-Is there a solution, i.e. a way to run the cox model (uni- or multivariable; as I only have 37 events I´d fear an overfit) and have more approachible HRs?
-Lastly, if I run the cox regression without a prefix/factorial (by that I mean omitting "i." for categorial) for my independent variable of choice ("pres"), I get a HR of approximately 6. I do however not know, how STATA runs that specific regression, if "pres" is not specified:
Does it treat the first level of the variable as reference against the other two levels which would be "partial" and "no response"?
Array
Thank you very much for your help and taking time to read this !
Complex dummy variable
Hi STATLIST,
Merry Christmas to those celebrating it and happy new Year.
I have a question about the creation of a dummy variable. In particular I have a list of product names each repeated once that I have merged with a panel dataset in which they all appear repeated in time. The panel dataset contains more than 9000 products repeated in time. My list contains about 500 products not repeated in time (is just a simple list). What I would like to do is create a dummy variable taking on value 1 in the panel if the product name in the panel also appear in the list. I will give you a dateax:
Of course the panel is the variable prd which continues (and is much much longer than the variable "recalled_products"). So for instance for the first product ADENOSINE, I would like to create a variable taking on value 1 when the prd variable takes on the name ADRENOSINE (I am sure that all the names in "recalled_products are present also in prd), so in our case, even if it is not present, 12 times (the panel repeats 12 times for ADENOSINE).
Thank you very much,
Federico
Merry Christmas to those celebrating it and happy new Year.
I have a question about the creation of a dummy variable. In particular I have a list of product names each repeated once that I have merged with a panel dataset in which they all appear repeated in time. The panel dataset contains more than 9000 products repeated in time. My list contains about 500 products not repeated in time (is just a simple list). What I would like to do is create a dummy variable taking on value 1 in the panel if the product name in the panel also appear in the list. I will give you a dateax:
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input str19 recalled_products str18 prd "ADENOSINE " "ALPHA-KETOGLUTARIC" "ADRUCIL " "ISOPROPYL ALC/BENZ" "TROVAN" "ISOPROPYL ALC/BENZ" "TEKTURNA" "ISOPROPYL ALC/BENZ" "ALOSETRON HCL" "ISOPROPYL ALC/BENZ" "ORLAAM" "ISOPROPYL ALC/BENZ" "DIETHYLPROPION HCL" "ISOPROPYL ALCOHOL" "AMPHETAMINE SALTS" "ISOPROPYL ALCOHOL" "CYTADREN " "ISOPROPYL ALCOHOL" "AMINOPHYLLINE " "ALCOHOL" "AMYTAL SOD" "ALCOHOL" "PRAMOXINE/HC" "ALCOHOL" "LOVENOX" "ALCOHOL" "DIET SUPP EPHEDRA" "ALCOHOL" "ASPIRIN " "ALCOHOL" "AUVI-Q" "ALCOHOL" "CLINIMIX" "ALCOHOL" "BIOSCANNER KETONE" "ALCOHOL" "PFIZERPEN G" "ALCOHOL" "VASCOR" "ALCOHOL" "PAMPRIN" "ALCOHOL" "BICALUTAMIDE " "20/20 EYE GLSS CLN" "BISMUTH SUBGAL" "20/20 EYE GLSS CLN" "ANTIVENIN" "20/20 EYE GLSS CLN" "BLEPHAMIDE" "20/20 EYE GLSS CLN" "LIPO 6" "20/20 REWETTING" "BOOST " "360 OTC EXTRA STR" "BORIC ACID " "360 OTC EXTRA STR" "HEPARIN SOD" "4-WAY" "BROMFENAC SOD" "4-WAY" "BROMOCRIPTINE MESY" "4-WAY" "BUPRENORPHINE HCL" "4-WAY" "BUPROPION HCL SR W" "4-WAY" "BURN " "4-WAY" "BHT" "4-WAY" "CARBINOXAMINE CMPD" "4-WAY" "CARISOPRODOL " "4-WAY" "SHARK CARTILAGE" "4-WAY" "ZYMAR" "4-WAY" "CELECOXIB " "4-WAY" "CERTA-VITE SENIOR" "666" "AQUACHLORAL" "666" "CHLORAMPHENICOL " "666" "LOBAC" "666" "CHLOROFORM " "666" "CHLOROQUINE PHOS" "666" "CHORIONIC GONADO" "666" "CLIOQUINOL " "666" "NEOCIDIN" "666" "CLOMIPRAMINE HCL" "666" "CLOZAPINE " "666" "CD/PSE" "666" "ACETAMINOPHEN PM" "7-KETO DHEA" "COUMADIN " "7-KETO DHEA" "CUBICIN" "7-KETO DHEA" "VASODILAN" "7-KETO DHEA" "CYPROHEPTADINE HCL" "7-KETO DHEA" "DRISTAN" "7-KETO DHEA" "ALEVAZOL" "7-KETO DHEA" "DEXAMFETAMINE " "7-KETO DHEA" "PROPOXYPHEN-N/APAP" "7-KETO DHEA" "DICLOFENAC SOD" "7-KETO DHEA" "DICYCLOMINE HCL" "7-KETO DHEA" "ORTHO DIENESTROL" "7-KETO DHEA" "DIETHYLSTILBESTROL " "A & D PERSONAL CAR" "MOTOFEN" "A & D PERSONAL CAR" "GUANIDINE" "A & D PERSONAL CAR" "LOMOTIL" "A & D PERSONAL CAR" "TRANDATE" "A & D PERSONAL CAR" "TIKOSYN" "A & D PERSONAL CAR" "ANZEMET" "A & D PERSONAL CAR" "DOMPERIDONE " "A & D PERSONAL CAR" "DOXYCYCLINE HYCLAT" "A & D PERSONAL CAR" "DICYCLOMINE HCL" "GARLIC/PARSLEY" "DROPERIDOL " "GARLIC/PARSLEY" "RAPTIVA" "GARLIC/PARSLEY" "EPINEPHRINE " "GARLIC/PARSLEY" "ERYTHROMYCIN" "GARLIC/PARSLEY" "ERYTHROMYCIN ESTOL" "GARLIC/PARSLEY" "ALCOHOL SWABS" "GARLIC/PARSLEY" "PLACIDYL" "GARLIC/PARSLEY" "ESTINYL" "GARLIC/PARSLEY" "PEPPERMINT SPIRIT" "GARLIC/PARSLEY" "ETOMIDATE " "GARLIC/PARSLEY" "OBIZUR" "GARLIC/PARSLEY" "FELBAMATE " "GARLIC/PARSLEY" "FLUVOXAMINE MAL" "GARLIC/PARSLEY" "FENTANYL " "GARLIC/PARSLEY" "SULFISOXAZOLE" "GARLIC/PARSLEY" "DURALGINA" "GARLIC/PARSLEY" "GATIFLOXACIN " "GARLIC/PARSLEY" "GELATIN " "GARLIC/PARSLEY" "GEMFIBROZIL " "GARLIC/PARSLEY" "GENTAMICIN SULF" "A&D CRKD SKIN RLF" "GLUCOSAMINE SULF " "A&D CRKD SKIN RLF" "ISMELIN" "A&D CRKD SKIN RLF" "DYNABAC" "A+D FIRST AID" "MITOXANTRONE HCL" "A+D FIRST AID" "PHENYLPROPANOLAMIN" "A+D FIRST AID" "SORINE" "A+D FIRST AID" end
Thank you very much,
Federico
Friday, December 28, 2018
Simple Time Series Regression
Hello everyone,
I have a fairly simple question and hope you guys can help me out. I already studied quiet a lot questions/answers here in the forum, but most of them had a different, more sophisticated problem at hand.
To my question; I want to figure out the correlation between y and x for which I have time-series data available (for example y=unemployment and x=CPI).
I already exponentially smoothed x=CPI (tssmooth exponential).
Now, as I am only interested in the correlation between y and x ( yt=ß0+ß1xt+ut), I was wondering if the simple - reg yt xt - would aim for the desired results.
I am trapped in my own thoughts right now and actually need some clarity, as this approach seems way too simple.
I am very thankful for every reply.
I have a fairly simple question and hope you guys can help me out. I already studied quiet a lot questions/answers here in the forum, but most of them had a different, more sophisticated problem at hand.
To my question; I want to figure out the correlation between y and x for which I have time-series data available (for example y=unemployment and x=CPI).
I already exponentially smoothed x=CPI (tssmooth exponential).
Now, as I am only interested in the correlation between y and x ( yt=ß0+ß1xt+ut), I was wondering if the simple - reg yt xt - would aim for the desired results.
I am trapped in my own thoughts right now and actually need some clarity, as this approach seems way too simple.
I am very thankful for every reply.
FMM lcprob variables
Hello experts,
In FMM (finite mixture models), our main models(s) could have certain IVs. Then, I can add some variables to lcprob to specify variables that determine the probabality of being in each class. For example, Stata help document says that total medical expenditure (DV) could be predicted by gender, age, and income. In a basic model, it only uses the mentioned IVs. Later, it says, with that, we are assuming that prior probability of being in each class was the same among all individuals. However, it would make a better sense if include "total number of chronic issues" each person has in "lcprob" part of the model.
Now, my question is: what is the criteria on the basis of which we decide one variable should be in the main model rather than in lcprob part of the model? In other words, in the mentioned example" total number of chronic issues" could have been used as one of the IVs in the model.
I hope my question is clear.
Thanks in advance
In FMM (finite mixture models), our main models(s) could have certain IVs. Then, I can add some variables to lcprob to specify variables that determine the probabality of being in each class. For example, Stata help document says that total medical expenditure (DV) could be predicted by gender, age, and income. In a basic model, it only uses the mentioned IVs. Later, it says, with that, we are assuming that prior probability of being in each class was the same among all individuals. However, it would make a better sense if include "total number of chronic issues" each person has in "lcprob" part of the model.
Now, my question is: what is the criteria on the basis of which we decide one variable should be in the main model rather than in lcprob part of the model? In other words, in the mentioned example" total number of chronic issues" could have been used as one of the IVs in the model.
I hope my question is clear.
Thanks in advance
calculating and graphing marginal effects from logit with interaction effect of two categorical variables
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input byte(owndecision treat gender) 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 0 1 1 1
treat=0,1,2; (3 treatments: 0=common, 1=asymmetric, 2=private)
gender=1 if female, 0 otherwise
I would like the average marginal effects of defection (owndecision=1) by gender for asymmetric and private and to produce a graph that looks like this:
Array
Code:
logit owndecision i.gender#i.treat ------------------------------------------------------------------------------ owndecision | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- gender#treat | 0 1 | 1.466337 .7372854 1.99 0.047 .0212843 2.91139 0 2 | 2.590267 1.179689 2.20 0.028 .2781187 4.902415 1 0 | 1.041454 .6522961 1.60 0.110 -.237023 2.319931 1 1 | 1.977163 .6868733 2.88 0.004 .6309158 3.32341 1 2 | 0 (empty) margins, dydx(treat) over(gender) ------------------------------------------------------------------------------ | Delta-method | dy/dx Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.treat | gender | 0 | .3472222 .1606046 2.16 0.031 .032443 .6620015 1 | .1828704 .1157482 1.58 0.114 -.0439919 .4097326 -------------+---------------------------------------------------------------- 2.treat | gender | 0 | .5138889 .1600699 3.21 0.001 .2001576 .8276201 1 | . (not estimable) ------------------------------------------------------------------------------ Note: dy/dx for factor levels is the discrete change from the base level. marginsplot
Array
Any help would be appreciated. Thank you.
Writing loop for multiple regressions
I have 10 dependant variables, y1-y30, and its respective lagged variables, lagy1-lagy30. I would like to regress one dependant variable on its specific lag variable and five other fixed controls, e.g. regress y1 lagy1 x1-x5. How can I write a loop code to run the 30 regressions and store the estimates respectively?
Currently, I wrote the following code, and the problems are 1) there are some meaningless regressions, e.g. regress y1 lagy2 x1-x5; 2) the estimates could not be stored.
local dependant y1-y10
local independant lagy1-lagy10
local x = 1
foreach p of local dependant{
foreach q of local independant{
regress `p' x1 x2 x3 x4 x5 `q'
est sto m_`x'
local x = `x' + 1
}
}
This is the first time I write loop code in STATA, and I checked previous posts but still could not find a solution. I really appreciate any help or comments. Thank you very much for the time and considerations!
Currently, I wrote the following code, and the problems are 1) there are some meaningless regressions, e.g. regress y1 lagy2 x1-x5; 2) the estimates could not be stored.
local dependant y1-y10
local independant lagy1-lagy10
local x = 1
foreach p of local dependant{
foreach q of local independant{
regress `p' x1 x2 x3 x4 x5 `q'
est sto m_`x'
local x = `x' + 1
}
}
This is the first time I write loop code in STATA, and I checked previous posts but still could not find a solution. I really appreciate any help or comments. Thank you very much for the time and considerations!
using "spmap"
Hi guys,
I am trying to map the result by using "spmap" command, yet keep having troubles with the error saying "master data not sorted"...
Below is the code that I used,
------------------------------------------------
use "$processed/production_regional.dta", clear
format weight_edible_ameday %4.2f
spmap weight_edible_ameday using vietmap_province_region3.dta, id(_ID) fcolor($colorscale) ///
legend(symy(*1) symx(*1) size(3) pos(4)) ///
title("Total harvest (kg/day/AME)", size($titlesize)) cln(7) ///clm(c) clb($cats) ///
note("Source: ***** crop production", size($notesize))
graph export "$maps/prod_region_weight.png", as(png) replace
------------------------------------------------
Can anyone tell me how I can resolve this problem?
Many thanks,
Manny
I am trying to map the result by using "spmap" command, yet keep having troubles with the error saying "master data not sorted"...
Below is the code that I used,
------------------------------------------------
use "$processed/production_regional.dta", clear
format weight_edible_ameday %4.2f
spmap weight_edible_ameday using vietmap_province_region3.dta, id(_ID) fcolor($colorscale) ///
legend(symy(*1) symx(*1) size(3) pos(4)) ///
title("Total harvest (kg/day/AME)", size($titlesize)) cln(7) ///clm(c) clb($cats) ///
note("Source: ***** crop production", size($notesize))
graph export "$maps/prod_region_weight.png", as(png) replace
------------------------------------------------
Can anyone tell me how I can resolve this problem?
Many thanks,
Manny
How to present vignettes in a tabular format
Hello everyone,
Could you please help me to present vignettes in a tabular format rather than a running text?
my data look like this:
------------------ copy up to and including the previous line ------------------
I want a table like this:
table 1:
Could you please help me to present vignettes in a tabular format rather than a running text?
Code:
use setup, clear gen phrase_A1 = "error" replace phrase_A1 = "male" if gender ==1 replace phrase_A1 = "female" if gender ==2 gen phrase_A2 = "error" replace phrase_A2 = "yes at the employer's premises" if experience_and_internship ==1 replace phrase_A2 = "yes, but in a different firm" if experience_and_internship ==2 replace phrase_A2 = "no" if experience_and_internship ==3 gen phrase_A3 ="error" replace phrase_A3 ="Omani" if nationality == 1 replace phrase_A3 ="non-Omani" if nationality == 2 gen phrase_A4 = "error" replace phrase_A4 = "leading university in Oman" if place_of_study ==1 replace phrase_A4 = "non-leading university in Oman" if place_of_study ==2 replace phrase_A4 = "leading university in Oman" if place_of_study ==3 replace phrase_A4 = "non-leading university abroad" if place_of_study ==4 gen phrase_A5 = "error" replace phrase_A5 = "College Diploma" if level_of_education ==1 replace phrase_A5 = "College Higher Diploma" if level_of_education ==2 replace phrase_A5 = "Bachelor" if level_of_education ==3 replace phrase_A5 = "masters" if level_of_education ==4 gen phrase_A6 = "error" replace phrase_A6 = "Engineering" if field_of_study ==1 replace phrase_A6 = "Business and Management" if field_of_study ==2 replace phrase_A6 = "Inforamtion and Technology" if field_of_study ==3 gen phrase_A7 = "error" replace phrase_A7 = "high" if grade ==1 replace phrase_A7 = "fair" if grade ==2 replace phrase_A7 = "low" if grade ==3 gen phrase_A8 = "error" replace phrase_A8 = "yes" if extra_curricular_activities ==1 replace phrase_A8 = "no" if extra_curricular_activities ==2 gen phrase_A9 = "error" replace phrase_A9 = "yes by an exisiting employee" if referred ==1 replace phrase_A9 = "yes through school-linkages" if referred ==2 replace phrase_A9 = "no" if referred ==3 assert phrase_A1 ~= "error" assert phrase_A2 ~= "error" assert phrase_A3 ~= "error" assert phrase_A4 ~= "error" assert phrase_A5 ~= "error" assert phrase_A6 ~= "error" assert phrase_A7 ~= "error" assert phrase_A8 ~= "error" assert phrase_A9 ~= "error" gen vigA = phrase_A1 + phrase_A2 + phrase_A3 + phrase_A4 + phrase_A5 + phrase_A6 + phrase_A7 + phrase_A8 + phrase_A9
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input float id_quest byte(vignr deck) float(id_vignette gender experience_and_internship field_of_study) 1 1 11 105 1 2 1 1 2 11 102 2 1 3 1 3 11 109 2 2 2 1 4 11 110 2 2 2 1 5 11 108 1 1 3 1 6 11 101 1 1 1 1 7 11 104 2 3 2 1 8 11 103 2 3 2 1 9 11 107 1 2 3 1 10 11 106 1 2 3 2 1 13 125 1 2 1 2 2 13 128 2 3 1 2 3 13 130 1 2 2 2 4 13 126 2 3 2 2 5 13 129 2 3 1 2 6 13 124 1 3 3 2 7 13 121 2 3 1 2 8 13 123 2 2 1 2 9 13 127 1 2 1 2 10 13 122 1 1 2 3 1 19 184 1 2 1 3 2 19 188 2 3 1 3 3 19 181 1 3 2 3 4 19 183 2 2 3 3 5 19 189 2 1 1 3 6 19 186 2 2 3 3 7 19 182 2 2 1 3 8 19 190 2 3 2 3 9 19 185 2 1 1 3 10 19 187 2 3 2 4 1 4 40 1 3 2 4 2 4 37 2 3 1 4 3 4 35 2 1 1 4 4 4 36 1 3 1 4 5 4 39 1 3 1 4 6 4 31 1 2 1 4 7 4 32 1 3 3 4 8 4 38 2 2 3 4 9 4 33 1 3 3 4 10 4 34 1 3 3 5 1 12 111 1 1 3 5 2 12 115 2 1 3 5 3 12 117 1 2 3 5 4 12 120 2 1 1 5 5 12 116 1 1 3 5 6 12 119 1 1 3 5 7 12 113 1 1 3 5 8 12 112 1 3 3 5 9 12 118 1 2 2 5 10 12 114 1 2 3 6 1 5 41 1 3 2 6 2 5 44 1 2 1 6 3 5 43 2 1 2 6 4 5 46 1 2 3 6 5 5 42 2 2 1 6 6 5 50 1 2 2 6 7 5 48 1 1 2 6 8 5 49 1 1 3 6 9 5 47 2 2 2 6 10 5 45 2 3 1 7 1 7 69 1 2 2 7 2 7 66 2 3 2 7 3 7 63 2 3 1 7 4 7 67 2 2 3 7 5 7 64 2 3 2 7 6 7 62 1 3 3 7 7 7 70 1 1 1 7 8 7 68 2 2 3 7 9 7 65 1 3 3 7 10 7 61 1 3 3 8 1 17 165 1 2 1 8 2 17 170 2 3 1 8 3 17 167 1 3 1 8 4 17 166 1 3 2 8 5 17 168 2 2 2 8 6 17 161 2 1 2 8 7 17 163 1 1 1 8 8 17 164 1 2 3 8 9 17 169 1 3 3 8 10 17 162 2 2 2 9 1 8 75 1 2 1 9 2 8 76 1 1 1 9 3 8 73 2 1 1 9 4 8 79 2 2 1 9 5 8 72 2 3 2 9 6 8 80 1 3 3 9 7 8 71 2 3 3 9 8 8 74 2 1 2 9 9 8 77 1 1 2 9 10 8 78 2 1 2 10 1 15 143 2 2 2 10 2 15 147 2 3 1 10 3 15 149 2 3 2 10 4 15 144 1 2 3 10 5 15 141 1 1 1 10 6 15 145 1 1 3 10 7 15 150 1 1 3 10 8 15 146 2 3 3 10 9 15 142 1 3 2 10 10 15 148 1 3 1 end label values gender gender label def gender 1 "male", modify label def gender 2 "female", modify label values experience_and_internship experience_and_internship label def experience_and_internship 1 "yes at the employer's premises", modify label def experience_and_internship 2 "yes, but in a different firm", modify label def experience_and_internship 3 "no", modify label values field_of_study field_of_study label def field_of_study 1 "Engineering", modify label def field_of_study 2 "Business and Management", modify label def field_of_study 3 "Inforamtion and Technology", modify
I want a table like this:
table 1:
gender | male |
experience and internship | no |
field of study | engineering |
The base year for finding yearly effects of the shock in DID
I wanted to estimate a difference-in-differences model using Stata looking at the effects of a trade shock (in 2007) on households' income. I have a repeated cross-sectional data for years 1995-2015. So, I estimated this model:
reg income Treat##Post i.year
which Treat is a dummy variable (1 for treated group and 0 for the control group), Post is a dummy variable (1 for years after the shock and 0 for years before the shock). I included year fix effects (i.year) to control for time-varying macroeconomic changes. Treated households used to be richer than the control group before the shock, however, their income trends were parallel (so, their income differences are not zero before the shock). The coefficient of interest (Treat*Post) is significantly negative.
I am also interested to find the effect of the shock for each year because I believe that the effect of shock has decreased over time. So, I estimated this model:
reg income ib2006.year##i.Treat
I have two questions regarding defining the base year:
(1) I define the year before the shock (2006) as the base year. It is assumed there is no difference between these two groups in term of income in the year 2006 which is not correct, because as I said before treated households used to be richer than the control group before the shock, so their income differences are not zero before the shock.
(2) Although Treat*Post coefficient is significant in the first model, Treat*year2007-Treat*year2015 are not significant in the second model. Why?
(if I change the base year to 2007, the coefficients will be significant because all values shift down.)
reg income Treat##Post i.year
which Treat is a dummy variable (1 for treated group and 0 for the control group), Post is a dummy variable (1 for years after the shock and 0 for years before the shock). I included year fix effects (i.year) to control for time-varying macroeconomic changes. Treated households used to be richer than the control group before the shock, however, their income trends were parallel (so, their income differences are not zero before the shock). The coefficient of interest (Treat*Post) is significantly negative.
I am also interested to find the effect of the shock for each year because I believe that the effect of shock has decreased over time. So, I estimated this model:
reg income ib2006.year##i.Treat
I have two questions regarding defining the base year:
(1) I define the year before the shock (2006) as the base year. It is assumed there is no difference between these two groups in term of income in the year 2006 which is not correct, because as I said before treated households used to be richer than the control group before the shock, so their income differences are not zero before the shock.
(2) Although Treat*Post coefficient is significant in the first model, Treat*year2007-Treat*year2015 are not significant in the second model. Why?
(if I change the base year to 2007, the coefficients will be significant because all values shift down.)
GEE and distributional assumptions
Hello all,
I am using GEE to estimate my dependent variable. The dependent variable has a lower bound of 0 (its observed value, not censored or truncated) and can take on larger values as well. However, in my dataset, there are a lot of zeros for the dependent variable (about 80% of the time). Would it be acceptable to run a linear GEE model here? (assuming that I probe my results using alternative approaches). From what I understand, GEE is a quasi-likelihood estimator and it has weaker distributional assumptions, so my thoughts were that this would be okay, but I'd be interested in hearing others thoughts. Though to be clear, I am interested in using and defending this approach for my analysis purposes.
Thank you in advance!
I am using GEE to estimate my dependent variable. The dependent variable has a lower bound of 0 (its observed value, not censored or truncated) and can take on larger values as well. However, in my dataset, there are a lot of zeros for the dependent variable (about 80% of the time). Would it be acceptable to run a linear GEE model here? (assuming that I probe my results using alternative approaches). From what I understand, GEE is a quasi-likelihood estimator and it has weaker distributional assumptions, so my thoughts were that this would be okay, but I'd be interested in hearing others thoughts. Though to be clear, I am interested in using and defending this approach for my analysis purposes.
Thank you in advance!
repeated time values within panel
Dear Statalisters!
I have problems with my time variable. I tried to use this commando:
xtset importer1 Year
But I get this error message:
repeated time values within panel
My data looks like this:
[ATTACH=CONFIG]temp_12911_1546025221248_158[/ATTACH]
I encoded Importer and Year with this commando:
encode Importer, gen(importer1)
encode Year1, gen(Year)
[ATTACH=CONFIG]temp_12912_1546025246080_192[/ATTACH]
In my data, i have 27 Importers and around 165 Exporters. I want to examine how imports from the Exporters to the Importers change during crisis and depending on if the Importer have euro or not.
The problem seems to be that the same year is used several time for the same Importer. Is it even possible for me to use panel data with my data set? If it is, how should I proceed?
Best reg(ression)ards,
Gabriel Bladh
Stockholm
Sweden
I have problems with my time variable. I tried to use this commando:
xtset importer1 Year
But I get this error message:
repeated time values within panel
My data looks like this:
[ATTACH=CONFIG]temp_12911_1546025221248_158[/ATTACH]
I encoded Importer and Year with this commando:
encode Importer, gen(importer1)
encode Year1, gen(Year)
[ATTACH=CONFIG]temp_12912_1546025246080_192[/ATTACH]
In my data, i have 27 Importers and around 165 Exporters. I want to examine how imports from the Exporters to the Importers change during crisis and depending on if the Importer have euro or not.
The problem seems to be that the same year is used several time for the same Importer. Is it even possible for me to use panel data with my data set? If it is, how should I proceed?
Best reg(ression)ards,
Gabriel Bladh
Stockholm
Sweden
discrepancy between mixed results and contrast command
Hi, I'm running a mixed model for longitudinal data with a two by two categorical interaction (all other variables being continuous). grceintra is coded like 0 for low ec and 1 for high ec. time is coded as 1 for time1, 2 for time 2 3 for time3 and 4 for time4.
here is the mixed command and the results :
when I use the contrast command to test main and interaction effect, the result is the following :
so, it's a little bit disturbing as :
1. mixed results show that high EC have lower rmssd (the DV) than low EC (coef = -.15) but the contrats command tells us that there is no main effect of the IV (grcetot chi2= .3.15, p=.076).
2. the interaction terme show that high EC group exhibit a significant gain of .14 between time1 and time4 than low EC group. but again, the overall interaction term is not significant (chi2 = 6.26, p = .09).
In social science we are not used to compute follow-up analysis after regressions because all coef in the mixed table are sufficient. But i'am a little bit obsessive with stat !!! (sorry).
I don't know what to conclude with such discrepancy. any help is welcome...
best
carole
here is the mixed command and the results :
Code:
xtmixed rmssd i.grceintra##i.time alc caf cig bmi ||id:alc caf cig bmi , residuals(un, t(time))
Code:
Mixed-effects ML regression Number of obs = 268 Group variable: id Number of groups = 68 Obs per group: min = 3 avg = 3.9 max = 4 Wald chi2(11) = 45.69 Log likelihood = -56.68586 Prob > chi2 = 0.0000 -------------------------------------------------------------------------------- rmssd | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------------+---------------------------------------------------------------- 1.grceintra | -.1516471 .1237396 -1.23 0.220 -.3941723 .0908781 | time | 2 | -.0828655 .0517872 -1.60 0.110 -.1843665 .0186355 3 | -.235263 .0556458 -4.23 0.000 -.3443267 -.1261993 4 | -.1622465 .04143 -3.92 0.000 -.2434478 -.0810453 | grceintra#time | 1 2 | .1037683 .0721843 1.44 0.151 -.0377103 .2452469 1 3 | .0890882 .0782335 1.14 0.255 -.0642466 .242423 1 4 | .1410346 .0577477 2.44 0.015 .0278511 .2542181 | alc | -.1268449 .0641427 -1.98 0.048 -.2525624 -.0011275 caf | -.0009671 .0437007 -0.02 0.982 -.0866189 .0846846 cig | .0116784 .0402781 0.29 0.772 -.0672652 .0906221 bmi | .0007147 .019349 0.04 0.971 -.0372087 .038638 _cons | 4.209916 .4915439 8.56 0.000 3.246508 5.173325 -------------------------------------------------------------------------------- ------------------------------------------------------------------------------ Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval] -----------------------------+------------------------------------------------ id: Independent | sd(alc) | 1.81e-09 . . . sd(caf) | 1.99e-09 . . . sd(cig) | 1.65e-10 . . . sd(bmi) | .022004 . . . -----------------------------+------------------------------------------------ Residual: Unstructured | sd(e1) | .1679547 . . . sd(e2) | .2756804 . . . sd(e3) | .3655816 . . . sd(e4) | .129396 . . . corr(e1,e2) | .1695978 . . . corr(e1,e3) | .5009059 . . . corr(e1,e4) | -.2689602 . . . corr(e2,e3) | .6809035 . . . corr(e2,e4) | -.0108121 . . . corr(e3,e4) | .4826205 . . . ------------------------------------------------------------------------------ LR test vs. linear model: chi2(13) = 330.39 Prob > chi2 = 0.0000
Code:
contrast time##grcetot Contrasts of marginal linear predictions Margins : asbalanced ------------------------------------------------ | df chi2 P>chi2 -------------+---------------------------------- rmssd | time | 3 34.92 0.0000 | grcetot | 1 3.15 0.0760 | time#grcetot | 3 6.26 0.0998
1. mixed results show that high EC have lower rmssd (the DV) than low EC (coef = -.15) but the contrats command tells us that there is no main effect of the IV (grcetot chi2= .3.15, p=.076).
2. the interaction terme show that high EC group exhibit a significant gain of .14 between time1 and time4 than low EC group. but again, the overall interaction term is not significant (chi2 = 6.26, p = .09).
In social science we are not used to compute follow-up analysis after regressions because all coef in the mixed table are sufficient. But i'am a little bit obsessive with stat !!! (sorry).
I don't know what to conclude with such discrepancy. any help is welcome...
best
carole
How to remove observations with no change in the dependent variable in a regression ?
I have a panel data set that is of the form
where shpro is a variable that represents the same product within the same shop ie it is product-shop specific. The date represents the date of the price reading and price is price of the product in that shop.
I am carrying out a fixed effects regression with time fixed effects and shop-product fixed effects. I wish to condition my regression on the fact that the price of each product at date=1 is different to the price of that same product at date=4. I initially generated a variable for price at date=1 and date=4 but of course this does not work since each observation has only one date. I have an inkling that I may need to reshape the data, but I am not entirely sure on how to do this.
Any help will be so much appreciated.
shpro | date | price |
1 | 1 | 100 |
1 | 2 | 100 |
1 | 3 | 100 |
1 | 4 | 100 |
2 | 1 | 98 |
2 | 2 | 100 |
2 | 3 | 102 |
2 | 4 | 104 |
3 | 1 | 99 |
3 | 2 | 100 |
etc. | etc. | etc. |
I am carrying out a fixed effects regression with time fixed effects and shop-product fixed effects. I wish to condition my regression on the fact that the price of each product at date=1 is different to the price of that same product at date=4. I initially generated a variable for price at date=1 and date=4 but of course this does not work since each observation has only one date. I have an inkling that I may need to reshape the data, but I am not entirely sure on how to do this.
Any help will be so much appreciated.
Panel Regression - Top 10% of income of each industry each year
Dear all,
unfortunately I am new to Stata and I dont really know how to go ahead. I want to perform a regression with the top 10% of income of each industry each year. I have 10 different industries and 14 years. I thought about creating dummy variables and I already generated dummy variables for the different industries (industry1, industry2, industry3 etc.) and the different years (year1, year2, year3 etc.). But now the problem is, how can I tell Stata to create a dummy of the best 10% of each industry in every year. Or am I thinking to complex and there is another command performing this.
Thank you in advance!
Best regards,
Corn
unfortunately I am new to Stata and I dont really know how to go ahead. I want to perform a regression with the top 10% of income of each industry each year. I have 10 different industries and 14 years. I thought about creating dummy variables and I already generated dummy variables for the different industries (industry1, industry2, industry3 etc.) and the different years (year1, year2, year3 etc.). But now the problem is, how can I tell Stata to create a dummy of the best 10% of each industry in every year. Or am I thinking to complex and there is another command performing this.
Thank you in advance!
Best regards,
Corn
Three way tables using svyset
Hello,
I am using Stata 13 and I have a question regarding three way tables while using complex survey data. My dataset is weighted and stratified and I would like to make a three way table. Unfortunately, the table command doesn't work for svy.
I have tried:
However, the proportions this command provides me with are the proportions of a given group within a group of var1, and I would like to know the proportion of this given group over all observations. Would this be possible?
Thank you!
I am using Stata 13 and I have a question regarding three way tables while using complex survey data. My dataset is weighted and stratified and I would like to make a three way table. Unfortunately, the table command doesn't work for svy.
I have tried:
Code:
svy: prop var1, over(var2 var3)
Thank you!
factor variables and time-series operators not allowed
Hi,
I am trying to run the following the codes, but get the error message "factor variables and time-series operators not allowed". Steps one through three work, but at step four and five I get the error message.
(1) xtologit CSRRS_n PPE INTAN RND CH LEV ROA OI Growth NLCF CETR ln_employees i.DataYearFiscal, vce(robust)
est store r1
(2) xtologit CSRRS_n PPE INTAN RND CH LEV ROA OI Growth NLCF GETR ln_employees i.DataYearFiscal, vce(robust)
est store r2
(3) esttab r1 r2 using "Regression.rtf",
(4) replace stats(N chi2 p) b(3) aux(se 3) star(* 0.10 ** 0.05 *** 0.01) obslast onecell nogaps
(5) compress title(Regressions) addnotes(p-levels are two-tailed, * p < 0.10, ** p < 0.05, *** p < 0.01; the numbers within the round parentheses are robust standard errors.)
Any help will be greatly appreciated.
I am trying to run the following the codes, but get the error message "factor variables and time-series operators not allowed". Steps one through three work, but at step four and five I get the error message.
(1) xtologit CSRRS_n PPE INTAN RND CH LEV ROA OI Growth NLCF CETR ln_employees i.DataYearFiscal, vce(robust)
est store r1
(2) xtologit CSRRS_n PPE INTAN RND CH LEV ROA OI Growth NLCF GETR ln_employees i.DataYearFiscal, vce(robust)
est store r2
(3) esttab r1 r2 using "Regression.rtf",
(4) replace stats(N chi2 p) b(3) aux(se 3) star(* 0.10 ** 0.05 *** 0.01) obslast onecell nogaps
(5) compress title(Regressions) addnotes(p-levels are two-tailed, * p < 0.10, ** p < 0.05, *** p < 0.01; the numbers within the round parentheses are robust standard errors.)
Any help will be greatly appreciated.
Thursday, December 27, 2018
Is it possible to divide a variable by the mean across individuals for a regression???
Hello,
is it okay to divide each individual's value of a variable by the mean of the sample and then use this transformed variable for a regression in a sample?
For example:
Would this cause any trouble?
The motivation would be to see what factors effect the price to lie above the sample average.
Thank you!
is it okay to divide each individual's value of a variable by the mean of the sample and then use this transformed variable for a regression in a sample?
For example:
Code:
sysuse auto, clear . reg price trunk weight displacement gear_ratio Source | SS df MS Number of obs = 74 -------------+---------------------------------- F(4, 69) = 8.54 Model | 210211246 4 52552811.6 Prob > F = 0.0000 Residual | 424854150 69 6157306.52 R-squared = 0.3310 -------------+---------------------------------- Adj R-squared = 0.2922 Total | 635065396 73 8699525.97 Root MSE = 2481.4 ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- trunk | -63.64507 91.74253 -0.69 0.490 -246.6664 119.3763 weight | 2.160798 .8998892 2.40 0.019 .3655685 3.956028 displacement | 10.36613 8.266774 1.25 0.214 -6.125634 26.85789 gear_ratio | 2192.778 1140.727 1.92 0.059 -82.91105 4468.466 _cons | -8139.774 4688.715 -1.74 0.087 -17493.5 1213.956 egen meanprice = mean(price) gen dividedprice = price/meanprice . reg dividedprice trunk weight displacement gear_ratio Source | SS df MS Number of obs = 74 -------------+---------------------------------- F(4, 69) = 8.54 Model | 5.53036265 4 1.38259066 Prob > F = 0.0000 Residual | 11.177316 69 .161990087 R-squared = 0.3310 -------------+---------------------------------- Adj R-squared = 0.2922 Total | 16.7076786 73 .22887231 Root MSE = .40248 ------------------------------------------------------------------------------ dividedprice | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- trunk | -.0103232 .0148806 -0.69 0.490 -.0400091 .0193627 weight | .0003505 .000146 2.40 0.019 .0000593 .0006417 displacement | .0016814 .0013409 1.25 0.214 -.0009936 .0043563 gear_ratio | .3556669 .1850251 1.92 0.059 -.0134481 .7247819 _cons | -1.320265 .760506 -1.74 0.087 -2.837433 .1969028 ------------------------------------------------------------------------------
The motivation would be to see what factors effect the price to lie above the sample average.
Thank you!
Assigning variable values to observations based common values of other variables...
I have a data set with 3 variables: one identifies the contract number; one identifies the type of contract (two values (prime or sub); and one lists which agency let the contract. The authorizing agency is only listed for the prime contracts. I need to assign the sub contracts the same agency value as its prime.
For Example...
what Stata code can I use so that the Sub get the Agency value of its Prime. I have 2,147 contracts with 435 primes; 1,712 subs; and 25 agencies
Thanks!
Steven Pitts
For Example...
Contract Number | Contract Type | Agency |
121212 | Prime | XX |
121212 | Sub | |
121212 | Sub | |
343434 | Prime | SS |
343434 | Sub | |
343434 | Sub | |
565656 | Prime | ZZ |
565656 | Sub | |
565656 | Sub |
Thanks!
Steven Pitts
Merging dates
Hi,
I'm trying to merge two dates variables from the same dataset, but I would like to prioritize one on the other:
considering date1 and date2
I would like to generate date3 = date1, except if date1 is missing, then replacing by date2
I tried :
gen date3 = date1
replace date3 = date2 if date 1 == "."
but I got a type mismatch message, even if my variables are all in numeric daily date (float)
I hope I'm clear and someone can help me,
Many thanks
El
Ps: Merry Christmas!
I'm trying to merge two dates variables from the same dataset, but I would like to prioritize one on the other:
considering date1 and date2
I would like to generate date3 = date1, except if date1 is missing, then replacing by date2
I tried :
gen date3 = date1
replace date3 = date2 if date 1 == "."
but I got a type mismatch message, even if my variables are all in numeric daily date (float)
I hope I'm clear and someone can help me,
Many thanks
El
Ps: Merry Christmas!
getting estimates when using bayes prefix for melogit
Hi Stata forum members,
I need some advice on how to get estimates after fitting melogit using bayesian framework. I have tried using the -parmest- command but I get an error that says "Estimates matrix e(b) must have exactly 1 row"
Below is my example code:
can someone help?
Thanks in anticipation.
Madu
Just to add that I use Stata/SE 15.1 and the error number is r(498);
I need some advice on how to get estimates after fitting melogit using bayesian framework. I have tried using the -parmest- command but I get an error that says "Estimates matrix e(b) must have exactly 1 row"
Below is my example code:
Code:
sysuse auto, clear bayes: melogit foreign trunk || rep78:, parmest,format(estimate min95 max95 %8.2f p %8.1e) list(,)
Thanks in anticipation.
Madu
Just to add that I use Stata/SE 15.1 and the error number is r(498);
1-to-(n) Propensity score matching without replacement
Hi,
I was hoping if someone can help me with this. I have a data set with about 100 cases and 6000 controls. I want to create a propensity score matched cohort of 1 case:3 controls (propensity score generated based on a set of baseline variables like age, gender, kidney function etc.). The -psmatch2- command does not let me create 1-to-many matching without replacement when using the n() option
"psmatch2 treatment_variable , pscore(logit1) caliper (.2) noreplacement n(3)"- returns error message
"psmatch2 treatment_variable , pscore(logit1) caliper (.2) n(3)"- does propensity matching with replacement (not what I am looking for)
Can anyone please suggest on how to do this/share code to overcome this? I am using Stata 15 version. I can't use the -teffects- command as I need the id for matched controls to do survival analysis on the final matched cohort.
Thank you so much in advance.
I was hoping if someone can help me with this. I have a data set with about 100 cases and 6000 controls. I want to create a propensity score matched cohort of 1 case:3 controls (propensity score generated based on a set of baseline variables like age, gender, kidney function etc.). The -psmatch2- command does not let me create 1-to-many matching without replacement when using the n() option
"psmatch2 treatment_variable , pscore(logit1) caliper (.2) noreplacement n(3)"- returns error message
"psmatch2 treatment_variable , pscore(logit1) caliper (.2) n(3)"- does propensity matching with replacement (not what I am looking for)
Can anyone please suggest on how to do this/share code to overcome this? I am using Stata 15 version. I can't use the -teffects- command as I need the id for matched controls to do survival analysis on the final matched cohort.
Thank you so much in advance.
Concatenate of a string and a number
Dear statalister,
I am trying to merge to database, I would like to try using a concatenate of country and year, first is a string and the second a number. Are there any function or command to do it? I tested strcat and something else I found in the fórum but one is for two strings and the second for two numbers.
Thank you for your kind help.
Best regards,
Alejandro
I am trying to merge to database, I would like to try using a concatenate of country and year, first is a string and the second a number. Are there any function or command to do it? I tested strcat and something else I found in the fórum but one is for two strings and the second for two numbers.
Thank you for your kind help.
Best regards,
Alejandro
Convert string to time including milliseconds
I have a variable containing strings in the following format
StringVar
"2018-12-27 14:28:41.4861930"
I would like to convert it into a variable with a time format STATA will recognize, keeping the precision to the millisecond level. E.g. the # of milliseconds since 1960 would be perfect.
I tried among other things the following, but it delivered only missing values.
gen time2=clock(StringVar,"DMYhms")
Any ideas?
Thank you for the help...
StringVar
"2018-12-27 14:28:41.4861930"
I would like to convert it into a variable with a time format STATA will recognize, keeping the precision to the millisecond level. E.g. the # of milliseconds since 1960 would be perfect.
I tried among other things the following, but it delivered only missing values.
gen time2=clock(StringVar,"DMYhms")
Any ideas?
Thank you for the help...
Joinby two variables
Dear statalisters,
I am trying to merge two datasets and I have some problems. I started yesterday merging using
and everything was ok. Today I am trying to use
but I have a problem, I think I créate duplicate data. My master data has 1 million observation and a size about 1,3 GB and the second database is about 170,000 observation and a size of 10MB. The final database is about 20 GB and 20 millón observations.
Do you know why is that change in size and observations? I think there are some duplicates, how can I see if there are duplicates and what can I do if there are?
Thank you very much for your help.
Alejandro
I am trying to merge two datasets and I have some problems. I started yesterday merging using
Code:
joinby firm
Code:
joinby country year
Do you know why is that change in size and observations? I think there are some duplicates, how can I see if there are duplicates and what can I do if there are?
Thank you very much for your help.
Alejandro
Generate multiple variables from a variable containing symbols and numbers
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input str29 salary_today "243,250 (307,840) (253,454)" "322,043 (342,970)" "279,102 (365,736)" "126,025[12]" "247,579††" "166218" "138,740†††" "161349" "130,646 (204,309)" "254160" "238908" "129,517 (175,081)***‡‡‡" "228190" "117763" "" "188,723" "135586" "161,349 (197,454)" "162056" end
Code:
. list +-----------------------------+ | salary_today | |-----------------------------| 1. | 243,250 (307,840) (253,454) | 2. | 322,043 (342,970) | 3. | 279,102 (365,736) | 4. | 126,025[12] | 5. | 247,579†† | |-----------------------------| 6. | 166218 | 7. | 138,740††† | 8. | 161349 | 9. | 130,646 (204,309) | 10. | 254160 | |-----------------------------| 11. | 238908 | 12. | 129,517 (175,081)***‡‡‡ | 13. | 228190 | 14. | 117763 | 15. | | |-----------------------------| 16. | 188,723 | 17. | 135586 | 18. | 161,349 (197,454) | 19. | 162056 | +-----------------------------+
For example, for observation 12 salary will be 129517, salary_p1 will be 175081, salary_p2 will be missing , and salary_note will be ***‡‡‡.
margins not estimable
Hi all,
I run a panel data fixed effect regression. There is an interaction term in the model. Through the reults, I can see the marginal effect of dummy. However, I need to plot the margin graph. It returns as "not estimable".
Please see attached.
Array
What should I do then? What's the problem here?
Thanks!
Best,
Linda
I run a panel data fixed effect regression. There is an interaction term in the model. Through the reults, I can see the marginal effect of dummy. However, I need to plot the margin graph. It returns as "not estimable".
Please see attached.
Array
What should I do then? What's the problem here?
Thanks!
Best,
Linda
Foreach vs. Forvalues when using char() function to remove special characters in a string variable
Hello all,
Using Stata 15.1/IC
I need to submit a bulk file with a string variable ("NAME" variable in this example) that is required to have no special characters besides ampersand and dash. I am able to accomplish this using the following series of commands:
charlist NAME //shows which characters are in my string var NAME
"&',-./01234689ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnop qrstuvwxyz
egen NEWNAME= sieve(NAME), omit(,./`"""'`"'"') // generates new variable with the special characters omitted but retains & and -
Results:
While this approach works as intended, I wanted to be able to use a command that is not dependent on the specific characters to be omitted, which could change between datasets (e.g. a character like "+" or "@" would not be excluded in a string variable that had them with my code--I'd have to manually update the command). Plus, the way you have to set off double- and single quote marks makes it hard to read in the log file.
I thought I could use the char() function to generalize the command by using the integer values associated with ASCII characters with a forvaluesloop (under the assumption I will nor run into any non-ASCII special characters), but I get the following error:
. forvalues i = 33/37 39/44 46/47 58/64 91/96 123/126 {
2. replace NAME = subinstr(NAME, char(`i'), "", .)
3. }
invalid syntax
r(198);
I am, however, able to use the foreachcommand without error:
. foreach i in 33 34 35 36 37 39 40 41 42 43 44 46 47 58 59 60 61 62 63 64 91 92 93 94 95 96 123 124 125 126 {
2. replace NAME =subinstr(NAME, char(`i'), "", .)
3. }
My question is why the forvalues command doesn't work. My presupposition is that I just did something wrong in the command syntax-wise, but I also wondered if Stata treats values in the char() function differently than I thought when used with forvalues.
Of course, if there is an even better way to accomplish the elimination of all special characters besides ampersands and dashes, I am all ears. Thanks for any advice.
Using Stata 15.1/IC
I need to submit a bulk file with a string variable ("NAME" variable in this example) that is required to have no special characters besides ampersand and dash. I am able to accomplish this using the following series of commands:
charlist NAME //shows which characters are in my string var NAME
"&',-./01234689ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnop qrstuvwxyz
egen NEWNAME= sieve(NAME), omit(,./`"""'`"'"') // generates new variable with the special characters omitted but retains & and -
Results:
Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str86(NAME NEWNAME)
"Single-Benefits, Inc." "Single-Benefits Inc"
"Superstar, LLC" "Superstar LLC"
"RML Agency, Inc." "RML Agency Inc"
"A & M Company, Inc." "A & M Company Inc"
end
I thought I could use the char() function to generalize the command by using the integer values associated with ASCII characters with a forvaluesloop (under the assumption I will nor run into any non-ASCII special characters), but I get the following error:
. forvalues i = 33/37 39/44 46/47 58/64 91/96 123/126 {
2. replace NAME = subinstr(NAME, char(`i'), "", .)
3. }
invalid syntax
r(198);
I am, however, able to use the foreachcommand without error:
. foreach i in 33 34 35 36 37 39 40 41 42 43 44 46 47 58 59 60 61 62 63 64 91 92 93 94 95 96 123 124 125 126 {
2. replace NAME =subinstr(NAME, char(`i'), "", .)
3. }
My question is why the forvalues command doesn't work. My presupposition is that I just did something wrong in the command syntax-wise, but I also wondered if Stata treats values in the char() function differently than I thought when used with forvalues.
Of course, if there is an even better way to accomplish the elimination of all special characters besides ampersands and dashes, I am all ears. Thanks for any advice.
Using -cmp- to estimate and interpret a three-stage Heckman model
Good morning all,
I am using the -cmp- package developed by Roodman to estimate a three-stage Heckman selection model. I am using the following code:
While the model has estimated and I am generally able to interpret it, I had a few questions.
First, standard Heckman models have a rho parameter that is the inverse Mills ratio. This controls for selection bias in the second stage of the regression. When estimating the above Heckman model, however, there are three rho parameters. Each parameter has numbers attached to it: rho_12, rho_13, and rho_23. I assume that this means there is a rho parameter being used in stage 2 that is from stage 1, stage 3 that is from stage 1, and stage 3 that is from stage 2. While this interpretation makes sense, why does rho_13 exist? Should the inverse Mills ratio of stage 1 really be put into stage 3? Would I need to constrain that parameter to zero? Some advice would be appreciated, as constraining the parameter to zero substantively changes my results.
Second, I am using probit models and want to interpret the coefficients using margins. I cannot, however, seem to write the code necessary to get marginal effects at the third stage of my model conditional on the first two stages. Here is the code from the cmp help file that is closest to what I want:
This code replicates the margins, predict(pcond) code for get marginal effects in the second stage of a Heckman probit model in base Stata. It conditions margins on the first stage being equal to 1. I want to do the same with cmp, except conditioning both the first and second stage being equal to 1. How would I do this?
Thanks in advance for anyone who can help. I greatly appreciate it!
- Garrett
I am using the -cmp- package developed by Roodman to estimate a three-stage Heckman selection model. I am using the following code:
Code:
cmp (stage3 = )(stage2 = ) (stage1 =), ind(stage2*$cmp_probit stage1*$cmp_probit $cmp_probit)
First, standard Heckman models have a rho parameter that is the inverse Mills ratio. This controls for selection bias in the second stage of the regression. When estimating the above Heckman model, however, there are three rho parameters. Each parameter has numbers attached to it: rho_12, rho_13, and rho_23. I assume that this means there is a rho parameter being used in stage 2 that is from stage 1, stage 3 that is from stage 1, and stage 3 that is from stage 2. While this interpretation makes sense, why does rho_13 exist? Should the inverse Mills ratio of stage 1 really be put into stage 3? Would I need to constrain that parameter to zero? Some advice would be appreciated, as constraining the parameter to zero substantively changes my results.
Second, I am using probit models and want to interpret the coefficients using margins. I cannot, however, seem to write the code necessary to get marginal effects at the third stage of my model conditional on the first two stages. Here is the code from the cmp help file that is closest to what I want:
Code:
cmp (wage2 = education age) (selectvar = married children education age), ind(selectvar*$cmp_probit $cmp_probit) qui margins, dydx(*) predict(pr eq(wage2) condition(0 ., eq(selectvar)))
Thanks in advance for anyone who can help. I greatly appreciate it!
- Garrett
Hyperlink to the file generated/modified by putexcel
This is a very minor request/question. Several of the user-made commands I use (e.g. estout, and iebaltab) have a nifty feature where they provide a hyperlink the the file that they write so that you can just click from the results window rather than browsing through your files. Is there any way to get putexcel to do this as well? I've been searching about but can't find much information about how this works. Thanks!
grouped variables
Hi
I currently have a variables for income following this structure:
Therefore when I run summary statistics it comes up with a mean of the "option" not the label. Is there any way I can re-code the variable to be the ranges shown? Or alternatively run summary statistics so I get a mean of the grouped variable? In general I do not understand how to deal with a grouped variable and struggled to find the relevant information. Many thanks
I currently have a variables for income following this structure:
Therefore when I run summary statistics it comes up with a mean of the "option" not the label. Is there any way I can re-code the variable to be the ranges shown? Or alternatively run summary statistics so I get a mean of the grouped variable? In general I do not understand how to deal with a grouped variable and struggled to find the relevant information. Many thanks
Standard errors using Frisch-Waugh-Lovell theorem
Hi,
I need to implement the Frisch Waugh-Lovell-theorem in Stata 15 MP (64-bit) in the context of a research project. To illustrate my problem, I would like to abstract from my actual problem and focus on the following MWE. In the example, I'd like to show that the coeffcient on headroom can be obtained in two ways, either through a standard OLS estimation with two regressors in total or through partialling out of the first regressor, trunk.
My trivial question is: Why are the standard errors of the variable headroom from the multiple regression and the standard error of the partialled out coeffcient on headroom not exactly equal? The coeffcients themselves correspond (which I wanted to see), however, I seem to not understand this procedure properly, since actually also the standard errros should correspond, right? Where am I not seeing the mistake?
Thank you very much in advance.
I need to implement the Frisch Waugh-Lovell-theorem in Stata 15 MP (64-bit) in the context of a research project. To illustrate my problem, I would like to abstract from my actual problem and focus on the following MWE. In the example, I'd like to show that the coeffcient on headroom can be obtained in two ways, either through a standard OLS estimation with two regressors in total or through partialling out of the first regressor, trunk.
Code:
sysuse auto2, clear * Multivariate regression reg price trunk headroom * Partialling out reg headroom trunk, vce(robust) predict double resid_x2, res reg price trunk predict double resid_y, res reg resid_y resid_x2
Thank you very much in advance.
Wednesday, December 26, 2018
questions about model selection with lassopack
Dear STATA users,
Sorry to ask you 3 simple questions.
1.When we used lassopack for selecting predictors, if the predictor is a categorical variable, should we just put it in the code, or add "i." before the variable?
Should we use these code:
lasso2 AO agec i.sex i.edu3 i.jobm i.incomef i.snec i.dnec1 , plotpath(lambda)
cvlasso AO agec i.sex i.edu3 i.jobm i.incomef i.snec i.dnec1 , lopt seed(123)
Or these code:
lasso2 AO agec sex edu3 jobm incomef snec dnec1 , plotpath(lambda)
cvlasso AO agec sex edu3 jobm incomef snec dnec1 , lopt seed(123)
2. Must we use cvlasso to select the predictors?
When we finished the lasso2 code and at the bottom of the results, there is a explanation "Type "lasso2, lic(ebic)" to run the model selected by EBIC.

My question is which one should be based for model selection? EBIC or Lambda?
3. After we run the lasso code and get the final model, the p values for some predictors are more than 0.05, is it ok?
Many thanks and best wishes!
Jing Pan
Sorry to ask you 3 simple questions.
1.When we used lassopack for selecting predictors, if the predictor is a categorical variable, should we just put it in the code, or add "i." before the variable?
Should we use these code:
lasso2 AO agec i.sex i.edu3 i.jobm i.incomef i.snec i.dnec1 , plotpath(lambda)
cvlasso AO agec i.sex i.edu3 i.jobm i.incomef i.snec i.dnec1 , lopt seed(123)
Or these code:
lasso2 AO agec sex edu3 jobm incomef snec dnec1 , plotpath(lambda)
cvlasso AO agec sex edu3 jobm incomef snec dnec1 , lopt seed(123)
2. Must we use cvlasso to select the predictors?
When we finished the lasso2 code and at the bottom of the results, there is a explanation "Type "lasso2, lic(ebic)" to run the model selected by EBIC.
My question is which one should be based for model selection? EBIC or Lambda?
3. After we run the lasso code and get the final model, the p values for some predictors are more than 0.05, is it ok?
Many thanks and best wishes!
Jing Pan
Is there a way to rename large number of variables with a single command? (Details)
I have a ton of variables, for example, var1_m, var2_m, var3_m, etc. I want to turn them into var1_2016, var2_2016, var3_2016, etc. Basically, changing _m at the end into _2016. Thanks 

Generate with tempfiles
I used Stata tempfile code from one of the earlier posts to append multiple years of NHIS mortality data. The code worked perfectly. However, I had to manually generate interview year variable in each data set before appending them. I'm hoping anyone can show me how to generate a new variable - year - as it will help me generate other variables. Here is a code I used
Code:
clear set more off local flist: dir "." files "*.dta" use NHIS_1986_MORT_2011_PUBLIC //is there a way to run the code without specifying the using dataset? local mort = 0 foreach fname of local flist { local ++mort tempfile temp`mort' save "`temp`mort''" } forval i = 1/`mort' { append using "`temp`i''" }
Looking for examples of OSIRIS dictionaries and data
Dear Statalisters,
I am writing a custom converter of data from OSIRIS dictionaries and need examples for testing.
I know ICPSR has a bunch of old datasets in this format, but I can't get access to it since it is all behind their login screen.
If anyone knows of publicly available examples of OSIRIS dictionaries (type I, or any other type), please point me to those resources.
If you can share data privately (a few observations should be sufficient), please sent me a message directly.
Thank you, Sergiy Radyakin
I am writing a custom converter of data from OSIRIS dictionaries and need examples for testing.
I know ICPSR has a bunch of old datasets in this format, but I can't get access to it since it is all behind their login screen.
If anyone knows of publicly available examples of OSIRIS dictionaries (type I, or any other type), please point me to those resources.
If you can share data privately (a few observations should be sufficient), please sent me a message directly.
Thank you, Sergiy Radyakin
Strategy to choose the right controls? Conceptual questions
Hello,
I have a problem with chosing the right controls for my model and hope someone can help me along .
With the model I want to explain savings, human capital and labour supply by the time of the demographic Transition (DT). DT is the variable of interest. The model is the same for all three:
dependent_2010 = ß_0 + ß_1*dependent_1990 + y1*DT + y2*DT² + c*Control , with subscripts i for the variables
I have already chosen a proxy for urbanity and a dummy for war during the investigation period 1990-2010 as control variables.
Question 1:Do I have to expect that urbanity/war is both correlated with an independent variable AND affects the dependent variable - or is it enough to assume that urbanity/war affects the dependent variable WITHOUT correlation to any independent variable?? -- In case that correlation with an independent variable is needed: My model includes one lagged value. If I assume that a control variable affects the dependent variable, I simultaneously assume it affects one of the "independent" variables, the lagged value, too. However, I have a feeling this would not force me to include such acontrol variable in the model, does it? I hope this makes sense.
Question 2: I have though about including the level of income in 1990 but would this be purposeful? Ecomomic models usually explain income as a function of savings, human capital and labour supply - so explaining these variables with income would be misleading, although it sounds logical that poor nations save less or can only afford little education. Should I exclude variables where reverse causality could occur?
Question 3: To control for life expectancy would be reasonable in my sense because expecting a longer life, people could tend to save more, work more and get more education. Now, this is tricky for me: The demographic transition (DT) is initiated by falling mortality rates and by this manner, also by rising life expectancy. That means life expectancy and DT must be correlated. Is it still okay to include life expectancy it in the model? I fear that because DT is a result of
Off Topic- Question 4: After I receive the estimation results and, say, find significant effects of DT on the dependent variable. What phrases am I "allowed" to state? I surely can't say, 'The result is that DT causes the dependent variable to rise/fall.' Is this really all I can say: 'We cannot reject that the effect of DT on the dependent variable is non-existent'?
I apologize as this is no direct question about Stata, but the help received on this forum is very valuable and I don't know where else to ask.
As always, thank you!!
I have a problem with chosing the right controls for my model and hope someone can help me along .
With the model I want to explain savings, human capital and labour supply by the time of the demographic Transition (DT). DT is the variable of interest. The model is the same for all three:
dependent_2010 = ß_0 + ß_1*dependent_1990 + y1*DT + y2*DT² + c*Control , with subscripts i for the variables
I have already chosen a proxy for urbanity and a dummy for war during the investigation period 1990-2010 as control variables.
Question 1:Do I have to expect that urbanity/war is both correlated with an independent variable AND affects the dependent variable - or is it enough to assume that urbanity/war affects the dependent variable WITHOUT correlation to any independent variable?? -- In case that correlation with an independent variable is needed: My model includes one lagged value. If I assume that a control variable affects the dependent variable, I simultaneously assume it affects one of the "independent" variables, the lagged value, too. However, I have a feeling this would not force me to include such acontrol variable in the model, does it? I hope this makes sense.
Question 2: I have though about including the level of income in 1990 but would this be purposeful? Ecomomic models usually explain income as a function of savings, human capital and labour supply - so explaining these variables with income would be misleading, although it sounds logical that poor nations save less or can only afford little education. Should I exclude variables where reverse causality could occur?
Question 3: To control for life expectancy would be reasonable in my sense because expecting a longer life, people could tend to save more, work more and get more education. Now, this is tricky for me: The demographic transition (DT) is initiated by falling mortality rates and by this manner, also by rising life expectancy. That means life expectancy and DT must be correlated. Is it still okay to include life expectancy it in the model? I fear that because DT is a result of
life exp., I could erase potential effects of DT on the dependent variable?
Off Topic- Question 4: After I receive the estimation results and, say, find significant effects of DT on the dependent variable. What phrases am I "allowed" to state? I surely can't say, 'The result is that DT causes the dependent variable to rise/fall.' Is this really all I can say: 'We cannot reject that the effect of DT on the dependent variable is non-existent'?
I apologize as this is no direct question about Stata, but the help received on this forum is very valuable and I don't know where else to ask.
As always, thank you!!
Generate a new variable by deleting everything after certain character ("/")
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input str28 country "UK" "France" "France / Singapore / UAE" "Switzerland" "Spain / US" "Italy" "Switzerland" "France" "Netherlands" "UK" "FR / GB / DE / ES / IT / PL" end
Code:
+-----------------------------+ | country | |-----------------------------| 1. | UK | 2. | France | 3. | France / Singapore / UAE | 4. | Switzerland | 5. | Spain / US | |-----------------------------| 6. | Italy | 7. | Switzerland | 8. | France | 9. | Netherlands | 10. | UK | |-----------------------------| 11. | FR / GB / DE / ES / IT / PL | +-----------------------------+
Non-linear hypotheses testing with a GSEM
Hello, I'm having trouble with the testnl command after a GSEM model. Specifically, my problem is that I can't refer to the covariance coefficient of my GSEM estimation within the testnl command.
Here is a minimal working example of my problem:
In the previous example, everything works fine until I test the hypothesis that the covariance between the error terms is 0, where I get an "option e.trunk not allowed" error.
I guess that the comma inside the covariance is interpreted as an option, but I don't know how else I can refer to this covariance. I inspected the e(b) matrix, but the coefficient have the same name.
Any help would be appreciated. Thanks in advance for your responses.
Here is a minimal working example of my problem:
Code:
sysuse auto gsem (price <- mpg rep78) (trunk <- length turn), cov(e.price*e.trunk) gsem, coeflegend testnl _b[trunk:length]=0 testnl _b[/var(e.price)]=0 testnl _b[/cov(e.price,e.trunk)]=0
I guess that the comma inside the covariance is interpreted as an option, but I don't know how else I can refer to this covariance. I inspected the e(b) matrix, but the coefficient have the same name.
Any help would be appreciated. Thanks in advance for your responses.