BJ Data Tech Solution

Specialized on Data processing, Data management Implementation plan, Data Collection tools - electronic and paper base, Data cleaning specifications, Data extraction, Data transformation, Data load, Analytical Datasets, and Data analysis. BJ Data Tech Solutions teaches on design and developing Electronic Data Collection Tools using CSPro, and STATA commands for data manipulation. Setting up Data Management systems using modern data technologies such as Relational Databases, C#, PHP and Android.

Monday, December 31, 2018

Firm and Industry Effects Together?

Hi,

In the context of corporate finance, some studies claim to use firm and industry fixed effects together in panel data regressions. However, since inclusion of firm effects takes care of all time invariant variables together, how is it possible for a researcher to include industry effects (the industry of a firm remains the same over generally) also in the same regression? I understand that if at all the industry of even a single firm in the dataset changes from one year to another, it would be mechanically possible to obtain results for fixed effects regression. But since the industry of a firm usually remains same across time for almost the entire set of firms in the sample, how reliable are beta coefficients of independent variables in case of a regression with firm and industry effects?

Here are some papers which employ firm and industry effects together:

Thakur, B., & Kannadhasan, M. (2018). Corruption and cash holdings: Evidence from emerging market economies. Emerging Markets Review, 38, 1-17. doi: 10.1016/j.ememar.2018.11.008

Venkiteshwaran, V. (2011). Partial adjustment toward optimal cash holding levels. Review Of Financial Economics, 20(3), 113-121. doi: 10.1016/j.rfe.2011.06.002

Thanks!

stata

Hello . I selecting the excell file. then import the following code in stata
gen area=substr( ADDRESS, 1, 1)
It is correct for all my files
but In one case, the following error is given
type mismatch
r(109);

What is your problem?
Thanks for help.

variable names as elements?

Dear All, I find this question here. The data is

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int ExpertID byte(domain_A domain_B domain_C domain_D domain_E)
290 . . 1 . .
 90 1 . . . .
149 1 . . . .
 11 1 1 1 0 0
181 1 1 1 0 0
 17 1 . . 1 .
142 1 . 1 . .
 40 1 1 . . .
106 . . . 1 .
182 1 0 0 0 0
end

and the desired result is

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int ExpertID str10 domain
290 C
 90 A 
149 A 
 11 ABC
181 ABC
 17 AD
142 AC 
 40 AB
106 D
182 A 
end

The rule is, for example, for ExpertID=290, only domain_C=1, so the desired result is domain=C. Another example, for ExpertID=11, domain_A, domain_B, and domain_C are all equal to 1, so the desired result is domain=ABC, and so on. Any suggestion is appreciated.

Variable not found in nlcom

Dear Everyone,

I'm new here and I would like to measure willingness to pay by using double bounded method analysis. I have found the constant and coefficients for each independent variables but I've got a message of 'not found' for my variable, is it because of the mistake in my command:

nlcom (wtp

_b[_cons]+Emplo_m*b[Emplo]+Income_m*b[Income]+TriedtoQuit_m*b[TriedtoQuit]+Heal_Res_m*b[Heal_Res]+PeerInfluence_m*b[PeerInfluence]+Toquit_m*b[Toquit]+Notice_m*b[Notice])), noheader

I have seven independent variables and this is what I received:
Emplo_m not found
r(111);

Thanks for your kind assistance.

Sunday, December 30, 2018

Insufficient observations to compute bootstrap standard errors

I am trying to perform mi imputation with bootstrap using following syntax:
1.mi set wide
2. program define myboot, rclass
3.mi register imputed varlist....
4.mi impute mvn varlist....., add( 187)
5. egen country1=group(country)
6. mi xtset country1 year,yearly
7. mi estimate: xtreg varlist.....
8. return scalar b_a = el(e(b_mi),1,1)
9. return scalar b_b = el(e(b_mi),1,2)
10. return scalar b_c = el(e(b_mi),1,3)
11. return scalar b_d = el(e(b_mi),1,4)
12. return scalar b_e = el(e(b_mi),1,5)
13. return scalar b_f = el(e(b_mi),1,6)
14. return scalar b_g = el(e(b_mi),1,7)
15. end
16. set seed 23543
17. bootstrap b_va1=r(b_a) b_var2=r(b_b) b_var3=r(b_c) b_var4=r(b_d) b_var5=r(b_e) b_var6=r(b_f) intercept=r(b_g), reps(2000) : myboot
I am facing following problem after execution:
Bootstrap replications (2000)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
xxxxxxxxxxxxxxxxxxxx
insufficient observations to compute bootstrap standard errors
no results will be saved
r(2000);
Please guide

Problem with a big database

Hello, I want to open an excel file, import it, but it does not leave stata 14, this excel file has 53.6 mb and the pop-up window tells me that for this type of file the maximum capacity is 40 mb. Could you guide me how to proceed? They've been working on this database SPSS, but as I have Stata, prefer to work it with this program

area specific linear time trend

Hi All,

Question 1:

I'm running a difference-in-differences analysis on yearly repeated cross-sectional data. I'd like to include the area specific linear time trend.

xtset id year

xtreg Y x1 x2 c.year_sequence # i.id, fe r

(note: year is 2005 to 2018, year_sequence is from 1 to 14)

Are these codes correct?

Question 2:

I'm running a difference-in-differences analysis on yearly level data (t). I have multiple areas(i) and multiple jobs (j). How to control for the area specific trend (not linear)?

Is that i.year ## i.area correct?

xtset area year

xtreg Y x1 x2 i.job i.area ## i.year , fe r

Thanks, appreciated!!!!

Happy New Year!

new command -rdcont- on SSC: test of running variable continuity in RDD

Hello all! Thanks to Kit Baum, a new package rdcont is now downloadable from SSC! This program can be installed from SSC by typing ssc install rdcont in the Stata command window.

Description: A common practice in the regression discontinuity design (RDD) is to test the hypothesis that the running variable has a continuous density at the threshold. rdcont tests this hypothesis using an approximate sign test, as detailed in Bugni and Canay (2019). Relative to competing tests, the approximate sign test is asymptotically valid under mild conditions. The rdcont test is implemented by default using the data-dependent choice of “q” provided by Bugni and Canay (2019).

Example: The example below uses data from Lee (2008), which uses RDD to estimate the effect of the incumbency advantage in US elections, to test the assumption of continuity in the running variable, difference in vote share between parties.

Code:

use http://fmwww.bc.edu/repec/bocode/t/table_two_final.dta, clear
rdcont difdemshare if use==1

Happy coding,
Joe

Changing many values at once

Hi all,

I am working with panel data from a household survey. For each household (nohhold), multiple observations have been made in each year (for different members of the household). Eqin (income) is only reported for the head of the household but I want this value to be extended to each of the household members, since my research focuses on spouses. Is there an easy way to do this? So, to give all observations of eqin for nohhold 106 in 2008 the value of 6841.481?

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double nohhold float(year eqin)
106 2007         .
106 2007         .
106 2007         .
106 2007         .
106 2007         .
106 2007         .
106 2007         .
106 2007         .
106 2008  6481.481
106 2008          .
106 2008          .
106 2008          .
106 2009          .
106 2009  7711.111
106 2009          .
106 2009          .
106 2010          .
106 2010          .
106 2010          .
106 2010          .
106 2011          .
106 2011  8888.889
106 2011          .
106 2011          .
106 2012          .
106 2012  8888.889
106 2012          .
106 2012          .
106 2013         .
106 2013          .
106 2013          .
106 2013  8888.889
106 2014          .
106 2014          .
106 2014  8888.889
106 2014          .
106 2015  9796.296
106 2015          .
106 2015          .
106 2015          .
106 2016          .
106 2016          .
106 2016  8888.889
106 2016          .
106 2017  11111.11
106 2017          .
106 2017          .
106 2017          .
318 2007         .
318 2007         .
318 2007         .
318 2007         .
318 2007         .
318 2007         .
end

Fairlie decomposition

Hello,

I am using the fairlie STATA module for decomposition (https://ideas.repec.org/c/boc/bocode/s456727.html) to analyze the following model:

Independent variables: aa001, aa004, ba016, ea104, eb001, eb002, ec003
Dependent variable: eh041 (binary: 0, 1)
Group variable: groupvar (binary: 0, 1)

However, I am unable to locate any information, either on STATAList.org or elsewhere, that helps me to interpret these results correctly. Also the publications by Fairlie didn't help me forward.

Question: Can anyone please help me to get into the right direction to interpret the results below?

Your response is highly appreciated!

The fairlie module is run using the following command:

Code:

fairlie eh041 aa001 aa004 ba016 ea104 eb001 eb002 ec023, by(groupvar)

This produces the following output:

Code:

Iteration 0:   log likelihood = -877.38553
Iteration 1:   log likelihood = -862.66744
Iteration 2:   log likelihood = -862.24213
Iteration 3:   log likelihood = -862.24169

Logistic regression                               Number of obs   =       2976
                                                  LR chi2(7)      =      30.29
                                                  Prob > chi2     =     0.0001
Log likelihood = -862.24169                       Pseudo R2       =     0.0173

------------------------------------------------------------------------------
groupvar     |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       aa001 |   .4093181   .1622917     2.52   0.012     .0912323     .727404
       aa004 |   .0152344   .0065036     2.34   0.019     .0024875    .0279812
       ba016 |    -.16377   .0795174    -2.06   0.039    -.3196213   -.0079188
       ea104 |  -.0061618   .0080322    -0.77   0.443    -.0219046     .009581
       eb001 |   .2024671   .1832657     1.10   0.269    -.1567272    .5616613
       eb002 |  -.2667996   .1977646    -1.35   0.177    -.6544111    .1208119
       ec023 |   .1391324   .0841333     1.65   0.098    -.0257658    .3040306
       _cons |   1.896858   .6315424     3.00   0.003     .6590576    3.134658
------------------------------------------------------------------------------

Decomposition replications (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..................................................   100

Non-linear decomposition by groupvar (G)

                                                Number of obs     =      6,312
                                                  N of obs G=0    =       2976
                                                  N of obs G=0    =       3336
                                                  Pr(Y!=0|G=0)    =  .91330645
                                                  Pr(Y!=0|G=1)    =  .89868106
                                                  Difference      =   .0146254
                                                  Total explained =  .00011247
------------------------------------------------------------------------------
groupvar     |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       aa001 |   .0004943   .0005906     0.84   0.403    -.0006633    .0016518
       aa004 |  -.0010569   .0006778    -1.56   0.119    -.0023854    .0002716
       ba016 |  -.0001179   .0005283    -0.22   0.823    -.0011533    .0009176
       ea104 |  -.0002608   .0004181    -0.62   0.533    -.0010804    .0005587
       eb001 |  -.0001183    .000191    -0.62   0.536    -.0004927    .0002561
       eb002 |   .0006998   .0005598     1.25   0.211    -.0003974     .001797
       ec023 |   .0004668   .0003892     1.20   0.230    -.0002961    .0012296
------------------------------------------------------------------------------

Issue with three-dimensional panel data analysis

Hi,

I am new with STATA and I would like some advice for the following problem: I am dealing with panel count data model.My dependent variable is a count and is the number(count) of investment projects in each host country(i), in each sector (j) in a given year(t). My data is panel data from 2003 to 2016 and I have 12 industries and 105 host nations. My two main explanatory variables vary by industry by time(ij)and by country, industry, time(ijt).Control variables are composed of i and t. Therefore,
I am dealing with three-dimensional panel data analysis:
i= country, 105
j= industry, 12
t =year, 14
I am using xtpoisson with fe approach with robust standard errors.

After reading a lot of STATAlist previous posts I realized that in order to xtset my data I need to combine countries*industries fixed effect:
egen panelid= group (country * industries)
xtset panelid Year
xtset Y X i.Year, robust i(panelid) fe

However, because I have a lot of countries (105) and a lot of zeros on my dependent variable an important downfall of this estimation is the loss of degrees of freedom because of the inclusion of all these dummy variables. Instead of interacting countries* industries and because I don’t want to combine the country and industry FE, I also tried to put them separately on the model:

For a model with industry FE:
Xtset Industry
Xtpoisson Y X i. Year i.country, fe
Xtpoisson Y X. i.Year i.region, fe
I also incorporate regional dummies in order to group my 105 countries and I incorporated in the model:

Xtpoisson Y X. i.Year i.region, fe

My question is that is there any other way to model a three dimensional panel data without combining industries and countries which generates so many dummies? At the same time when I am grouping industries & countries does not allow me to have a separate information about industries or countries.

I found this older post that was helpful on my decision.

https://www.statalist.org/forums/for...ata-regression

Any suggestions or advice will be greatly appreciated.

Thank you very much,

Questions about Data Setting

Dear Statalist,

I am now setting the data to write the DID.
However, there is a problem with the arrangement of my data to use DID.

My data are like as below:

ID | revenue2007 | revenue2008 | revenue2009 | asset2009 | asset2010 | asset2011 | manufacture | wholesale | others

While manufacture, wholesale, and others are dummy variables. (Form of 0 1)

I would like to set the above data as follows.

ID year revenue asset type of business
1 0(2007)
1 1(2008)
1 2(2009)
1 0
1 1
1 2
1 0
1 1
1 2
2 0
.
.
where year 0 represents 2007, 1 does 2008, and 2 does 2009.
(Suppose the treatment was treated at 2008)

I tried to use 'reshape', but I do not know how to do it.
Thanks in advance.

HJ

Marginsplot, addplot - adjustment

Hello,
using the command: - marginsplot, addplot(hist...)-, I got this graph:
Array

I would like to move the histogram at the bottom of the graph (e.g. where y(vertical) axis equals to -0.2)…

Thank you in advance!
Have a nice and creative new year!!!

Twoway line by county

Hi everyone,

For my Panel Data descriptive Analysis I am trying to graph the development of Charging stations per Km for my 18 counties.

Since I have monthly data, I first created the yearly means for every county (to reduce the amount of datapoints)

Code:

egen meanCHS4 = mean(ChStationsRoadKm), by (Year county)

For my graph I am using the following code

Code:

twoway line meanCHS4 Year, by(county)

see attached picture.

My problem now is that Oslo has a much higher amount of charging stations per Km than the other 17 counties and since they all use the same scale there is not much information I can see in the other 17 graphs. Is there a way to scale Oslo differently than the other 17 counties?

Thank you in advance,
Alex
Array

Saturday, December 29, 2018

Merging 3 data sets

Good evening all. I am looking for some help merging 3 data sets.
All 3 data sets are sorted by patient ID (ptid) and I would like to merge by ptid. The issue is that the master data set has one row per ptid, but the other two have multiple rows of data with the same ptid.
I was able to merge the master data set with one of the 2 other data sets no problem using 1:m merge.
code:
use "apap_analysis"
merge 1:m ptid using "apap_meds"

Now I am unsure of how to merge in the 3rd data set, which contains the same ptid variable, but otherwise contains different variables than the first two datasets.
I tried using the m:m merge, but it created issues in the data, mainly duplicating rows of variables that I do not want to be duplicated.

Does anyone know how I can merge in the 3rd data set? Can I tag the ptid in all 3 data sets and merge based on the tag ptid?

I can provide more detail/clarification if needed.

Thanks!!

Different resutls between xtreg and xtivreg2

I recently ran into an issue with xtivreg2. I find that the coefficient estimates are so much different between xtivreg2 and xtreg estimations although I have the same obs in both cases as you see below Do you know why this might be the case? Thanks very much for your help

Ken

. xi: xtivreg2 income_ln (l_nooutage=l_nooutage_other) i.year, fe robust
i.year _Iyear_2012-2016 (naturally coded; _Iyear_2012 omitted)

FIXED EFFECTS ESTIMATION
------------------------
Number of groups = 3563 Obs per group: min = 2
avg = 2.7
max = 3

IV (2SLS) estimation
--------------------

Estimates efficient for homoskedasticity only
Statistics robust to heteroskedasticity

Number of obs = 9557
F( 3, 5991) = 52.73
Prob > F = 0.0000
Total (centered) SS = 1979.58912 Centered R2 = -0.0393
Total (uncentered) SS = 1979.58912 Uncentered R2 = -0.0393
Residual SS = 2057.374316 Root MSE = .5859

------------------------------------------------------------------------------
| Robust
income_ln | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
l_nooutage | 2.12346 .6731734 3.15 0.002 .8040644 3.442856
_Iyear_2014 | .0671006 .0173548 3.87 0.000 .0330859 .1011154
_Iyear_2016 | .1476405 .0245762 6.01 0.000 .0994721 .195809
------------------------------------------------------------------------------
Underidentification test (Kleibergen-Paap rk LM statistic): 164.081
Chi-sq(1) P-val = 0.0000
------------------------------------------------------------------------------
Weak identification test (Cragg-Donald Wald F statistic): 184.444
(Kleibergen-Paap rk Wald F statistic): 181.606
Stock-Yogo weak ID test critical values: 10% maximal IV size 16.38
15% maximal IV size 8.96
20% maximal IV size 6.66
25% maximal IV size 5.53
Source: Stock-Yogo (2005). Reproduced by permission.
NB: Critical values are for Cragg-Donald F statistic and i.i.d. errors.
------------------------------------------------------------------------------
Hansen J statistic (overidentification test of all instruments): 0.000
(equation exactly identified)
------------------------------------------------------------------------------
Instrumented: l_nooutage
Included instruments: _Iyear_2014 _Iyear_2016
Excluded instruments: l_nooutage_other
------------------------------------------------------------------------------

. xi: xtreg income_ln l_nooutage i.year, fe robust
i.year _Iyear_2012-2016 (naturally coded; _Iyear_2012 omitted)

Fixed-effects (within) regression Number of obs = 9,557
Group variable: hh_ID Number of groups = 3,563

R-sq: Obs per group:
within = 0.0299 min = 2
between = 0.0287 avg = 2.7
overall = 0.0156 max = 3

F(3,3562) = 52.80
corr(u_i, Xb) = 0.0359 Prob > F = 0.0000

(Std. Err. adjusted for 3,563 clusters in hh_ID)
------------------------------------------------------------------------------
| Robust
income_ln | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
l_nooutage | -.1175234 .1064316 -1.10 0.270 -.3261965 .0911496
_Iyear_2014 | .1045732 .01321 7.92 0.000 .0786732 .1304731
_Iyear_2016 | .2107709 .0169039 12.47 0.000 .1776287 .2439131
_cons | 11.66598 .6202094 18.81 0.000 10.44998 12.88199
-------------+----------------------------------------------------------------
sigma_u | .74279106
sigma_e | .56616934
rho | .63252005 (fraction of variance due to u_i)
------------------------------------------------------------------------------

Reshape Long Missing Values Error

My dataset is in time series format.

I am converting it to a panel using the following code:

Code:

reshape long var, i(date)

But keep getting the following error:
variable _j contains all missing values
r(498);

Here is the data sample:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int date double(var1HK0000040383 var2HK0000050325) byte var3KR7000010009 double(var4KR7000020008 var5KR7000030007 var6KR7000040006)
16072 0 0 0 5243.082 11930.294 532.714
16075 0 0 0 5243.082 11930.294 532.714
16076 0 0 0 5243.082 11930.294 532.714
16077 0 0 0 5243.082 11930.294 532.714
16078 0 0 0 5243.082 11930.294 532.714
16079 0 0 0 5243.082 11930.294 532.714
16082 0 0 0 5243.082 11930.294 532.714
16083 0 0 0 5243.082 11930.294 532.714
16084 0 0 0 5243.082 11930.294 532.714
16085 0 0 0 5243.082 11930.294 532.714
16086 0 0 0 5243.082 11930.294 532.714
16089 0 0 0 5243.082 11930.294 532.714
16090 0 0 0 5243.082 11930.294 532.714
16091 0 0 0 5243.082 11930.294 532.714
16092 0 0 0 5243.082 11930.294 532.714
16093 0 0 0 5243.082 11930.294 532.714
16096 0 0 0 5243.082 11930.294 532.714
16097 0 0 0 5243.082 11930.294 532.714
end
format %tdnn/dd/CCYY date

What is the problem here?

Thank you.

Doubt with power calculations

Greetings I am new to the forum., I am working with a categorical data set, I am trying to calculate the sample size for one variable in my case is epilepsy episode yes/no., I want to count all the yes for further analysis but also I want to include some N in the analysis. Should I used power oneproportion on stata or power twoproportions?
Thanks for your help

Parallel loop with numlist and varlist

I built a for loop with locals in it so that I can have parallel loops of a varlist and a numlist. I want to generate a new variable in each loop that equals to the product of the variable from the varlist and the number from the numlist, but I ended up getting a repeating string (see the graph below) I also attached my codes below. How can I get the product I want -- 136*7 instead of seven 136 connected together for example? Thanks!

Array

Array

kdensity for 10,000 variables

Hello, I plan to make an illustrative graph to show kdensity graphs for about 10,000. I use the below coding
[QUOTE]
forvalues j = 1(1)10000 {

local call `call' (kdensity norm if id == `j' , legend(off)) ||
}
twoway `call' /QUOTE]

However, twoway returns to me saying there are two many graphs. Is there any other way that I can make a joint kdensity graph for such many variables?

Thank you.

Importing data from Excel

Dear All,
I've got an Excel file with about 37 sheets. The sheets identical (e.g. in terms of number of columns, rows, etc). How can I import them at once into a single Stata file?

Thanks,
Dapel

Reporting results ordered logit regression: individual predictors or entire model?

Hello,

I am running the ordered logit regression to predict eh041, by regression on the variables aa001, aa004, ba016, ca001, ea104, eb001, eb002, ec023, and dummy.

My question is: Is it better to report the coefficients, standard errors, and p-values for each individual predictor, or for the entire model (if so, which statistics to report?)?

Example of the dataset:

Code:

input long id int year byte(ca001 aa001) float aa004 byte(eb001 eb002) float(ea104 ec023) byte(eh041 ba016) int(dummy)
11001 2004 1 1 60 0 1   10 3 2 4 0
11001 2006 . .  . . .    . . . . 1
11002 2004 . 2 65 . .    . . . 4 0
11002 2006 . .  . . .    . . . . 1
25601 2004 1 1 50 0 1   36 5 2 6 0
25601 2006 1 1 52 0 1   36 4 1 6 1

Command for ordered logit regression:

Code:

ologit eh041 aa001 aa004 ba016 ca001 ea104 eb001 eb002 ec023 dummy

Output:

Code:

note: ca001 omitted because of collinearity
Iteration 0:   log likelihood = -5928.1906  
Iteration 1:   log likelihood = -5880.5609  
Iteration 2:   log likelihood = -5880.4552  
Iteration 3:   log likelihood = -5880.4552  

Ordered logistic regression                     Number of obs     =      6,312
                                                LR chi2(8)        =      95.47
                                                Prob > chi2       =     0.0000
Log likelihood = -5880.4552                     Pseudo R2         =     0.0081

------------------------------------------------------------------------------------
             eh041 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------------+----------------------------------------------------------------
             aa001 |  -.3323744   .0608015    -5.47   0.000    -.4515431   -.2132056
             aa004 |  -.0066495   .0024361    -2.73   0.006    -.0114241   -.0018748
             ba016 |   .0874215      .0311     2.81   0.005     .0264666    .1483763
             ca001 |          0  (omitted)
             ea104 |  -.0065987   .0030108    -2.19   0.028    -.0124998   -.0006976
             eb001 |   .0052226   .0671317     0.08   0.938    -.1263531    .1367982
             eb002 |  -.2247139   .0795934    -2.82   0.005    -.3807141   -.0687136
             ec023 |  -.2138397   .0323412    -6.61   0.000    -.2772272   -.1504521
             dummy |   .0544896   .0502734     1.08   0.278    -.0440444    .1530237
-------------------+----------------------------------------------------------------
             /cut1 |  -2.209866   .2457687                     -2.691564   -1.728168
             /cut2 |   .8122446   .2445497                       .332936    1.291553
             /cut3 |   3.017259   .2680799                      2.491832    3.542686
------------------------------------------------------------------------------------

Wald chi2 disappears after I apply the robust variance estimate

Dear Stata users and experts,
I am running an analysis where I have a time invariant variable, cross-section variables, and longitudinal variables. The data has 157 firms total with observations for years 2008-2013 for the dependent variable. I included a year and industry dummies in the GEE model without using robust and I got the Wald Chi2 but when I added the robust variance estimate the wald Chi disappeared? Any suggestions?

New command -oaxaca_rif-

Dear all,
Thanks to Prof. Baum a new command named oaxaca_rif is now available in the SSC archive.
This command is a wrapper for the -oaxaca- command that allows for the estimation of reweighted RIF (recentered Influence Function) decomposition for a large set of distributional statistics.
Hope you find it useful.
Fernando

Quicker way to export correlation coefficients into Excel?

Hello, I am running correlations over hundreds of variables and storing the output of the correlation coefficients, the variable names, number of observations and the confidence interval into Excel.

Due to the number of correlations I'm running, I am wondering whether there is a quicker way for my computer to run the task. This is my current code:

quietly{
putexcel set coef3, modify
local i=0
foreach var of varlist ea_* {
foreach var2 of varlist wdi_* {
local i=`i'+1

esize unp `var'==`var2', pbcorr
return list
putexcel A`i'=`r(r_pb)' B`i'=`r(lb_r_pb)' C`i'=`r(ub_r_pb)' D`i'=`r(N_1)' E`i'="`var'" F`i'="`var2'", nformat(excelnfmt)
}
}
}

Thank you and happy holidays

Help with output/results window cutting off variable names

My variable names are not shown in full, is there any way of getting Stata to tabulate so the output shows the full variable name? There is plenty of space in the results window.

Code:

. tab stilling_i_husstand_std kn_std  if Særbarn_in_household==1 & ægtefælle_in_household==0

stilling_i_husstand_s |              kn_std
                   td |        ??          K          M |     Total
----------------------+---------------------------------+----------
Barn af Enke hos hu.. |         0          1          0 |         1
Enke hos husstandso.. |         0          6          1 |         7
Faglig medarbejder .. |         0         10          4 |        14
   Husstandsoverhoved |         0      1,697        459 |     2,156
Husstandsoverhoveds.. |         0         10          2 |        12
Husstandsoverhoveds.. |         0         16          3 |        19
Husstandsoverhoveds.. |         0      1,230        356 |     1,586
Husstandsoverhoveds.. |         0         15         11 |        26
Husstandsoverhoveds.. |         0          1          0 |         1
Husstandsoverhoveds.. |         0          5          0 |         5
Husstandsoverhoveds.. |         0        104          9 |       113
Husstandsoverhoveds.. |         0          6          0 |         6
Husstandsoverhoveds.. |         1        201        464 |       666
Husstandsoverhoveds.. |         0         11         11 |        22
Husstandsoverhoveds.. |         0          1          0 |         1
Husstandsoverhoveds.. |         0          4          0 |         4
Husstandsoverhoveds.. |         0          2          0 |         2
Husstandsoverhoveds.. |         0          1          4 |         5
Husstandsoverhoveds.. |         0         55         51 |       106
Husstandsoverhoveds.. |         0          0          5 |         5
Husstandsoverhoveds.. |         0        219      2,914 |     3,133
Husstandsoverhoveds.. |         0         14         10 |        24
Husstandsoverhoveds.. |         0         29          2 |        31
Husstandsoverhoveds.. |         0          8          3 |        11
Husstandsoverhoveds.. |         0        467         51 |       518
Husstandsoverhoveds.. |         0          0          1 |         1
Husstandsoverhoveds.. |         0          5          0 |         5
Husstandsoverhoveds.. |         0          1          0 |         1
Husstandsoverhoveds.. |         0          6          1 |         7
                OTHER |         0        233         73 |       306
Opholdende hos huss.. |         0         13          2 |        15
Tjenestefolk hos op.. |         0          2          0 |         2
----------------------+---------------------------------+----------
                Total |         1      4,373      4,437 |     8,811

De trending// De seasonalising data - find weekly mean

Hiya,

for my project i have daily stock market data (returns and volatility) and daily weather (cloud cover, rain, temp).

How do i get stata to take a weekly average and then subtract it from the daily value in order to just see the excess value from the mean?

Also, does anyone know how to make a graph showing the returns for specific observations only. - By this i mean, cloud cover is measured one to 8. how do i make a graph against the returns for only the values that are 0 and 8, so ommitting certain observation values

thank you so much!!

if you could write the do file commands would be great

IDs in different categories : how to count ?

Dear statalist members,

I have a sample of about 1 million people (id) with one or more records in one or more categories (cat), 15 categories in total. In summary:

id cat
id1 cat1
id2 cat1
id2 cat1
id3 cat1
id3 cat3
… …

I'm trying to find out how many people have at least one record in different categories and what are these categories. I am not interested in other people (people with just one record or several records in one category). In summary, I would like to have a result of the type:

At least one record in cat1 and cat2: 1000 people;
At least one record in cat1 and cat3: 500 people;
At least one record in cat1, cat2 and cat3: 200 people;
...

For now, I have only managed to count the number of people with one or more records in general:

bysort id:gen obs=_N
bysort id:gen obs2=_n
keep if obs2==1
tab obs

Could someone tell me how I could solve this problem?

Many thanks,

Maxime
(Stata 13.1)

Cox regression with enourmous Hazard Ratio (logarithmic)

Dear forum,

I have encountered a problem in that for my Cox regression my output gives enourmous Hazard Ratios for my outcome (disease recurrece), such as 1.33e+10.

A) First I would like to give you the specifics:
I have a project, where I assess the impact of reponse to chemotherapy (i.e. "pres" a variable with 3 levels: complete, partial, no response) on disease recurrence within my given follow up. Displayed graphically (Kaplan Meier plot), the outcome is quite impressive:

Array

However, when I run a Cox regression, adjusted for other variables (age, smoking status, etc), my ouput displays grotesque Hazard ratios:

Array

The problem remains, even if a run a univariable model. I believe the issue here is collinearity in that for example "No response" predicts my outcome (disease recurrence) almost perfectly and therefore has a very large HR.

B) My questions are be the following:

-Do you find my explanation plausible (s. above)?
-Is there a solution, i.e. a way to run the cox model (uni- or multivariable; as I only have 37 events I´d fear an overfit) and have more approachible HRs?
-Lastly, if I run the cox regression without a prefix/factorial (by that I mean omitting "i." for categorial) for my independent variable of choice ("pres"), I get a HR of approximately 6. I do however not know, how STATA runs that specific regression, if "pres" is not specified:
Does it treat the first level of the variable as reference against the other two levels which would be "partial" and "no response"?

Array

Thank you very much for your help and taking time to read this !

Complex dummy variable

Hi STATLIST,

Merry Christmas to those celebrating it and happy new Year.
I have a question about the creation of a dummy variable. In particular I have a list of product names each repeated once that I have merged with a panel dataset in which they all appear repeated in time. The panel dataset contains more than 9000 products repeated in time. My list contains about 500 products not repeated in time (is just a simple list). What I would like to do is create a dummy variable taking on value 1 in the panel if the product name in the panel also appear in the list. I will give you a dateax:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str19 recalled_products str18 prd
"ADENOSINE "          "ALPHA-KETOGLUTARIC"
"ADRUCIL "            "ISOPROPYL ALC/BENZ"
"TROVAN"              "ISOPROPYL ALC/BENZ"
"TEKTURNA"            "ISOPROPYL ALC/BENZ"
"ALOSETRON HCL"       "ISOPROPYL ALC/BENZ"
"ORLAAM"              "ISOPROPYL ALC/BENZ"
"DIETHYLPROPION HCL"  "ISOPROPYL ALCOHOL" 
"AMPHETAMINE SALTS"   "ISOPROPYL ALCOHOL" 
"CYTADREN "           "ISOPROPYL ALCOHOL" 
"AMINOPHYLLINE "      "ALCOHOL"           
"AMYTAL SOD"          "ALCOHOL"           
"PRAMOXINE/HC"        "ALCOHOL"           
"LOVENOX"             "ALCOHOL"           
"DIET SUPP EPHEDRA"   "ALCOHOL"           
"ASPIRIN "            "ALCOHOL"           
"AUVI-Q"              "ALCOHOL"           
"CLINIMIX"            "ALCOHOL"           
"BIOSCANNER KETONE"   "ALCOHOL"           
"PFIZERPEN G"         "ALCOHOL"           
"VASCOR"              "ALCOHOL"           
"PAMPRIN"             "ALCOHOL"           
"BICALUTAMIDE "       "20/20 EYE GLSS CLN"
"BISMUTH SUBGAL"      "20/20 EYE GLSS CLN"
"ANTIVENIN"           "20/20 EYE GLSS CLN"
"BLEPHAMIDE"          "20/20 EYE GLSS CLN"
"LIPO 6"              "20/20 REWETTING"   
"BOOST "              "360 OTC EXTRA STR" 
"BORIC ACID "         "360 OTC EXTRA STR" 
"HEPARIN SOD"         "4-WAY"             
"BROMFENAC SOD"       "4-WAY"             
"BROMOCRIPTINE MESY"  "4-WAY"             
"BUPRENORPHINE HCL"   "4-WAY"             
"BUPROPION HCL SR W"  "4-WAY"             
"BURN "               "4-WAY"             
"BHT"                 "4-WAY"             
"CARBINOXAMINE CMPD"  "4-WAY"             
"CARISOPRODOL "       "4-WAY"             
"SHARK CARTILAGE"     "4-WAY"             
"ZYMAR"               "4-WAY"             
"CELECOXIB "          "4-WAY"             
"CERTA-VITE SENIOR"   "666"               
"AQUACHLORAL"         "666"               
"CHLORAMPHENICOL "    "666"               
"LOBAC"               "666"               
"CHLOROFORM  "        "666"               
"CHLOROQUINE PHOS"    "666"               
"CHORIONIC GONADO"    "666"               
"CLIOQUINOL "         "666"               
"NEOCIDIN"            "666"               
"CLOMIPRAMINE HCL"    "666"               
"CLOZAPINE "          "666"               
"CD/PSE"              "666"               
"ACETAMINOPHEN PM"    "7-KETO DHEA"       
"COUMADIN "           "7-KETO DHEA"       
"CUBICIN"             "7-KETO DHEA"       
"VASODILAN"           "7-KETO DHEA"       
"CYPROHEPTADINE HCL"  "7-KETO DHEA"       
"DRISTAN"             "7-KETO DHEA"       
"ALEVAZOL"            "7-KETO DHEA"       
"DEXAMFETAMINE "      "7-KETO DHEA"       
"PROPOXYPHEN-N/APAP"  "7-KETO DHEA"       
"DICLOFENAC SOD"      "7-KETO DHEA"       
"DICYCLOMINE HCL"     "7-KETO DHEA"       
"ORTHO DIENESTROL"    "7-KETO DHEA"       
"DIETHYLSTILBESTROL " "A & D PERSONAL CAR"
"MOTOFEN"             "A & D PERSONAL CAR"
"GUANIDINE"           "A & D PERSONAL CAR"
"LOMOTIL"             "A & D PERSONAL CAR"
"TRANDATE"            "A & D PERSONAL CAR"
"TIKOSYN"             "A & D PERSONAL CAR"
"ANZEMET"             "A & D PERSONAL CAR"
"DOMPERIDONE "        "A & D PERSONAL CAR"
"DOXYCYCLINE HYCLAT"  "A & D PERSONAL CAR"
"DICYCLOMINE HCL"     "GARLIC/PARSLEY"    
"DROPERIDOL "         "GARLIC/PARSLEY"    
"RAPTIVA"             "GARLIC/PARSLEY"    
"EPINEPHRINE "        "GARLIC/PARSLEY"    
"ERYTHROMYCIN"        "GARLIC/PARSLEY"    
"ERYTHROMYCIN ESTOL"  "GARLIC/PARSLEY"    
"ALCOHOL SWABS"       "GARLIC/PARSLEY"    
"PLACIDYL"            "GARLIC/PARSLEY"    
"ESTINYL"             "GARLIC/PARSLEY"    
"PEPPERMINT SPIRIT"   "GARLIC/PARSLEY"    
"ETOMIDATE "          "GARLIC/PARSLEY"    
"OBIZUR"              "GARLIC/PARSLEY"    
"FELBAMATE "          "GARLIC/PARSLEY"    
"FLUVOXAMINE MAL"     "GARLIC/PARSLEY"    
"FENTANYL "           "GARLIC/PARSLEY"    
"SULFISOXAZOLE"       "GARLIC/PARSLEY"    
"DURALGINA"           "GARLIC/PARSLEY"    
"GATIFLOXACIN "       "GARLIC/PARSLEY"    
"GELATIN "            "GARLIC/PARSLEY"    
"GEMFIBROZIL "        "GARLIC/PARSLEY"    
"GENTAMICIN SULF"     "A&D CRKD SKIN RLF" 
"GLUCOSAMINE SULF "   "A&D CRKD SKIN RLF" 
"ISMELIN"             "A&D CRKD SKIN RLF" 
"DYNABAC"             "A+D FIRST AID"     
"MITOXANTRONE HCL"    "A+D FIRST AID"     
"PHENYLPROPANOLAMIN"  "A+D FIRST AID"     
"SORINE"              "A+D FIRST AID"     
end

Of course the panel is the variable prd which continues (and is much much longer than the variable "recalled_products"). So for instance for the first product ADENOSINE, I would like to create a variable taking on value 1 when the prd variable takes on the name ADRENOSINE (I am sure that all the names in "recalled_products are present also in prd), so in our case, even if it is not present, 12 times (the panel repeats 12 times for ADENOSINE).

Thank you very much,

Federico

Friday, December 28, 2018

Simple Time Series Regression

Hello everyone,

I have a fairly simple question and hope you guys can help me out. I already studied quiet a lot questions/answers here in the forum, but most of them had a different, more sophisticated problem at hand.

To my question; I want to figure out the correlation between y and x for which I have time-series data available (for example y=unemployment and x=CPI).

I already exponentially smoothed x=CPI (tssmooth exponential).

Now, as I am only interested in the correlation between y and x ( y_t=ß₀+ß₁x_t+u_t), I was wondering if the simple - reg y_t x_t - would aim for the desired results.

I am trapped in my own thoughts right now and actually need some clarity, as this approach seems way too simple.

I am very thankful for every reply.

FMM lcprob variables

Hello experts,

In FMM (finite mixture models), our main models(s) could have certain IVs. Then, I can add some variables to lcprob to specify variables that determine the probabality of being in each class. For example, Stata help document says that total medical expenditure (DV) could be predicted by gender, age, and income. In a basic model, it only uses the mentioned IVs. Later, it says, with that, we are assuming that prior probability of being in each class was the same among all individuals. However, it would make a better sense if include "total number of chronic issues" each person has in "lcprob" part of the model.

Now, my question is: what is the criteria on the basis of which we decide one variable should be in the main model rather than in lcprob part of the model? In other words, in the mentioned example" total number of chronic issues" could have been used as one of the IVs in the model.

I hope my question is clear.

Thanks in advance

calculating and graphing marginal effects from logit with interaction effect of two categorical variables

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(owndecision treat gender)
1 1 1
1 1 1
1 1 1
1 1 1
0 1 1
1 1 1
1 1 0
1 1 0
1 1 0
1 1 1
0 1 1
1 1 0
1 1 1
1 1 1
0 1 0
1 1 1

owndecision = 1 if defect, 0 otherwise
treat=0,1,2; (3 treatments: 0=common, 1=asymmetric, 2=private)
gender=1 if female, 0 otherwise

I would like the average marginal effects of defection (owndecision=1) by gender for asymmetric and private and to produce a graph that looks like this:
Array

Code:

logit owndecision i.gender#i.treat

------------------------------------------------------------------------------
 owndecision |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender#treat |
        0 1  |   1.466337   .7372854     1.99   0.047     .0212843     2.91139
        0 2  |   2.590267   1.179689     2.20   0.028     .2781187    4.902415
        1 0  |   1.041454   .6522961     1.60   0.110     -.237023    2.319931
        1 1  |   1.977163   .6868733     2.88   0.004     .6309158     3.32341
        1 2  |          0  (empty)

 margins, dydx(treat) over(gender)
------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.treat      |
      gender |
          0  |   .3472222   .1606046     2.16   0.031      .032443    .6620015
          1  |   .1828704   .1157482     1.58   0.114    -.0439919    .4097326
-------------+----------------------------------------------------------------
2.treat      |
      gender |
          0  |   .5138889   .1600699     3.21   0.001     .2001576    .8276201
          1  |          .  (not estimable)
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.

marginsplot

Since the coefficient on gender=1 when treat=2 is empty then the marginal effect is not estimable, and the graph produced by marginsplot is missing that effect:

Array

Any help would be appreciated. Thank you.

Writing loop for multiple regressions

I have 10 dependant variables, y1-y30, and its respective lagged variables, lagy1-lagy30. I would like to regress one dependant variable on its specific lag variable and five other fixed controls, e.g. regress y1 lagy1 x1-x5. How can I write a loop code to run the 30 regressions and store the estimates respectively?

Currently, I wrote the following code, and the problems are 1) there are some meaningless regressions, e.g. regress y1 lagy2 x1-x5; 2) the estimates could not be stored.

local dependant y1-y10
local independant lagy1-lagy10
local x = 1
foreach p of local dependant{
foreach q of local independant{
regress `p' x1 x2 x3 x4 x5 `q'
est sto m_`x'
local x = `x' + 1
}
}

This is the first time I write loop code in STATA, and I checked previous posts but still could not find a solution. I really appreciate any help or comments. Thank you very much for the time and considerations!

using "spmap"

Hi guys,

I am trying to map the result by using "spmap" command, yet keep having troubles with the error saying "master data not sorted"...

Below is the code that I used,
------------------------------------------------
use "$processed/production_regional.dta", clear

format weight_edible_ameday %4.2f
spmap weight_edible_ameday using vietmap_province_region3.dta, id(_ID) fcolor($colorscale) ///
legend(symy(*1) symx(*1) size(3) pos(4)) ///
title("Total harvest (kg/day/AME)", size($titlesize)) cln(7) ///clm(c) clb($cats) ///
note("Source: ***** crop production", size($notesize))
graph export "$maps/prod_region_weight.png", as(png) replace
------------------------------------------------

Can anyone tell me how I can resolve this problem?

Many thanks,

Manny

How to present vignettes in a tabular format

Hello everyone,

Could you please help me to present vignettes in a tabular format rather than a running text?

Code:

use setup, clear

gen phrase_A1 = "error"

    replace phrase_A1 = "male" if gender ==1
    replace phrase_A1 = "female" if gender ==2

gen phrase_A2 = "error" 
    replace phrase_A2 = "yes at the employer's premises" if experience_and_internship ==1
    replace phrase_A2 = "yes, but in a different firm" if experience_and_internship ==2
    replace phrase_A2 = "no" if experience_and_internship ==3

gen phrase_A3 ="error" 
    replace phrase_A3 ="Omani" if nationality == 1 
    replace phrase_A3 ="non-Omani" if nationality == 2 
    
gen phrase_A4 = "error"
    replace phrase_A4 = "leading university in Oman" if place_of_study ==1
    replace phrase_A4 = "non-leading university in Oman" if place_of_study ==2
    replace phrase_A4 = "leading university in Oman" if place_of_study ==3
    replace phrase_A4 = "non-leading university abroad" if place_of_study ==4
    
gen phrase_A5 = "error"
    replace phrase_A5 = "College Diploma" if level_of_education ==1
    replace phrase_A5 = "College Higher Diploma" if level_of_education ==2
    replace phrase_A5 = "Bachelor" if level_of_education ==3
    replace phrase_A5 = "masters" if level_of_education ==4
    
gen phrase_A6 = "error"
    replace phrase_A6 = "Engineering" if field_of_study ==1
    replace phrase_A6 = "Business and Management" if field_of_study ==2
    replace phrase_A6 = "Inforamtion and Technology" if field_of_study ==3
    
gen phrase_A7 = "error"
    replace phrase_A7 = "high" if grade ==1
    replace phrase_A7 = "fair" if grade ==2
    replace phrase_A7 = "low" if grade ==3
    
gen phrase_A8 = "error"
    replace phrase_A8 = "yes" if extra_curricular_activities ==1
    replace phrase_A8 = "no" if extra_curricular_activities ==2
    
gen phrase_A9 = "error"
    replace phrase_A9 = "yes by an exisiting employee" if referred ==1
    replace phrase_A9 = "yes through school-linkages" if referred ==2
    replace phrase_A9 = "no" if referred ==3
    

assert phrase_A1 ~= "error"  
assert phrase_A2 ~= "error"
assert phrase_A3 ~= "error"
assert phrase_A4 ~= "error"  
assert phrase_A5 ~= "error"
assert phrase_A6 ~= "error"
assert phrase_A7 ~= "error"  
assert phrase_A8 ~= "error"
assert phrase_A9 ~= "error"


gen vigA = phrase_A1 + phrase_A2 + phrase_A3 + phrase_A4 + phrase_A5 + phrase_A6 + phrase_A7 + phrase_A8 + phrase_A9

my data look like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float id_quest byte(vignr deck) float(id_vignette gender experience_and_internship field_of_study)
 1  1 11 105 1 2 1
 1  2 11 102 2 1 3
 1  3 11 109 2 2 2
 1  4 11 110 2 2 2
 1  5 11 108 1 1 3
 1  6 11 101 1 1 1
 1  7 11 104 2 3 2
 1  8 11 103 2 3 2
 1  9 11 107 1 2 3
 1 10 11 106 1 2 3
 2  1 13 125 1 2 1
 2  2 13 128 2 3 1
 2  3 13 130 1 2 2
 2  4 13 126 2 3 2
 2  5 13 129 2 3 1
 2  6 13 124 1 3 3
 2  7 13 121 2 3 1
 2  8 13 123 2 2 1
 2  9 13 127 1 2 1
 2 10 13 122 1 1 2
 3  1 19 184 1 2 1
 3  2 19 188 2 3 1
 3  3 19 181 1 3 2
 3  4 19 183 2 2 3
 3  5 19 189 2 1 1
 3  6 19 186 2 2 3
 3  7 19 182 2 2 1
 3  8 19 190 2 3 2
 3  9 19 185 2 1 1
 3 10 19 187 2 3 2
 4  1  4  40 1 3 2
 4  2  4  37 2 3 1
 4  3  4  35 2 1 1
 4  4  4  36 1 3 1
 4  5  4  39 1 3 1
 4  6  4  31 1 2 1
 4  7  4  32 1 3 3
 4  8  4  38 2 2 3
 4  9  4  33 1 3 3
 4 10  4  34 1 3 3
 5  1 12 111 1 1 3
 5  2 12 115 2 1 3
 5  3 12 117 1 2 3
 5  4 12 120 2 1 1
 5  5 12 116 1 1 3
 5  6 12 119 1 1 3
 5  7 12 113 1 1 3
 5  8 12 112 1 3 3
 5  9 12 118 1 2 2
 5 10 12 114 1 2 3
 6  1  5  41 1 3 2
 6  2  5  44 1 2 1
 6  3  5  43 2 1 2
 6  4  5  46 1 2 3
 6  5  5  42 2 2 1
 6  6  5  50 1 2 2
 6  7  5  48 1 1 2
 6  8  5  49 1 1 3
 6  9  5  47 2 2 2
 6 10  5  45 2 3 1
 7  1  7  69 1 2 2
 7  2  7  66 2 3 2
 7  3  7  63 2 3 1
 7  4  7  67 2 2 3
 7  5  7  64 2 3 2
 7  6  7  62 1 3 3
 7  7  7  70 1 1 1
 7  8  7  68 2 2 3
 7  9  7  65 1 3 3
 7 10  7  61 1 3 3
 8  1 17 165 1 2 1
 8  2 17 170 2 3 1
 8  3 17 167 1 3 1
 8  4 17 166 1 3 2
 8  5 17 168 2 2 2
 8  6 17 161 2 1 2
 8  7 17 163 1 1 1
 8  8 17 164 1 2 3
 8  9 17 169 1 3 3
 8 10 17 162 2 2 2
 9  1  8  75 1 2 1
 9  2  8  76 1 1 1
 9  3  8  73 2 1 1
 9  4  8  79 2 2 1
 9  5  8  72 2 3 2
 9  6  8  80 1 3 3
 9  7  8  71 2 3 3
 9  8  8  74 2 1 2
 9  9  8  77 1 1 2
 9 10  8  78 2 1 2
10  1 15 143 2 2 2
10  2 15 147 2 3 1
10  3 15 149 2 3 2
10  4 15 144 1 2 3
10  5 15 141 1 1 1
10  6 15 145 1 1 3
10  7 15 150 1 1 3
10  8 15 146 2 3 3
10  9 15 142 1 3 2
10 10 15 148 1 3 1
end
label values gender gender
label def gender 1 "male", modify
label def gender 2 "female", modify
label values experience_and_internship experience_and_internship
label def experience_and_internship 1 "yes at the employer's premises", modify
label def experience_and_internship 2 "yes, but in a different firm", modify
label def experience_and_internship 3 "no", modify
label values field_of_study field_of_study
label def field_of_study 1 "Engineering", modify
label def field_of_study 2 "Business and Management", modify
label def field_of_study 3 "Inforamtion and Technology", modify

------------------ copy up to and including the previous line ------------------

I want a table like this:
table 1:

gender	male
experience and internship	no
field of study	engineering

The base year for finding yearly effects of the shock in DID

I wanted to estimate a difference-in-differences model using Stata looking at the effects of a trade shock (in 2007) on households' income. I have a repeated cross-sectional data for years 1995-2015. So, I estimated this model:
reg income Treat##Post i.year
which Treat is a dummy variable (1 for treated group and 0 for the control group), Post is a dummy variable (1 for years after the shock and 0 for years before the shock). I included year fix effects (i.year) to control for time-varying macroeconomic changes. Treated households used to be richer than the control group before the shock, however, their income trends were parallel (so, their income differences are not zero before the shock). The coefficient of interest (Treat*Post) is significantly negative.

I am also interested to find the effect of the shock for each year because I believe that the effect of shock has decreased over time. So, I estimated this model:
reg income ib2006.year##i.Treat

I have two questions regarding defining the base year:
(1) I define the year before the shock (2006) as the base year. It is assumed there is no difference between these two groups in term of income in the year 2006 which is not correct, because as I said before treated households used to be richer than the control group before the shock, so their income differences are not zero before the shock.

(2) Although Treat*Post coefficient is significant in the first model, Treat*year2007-Treat*year2015 are not significant in the second model. Why?
(if I change the base year to 2007, the coefficients will be significant because all values shift down.)

GEE and distributional assumptions

Hello all,

I am using GEE to estimate my dependent variable. The dependent variable has a lower bound of 0 (its observed value, not censored or truncated) and can take on larger values as well. However, in my dataset, there are a lot of zeros for the dependent variable (about 80% of the time). Would it be acceptable to run a linear GEE model here? (assuming that I probe my results using alternative approaches). From what I understand, GEE is a quasi-likelihood estimator and it has weaker distributional assumptions, so my thoughts were that this would be okay, but I'd be interested in hearing others thoughts. Though to be clear, I am interested in using and defending this approach for my analysis purposes.

Thank you in advance!

repeated time values within panel

Dear Statalisters!

I have problems with my time variable. I tried to use this commando:

xtset importer1 Year
But I get this error message:

repeated time values within panel

My data looks like this:

[ATTACH=CONFIG]temp_12911_1546025221248_158[/ATTACH]

I encoded Importer and Year with this commando:

encode Importer, gen(importer1)
encode Year1, gen(Year)

[ATTACH=CONFIG]temp_12912_1546025246080_192[/ATTACH]

In my data, i have 27 Importers and around 165 Exporters. I want to examine how imports from the Exporters to the Importers change during crisis and depending on if the Importer have euro or not.
The problem seems to be that the same year is used several time for the same Importer. Is it even possible for me to use panel data with my data set? If it is, how should I proceed?

Best reg(ression)ards,

Gabriel Bladh
Stockholm
Sweden

discrepancy between mixed results and contrast command

Hi, I'm running a mixed model for longitudinal data with a two by two categorical interaction (all other variables being continuous). grceintra is coded like 0 for low ec and 1 for high ec. time is coded as 1 for time1, 2 for time 2 3 for time3 and 4 for time4.
here is the mixed command and the results :

Code:

 xtmixed rmssd i.grceintra##i.time alc caf cig bmi ||id:alc caf cig bmi , residuals(un, t(time))

Code:

Mixed-effects ML regression                     Number of obs     =        268
Group variable: id                              Number of groups  =         68

                                                Obs per group:
                                                              min =          3
                                                              avg =        3.9
                                                              max =          4

                                                Wald chi2(11)     =      45.69
Log likelihood =  -56.68586                     Prob > chi2       =     0.0000

--------------------------------------------------------------------------------
         rmssd |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
---------------+----------------------------------------------------------------
   1.grceintra |  -.1516471   .1237396    -1.23   0.220    -.3941723    .0908781
               |
          time |
            2  |  -.0828655   .0517872    -1.60   0.110    -.1843665    .0186355
            3  |   -.235263   .0556458    -4.23   0.000    -.3443267   -.1261993
            4  |  -.1622465     .04143    -3.92   0.000    -.2434478   -.0810453
               |
grceintra#time |
          1 2  |   .1037683   .0721843     1.44   0.151    -.0377103    .2452469
          1 3  |   .0890882   .0782335     1.14   0.255    -.0642466     .242423
          1 4  |   .1410346   .0577477     2.44   0.015     .0278511    .2542181
               |
           alc |  -.1268449   .0641427    -1.98   0.048    -.2525624   -.0011275
           caf |  -.0009671   .0437007    -0.02   0.982    -.0866189    .0846846
           cig |   .0116784   .0402781     0.29   0.772    -.0672652    .0906221
           bmi |   .0007147    .019349     0.04   0.971    -.0372087     .038638
         _cons |   4.209916   .4915439     8.56   0.000     3.246508    5.173325
--------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
id: Independent              |
                     sd(alc) |   1.81e-09          .             .           .
                     sd(caf) |   1.99e-09          .             .           .
                     sd(cig) |   1.65e-10          .             .           .
                     sd(bmi) |    .022004          .             .           .
-----------------------------+------------------------------------------------
Residual: Unstructured       |
                      sd(e1) |   .1679547          .             .           .
                      sd(e2) |   .2756804          .             .           .
                      sd(e3) |   .3655816          .             .           .
                      sd(e4) |    .129396          .             .           .
                 corr(e1,e2) |   .1695978          .             .           .
                 corr(e1,e3) |   .5009059          .             .           .
                 corr(e1,e4) |  -.2689602          .             .           .
                 corr(e2,e3) |   .6809035          .             .           .
                 corr(e2,e4) |  -.0108121          .             .           .
                 corr(e3,e4) |   .4826205          .             .           .
------------------------------------------------------------------------------
LR test vs. linear model: chi2(13) = 330.39               Prob > chi2 = 0.0000

when I use the contrast command to test main and interaction effect, the result is the following :

Code:

 contrast time##grcetot

Contrasts of marginal linear predictions

Margins      : asbalanced

------------------------------------------------
             |         df        chi2     P>chi2
-------------+----------------------------------
rmssd        |
        time |          3       34.92     0.0000
             |
     grcetot |          1        3.15     0.0760
             |
time#grcetot |          3        6.26     0.0998

so, it's a little bit disturbing as :
1. mixed results show that high EC have lower rmssd (the DV) than low EC (coef = -.15) but the contrats command tells us that there is no main effect of the IV (grcetot chi2= .3.15, p=.076).
2. the interaction terme show that high EC group exhibit a significant gain of .14 between time1 and time4 than low EC group. but again, the overall interaction term is not significant (chi2 = 6.26, p = .09).

In social science we are not used to compute follow-up analysis after regressions because all coef in the mixed table are sufficient. But i'am a little bit obsessive with stat !!! (sorry).
I don't know what to conclude with such discrepancy. any help is welcome...
best
carole

How to remove observations with no change in the dependent variable in a regression ?

I have a panel data set that is of the form

shpro	date	price
1	1	100
1	2	100
1	3	100
1	4	100
2	1	98
2	2	100
2	3	102
2	4	104
3	1	99
3	2	100
etc.	etc.	etc.

where shpro is a variable that represents the same product within the same shop ie it is product-shop specific. The date represents the date of the price reading and price is price of the product in that shop.

I am carrying out a fixed effects regression with time fixed effects and shop-product fixed effects. I wish to condition my regression on the fact that the price of each product at date=1 is different to the price of that same product at date=4. I initially generated a variable for price at date=1 and date=4 but of course this does not work since each observation has only one date. I have an inkling that I may need to reshape the data, but I am not entirely sure on how to do this.

Any help will be so much appreciated.

Panel Regression - Top 10% of income of each industry each year

Dear all,

unfortunately I am new to Stata and I dont really know how to go ahead. I want to perform a regression with the top 10% of income of each industry each year. I have 10 different industries and 14 years. I thought about creating dummy variables and I already generated dummy variables for the different industries (industry1, industry2, industry3 etc.) and the different years (year1, year2, year3 etc.). But now the problem is, how can I tell Stata to create a dummy of the best 10% of each industry in every year. Or am I thinking to complex and there is another command performing this.

Thank you in advance!

Best regards,

Corn

Three way tables using svyset

Hello,

I am using Stata 13 and I have a question regarding three way tables while using complex survey data. My dataset is weighted and stratified and I would like to make a three way table. Unfortunately, the table command doesn't work for svy.

I have tried:

Code:

 svy: prop var1, over(var2 var3)

However, the proportions this command provides me with are the proportions of a given group within a group of var1, and I would like to know the proportion of this given group over all observations. Would this be possible?

Thank you!

factor variables and time-series operators not allowed

Hi,

I am trying to run the following the codes, but get the error message "factor variables and time-series operators not allowed". Steps one through three work, but at step four and five I get the error message.

(1) xtologit CSRRS_n PPE INTAN RND CH LEV ROA OI Growth NLCF CETR ln_employees i.DataYearFiscal, vce(robust)
est store r1
(2) xtologit CSRRS_n PPE INTAN RND CH LEV ROA OI Growth NLCF GETR ln_employees i.DataYearFiscal, vce(robust)
est store r2
(3) esttab r1 r2 using "Regression.rtf",
(4) replace stats(N chi2 p) b(3) aux(se 3) star(* 0.10 ** 0.05 *** 0.01) obslast onecell nogaps
(5) compress title(Regressions) addnotes(p-levels are two-tailed, * p < 0.10, ** p < 0.05, *** p < 0.01; the numbers within the round parentheses are robust standard errors.)

Any help will be greatly appreciated.

Thursday, December 27, 2018

Is it possible to divide a variable by the mean across individuals for a regression???

Hello,

is it okay to divide each individual's value of a variable by the mean of the sample and then use this transformed variable for a regression in a sample?

For example:

Code:

sysuse auto, clear

. reg price trunk weight displacement gear_ratio

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(4, 69)        =      8.54
       Model |   210211246         4  52552811.6   Prob > F        =    0.0000
    Residual |   424854150        69  6157306.52   R-squared       =    0.3310
-------------+----------------------------------   Adj R-squared   =    0.2922
       Total |   635065396        73  8699525.97   Root MSE        =    2481.4

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       trunk |  -63.64507   91.74253    -0.69   0.490    -246.6664    119.3763
      weight |   2.160798   .8998892     2.40   0.019     .3655685    3.956028
displacement |   10.36613   8.266774     1.25   0.214    -6.125634    26.85789
  gear_ratio |   2192.778   1140.727     1.92   0.059    -82.91105    4468.466
       _cons |  -8139.774   4688.715    -1.74   0.087     -17493.5    1213.956


egen meanprice = mean(price)
gen dividedprice = price/meanprice


. reg dividedprice trunk weight displacement gear_ratio

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(4, 69)        =      8.54
       Model |  5.53036265         4  1.38259066   Prob > F        =    0.0000
    Residual |   11.177316        69  .161990087   R-squared       =    0.3310
-------------+----------------------------------   Adj R-squared   =    0.2922
       Total |  16.7076786        73   .22887231   Root MSE        =    .40248

------------------------------------------------------------------------------
dividedprice |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       trunk |  -.0103232   .0148806    -0.69   0.490    -.0400091    .0193627
      weight |   .0003505    .000146     2.40   0.019     .0000593    .0006417
displacement |   .0016814   .0013409     1.25   0.214    -.0009936    .0043563
  gear_ratio |   .3556669   .1850251     1.92   0.059    -.0134481    .7247819
       _cons |  -1.320265    .760506    -1.74   0.087    -2.837433    .1969028
------------------------------------------------------------------------------

Would this cause any trouble?
The motivation would be to see what factors effect the price to lie above the sample average.
Thank you!

Assigning variable values to observations based common values of other variables...

I have a data set with 3 variables: one identifies the contract number; one identifies the type of contract (two values (prime or sub); and one lists which agency let the contract. The authorizing agency is only listed for the prime contracts. I need to assign the sub contracts the same agency value as its prime.

For Example...

Contract Number	Contract Type	Agency
121212	Prime	XX
121212	Sub
121212	Sub
343434	Prime	SS
343434	Sub
343434	Sub
565656	Prime	ZZ
565656	Sub
565656	Sub

what Stata code can I use so that the Sub get the Agency value of its Prime. I have 2,147 contracts with 435 primes; 1,712 subs; and 25 agencies

Thanks!

Steven Pitts

Merging dates

Hi,

I'm trying to merge two dates variables from the same dataset, but I would like to prioritize one on the other:
considering date1 and date2
I would like to generate date3 = date1, except if date1 is missing, then replacing by date2

I tried :
gen date3 = date1
replace date3 = date2 if date 1 == "."

but I got a type mismatch message, even if my variables are all in numeric daily date (float)

I hope I'm clear and someone can help me,

Many thanks

El

Ps: Merry Christmas!

getting estimates when using bayes prefix for melogit

Hi Stata forum members,

I need some advice on how to get estimates after fitting melogit using bayesian framework. I have tried using the -parmest- command but I get an error that says "Estimates matrix e(b) must have exactly 1 row"

Below is my example code:

Code:

sysuse auto, clear
bayes: melogit foreign trunk || rep78:,
parmest,format(estimate min95 max95 %8.2f p %8.1e) list(,)

can someone help?

Thanks in anticipation.

Madu

Just to add that I use Stata/SE 15.1 and the error number is r(498);

1-to-(n) Propensity score matching without replacement

Hi,

I was hoping if someone can help me with this. I have a data set with about 100 cases and 6000 controls. I want to create a propensity score matched cohort of 1 case:3 controls (propensity score generated based on a set of baseline variables like age, gender, kidney function etc.). The -psmatch2- command does not let me create 1-to-many matching without replacement when using the n() option

"psmatch2 treatment_variable , pscore(logit1) caliper (.2) noreplacement n(3)"- returns error message

"psmatch2 treatment_variable , pscore(logit1) caliper (.2) n(3)"- does propensity matching with replacement (not what I am looking for)

Can anyone please suggest on how to do this/share code to overcome this? I am using Stata 15 version. I can't use the -teffects- command as I need the id for matched controls to do survival analysis on the final matched cohort.

Thank you so much in advance.

Concatenate of a string and a number

Dear statalister,

I am trying to merge to database, I would like to try using a concatenate of country and year, first is a string and the second a number. Are there any function or command to do it? I tested strcat and something else I found in the fórum but one is for two strings and the second for two numbers.
Thank you for your kind help.

Best regards,
Alejandro

Convert string to time including milliseconds

I have a variable containing strings in the following format

StringVar
"2018-12-27 14:28:41.4861930"

I would like to convert it into a variable with a time format STATA will recognize, keeping the precision to the millisecond level. E.g. the # of milliseconds since 1960 would be perfect.

I tried among other things the following, but it delivered only missing values.
gen time2=clock(StringVar,"DMYhms")

Any ideas?
Thank you for the help...

Joinby two variables

Dear statalisters,

I am trying to merge two datasets and I have some problems. I started yesterday merging using

Code:

joinby firm

and everything was ok. Today I am trying to use

Code:

joinby country year

but I have a problem, I think I créate duplicate data. My master data has 1 million observation and a size about 1,3 GB and the second database is about 170,000 observation and a size of 10MB. The final database is about 20 GB and 20 millón observations.

Do you know why is that change in size and observations? I think there are some duplicates, how can I see if there are duplicates and what can I do if there are?

Thank you very much for your help.

Alejandro

Generate multiple variables from a variable containing symbols and numbers

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str29 salary_today
"243,250 (307,840) (253,454)"  
"322,043 (342,970)"            
"279,102 (365,736)"            
"126,025[12]"                  
"247,579††"                
"166218"                       
"138,740†††"             
"161349"                       
"130,646 (204,309)"            
"254160"                       
"238908"                       
"129,517 (175,081)***‡‡‡"
"228190"                       
"117763"                       
""                             
"188,723"                      
"135586"                       
"161,349 (197,454)"            
"162056"                       
end

The variable salary_today looks like the following

Code:

. list

     +-----------------------------+
     |                salary_today |
     |-----------------------------|
  1. | 243,250 (307,840) (253,454) |
  2. |           322,043 (342,970) |
  3. |           279,102 (365,736) |
  4. |                 126,025[12] |
  5. |                   247,579†† |
     |-----------------------------|
  6. |                      166218 |
  7. |                  138,740††† |
  8. |                      161349 |
  9. |           130,646 (204,309) |
 10. |                      254160 |
     |-----------------------------|
 11. |                      238908 |
 12. |     129,517 (175,081)***‡‡‡ |
 13. |                      228190 |
 14. |                      117763 |
 15. |                             |
     |-----------------------------|
 16. |                     188,723 |
 17. |                      135586 |
 18. |           161,349 (197,454) |
 19. |                      162056 |
     +-----------------------------+

I want to generate four variables salary, salary_p1, salary_p2, and salary_note : salary will contain the numbers before parenthesis; salary_p1 will contain the numbers between the first two parentheses; salary_p2 will contain the numbers between the last parentheses ; salary_note will contain all the symbols ( including [12] in observation 4).

For example, for observation 12 salary will be 129517, salary_p1 will be 175081, salary_p2 will be missing , and salary_note will be ***‡‡‡.

margins not estimable

Hi all,

I run a panel data fixed effect regression. There is an interaction term in the model. Through the reults, I can see the marginal effect of dummy. However, I need to plot the margin graph. It returns as "not estimable".

Please see attached.

Array

What should I do then? What's the problem here?

Thanks!

Best,
Linda

Foreach vs. Forvalues when using char() function to remove special characters in a string variable

Hello all,

Using Stata 15.1/IC

I need to submit a bulk file with a string variable ("NAME" variable in this example) that is required to have no special characters besides ampersand and dash. I am able to accomplish this using the following series of commands:

charlist NAME //shows which characters are in my string var NAME
"&',-./01234689ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnop qrstuvwxyz

egen NEWNAME= sieve(NAME), omit(,./`"""'`"'"') // generates new variable with the special characters omitted but retains & and -

Results:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str86(NAME NEWNAME)
"Single-Benefits, Inc."                                       "Single-Benefits Inc"                                              
"Superstar, LLC"                                              "Superstar LLC"                                                  
"RML Agency, Inc."                                            "RML Agency Inc"                                        
"A & M Company, Inc."                                         "A & M Company Inc"
end

While this approach works as intended, I wanted to be able to use a command that is not dependent on the specific characters to be omitted, which could change between datasets (e.g. a character like "+" or "@" would not be excluded in a string variable that had them with my code--I'd have to manually update the command). Plus, the way you have to set off double- and single quote marks makes it hard to read in the log file.

I thought I could use the char() function to generalize the command by using the integer values associated with ASCII characters with a forvaluesloop (under the assumption I will nor run into any non-ASCII special characters), but I get the following error:

. forvalues i = 33/37 39/44 46/47 58/64 91/96 123/126 {
2. replace NAME = subinstr(NAME, char(`i'), "", .)
3. }
invalid syntax
r(198);

I am, however, able to use the foreachcommand without error:
. foreach i in 33 34 35 36 37 39 40 41 42 43 44 46 47 58 59 60 61 62 63 64 91 92 93 94 95 96 123 124 125 126 {
2. replace NAME =subinstr(NAME, char(`i'), "", .)
3. }

My question is why the forvalues command doesn't work. My presupposition is that I just did something wrong in the command syntax-wise, but I also wondered if Stata treats values in the char() function differently than I thought when used with forvalues.

Of course, if there is an even better way to accomplish the elimination of all special characters besides ampersands and dashes, I am all ears. Thanks for any advice.

Using -cmp- to estimate and interpret a three-stage Heckman model

Good morning all,

I am using the -cmp- package developed by Roodman to estimate a three-stage Heckman selection model. I am using the following code:

Code:

cmp (stage3 = )(stage2 = ) (stage1 =), ind(stage2*$cmp_probit stage1*$cmp_probit $cmp_probit)

While the model has estimated and I am generally able to interpret it, I had a few questions.

First, standard Heckman models have a rho parameter that is the inverse Mills ratio. This controls for selection bias in the second stage of the regression. When estimating the above Heckman model, however, there are three rho parameters. Each parameter has numbers attached to it: rho_12, rho_13, and rho_23. I assume that this means there is a rho parameter being used in stage 2 that is from stage 1, stage 3 that is from stage 1, and stage 3 that is from stage 2. While this interpretation makes sense, why does rho_13 exist? Should the inverse Mills ratio of stage 1 really be put into stage 3? Would I need to constrain that parameter to zero? Some advice would be appreciated, as constraining the parameter to zero substantively changes my results.

Second, I am using probit models and want to interpret the coefficients using margins. I cannot, however, seem to write the code necessary to get marginal effects at the third stage of my model conditional on the first two stages. Here is the code from the cmp help file that is closest to what I want:

Code:

cmp (wage2 = education age) (selectvar = married children education age), ind(selectvar*$cmp_probit $cmp_probit) qui
margins, dydx(*) predict(pr eq(wage2) condition(0 ., eq(selectvar)))

This code replicates the margins, predict(pcond) code for get marginal effects in the second stage of a Heckman probit model in base Stata. It conditions margins on the first stage being equal to 1. I want to do the same with cmp, except conditioning both the first and second stage being equal to 1. How would I do this?

Thanks in advance for anyone who can help. I greatly appreciate it!

- Garrett

Hyperlink to the file generated/modified by putexcel

This is a very minor request/question. Several of the user-made commands I use (e.g. estout, and iebaltab) have a nifty feature where they provide a hyperlink the the file that they write so that you can just click from the results window rather than browsing through your files. Is there any way to get putexcel to do this as well? I've been searching about but can't find much information about how this works. Thanks!

grouped variables

Hi
I currently have a variables for income following this structure:

Therefore when I run summary statistics it comes up with a mean of the "option" not the label. Is there any way I can re-code the variable to be the ranges shown? Or alternatively run summary statistics so I get a mean of the grouped variable? In general I do not understand how to deal with a grouped variable and struggled to find the relevant information. Many thanks

Standard errors using Frisch-Waugh-Lovell theorem

Hi,
I need to implement the Frisch Waugh-Lovell-theorem in Stata 15 MP (64-bit) in the context of a research project. To illustrate my problem, I would like to abstract from my actual problem and focus on the following MWE. In the example, I'd like to show that the coeffcient on headroom can be obtained in two ways, either through a standard OLS estimation with two regressors in total or through partialling out of the first regressor, trunk.

Code:

sysuse auto2, clear

* Multivariate regression
reg price trunk headroom  

* Partialling out
reg headroom trunk, vce(robust)
predict double resid_x2, res

reg price trunk
predict double resid_y, res

reg resid_y resid_x2

My trivial question is: Why are the standard errors of the variable headroom from the multiple regression and the standard error of the partialled out coeffcient on headroom not exactly equal? The coeffcients themselves correspond (which I wanted to see), however, I seem to not understand this procedure properly, since actually also the standard errros should correspond, right? Where am I not seeing the mistake?

Thank you very much in advance.

Wednesday, December 26, 2018

questions about model selection with lassopack

Dear STATA users,
Sorry to ask you 3 simple questions.

1.When we used lassopack for selecting predictors, if the predictor is a categorical variable, should we just put it in the code, or add "i." before the variable?

Should we use these code:
lasso2 AO agec i.sex i.edu3 i.jobm i.incomef i.snec i.dnec1 , plotpath(lambda)
cvlasso AO agec i.sex i.edu3 i.jobm i.incomef i.snec i.dnec1 , lopt seed(123)

Or these code:
lasso2 AO agec sex edu3 jobm incomef snec dnec1 , plotpath(lambda)
cvlasso AO agec sex edu3 jobm incomef snec dnec1 , lopt seed(123)

2. Must we use cvlasso to select the predictors?
When we finished the lasso2 code and at the bottom of the results, there is a explanation "Type "lasso2, lic(ebic)" to run the model selected by EBIC.

My question is which one should be based for model selection? EBIC or Lambda?

3. After we run the lasso code and get the final model, the p values for some predictors are more than 0.05, is it ok?

Many thanks and best wishes!
Jing Pan

Is there a way to rename large number of variables with a single command? (Details)

I have a ton of variables, for example, var1_m, var2_m, var3_m, etc. I want to turn them into var1_2016, var2_2016, var3_2016, etc. Basically, changing _m at the end into _2016. Thanks

Generate with tempfiles

I used Stata tempfile code from one of the earlier posts to append multiple years of NHIS mortality data. The code worked perfectly. However, I had to manually generate interview year variable in each data set before appending them. I'm hoping anyone can show me how to generate a new variable - year - as it will help me generate other variables. Here is a code I used

Code:

clear

set more off
local flist: dir "." files "*.dta"

use NHIS_1986_MORT_2011_PUBLIC     //is there a way to run the code without specifying the using dataset?

local mort = 0
foreach fname of local flist {
    local ++mort   
    tempfile temp`mort'
    save  "`temp`mort''"
}    

forval i = 1/`mort' {
  append using "`temp`i''"
}

Looking for examples of OSIRIS dictionaries and data

Dear Statalisters,

I am writing a custom converter of data from OSIRIS dictionaries and need examples for testing.
I know ICPSR has a bunch of old datasets in this format, but I can't get access to it since it is all behind their login screen.
If anyone knows of publicly available examples of OSIRIS dictionaries (type I, or any other type), please point me to those resources.
If you can share data privately (a few observations should be sufficient), please sent me a message directly.

Thank you, Sergiy Radyakin

Strategy to choose the right controls? Conceptual questions

Hello,

I have a problem with chosing the right controls for my model and hope someone can help me along .

With the model I want to explain savings, human capital and labour supply by the time of the demographic Transition (DT). DT is the variable of interest. The model is the same for all three:

dependent_2010 = ß_0 + ß_1*dependent_1990 + y1*DT + y2*DT² + c*Control , with subscripts i for the variables

I have already chosen a proxy for urbanity and a dummy for war during the investigation period 1990-2010 as control variables.

Question 1:Do I have to expect that urbanity/war is both correlated with an independent variable AND affects the dependent variable - or is it enough to assume that urbanity/war affects the dependent variable WITHOUT correlation to any independent variable?? -- In case that correlation with an independent variable is needed: My model includes one lagged value. If I assume that a control variable affects the dependent variable, I simultaneously assume it affects one of the "independent" variables, the lagged value, too. However, I have a feeling this would not force me to include such acontrol variable in the model, does it? I hope this makes sense.

Question 2: I have though about including the level of income in 1990 but would this be purposeful? Ecomomic models usually explain income as a function of savings, human capital and labour supply - so explaining these variables with income would be misleading, although it sounds logical that poor nations save less or can only afford little education. Should I exclude variables where reverse causality could occur?

Question 3: To control for life expectancy would be reasonable in my sense because expecting a longer life, people could tend to save more, work more and get more education. Now, this is tricky for me: The demographic transition (DT) is initiated by falling mortality rates and by this manner, also by rising life expectancy. That means life expectancy and DT must be correlated. Is it still okay to include life expectancy it in the model? I fear that because DT is a result of

life exp., I could erase potential effects of DT on the dependent variable?

Off Topic- Question 4: After I receive the estimation results and, say, find significant effects of DT on the dependent variable. What phrases am I "allowed" to state? I surely can't say, 'The result is that DT causes the dependent variable to rise/fall.' Is this really all I can say: 'We cannot reject that the effect of DT on the dependent variable is non-existent'?

I apologize as this is no direct question about Stata, but the help received on this forum is very valuable and I don't know where else to ask.
As always, thank you!!

Generate a new variable by deleting everything after certain character ("/")

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str28 country
"UK"                         
"France"                     
"France / Singapore / UAE"   
"Switzerland"                
"Spain / US"                 
"Italy"                      
"Switzerland"                
"France"                     
"Netherlands"                
"UK"                         
"FR / GB / DE / ES / IT / PL"
end

Code:

    +-----------------------------+
     |                     country |
     |-----------------------------|
  1. |                          UK |
  2. |                      France |
  3. |    France / Singapore / UAE |
  4. |                 Switzerland |
  5. |                  Spain / US |
     |-----------------------------|
  6. |                       Italy |
  7. |                 Switzerland |
  8. |                      France |
  9. |                 Netherlands |
 10. |                          UK |
     |-----------------------------|
 11. | FR / GB / DE / ES / IT / PL |
     +-----------------------------+

I want to generate a new variable, say, home_country by keeping only the first country in the country variable i.e. starting from "/", delete everything. For example, for observation 1, 2, and 3 home_country will be UK, France, and France respectively.

Non-linear hypotheses testing with a GSEM

Hello, I'm having trouble with the testnl command after a GSEM model. Specifically, my problem is that I can't refer to the covariance coefficient of my GSEM estimation within the testnl command.

Here is a minimal working example of my problem:

Code:

sysuse auto
gsem (price <- mpg rep78) (trunk <- length turn), cov(e.price*e.trunk)
gsem, coeflegend
testnl _b[trunk:length]=0
testnl _b[/var(e.price)]=0
testnl  _b[/cov(e.price,e.trunk)]=0

In the previous example, everything works fine until I test the hypothesis that the covariance between the error terms is 0, where I get an "option e.trunk not allowed" error.

I guess that the comma inside the covariance is interpreted as an option, but I don't know how else I can refer to this covariance. I inspected the e(b) matrix, but the coefficient have the same name.

Any help would be appreciated. Thanks in advance for your responses.