BJ Data Tech Solution

Specialized on Data processing, Data management Implementation plan, Data Collection tools - electronic and paper base, Data cleaning specifications, Data extraction, Data transformation, Data load, Analytical Datasets, and Data analysis. BJ Data Tech Solutions teaches on design and developing Electronic Data Collection Tools using CSPro, and STATA commands for data manipulation. Setting up Data Management systems using modern data technologies such as Relational Databases, C#, PHP and Android.

Thursday, January 31, 2019

System GMM and AB test for AR(2)

Hi All:

I am using system GMM method to estimate my panel data (T=6). I am new to this method. I know that the assumption of system GMM is the changes in the instrumenting variables are uncorrelated with fixed effect. But I do not know whether system GMM should rely on the AB test for AR(2). My results for AR(2) is significant. If it relies on AR(2), how should I correct it?

The following is my code:

Code:

xtabond2 mcs_abs L.mcs_abs i.RWL1 i.RWL1#i.b i.housect age i.Rmarital i.Rchild i.cfreq i.Ghealth1 i.Xoccup i.healthstress i.moneystress i.disease i.Rstate yeardum1-yeardum6, gmmstyle(mcs_abs i.RWL1, laglimits(2 4) collapse eq(level)) gmmstyle(i.RWL1#i.b i.Ghealth1 i.housect i.disease,lag(1 2) eq(level)) ivstyle(age i.Rchild i.Rmarital i.cfreq i.Xoccup i.healthstress i.moneystress i.Rstate yeardum1-yeardum6, equation(level)) twostep robust small orthogonal

The test is following:

Code:

Arellano-Bond test for AR(1) in first differences: z =  -9.51  Pr > z =  0.000
Arellano-Bond test for AR(2) in first differences: z =   3.54  Pr > z =  0.000
------------------------------------------------------------------------------
Sargan test of overid. restrictions: chi2(121)  = 145.59  Prob > chi2 =  0.063
  (Not robust, but not weakened by many instruments.)
Hansen test of overid. restrictions: chi2(121)  = 124.96  Prob > chi2 =  0.384
  (Robust, but weakened by many instruments.)

Difference-in-Hansen tests of exogeneity of instrument subsets:
  gmm(lfstfy 1b.RWL1 2.RWL1 3.RWL1 4.RWL1 5.RWL1 6.RWL1 7.RWL1 8.RWL1, collapse eq(level) 
> lag(2 4))
    Hansen test excluding group:     chi2(97)   = 107.38  Prob > chi2 =  0.221
    Difference (null H = exogenous): chi2(24)   =  17.58  Prob > chi2 =  0.823
  iv(age 0b.Rchild 1.Rchild 0b.Rmarital 1.Rmarital 0b.disease 1.disease 2.disease 0b.cfreq
>  1.cfreq 2.cfreq 3.cfreq 1b.Xoccup 2.Xoccup 3.Xoccup 1b.healthstress 2.healthstress 3.he
> althstress 4.healthstress 5.healthstress 1b.moneystress 2.moneystress 3.moneystress 4.mo
> neystress 5.moneystress 1b.Rstate 2.Rstate 3.Rstate 4.Rstate 5.Rstate 6.Rstate yeardum1 
> yeardum2 yeardum3 yeardum4 yeardum5 yeardum6, eq(level))
    Hansen test excluding group:     chi2(93)   = 109.94  Prob > chi2 =  0.111
    Difference (null H = exogenous): chi2(28)   =  15.03  Prob > chi2 =  0.978

Thank you,
Connie

Documentation for non existent function

I was learning string functions in Stata. After obtaining help using help string functions command, one thing that particularly caught my attention was the strcat(s1,s2) function which according to the description doesn't exist.
This is what the help documentation says.
strcat(s1,s2)

Domain s1: strings
Domain s2: strings
Range: strings
Description: There is no strcat()function. ...............

Why would Stata include help for a function that doesn't exist because I believe there are many other instances of operations which Stata doesn't have a command or functions for? What exactly was the motivation behind this?
Do I also expect similar documentations for other Stata commands and functions because I am still a newbie in Stata?

Regards.
MJ

How to declare weekly data as time series data in Stata 15

Dear All
I want to use time series data on a weekly basis. When I try, I get the error "repeated time values in sample"

I have used the following commands

replace Date = wofd(Date)
(522 real changes made)

. format %tw Date

. tset Date
repeated time values in sample
r(451);

Please someone help on how to resolve this . I want to panel data /time series analysis for VAR or ARCH Models

Creating and storing residuals in a loop

Hi all

I am trying to run the following code. The purpose is to capture industry and year wise residuals from the stated model.
It works fine. However, when I try to replace the i= "twodigcode"(i.e. two digit codes with one digit codes) i= "onedigcode, Stata still gives the same results. I have also changed the i =1/6. But the results donot change. Can someone please guide where I am going wrong or what am I missing. Thanks in anticipation.

Code:

forvalues y = 2015/2017 {   // Define a for/next loop spanning years in sample

               forvalues i = 1/45 { // Define a for/next loop spanning the industries in the sample

               capture: reg   INVESTMENT SALESGROWTH    if y=='Years' & i=='twodigcode' , noconstant    // estimate Jones-type regression silently, within ind-year samples

              

               capture: predict resid`++n' if e(sample), residuals   // save residuals in temporary variable named 'residXXX', and increment the local counter

               capture: replace `rname'=resid`n' if e(sample) // update values of permanent variable with residuals estimated in line above

               capture: drop resid`n'  //drop temporary variable

                              } // NEXT INDUSTRY

               } // NEXT YEAR

Creating and storing residuals in a loop

Code:

forvalues y = 2015/2017 {   // Define a for/next loop spanning years in sample

               forvalues i = 1/45 { // Define a for/next loop spanning the industries in the sample

               capture: reg   INVESTMENT SALESGROWTH    if `y'==Years & `i'==INDUSTRYGROUP , noconstant    // estimate Jones-type regression silently, within ind-year samples

              

               capture: predict resid`++n' if e(sample), residuals   // save residuals in temporary variable named 'residXXX', and increment the local counter

               capture: replace `rname'=resid`n' if e(sample) // update values of permanent variable with residuals estimated in line above

               capture: drop resid`n'  //drop temporary variable

                              } // NEXT INDUSTRY

               } // NEXT YEAR

Question about reghdfe

Could you please answer my questions about reghdfe command?

1. When I conducted estimation using reghdfe, the following error messages appears after estimation.

Warning: VCV matrix was non-positive semi-definite; adjustment from Cameron, Gelbach & Miller applied.
WARNING: Missing F statistic (dropped variables due to collinearity or too few clusters).

1) Could you please let me know the reason why these messages appeared?

2) Even though adjustment from Cameron, Gelbach & Miller is applied and F statistic is missing, are the estimates of coefficients and standard errors still valid?

2. I tried to estimate standard errors clustered by individuals (pubid) and states (state) because some respondents moved across states between survey waves. There are 8635 clusters for individuals and 49 clusters for states. But the estimation results show that standard errors are adjusted only for 49 clusters in individuals and states. Maybe this happened because most individuals did not move across states for the all of the targeted periods. However, it is sure that there are individuals who moved between states. Are the estimates of standard errors still valid?

Can I use tssmooth for a fixed number of periods like a rolling forecast, safe the forecast t+5, then start in t+1, save forecast t+6...

Hello everybody,

I'm trying to forecast values for the strategy of companies. I have a panel dataset for companies from 2000 - 2015, but I only need the forecasted values for 2005 - 2015. What I would like to do is take the periods 2000 - 2004, apply exponential smoothing and receive the forecasted value for 2005, then start over for the periods of 2001 - 2005 and forecast 2006, and so on. In the end I only need the last forecasted value for each of the time windows one after the other as one variable. Does anybody have any suggestions of how to do that?
I thought about using

tssmooth exponential double smooth_`var'_2005=`var' if inrange(year,2000,2005)

and do this for all periods, but I'm not sure if it would do what I expect it to do and if there is an option with less manual effort? The rolling command wouldn't really help in this case because it would always "overwrite" the end value I'm interested in, no?

Thank you so much for your help in advance

Calibration of logistic regression on large dataset.

Evaluating goodness-of-fit for a logistic regression model using the Hosmer-Lemeshow test is not reliable in large datasets.
Which method would then best work. What alternatives.
Also, while many user defined apps for calibration plots, wonder how to manually write a calibration plot (predicted vs observed frequencies)
Regards

Weak IV test postestimation test when using ivreghdfe command?

Dear STATA community,

I am hoping you can help me find a command I am looking for. My coauthor and I ran the following regression (I dropped all the control variables to make it more clear):

ivreghdfe employed (distance=ruggedness), absorb(district year#district month) cluster(cluster)

Distance to the treatment "distance" is instrumented by the ruggedness of the terrain at the location "ruggedness". Since we are using more than 1 fixed effects variable, we need to use the ivreghdfe command.

We want to run a weak IV test using a postestimation command, but the command "weakivtest" only works with ivreg2 and ivregress.

Do you have any suggestions about how we can address this problem?

Thank you for considering this request, and for any tips or leads you can offer!

Generate a moving window average

Is there an easy way to generate a moving window average? For instance, for every 5 minutes of Ta_NOAA, I want to generate the average Ta_NOAA from the prior 30 and prior 60 minutes. I can do simple lags and add them up, but this becomes tedious when the lag is over a large time frame (i.e. prior 24 hours would require creating 288 variables at 5 minute intervals).

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double AKST_datetime float(Ta_NOAA Ta_mean30_p Ta_mean60_p)
1.7304192e+12 -2.6          .          .
1.7304195e+12 -2.6          .          .
1.7304198e+12 -2.5          .          .
1.7304201e+12 -2.5          .          .
1.7304204e+12 -2.2          .          .
1.7304207e+12 -2.1          .          .
 1.730421e+12   -2  -2.416667          .
1.7304213e+12   -2 -2.3166666          .
1.7304216e+12 -1.9 -2.2166667          .
1.7304219e+12 -1.8 -2.1166666          .
1.7304222e+12 -1.7         -2          .
1.7304225e+12 -1.7 -1.9166666          .
1.7304228e+12 -1.7      -1.85 -2.1333334
1.7304231e+12 -1.6       -1.8 -2.0583334
1.7304234e+12 -1.6 -1.7333333     -1.975
1.7304237e+12 -1.5 -1.6833333       -1.9
 1.730424e+12 -1.5 -1.6333333 -1.8166667
1.7304243e+12 -1.5       -1.6 -1.7583333
1.7304246e+12 -1.4 -1.5666667 -1.7083334
1.7304249e+12 -1.4 -1.5166667 -1.6583333
1.7304252e+12 -1.3 -1.4833333 -1.6083333
1.7304255e+12 -1.3 -1.4333333 -1.5583333
1.7304258e+12 -1.3       -1.4 -1.5166667
1.7304261e+12 -1.3 -1.3666667 -1.4833333
1.7304264e+12 -1.2 -1.3333334      -1.45
1.7304267e+12 -1.2       -1.3 -1.4083333
 1.730427e+12 -1.2 -1.2666667     -1.375
1.7304273e+12 -1.2      -1.25 -1.3416667
1.7304276e+12 -1.2 -1.2333333 -1.3166667
1.7304279e+12 -1.3 -1.2166667 -1.2916666
1.7304282e+12 -1.3 -1.2166667     -1.275
1.7304285e+12 -1.4 -1.2333333 -1.2666667
1.7304288e+12 -1.4 -1.2666667 -1.2666667
1.7304291e+12 -1.5       -1.3     -1.275
1.7304294e+12 -1.4      -1.35 -1.2916666
1.7304297e+12 -1.5 -1.3833333       -1.3
  1.73043e+12 -1.4 -1.4166666 -1.3166667
end
format %tc AKST_datetime

gen L5=Ta_NOAA[_n-1]
gen L10=Ta_NOAA[_n-2]
gen L15=Ta_NOAA[_n-3]
gen L20=Ta_NOAA[_n-4]
gen L25=Ta_NOAA[_n-5]
gen L30=Ta_NOAA[_n-6]
gen Ta_mean30_pp = (L5 + L10 + L15 + L20 + L25 + L30)/6

Thanks!
Dan

Guarantee 3 consecutive observations before and after the event

Dear Stata Users,

I need to keep just those firms (gvkey) that have 3 year consecutive observations before and after the event. The event date observation is supposed to be accounted for "after" event, but not for the consecutive years of "before" event. For example, in the data below firms (gvkey) 1173 and 1266 satisfy this requirement. Please, help me with this issue.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long gvkey_destr int fyear float event
1121 2006 .
1121 2007 .
1121 2008 1
1121 2009 .
1121 2010 .
1121 2011 .
1121 2012 .
1121 2013 .
1173 2006 .
1173 2007 .
1173 2008 .
1173 2009 1
1173 2010 .
1173 2011 .
1173 2012 .
1210 2006 .
1210 2007 .
1210 2008 .
1210 2009 .
1210 2010 .
1210 2011 .
1210 2012 1
1210 2013 .
1266 2006 .
1266 2007 .
1266 2008 .
1266 2009 .
1266 2010 .
1266 2011 1
1266 2012 .
1266 2013 .
end

Simple help with global macros

Hello,

I believe I incorrectly posted this to the Statalist earlier today. I recognize this question is simple but I need help. I try to debug my .do files by examining datasets created within them. So, I want to understand how to create a temporary dataset that exists after completion of a do file but its erased upon Stata exit. I have been unsuccessful at accomplishing this.

I wrote the following simple program as a way to test possibilities.

input firm year
10 2019
10 2018
11 2017
11 2016
end

describe
tempfile nextcit
save "`nextcit'"

global macrot "`nextcit'"
save $macrot, replace

describe

use $macrot

list

[output after describe command]
describe

Contains data from C:\Users\Owner\AppData\Local\Temp\ST_459c_000001.t mp
obs: 4
vars: 2 31 Jan 2019 12:45
size: 32
----------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------------------------
firm float %9.0g
year float %9.0g
----------------------------------------------------------------------------------------------------------------
Sorted by:

.
. use $macrot

.
. list

+-------------+
| firm year |
|-------------|
1. | 10 2019 |
2. | 10 2018 |
3. | 11 2017 |
4. | 11 2016 |
+-------------+

-----
However, after my .do file terminates and I type in the command window :

. use $macrot

and receive the following error:

file C:\Users\Owner\AppData\Local\Temp\ST_459c_000001.t mp not found
r(601);

How can I access the data that I intended to store within $macrot?

Thanks in advance for your time and consideration,

Ed

LPOLY: How can I change the scale of the axis on the lpoly graph?

Dear Stata users:

When I do the lpoly graph, it seems no option allows me to restrict my scale of the axis to a certain range. What should I do if I want the scale on the plot to be narrowed down?

For example, if the scale of my x-axis is between -4 and 4 and I want the scale on the final plot only between -2 and 2.

I have tried lpoly y x, xlab(-2 2) and lpoly y x, xscale(range(-2 2)), but it doesn't work.

Best,
Jeffery

How can I make a line graph for data from a certain date range?

Hello! I'm working with daily time series data with the date originally in the format "01/29/19" (string) which I changed to 29jan2019.

I want to use the tsline command to plot my 2 series- insys_growth and spy_growth for the date ranges 02mar2015 to 31may2017
I used the command

Code:

tsline insys_close spy_close if 02mar2015<=edate<=30may2017

but it keeps giving me an error 30may2017 invalid name (even though the date is totally there in my data).
I also tried using the inrange command

Code:

tsline spy_growth insys_growth if inrange(date, 02mar2015, 31may2017)

but I get the error "31may2017 invalid name r(198);"

A preview of my data:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str8 date float(edate insys_growth spy_growth)
"1/30/09" 17927   .1988509 4.4167905
"2/2/09"  17930 .016529638 4.4137673
"2/3/09"  17931 .016529638 4.4277167
"2/4/09"  17932   .1988509 4.4228086
"2/5/09"  17933   .3530013 4.4375796
"2/6/09"  17934   .3530013  4.465678
"2/9/09"  17937   .3530013 4.4670568
"2/10/09" 17938   .1988509  4.420165
"2/11/09" 17939   .3530013 4.4260435
"2/12/09" 17940   .3530013  4.426761
"2/13/09" 17941    .604316  4.415945
"2/16/09" 17944   .4865331         .
"2/17/09" 17945 .016529638 4.3722286
"2/18/09" 17946   .4865331 4.3698277
"2/19/09" 17947   .1988509  4.359014
"2/20/09" 17948   .1988509  4.349245
"2/23/09" 17951   .1988509 4.3128104
"2/24/09" 17952   .1988509   4.35002
"2/25/09" 17953   .1988509  4.342116
"2/26/09" 17954   .1988509  4.325721
"2/27/09" 17955 .016529638 4.3031187
"3/2/09"  17958  -.2066147   4.25703
"3/3/09"  17959   .1988509 4.2494946
"3/4/09"  17960 .016529638  4.272909
end
format %td edate

How can I plot the tsline graph only for a specific date range?
(Not sure why it's showing the date as 17927 and such, when it's actually in the 30may2017 format.)
I also want to round up the insys_growth and spy_growth decimals and format them to have to digits after the decimal point. For instance, 4.272909 would become 4.27. How can I do that?

Thank you!!
Shruti

Missing R-squared from IV regression

Question: How can I display/find the missing "within R-squared" from an IV regression?

Example of the problem:

Suppose I use the following dataset provided by Stata:

Code:

use http://www.stata-press.com/data/r13/nlswork

And I run the following IV regression:

Code:

xtivreg ln_w age c.age#c.age not_smsa (tenure = union south), fe

I get the following result:

Code:

. xtivreg ln_w age c.age#c.age not_smsa (tenure = union south), fe

Fixed-effects (within) IV regression            Number of obs     =     19,007
Group variable: idcode                          Number of groups  =      4,134

R-sq:                                           Obs per group:
     within  =      .                                         min =          1
     between = 0.1304                                         avg =        4.6
     overall = 0.0897                                         max =         12

                                                Wald chi2(4)      =  147926.58
corr(u_i, Xb)  = -0.6843                        Prob > chi2       =     0.0000

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      tenure |   .2403531   .0373419     6.44   0.000     .1671643    .3135419
         age |   .0118437   .0090032     1.32   0.188    -.0058023    .0294897
             |
 c.age#c.age |  -.0012145   .0001968    -6.17   0.000    -.0016003   -.0008286
             |
    not_smsa |  -.0167178   .0339236    -0.49   0.622    -.0832069    .0497713
       _cons |   1.678287   .1626657    10.32   0.000     1.359468    1.997106
-------------+----------------------------------------------------------------
     sigma_u |  .70661941
     sigma_e |  .63029359
         rho |  .55690561   (fraction of variance due to u_i)
------------------------------------------------------------------------------
F  test that all u_i=0:     F(4133,14869) =     1.44      Prob > F    = 0.0000
------------------------------------------------------------------------------
Instrumented:   tenure
Instruments:    age c.age#c.age not_smsa union south
------------------------------------------------------------------------------

Thanks!

Longitudinal data - generating variables dependent on observations within each subject

Hi everyone,

I have longitudinal data (see dataex below). I need to censor each id according to a few conditions.
Condition 1: if within the same id, treatment = 2 occurs on the same date as treatment = 1, I need to use that treat_date as the censoring date.

Condition 2: if within the same id, there is a delay between treat_date >= x days for the same treatment, I need to censor at that date plus a specified add-on duration (say 10 days)

How do I look within each id to determine if these conditions occur?

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int id byte treatment float treat_date
11 1 20054
11 1 20102
11 1 20176
11 1 20209
11 1 20247
11 1 20332
11 2 20391
11 2 20519
11 2 20576
11 3 20434
11 3 20450
11 5 20585
11 5 20618
11 5 20630
11 5 20675
11 5 20746
12 1 19401
12 1 19403
12 1 19460
12 1 19797
12 2 19686
12 2 19716
12 3 19529
12 3 19539
12 3 19567
12 3 19627
12 3 19787
12 4 19849
12 4 19915
12 4 19922
12 4 19954
12 4 19973
12 4 20025
12 4 20112
12 4 20197
12 4 20225
12 4 20235
12 4 20278
12 4 20332
12 4 20352
12 4 20365
12 4 20401
12 4 20474
12 4 20546
12 4 20636
12 4 20709
12 4 20759
12 4 20761
12 5 19683
12 5 19749
12 5 19754
13 1 20420
13 1 20432
13 1 20496
13 1 20547
13 1 20577
13 1 20664
13 1 20695
13 1 20752
14 2 17348
14 2 17368
14 2 17828
14 2 17902
14 2 17916
14 2 18228
14 2 18270
14 2 18318
14 2 18377
14 2 18440
14 2 18467
14 2 18490
14 2 18542
14 3 17262
14 3 17319
14 3 17448
14 3 17453
14 3 17461
14 3 17494
14 3 17521
14 3 17598
14 3 17602
14 3 17663
14 3 17694
14 3 17732
14 3 17759
14 3 17918
14 3 18091
14 3 18169
14 4 18598
14 4 18688
14 4 18701
14 4 18746
14 4 18820
14 4 18899
14 4 18977
14 4 19064
14 4 19126
14 5 17990
14 5 18031
15 1 18241
end
format %td treat_date

---------

Residuals in a panel data model

Hi
I am running a regression (using panel data) looking at the effect of income on food consumption and controlling for age. I am trying to determine if I need a squared term for age so wish to create a scatter plot of residuals against my independent variable as I have read this will show if there is a non- linear relationship- is this correct?
Therefore do I run:
xtreg fruvege age_dv, fe robust
predict fit, xbu
gen fit_2=fit^2
xtreg fruvege age_dv fit_2, fe robust
test fit_2=0

twoway (scatter fit_2 lnm)

in which case I get this:

[ATTACH=CONFIG]temp_13277_1548951433560_83[/ATTACH]

What does this mean? what am I doing wrong?
Thank you for any help.

Adding an interaction term into a model or stratifying data , which method is more preferable to analyse interaction terms?

Hi Statlists,

Hope this post finds you well. May I know why stratification seems to be less preferable than adding an interaction term into a model straight away? is it because p-value derived from each subgroup tends to be meaningless once we stratify data into groups, as the power analysis maybe impeded by the sample size of each strata itself? In this case, based on the reason given above, am I right to say that interaction term is still more preferable than the stratification itself? It is probably because there is a need in retaining the sample size ? for example, an interaction was found between A and B on C. the interaction between A and B3 on C was observed to be significant in a regression analysis ,which took Interaction term term into account . But when I just looked at the association between A and C while stratifying data into three B groups - B1 (n=100),B2 (n=75) and B3(n=45), the effect between A and B3 on C became insignificant. Why is this so? Is it probably due to the change in sample size ?

Any input and comments are much appreciated.

Thank you for the clarification in advance.

Marginal effects Tobit (mfx vs margins)

Hello
I'm trying to calculate the marginal effects of a Tobit model using the margins command instead of mfx, because margins is faster and mfx is a discontinued command.

Tobit models have 3 marginal effects, the marginal effect on probability at the truncated point, the conditional marginal effect and the unconditional marginal effect. For each one I used mfx with the following options

Code:

mfx            compute, predict (p(0,.))
mfx            compute, predict (e(0,.))
mfx            compute, predict (ys(0,.))

I am trying to replicate the results using margins, following the same order of marginal effects my syntax is as follows:

Code:

margins            , dydx(*) predict(p(0,.))
margins            , dydx(*) predict(e(0,.))
margins            , dydx(*) predict(ystar(0,.))

The marginal effects I get with margins are similar to those I get with mfx, but not the same. I'm not sure if I'm using the wrong option.
Any help is welcome, thank you

Help with using expand>2 while replacing values in duplicates generated

Hi,

I am trying to use the expand command to create duplicates and replacing one of the variables in the row.

For example,
expand 2 if state=="S" & district=="D" & year=="2009", generate (new)
One the duplicate is created, I apply:
replace district="D1" if district=="D" & state=="S" & year==2009 & new==1

This works perfectly only if I want to use expand 2.
Now I want to expand a row 9 times, the replace command will not work as all the new duplicated are assigned the value 1.

To elucidate:
expand 9 if state=="S" & district=="D" & year=="2009", generate (new)
This created the necessary duplicate rows but I can not do the following:

replace district="D1" if district=="D" & state=="S" & year==2009 & new==1
replace district="D2" if district=="D" & state=="S" & year==2009 & new==1
and so on.

I tried generating a case id and replacing it but it requires me to manually check the id created which is not feasible as I need to do this for various states and have a million rows of data.

I am sure there is a better way of doing this which I am missing.

Any help would be appreciated.

Thank you
Regards,
Purnima

Individual Caliper for Variables Nearest-Neighbour Matching (psmatch2)

Dear Community,

I aim to apply nearest neighbour matching using the mahalanobis option in the the psmatch2 package and Stata15. Given the syntax of this package, I can use the option 'caliper' to define the maximum distance between controls (I assume by setting it to a certain standard deviation?). My question relates to the situation where the importance of deviations varies across the different covariate-matching variables. In other words, I'd like to assign a smaller tolerance to certain covariates. Is this anyhow possible in the psmatch2 or any comparable environment? From my point of view, it would have to be anyhow combined with the possibility of including a list of variables. I googled it and checked the forum search, but could not find any corresponding information.

To provide a minimum working example, here the stata code I am currently working with:

Code:

psmatch2 var_treatment, mahalanobis (covariat1 covariat2 covariat3) outcome(dep_var) neighbor(5) caliper(0.3) kernel kerneltype(epan)

Any help is highly appreciated!

Best,
Rob

A question on macro expression.

If we want to write i = i + 1, we could use ++i.

I wonder if there is also a short expression for i = i + 2 (or any number > 2)?

Launching a Free Online Course: Introduction to GIS in R

We’re so excited to announce the launch of our second online course about geospatial data in R. Sign up here.

When you hear “geospatial data”, what comes to your mind? For many people, it’s ordinary maps. These are an important output of geospatial data, but it can actually be used for so much more. Geospatial data is at the heart of the big data revolution, and it’s the foundation of some of today’s biggest industries and innovations.

For a long time, serious work with geospatial data required a proprietary desktop GIS (Geographic Information System), such as ArcGIS. But today R and its data visualization libraries have become powerful enough to tackle even the toughest geospatial data.

Our latest online course, Introduction to GIS: Manipulating and Mapping Geospatial Data in R, will teach you the ins and outs of how to extract, process, analyze and map geospatial data in R. Sign up now.

Who should take this course?

This course is perfect for anyone who’s comfortable with R and wants to expand their skills, learn some of the latest packages, and start working with one of today’s most common forms of data.

Know someone who would find this ebook useful? Share it with them!

What will you learn in the course?

With the help of code snippets, exhaustive resources, and dozens of sample maps and web applications, this online course will help you learn everything from the basic “Hello World” geospatial code to in-depth analysis of satellite images with 6 in-depth lessons.

introduction to gis in r

LESSON 1

Use Cases of Geospatial Data

What is geospatial data, where does it come from, and why is it worth your attention? In this lesson, you’ll get an overview of how geospatial data is being used today across sectors to segment markets, detect and prevent fraud, improve delivery routes, identify vulnerable populations, and more.

Key Topics:

Basic information about geospatial data
Business use cases for geospatial data
Public use cases for geospatial data

introduction to gis in r

LESSON 2

Manipulating Geospatial Data in R

This lesson starts with the basics — why we recommend R as a GIS, and a comparison of two common R packages for geospatial analysis. Next, you’ll walk through fundamental geospatial operations, illustrated with state-level population and economic data for India.

Key Topics:

Importing spatial data into R with the sf package
Storing geospatial & attribute data in a spatial dataframe
Simplifying sf geospatial objects before plotting

introduction to gis in r

LESSON 3

Creating Static Maps in R

The next step after analysis is visualization. This lesson introduces some of the most well-known R packages for creating static geospatial maps. It covers traditional visualizations like choropleth maps, as well as ones that aren’t true geographic visualizations but still convey geospatial data.

Key Topics:

sf, tmap, and ggplot2 packages in R
Choropleth, inset, faceted, geofaceted, cartogram, dot density, proportional symbols, and hexbin maps

introduction to gis in r

LESSON 4

Creating Animated & Interactive Maps in R

Animation and interactivity are especially well-suited to geospatial data since they can show change over time. This lesson walks you through 7 different R packages for building animated and interactive maps, plus an overview of how to build geospatial web applications with Shiny.

Key Topics:

Animated maps with tmap and gganimate
Interactive maps with tmap, ggiraph, geogrid, geofacet, mapview, plotly and leaflet
Interactive web applications with Shiny

introduction to gis in r

LESSON 5

Performing Spatial Subsetting in R

Spatial subsetting helps you tap into the actual geometry of geospatial data. This lesson explains how to filter the regions in your data based on their relation to other regions (such as a common border, distance from a certain point, intersection, and more).

Key Topics:

What spatial subsetting is and when it may be useful
Different topological relations
3 methods for spatially subsetting data

introduction to gis in r

LESSON 6

Exploring Raster Images in R

Raster data — or the images and data captured by satellites — is an even more complex form of geospatial data. This lesson explains what raster images are, where to get them, how to extract and process them, and what basic operations and analysis you can do on them.

Key Topics:

Raster attributes and features
Downloading and reading Landsat 8 data with rLandsat
Plotting, cropping ,and building indices on raster images

How much does the course cost?

Absolutely nothing! You can enroll in this course for free, learn at your own pace, and access all the lessons and course materials anytime, anywhere.

Ready to turn R into a lean, mean geospatial data machine? Sign up now!

The post Launching a Free Online Course: Introduction to GIS in R appeared first on SocialCops.

Panel data - dropping cross section based on missing values

In a panel dataset, there are companies with revenues for multiple years.

I would like to drop all companies if their revenue information is missing for any year.

Is there a simple way to achieve this?

Problems when running optimal k-means cluster solution program

Dear Team,

After reading the excellent "Stata tip 110: How to get the optimal k-means cluster solution, Stata Journal (2012) 12, Number 2, pp. 347-351" from Anna Makles I copied and paste the code written on the paper. The STATA do file is:

PHP Code:

use physed, clear

local list1 " flexibility speed strength "
foreach v of varlist `list1´ {

egen z_`v´ = std(`v´)

local list2 "z_flexibility z_speed z_strength"

forvalues k = 1(1)20 {

cluster kmeans `list2´, k(`k´) start(random(123)) name(cs`k´)

}

* WSS matrix

matrix WSS = J(20,5,.)
matrix colnames WSS = k WSS log(WSS) eta-squared PRE
* WSS for each clustering

forvalues k = 1(1)20 {
scalar ws`k´ = 0

foreach v of varlist `list2´ {
quietly anova `v´ cs`k´

scalar ws`k´ = ws`k´ + e(rss)

}
matrix WSS[`k´, 1] = `k´

matrix WSS[`k´, 2] = ws`k´

matrix WSS[`k´, 3] = log(ws`k´)
matrix WSS[`k´, 4] = 1 - ws`k´/WSS[1,2]
matrix WSS[`k´, 5] = (WSS[`k´-1,2] - ws`k´)/WSS[`k´-1,2]

}
matrix list WSS

local squared = char(178)
_matplot WSS, columns(2 1) connect(l) xlabel(#10) name(plot1, replace) nodraw noname
_matplot WSS, columns(3 1) connect(l) xlabel(#10) name(plot2, replace) nodraw noname
_matplot WSS, columns(4 1) connect(l) xlabel(#10) name(plot3, replace) nodraw noname ytitle({&eta}`squared´)
_matplot WSS, columns(5 1) connect(l) xlabel(#10) name(plot4, replace) nodraw noname
graph combine plot1 plot2 plot3 plot4, name(plot1to4, replace)

But I obtain no table nor graph and the system crashes and I have to restart the computer. I repeated the run changing " ` " and " ´ " by " ' " and using other dataset with same results. I 'm user of STATA 15.0.

Any idea about what is happening?

Thank you very much.
Jorge

Studentized deleted residuals and DFfits after logistic regression in Stata. How to calculate?

How can I calcilate studentized deleted (externally, jackknifed) residuals and dffits after performing logistic regression in Stata? The rstudent and dfits postestimation commands are available only after regres but not the logit.

Results Interpretation

Hi everyone,

Can anyone help me interpret these results, specifically the F-test below. What does this mean?

Thank you in advance!

Why do i have large z test statistics when i ran translog model

First i ran frontier model and then i did translog method here i have obtained large wald chisquare and large z test statistics my lnv2sig2v is not significant is it a problem what should i do?

Replacing missing variable with other observations that satisfy certain conditions

Hi all,
I am working on a cross-country dataset. The dataset is created when I merge the bilateral trade data with the country characteristics (such as GDP and bilateral trade agreements). When I merge I use the exporter ISO code, the importer ISO code and time as the identifiers. My master data is the trade data and my using data is the characteristics data. In the using data I rename the country of origin the exporter and the country of destination the importer. For instance, in the merged data I will have the trade value exported by country A to country B, the GDP of country A and of country B and whether A and B have a trade agreement. However, what the data misses is when B exports to A I don't have the characteristics because the using data already considers A as the origin and B as the destination (due to the way I rename the variables in the using data). I cannot merge the data one more time with A being the destination and B as the origin because of duplication (the free trade agreement variable will appear twice). I was then thinking of replacing the missing observation when B exports to A by the value of the observation when A exports to B thanks to symmetry. I checked on the forum and what I can find is I need know the ID of the observation (i.e. replace GDP = GDP[6]). In my case, I only know the observation satisfies certain conditions: replace GDP = GDP of the observation whose Exporter=Importer and Importer=Exporter.

To illustrate my point, here is an example:

Before:

Exporter Importer Exporter_GDP Importer_GDP Trade value
A B 100 200 50
B A . . 70

After

Exporter Importer Exporter_GDP Importer_GDP Trade value
A B 100 200 50
B A 200 100 70

Could you please give me some suggestions?

Many thanks,

Gia Cat Luong

Collapse different columns differently

In a panel database of companies (bisnode), the revenue information is not always for the full year...sometimes it's quarterly, for example...

So I want to collapse rows such that rows that belong to the same company for the same year are merged. Not just that, I want most columns to average out, but I 2 columns to aggregate (sum).

Can I even run collapse such that it runs differently (average v sum) on different columns?

Machine Learning setup

How can I break down my loaded dataset into training set and test set, and develop random forest on the training set, calculating fit for both the training set and test set?

Wrangling panel data - calculating growth rates and cagr

I am working with bisnode panel data https://www.bisnodegroup.com/solutio.../company-data/ which has ID, year, and company revenue info.

I would like to calculate Year-on-year revenue growth (i.e. growth compared to previous year) and CAGR (Compounded Annual Growth Rate) over a multiple year period https://www.investopedia.com/terms/c/cagr.asp

I have never worked with panel data before, so not sure how to it....I've only ever done it for timeseries data (i.e. single company).

The other thing I want to do is remove all companies that at some point had revenues, but had zero revenues in a later year. (if they had zero revenues in the beginning, but later had positive revenues, then I don't want to remove them.

Any suggestions/ideas/advice?

Wednesday, January 30, 2019

Comparing predictions and regression fitted values between two regression models with an additional explanatory varaiable

Dear Statalisters,

I am struggling with a task in which I want to investigate how an additional variable (i.e., human rights) affects my regression results.

My regression model is as follows:

Code:

regress DV IV HR

where DV represents dependent variable; IV represents independent variables. Human rights is my key explanatory variable and I am interested in looking at how human rights affects the explanation of my results.

Task:

1. Generate a set of predictions for a model in which I use all the explanatory variables except human rights with the coefficients estimated over the whole sample.

Code:

regress DV IV

Code:

predict foreignaid

2. Compare these predictions with the regressions fitted values (which take into account all the predictors including human rights)

Code:

regress DV IV HR

3. Calculate the % difference between the first two measures to see how much of the dependent variable is explained by the human rights.

Now, I am struggling with the second and third step in Stata. I am also wondering whether my codes for the first two steps are okay.

Any help will be appreciated.

Best regards,
Shazmeen Maroof.

Importing previously imputed data using mi import

Hello Stata Users,
I have been trying to import previously imputed data (m=0, 1,2,…..20) using mi import command. The data set includes the original un-imputed data(m=0) which has missing values and 20 (m=1,2,3.....20) imputed datasets. Below is my code:

mi import flong, m(imp_number) id(record_id) imputed(v1 v2 v3 v4 v5 v6)

When I run the above code it returns with an error message.

“variable record_id has invalid values
record_id takes on at least one value if imp_number>0 that it does not if imp_number==0”

However, the same code works if I delete the original data and import only imputed datasets. But the problem is it takes first imputed dataset (m=1 reads as m=0) as original and takes remaining 19 imputed dataset which is not correct.

Can anyone please help me in importing the data with the original included?

Thanks in advance.
Baker

Microsoft Organizational Structure: Divisional Structure with Focus on Innovation

Microsoft organizational structure can be classified as divisional. In June 2015, the senior management announced a change in Microsoft organizational structure to align to its strategic direction as a productivity and platform company. This restructuring initiative resulted in elimination of … Continue reading →

The post Microsoft Organizational Structure: Divisional Structure with Focus on Innovation appeared first on Research-Methodology.

Microsoft Leadership: A New Era for Multinational Technology Company

Co-founder of the company, Bill Gates was at the helm of Microsoft leadership since its inception in 1972 until 2000, when Steve Ballmer succeeded him as CEO. While Steve Job’s leadership was rightly regarded as successful, Steve Ballmer was pointed … Continue reading →

The post Microsoft Leadership: A New Era for Multinational Technology Company appeared first on Research-Methodology.

Scoring measures using STATA

I have dataset that includes items measuring diagnostic criteria for personality disorders. For each criterion there are multiple items. For some of the criteria, respondents must respond yes to more than two items in order to meet that criterion. A criterion may have 5 items and thus multiple combinations of items would satisfy that criterion. Does anyone have experience scoring measures using Stata? Is there a way to include multiple combinations of items in an "if" statement without list all combinations? Please let me know if any of this is unclear.

Problem with nlsur command

Dear Stata users,

I am currently running the nlsur command to estimate a system of 2 equations using the following syntax:

Code:

nlsur (c_fsinc = {a0} + {a1}*c_stacc + {a2}*c_stocf)(c_sizeret = {beta1}*(c_fsinc -{a0s}-{a1s}*c_stacc - {a2s}*c_stocf)), vce(cluster time)

The equation (1) is the forecasting equation: c_fsinc = {a0} + {a1}*c_stacc + {a2}*c_stocf
The equation (2) is the pricing equation: c_sizeret = {beta1}*(c_fsinc -{a0s}-{a1s}*c_stacc - {a2s}*c_stocf

and this is what I got:

Code:

Calculating NLS estimates...
could not evaluate equation 1
starting values invalid or some RHS variables have missing values
r(480);

I am new to Stata and I am not sure how to fix this error. As I knew that Stata excludes missing values when regressing, so I guess there might be something to do with the initial values.

I am not sure about how to proceed from here. Any help and suggestion will be very much appreciated.

Sincerely yours,

Khanh

This is the data generated by -dataex-:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int(unit_id time) float(c_fsinc c_stacc c_stocf c_sizeret)
1 2000   11.095           .   44.008   3.044987
1 2001    5.583   -.0874317    5.224  -.1652283
1 2002    3.861   .29353163  -42.801  -.4231709
1 2003   -4.064  -.15650004  10.2003 -.54456735
1 2004  -13.577   .02961883   -8.937 -.29095754
1 2005    .6989  -.27193496    7.805   3.436494
1 2006  -9.2673  -.14769301   4.6517   3.818903
1 2007 -11.1224  -.14354037  -8.7062   5.257833
1 2008   2.4688   -.0722658  -1.1474  3.1859944
1 2009  11.6241   -.1986534  45.4037   1.705382
1 2010   2.9899   .11853852 -11.8675   .1003933
1 2011   2.0053  -.05195754  19.3657  -.4316798
1 2012  -1.0334   .16627726  17.5627 -.09248185
1 2013   3.8014   .02308112 -21.0423   .3205987
1 2014   1.2471   .09067812  -5.5341  .23287398
1 2015  39.2993  .003675696   9.4619    1.84437
1 2016   8.5667   -.2839655  140.423 -.12539318
1 2017        .  -.12699409  14.4395  -.6051657
2 2000 -124.622   .20785487   9.1339  1.8036478
2 2001   -10.71   -.3467585  -5.3689  -.8658409
2 2002     .987    .0822746     .317  -.3857885
2 2003   38.588     .019928   15.924 -.29408288
2 2004   -5.266    .4997058   -9.562  .10540456
2 2005  -1.7561  -.24746884     14.5  3.3120015
2 2006   1.6822   .10150523   6.1509   3.854261
2 2007     .748  -.15123634  12.9634     5.8056
2 2008    .3135   .08393575   1.4254   3.184806
2 2009    .8889  .034479022   4.1066   5.606124
2 2010  33.9895  -.14203873    -.385  -.1284118
2 2011   39.129  -.25330418  103.386 -.04299143
2 2012   22.758  -.04573207  86.7555   3.895035
2 2013   7.7678  -.02250768  66.8851    .845217
2 2014  185.671  -.08318695  69.7514   5.885801
2 2015  521.866    .3290931 -62.7312   7.409277
2 2016  873.781   .18091014 -39.1415   4.887175
2 2017        .    .1497909 -15.7663   4.142799
3 2000  26.6242  -.20521878  -33.397   5.215007
3 2001  13.5809  -.05240782  240.861   4.782817
3 2002    12.91   .04881822   14.756   5.120404
3 2003   16.844   .04537622  -39.653   4.996433
3 2004  71.8039  -.05788913   43.242   4.856194
3 2005  135.634   -.0755362  429.297  4.0577707
3 2006  230.389  -.06104685  129.678    5.03461
3 2007  200.381   .05176217  -232.24   7.855485
3 2008   253.46   .04361181  554.182  3.6344376
3 2009  325.476  -.04361504  515.166   6.992318
3 2010  269.923    .1736856 -654.929    5.62332
3 2011  160.461    .1279638 -1120.65  4.7459702
3 2012    292.4  .014459985 -72.1783   4.903387
3 2013  303.325 -.005289631 -299.214   5.330762
3 2014  750.495 .0012534356  170.046   6.734075
3 2015  233.384    .1027608 -383.128   5.482848
3 2016  133.204   .12325507 -388.045   4.874717
3 2017        .   .01924204 -160.627   4.081945
4 2000    1.755           .  -28.226   .7562572
4 2001   12.143           .   22.629  -.3611049
4 2002    7.031  .021934465   50.347   3.599008
4 2003  -14.997    .1994297 -157.781   3.511429
4 2004 -81.2912   -.1859179  -14.998  3.6600254
4 2005   4.1603   -.2982836   3.5313     3.4637
4 2006 -28.8689   .16106696   3.8836  1.0714513
4 2007  15.8154  -.15661116  22.3851  2.0610507
4 2008   7.3919  .070336886   3.6043  -.7715604
4 2009  -1.0431  .001318937  12.0977   1.194843
4 2010   5.0148  .011088284 -10.3319  .12995148
4 2011  37.8202  .016581232   5.0777  .16719593
4 2012   3.1401   .22159834  -8.2186   1.892415
4 2013 -74.5532  .031706214    .7513   3.841636
4 2014  33.4436   -.0662543 -118.064  1.5790174
4 2015  39.8771    .3191679 -410.522  2.3134086
4 2016 -1061.43   .05887472 -182.278    1.70993
4 2017        .   .04516191   513.08   3.060521
5 2000  153.207   -.4568579  292.581   7.921634
5 2001  162.802  -.04799785  308.812   3.682178
5 2002  202.632  -.04271837  285.899   2.334574
5 2003  337.806  -.04614428   395.42   2.427877
5 2004  316.411  -.05078634  532.241  4.3948708
5 2005  332.112  -.04960303  532.241   3.995865
5 2006  431.485  -.18311357  665.837   6.477026
5 2007   420.08   .04067595  906.812  3.5393586
5 2008  831.944  -.07130431  800.488  2.0022154
5 2009  1455.21   -.0417543  1502.44   5.509697
5 2010  1178.23  -.02768715  2229.47   4.916965
5 2011  274.746  -.05705725  1461.88   2.061875
5 2012  1535.93  -.04438705  1444.83  4.1144814
5 2013  873.653  -.02075426  1456.53  2.5898175
5 2014  532.653  -.06572313  1196.18   .9744867
5 2015  797.722  -.04347397  1096.23  3.1049085
5 2016  825.388  -.03354574  1956.05  2.4525535
5 2017        .  -.07137262  2115.06   2.308459
6 2000 -707.018     .358442  229.842   .9246945
6 2001   43.123  -.29042542  1805.72  .42337635
6 2002  101.071  -.06911801  963.539    .701834
6 2003  142.549   .18138306  345.782   .8659249
6 2004   19.552   .08003017 -354.695   .6883366
6 2005  96.7749  -.05598192  -81.535   .5126982
6 2006  207.092   .06734498  178.829  1.0426645
6 2007   267.68  -.05472161 -25.5691  2.0497851
6 2008  151.671  -.08230114  277.008   .5703058
6 2009  83.9479 -.017399611  263.371  2.3021162
end

How significance of stata pwcorr is calculated?

When i run pwcorr, with the option sig, how stata test de significance of the correlation coeficient? How its calculated?

Propensity score weighting on samples

I want to weight my sample with propensity scores first, then run OLS with weighted sample. I know

PHP Code:

teffect

does this job by combining the two steps. However, I want to get a descriptive statistics comparison between the non-weighted sampel and weighted sample. Anyone know the command for the first step that does the weighting to the sample? Thank you very much.
S

Statistical comparison between 6 groups with unequal variance and 1 observation.

Dear Statalists,

I am analyzing a dataset which includes two variable "tech changing rate" and "Group". "Group" is a categorical variable from 1 to 6, which means that there are 6 groups. I am trying to see different types of groups have different rates of changing by performing statistical comparison test. My issues are:
1) Variances between the groups are not equal
2) Samples are not independent; the way I categorize groups made each group dependent.
3) Group 4 only has one observation. Group 1 has 170, Group 2 has 250, Group 3 has 700, Group 5 has 30, and Group 6 has 90.

What I have concluded so far is, I cannot use ANOVA since variance and group sizes vary. Also, since each group is not independent, I am guessing that I have to use friedman test but not sure of this. Can anyone share an idea of how I should perform statistical difference tests between theses 6 groups?

I appreciate your advice!!

Thanks!

Looking for US data base for tuition fees

Hi,

Im currently working in a research on higher education in the United States. I have searched for a database with the average tuition cost per institution (higher education) hopefully for the 1980 - 2015 period, but I have found nothing, so if anyone could point me in the right direction it would be much appreciated.

Thanks and have a beatiful day.

Error "initial values not feasible" for multiple imputation.

Dear experts,

I'd like to ask for your help with syntax.
I am doing a multiple imputation as below, but I get an error "initial values not feasible error occurred during imputation of choice on m = 1"
Would you let me know how to fix this?

Thank you
Sandy

mi set flong

mi register imputed choice

mi register regular term mexempt rexempt wexempt alast hlast ///
casphase cascaa sat_total sat_math sat_verb sat_write ///
cjmajor gender

mi set M=10

mi impute truncreg choice = term mexempt rexempt wexempt alast hlast ///
casphase sat_total cascaa sat_verb sat_math sat_write ///
gender cjmajor, replace ll(1) ul(3)

Geographic Regression discontinuity

Hello guys,
I am comparing the economic well-being of some group of people residing on opposite sides of a certain border. I believe there is a jump at the border, and I want to represent this graphically using Regression Discontinuity design. Is there anyone who has any experience in this, and would like to help ? Also I want to see how living conditions decline as one moves closer to the border.

Weighting without knowing psu

Dear Stata users,

I want to conduct some cross-sectional analysis with data from the South African NIDS (National Income Dynamics Study)/household survey Wave 5.
I struggle with weighting and would be very grateful about help!

It is a two-stage sampling with stratification at the district council level. The dataset provides a design weight (correcting for nonresponse) and a post stratified weight (calibrating for sex, age, race).
My first idea was to process weighting by
svyset psu [pw= poststatWeight], strata(districtVariable) Now, I have two problems. First, the psu variable is not included in the dataset but in a secured dataset which I cannot access (tried to get in touch but does not seem possible). Second, it appears to me that I have to use another command since using a post-strat weight. I considered the following command: svyset [pweight=wt], poststrata(groupVariable) postweight(poptots) Nevertheless, I am not sure if I do understand the command correctly (also because the command considers just one calibrating variable): Do I just insert my sex, age and race variable as "groupVariable" and how can I deal with the population totals variable since such variables are not provided by the dataset? I studied all manuals and weight descriptions of NIDS (in former waves a psu variable is given) but couldn't figgure out a solution, how I am supposed to weight. In a documentation about cluster correction in the dataset, it says "we should at minimum svyset households as our “cluster” variable" but I assume this does rather refer to cluster as an stata option than as the psu. Would be thankful about any hint!

Way to identify first successful loop iteration?

Hi Stata,

I have a loop:

forv x = 1/100{
cap{
[stata_commands]
[if first successful iteration, execute command x]

}
}

How can I run a command only during the first successful iteration of the loop? Any ideas?

Thanks in advance.

-Reese

p.s. I'm using v 14.2

-mimrgns- and -marginsplot-

The help file for mimrgns states that while "[i]n principle, marginsplot works after mimrgns […], the plotted confidence intervals are based on inappropriate degrees of freedom". However, it also suggests that "the differences should be small for large sample sizes". In this post, I demonstrate how to calculate these differences.

I borrow an example from https://www3.nd.edu/~rwilliam/stats3/MD02.pdf”]Richard Williams’ excellent paper on Multiple Imputation & Maximum Likelihood[/url]. We start with an example dataset where we impute missing values in one variable. Then, we run a logistic regression model and obtain predictive margins.*

Code:

version 12.1
webuse mheart0, clear
mi set mlong
mi register imputed bmi
mi register regular attack smokes age hsgrad female
mi impute regress bmi attack smokes age hsgrad female, add(20) rseed(2232)
mi estimate, dots: logit attack i.smokes age bmi i.hsgrad i.female
mimrgns smokes, at(bmi = (20(5)35)) predict(pr) cmdmargins

Omitting most of the outout, the results for the mimrgns command are

Code:

(output omitted)
. mimrgns smokes, at(bmi = (20(5)35)) predict(pr) cmdmargins

Multiple-imputation estimates                     Imputations     =         20
Predictive margins                                Number of obs   =        154
                                                  Average RVI     =     0.0419
                                                  Largest FMI     =     0.1500
DF adjustment:   Large sample                     DF:     min     =     866.55
                                                          avg     =   41276.27
Within VCE type: Delta-method                             max     =  191492.08

Expression   : Pr(attack), predict(pr)

1._at        : bmi             =          20

2._at        : bmi             =          25

3._at        : bmi             =          30

4._at        : bmi             =          35

------------------------------------------------------------------------------
             |     Margin   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  _at#smokes |
        1 0  |   .2132581   .0594192     3.59   0.000     .0967207    .3297954
        1 1  |   .4716674   .0832583     5.67   0.000     .3084512    .6348835
        2 0  |   .3239331   .0498932     6.49   0.000     .2261437    .4217226
        2 1  |   .6118189   .0617438     9.91   0.000     .4908022    .7328357
        3 0  |   .4585168   .0750399     6.11   0.000     .3113833    .6056503
        3 1  |   .7356222    .071637    10.27   0.000     .5951187    .8761257
        4 0  |   .5985056   .1179053     5.08   0.000     .3671509    .8298602
        4 1  |    .830356   .0798759    10.40   0.000      .673583    .9871289
------------------------------------------------------------------------------

Now, if we were to call marginsplot, the plotted confidence intervals would not match what mimrgns reports. How large would the differences be? To find out, we first store mimrgns' results in a matrix. More precisely, we store the coefficients, standard errors, and confidence limits. Those results are stored in rows 1, 2, 5, and 6 of r(table). Since we will later store these results as variables, we transpose the matrix.

Code:

matrix rtable = r(table)
matrix rtable  = (rtable[1..2, 1...]\ rtable[5..6, 1...])'

Now, that we have stored mimrgns' results, we will replicate the file that marginsplot would use. To do so, we need to dig a little deeper into marginsplot's internals. marginsplot internally calls margins with its undocumented saving() option. Undocumented means that there is an online help file but no corresponding entry in the manual. More importantly, it means that this option might behave differently in the future, even under version control. Anyway, margins, with the saving() option, calls a utility routine, _marg_save, which is not documented. Not documented means that there is neither an online help file nor an entry in the manual. The code is implemented as an ado-file, so we can learn how it works. Obviously, it might work differently in the future.

Code:

_marg_save , saving(mimrgns_results , double)

We can now load the created file and have a look at some of the variables

Code:

use mimrgns_results, clear
list _margin _se _ci_lb _ci_ub , noobs separator(0)

The output is

Code:

. list _margin _se _ci_lb _ci_ub , noobs separator(0)

  +-----------------------------------------------+
  |   _margin         _se      _ci_lb      _ci_ub |
  |-----------------------------------------------|
  | .21325808   .05941922   .09679856    .3297176 |
  | .47166739   .08325828   .30848416   .63485062 |
  | .32393311   .04989317   .22614429   .42172193 |
  | .61181892   .06174378   .49080334    .7328345 |
  | .45851679   .07503986   .31144138   .60559221 |
  | .73562219   .07163704   .59521618    .8760282 |
  | .59850555   .11790533   .36741535   .82959576 |
  | .83035595   .07987593   .67380201   .98690989 |
  +-----------------------------------------------+

From a first glance, it appears as if the results match those from mimrgns. To be sure, we will save the mimrgns results in the same dataset and calculate the respective differences.

Code:

svmat double rtable , names(col)
local vars _margin b _se se _ci_lb ll _ci_ub ul
forvalues i = 1(2)8 {
    local var1 : word   `i' of `vars'
    local var2 : word `++i' of `vars'
    generate double diff_`var2' = abs(`var1'-`var2')
}

Finally, we can look at the differences.

Code:

list _margin b diff_b _se se diff_se _ci_lb ll diff_ll _ci_ub ul diff_ul , noobs separator(0)

which shows

Code:

. list _margin b diff_b _se se diff_se _ci_lb ll diff_ll _ci_ub ul diff_ul , noobs separator(0)

  +------------------------------------------------------------------------------------------------------------------------------------------+
  |   _margin           b   diff_b         _se          se   diff_se      _ci_lb          ll     diff_ll      _ci_ub          ul     diff_ul |
  |------------------------------------------------------------------------------------------------------------------------------------------|
  | .21325808   .21325808        0   .05941922   .05941922         0   .09679856   .09672075   .00007781    .3297176   .32979541   .00007781 |
  | .47166739   .47166739        0   .08325828   .08325828         0   .30848416   .30845124   .00003292   .63485062   .63488354   .00003292 |
  | .32393311   .32393311        0   .04989317   .04989317         0   .22614429   .22614367   6.181e-07   .42172193   .42172255   6.181e-07 |
  | .61181892   .61181892        0   .06174378   .06174378         0   .49080334   .49080216   1.180e-06    .7328345   .73283568   1.180e-06 |
  | .45851679   .45851679        0   .07503986   .07503986         0   .31144138    .3113833   .00005808   .60559221   .60565029   .00005808 |
  | .73562219   .73562219        0   .07163704   .07163704         0   .59521618   .59511871   .00009746    .8760282   .87612566   .00009746 |
  | .59850555   .59850555        0   .11790533   .11790533         0   .36741535   .36715088   .00026447   .82959576   .82986022   .00026447 |
  | .83035595   .83035595        0   .07987593   .07987593         0   .67380201   .67358305   .00021897   .98690989   .98712886   .00021897 |
  +------------------------------------------------------------------------------------------------------------------------------------------+

We find that the point estimates and the standard errors match exactly; the difference is flat 0. As stated in the help file, the confidence intervals do not match exactly. In this example, we find that the differences are in the 4th decimal place or later. We will not be able to spot those differences in a graph.

Note that there is no guarantee that the differences will always be that small. However, perhaps the waring in the help file overstates the problem. Anyway, you now know how to check when in doubt.

Best
Daniel

* I have modified the mimrgns call to include imputed variables in the at() option. The observed differences for confidence intervals are even smaller (virtually 0) in the original example.

Confusion about how to keep one row of data for each student with the most number of classes taken in a program

I am using Stata 15.1 for Windows.

I would like to keep a certain student in the program that he concentrated in the most. If he took 1 class in the English program, 2 classes in gov, 4 classes in Health Science, and 1 class in Finance, I would only want to consider the student a Health Science concentrator since he took the most classes in that program. (Example below) I have millions of rows of data. This happens to a lot of students where they took courses in different programs or programs that overlap. I want to keep them in the program that they took the most classes in. In terms of compliance, some students take a level 2 course and level 3 course and not a level 1 course. Some take more than 4 courses.

Example:
This is just an example of 1 program (Finance):

gen level_finplan = 1 if inlist(course_code, 5905, 3709, 3638, 3721, 5891)
replace level_finplan = 2 if inlist(course_code, 3496, 3767, 5901, 3749, 3751, 5898)
replace level_finplan = 3 if inlist(course_code, 3701, 5910)
bysort studentid : egen mx_finplan = max(level_finplan)
replace level_finplan = 4 if inlist(course_code, 3713) & level_finplan == . & mx_finplan !=4
drop mx_finplan
bysort studentid : egen mx_finplan = max(level_finplan)
replace level_finplan = 4 if inlist(course_code, 5890) & (mx_finplan == 3 | mx_finplan == 2 | mx_finplan == 1 | mx_finplan == .) & level_finplan == .
replace level_finplan = 5 if inlist(course_code, 5890) & mx_finplan == 4

//Note: level 1 course = 5905, 3709, 3638, 3721, 5891
//level 2 course = 3496 ...
//and so on..

xxxxxxxxx Course1 Course2 Course3 Course4 Program
Student A 2050 3590 1309 2549 Health Science
Student A 2040 English
Student A 2890 4030 Government
Student A 3767 Finance

How would I start? I only want 1 row per student.

Use past quarter average of data to regress with next quarter data.

Hello all,

I am a newbie to STATA and struggling with a peculiar problem. Hoping for some insight from you all.

I am looking to see the effect of past 3 quarter average of Profit on Share price

I want to do a simple regression of:

Y = a + b x + e

Y is Profit and X is share price and are values we have from data.

But I want to take the past 3 simple average of y and regress it with next period value of x. if we are in time t, Y is past 3 quarter profit (t-3+ t-2+t-1)/3 and x is value of x in time t i.e. current quarter stock price.

I want to do this for all the stocks in my datasheet. The data is in rows for each quarter (kindly see below) for each particular stock.

My data is in the following format: The stock id and data run across sequentially.
So I want to regress average profit from quarter (Q1+Q2+Q3)/3 on Share price in Q4.

Stock_ID	Date	Fiscal_Qtr	Profit_Y	Price_X
1091	20170228	2017Q1	14.4	34.42
1091	20170531	2017Q2	16.3	34.94
1091	20170831	2017Q3	11	36.06
1091	20171130	2017Q4	13.3	41.58
1256	20170228	2017Q1	480	36.6
1256	20170531	2017Q2	863	42.96
1256	20170831	2017Q3	942	35.48
1256	20171130	2017Q4	597	53.63
2891	20170228	2017Q1	3.021	16.59
2891	20170531	2017Q2	4.493	15.59
2891	20170831	2017Q3	3.703	13.4
2891	20171130	2017Q4	1.86	15.54

Thanks for your attention,
Jonathan

problems with moving average - panel data

Hi everyone,
I am using Italian administrative longitudinal data for 10 years. Since I would like to understand how household income has evolved in my dataset over time, I thought to resort to the Moving Average. My Idea was therefore to take the average household income in each year and then generating the MA. However, I have a lot of problems. I am not able to use the AC function, even if I have installed the user-created command "PANELAUTO". Furthermore, when I use the tsgraph, nothing appears in the graph. Is there anyone who could help me?
I post here an example of the dataset at my disposal:

Code:

input long ID str8 ID_hh float(year hh_income)
1 "01" 2006 16590.01
1 "01" 2007 17380.10
1 "01" 2008 17600.91
1 "01" 2009 18111.36
1 "01" 2010 20217.31
1 "01" 2011 20300.63
1 "01" 2012 23567.38
1 "01" 2013 22698.79
1 "01" 2014 23009.00
1 "01" 2015 23090.34
1 "01" 2016 23250.55
2 "01" 2006 16590.01
2 "01" 2007 17380.10
2 "01" 2008 17600.91
2 "01" 2009 18111.36
2 "01" 2010 20217.31
2 "01" 2011 20300.63
2 "01" 2012 23567.38
3 "02" 2008 28656.31
3 "02" 2009 29360.20
3 "02" 2010 26003.27
3 "02" 2011 25322.36
3 "02" 2012 25210.16
3 "02" 2013 24200.56
3 "02" 2014 25300.36

end

Where ID is individual's idenitfier, ID_hh is household identifier, and hh_income is household income (per year)
Thank you a lot in advance,
Andrea

Dropping variables in the batch structure

My dataset contains ~4000 variables from a survey that collected data on upto 10 members from the same household. (AGE_01, AGE_02...AGE_10/SEX_01, SEX_02.... SEX_10/BMI_01, BMI_02.....BMI_10).

Is it possible to keep only the FIRST observation for all the variables (AGE1, SEX1, BMI1) by dropping the rest.

Principal component regressionusing themultinomial logit model

Hello everybody.

I have a few questions on the principal component regression. The latter consists of three steps which are well summarized here: https://en.wikipedia.org/wiki/Princi...ent_regression

1 - I was wondering how to compute back the coefficient estimates for the original regressors (third step). To my understanding, the coefficients of the original explanatory variables shall be derived from the eigenvalues associated to each original variable for the principal components, however, I am struggling to perform this transformation in Stata.

2 - The second question is more methodological. Indeed, in Step 2, do you think it is feasible to use the multinomial logit estimator, rather than OLS? Or maybe this would make step 3 impossible since the non-linear nature of the mlogit estimator would mess up things too much (?)

3 - Last, I just kindly wanted to ask if in the second step (OLS/mlogit(?) estimation) you could also utilize additional regressors besides the scores of the principal components obtained from the PCA. Nonetheless, I am afraid that this would prevent to derive the coefficient estimates in the third step, since in step 2 it is specified that estimated regression coefficients should be with "dimension equal to the number of selected principal components".

Thank you very much who whoever would like to help me clarifying these 3 doubts.

Kodi

Reshaping multiple variables in one dataset using STATA

Good morning,

I would like to reshape my data from wide to long. I have multiple variables within the dataset that I would like to do this for. Each variable has a different number of iterations, that is, one variable <breed> has four iterations, the variable <caontrol> has 7 iterations and so on. Please see the dataset provided below.

I have used the following command to create an id variable
gen id= _n
reshape long caontrol, i(id) j(seq)

Where "seq" is sequence
And that reshapes one variable.

But then also changes the <id> variable such that it cannot be used to reshape the other variables.

My main question is: Is there a way to reshape multiple variables with different iterations, using a single command? And if not, what would you recommend please?

Thank you in advance.

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str2(breed1 breed3 breed4) byte(caontrol1 caontrol2 caontrol3 caontrol4 caontrol5 caontrol6 caontrol7 wildcon1 wildcon2 wildcon3)
"PB" "MB" "MB" 1 0 1 1 1 1 1 1 0 0
"MB" "MB" "MB" 0 1 0 0 0 0 0 0 1 1
"PB" "PB" "PB" 1 1 1 1 1 1 1 0 1 1
"PB" "MB" "MB" 1 1 0 0 0 0 0 0 1 1
"PB" "MB" "MB" 1 1 1 1 1 1 1 0 1 1
end

Preparing household data

Dear Stata list

My data has the following format:

Code:

clear
input householdID    personID    personinHHID    personsfatherID
1001    5001    1    .
1001    5002    2    .
1002    5003    1    .
1002    5004    2    .
1002    5005    3    1
1003    5006    1    .
1003    5007    2    .
end

list, sepby(householdID) abbrev(20)

     +---------------------------------------------------------+
     | householdID   personID   personinHHID   personsfatherID |
     |---------------------------------------------------------|
  1. |        1001       5001              1                 . |
  2. |        1001       5002              2                 . |
     |---------------------------------------------------------|
  3. |        1002       5003              1                 . |
  4. |        1002       5004              2                 . |
  5. |        1002       5005              3                 1 |
     |---------------------------------------------------------|
  6. |        1003       5006              1                 . |
  7. |        1003       5007              2                 . |
     +---------------------------------------------------------+

I.e. the data set consists of households (householdID) with people in them (personID). People within a household are numbered consecutively (personinHHID), and a variable (personsfatherID) tells me that within-household ID of their father (if known). How can I create a data set that gives me a person's father's person ID, i.e. make the data set look like this:

Code:

     +-----------------------------------+
     | householdID   personID   fatherID |
     |-----------------------------------|
  1. |        1001       5001          . |
  2. |        1001       5002          . |
     |-----------------------------------|
  3. |        1002       5003          . |
  4. |        1002       5004          . |
  5. |        1002       5005       5003 |
     |-----------------------------------|
  6. |        1003       5006          . |
  7. |        1003       5007          . |
     +-----------------------------------+

Thanks for your consideration
KS

Implement Multinomial Logit Model using ml command

I would like to implement multinomial logit model using maximum likelihood command to generate the same result as mlogit.
Pr(y=1) = exp(beta1*x)/(1+beta1*x+beta2*x)
Pr(y=1) = exp(beta2*x)/(1+beta1*x+beta2*x)
Pr(y=1) = 1/(1+beta1*x+beta2*x)

The following code did not work (not generating anything), but I have no idea how to fix it.

Code:

program define mymlogit
    args lnf Xb1 Xb2
    quietly replace `lnf' = -`Xb1' - ln(1+exp(-`Xb1')+exp(-`Xb2')) if $ML_y1==1
    quietly replace `lnf' = -`Xb2' - ln(1+exp(-`Xb1')+exp(-`Xb2')) if $ML_y1==2
    quietly replace `lnf' = -ln(1+exp(-`Xb1')+exp(-`Xb2')) if $ML_y1==3
end

ml model lf mymlogit (y= x1 x2)

It would be awesome if anyone could also explain how does ml model return values to the program.

panelsubmatrix(): 3301 subscript invalid

I use stata15 and I write

Code:

xsmle INDEX CREDDEV CHEXCHRATE UNINF, wmat(Z) model(sdm)

and I get this error

Code:

panelsubmatrix():  3301  subscript invalid
            _xsmle_est():     -  function returned error
                 <istmt>:     -  function returned error
r(3301);

Any help please?

How to use "/" in Stata

Hi everyone,

I would like to know how to build a code that lets me display the following data:

Individual's heights of those individuals whose weight is between 20 and 50 KG.

I did try to type: list height if weight in (20/50).

I know I could use the mathematical operators such as >= and <= but I am trying to understand how to use "/" properly in Stata.

Thank you in advance.

WEIGHT

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float weight
19.2
20.4
20.5
20.5
21.9
22.4
24.4
25.7
25.9
26.9
27.5
28.1
28.2
  29
  29
29.3
29.4
  30
30.1
30.2
30.5
30.5
30.9
  31
31.6
32.2
32.2
32.4
32.7
32.8
32.8
32.9
33.1
33.2
33.4
33.4
33.6
33.6
33.7
33.8
  34
  34
  34
34.1
34.3
34.4
34.5
34.8
34.9
34.9
  35
  35
  35
35.1
35.1
35.2
35.3
35.3
35.4
35.5
35.5
35.5
35.5
35.6
35.6
35.8
35.8
  36
36.1
36.1
36.3
36.3
36.4
36.5
36.5
36.6
36.6
36.7
36.9
  37
  37
  37
37.1
37.1
37.1
37.2
37.2
37.3
37.3
37.3
37.3
37.5
37.5
37.6
37.6
37.6
37.7
37.7
37.7
37.7
end

HEIGHT

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float height
1.685
1.538
 1.68
1.702
1.487
1.638
  1.3
1.505
1.326
1.485
  1.6
1.595
1.323
 1.67
1.526
1.405
1.537
1.514
 1.59
  1.5
1.695
 1.44
1.499
 1.64
1.504
1.542
1.519
1.567
1.481
 1.54
1.528
1.637
1.634
1.501
  .79
 1.47
 1.46
1.663
1.454
    .
 1.56
 1.41
 1.49
1.518
 1.46
1.523
 1.52
1.686
1.534
 1.69
1.475
1.362
1.589
1.405
1.739
1.442
1.457
1.575
1.472
1.614
1.396
1.625
 1.58
1.496
  1.5
1.312
 1.52
 1.62
1.505
1.456
1.482
1.404
1.484
 1.58
1.608
1.435
 1.71
 1.64
1.533
 1.49
1.535
 1.52
1.593
  1.5
1.465
1.205
1.528
 1.55
1.502
1.536
1.589
 1.55
1.524
1.525
1.477
1.735
1.582
 1.52
1.474
    .
end