I have a variable consisting of characters and numbers in a string format. How can I separate the character part of a variable and the numbers part? The positioning of a number part is not in the same sequence for all observations so that I couldn't use a substring option. Is there any general way to identify the number part and separate it?
Specialized on Data processing, Data management Implementation plan, Data Collection tools - electronic and paper base, Data cleaning specifications, Data extraction, Data transformation, Data load, Analytical Datasets, and Data analysis. BJ Data Tech Solutions teaches on design and developing Electronic Data Collection Tools using CSPro, and STATA commands for data manipulation. Setting up Data Management systems using modern data technologies such as Relational Databases, C#, PHP and Android.
Friday, March 31, 2023
specifying stored coefficients in test
I am running two multilevel models and storing the coefficients. I then want to test of the product of two coefficients, one from each model, is zero as a simple way to get at indirect effects/moderation. But I'm not sure how to indicate the stored coefficients. Here's my syntax
mixed lifesatis airsatis wtrsatis presenv LNhhincomeC female ageC tertedu eparttime unemployed nolabor marrdom divsepwid hlthprob LNcintenseC pctprotectC LNgdppcC meaneduC incgini19C wealthgini19C || WP5: airsatis wtrsatis presenv LNhhincomeC female ageC tertedu eparttime unemployed nolabor marrdom divsepwid hlthprob, reml
estimates store full_model
mixed airsatis LNhhincomeC female ageC tertedu eparttime unemployed nolabor marrdom divsepwid hlthprob LNcintenseC pctprotectC LNgdppcC meaneduC incgini19C wealthgini19C || WP5: LNhhincomeC female ageC tertedu eparttime unemployed nolabor marrdom divsepwid hlthprob, reml
estimates store airsat_model
Then when I try to test
test (([full_model]airsatis)*([airsat_model]LNcintenseC))=0
equation [full_model] not found
r(303);
Help appreciated!
mixed lifesatis airsatis wtrsatis presenv LNhhincomeC female ageC tertedu eparttime unemployed nolabor marrdom divsepwid hlthprob LNcintenseC pctprotectC LNgdppcC meaneduC incgini19C wealthgini19C || WP5: airsatis wtrsatis presenv LNhhincomeC female ageC tertedu eparttime unemployed nolabor marrdom divsepwid hlthprob, reml
estimates store full_model
mixed airsatis LNhhincomeC female ageC tertedu eparttime unemployed nolabor marrdom divsepwid hlthprob LNcintenseC pctprotectC LNgdppcC meaneduC incgini19C wealthgini19C || WP5: LNhhincomeC female ageC tertedu eparttime unemployed nolabor marrdom divsepwid hlthprob, reml
estimates store airsat_model
Then when I try to test
test (([full_model]airsatis)*([airsat_model]LNcintenseC))=0
equation [full_model] not found
r(303);
Help appreciated!
How to use If condition to restrict t test on only two groups with a categorical var that has 3 groups
Hi Family,
I have 1 categorical variable that has 3 grps( blue,yellow, green), I wanted check if the variance between only these two groups( blue and green) differs. So how do i use if condition to constraint the categorical variable to work with only the 2 groups out of the 3.
Given that;
the value label for blue=1 yellow=2 and green=3
wait= Weight of individuals
Below is the command I used but stata still generate an error.
sdtest wait , by( grps ) if grps==1 & grps==3
option if not allowedr(198);
I need your help?
Thank you.
Amidu
I have 1 categorical variable that has 3 grps( blue,yellow, green), I wanted check if the variance between only these two groups( blue and green) differs. So how do i use if condition to constraint the categorical variable to work with only the 2 groups out of the 3.
Given that;
the value label for blue=1 yellow=2 and green=3
wait= Weight of individuals
Below is the command I used but stata still generate an error.
sdtest wait , by( grps ) if grps==1 & grps==3
option if not allowedr(198);
I need your help?
Thank you.
Amidu
Thursday, March 30, 2023
Make a list table in Stata
Dear Stata users,
Thank you in advance. I want to make a table that has all items listed in cells that are grouped by a categorical variable. Let me illustrate with data example and show you what I want to get.
You can use this data example:
Or you can use the following data directly:
And what I want to get is a table like this:
Array
Thank you in advance. I want to make a table that has all items listed in cells that are grouped by a categorical variable. Let me illustrate with data example and show you what I want to get.
You can use this data example:
Code:
sysuse auto replace foreign=3 in 1/15 label define origin 3 Whatever, modify sample 30 keep make foreign tabulate foreign
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input str18 make byte foreign "VW Diesel" 1 "Olds Cutlass" 0 "Cad. Eldorado" 3 "Audi 5000" 1 "Olds Cutl Supr" 0 "Datsun 210" 1 "Olds Omega" 0 "Fiat Strada" 1 "AMC Spirit" 3 "Toyota Corolla" 1 "Chev. Malibu" 0 "Pont. Firebird" 0 "Mazda GLC" 1 "Merc. Monarch" 0 "Honda Accord" 1 "Pont. Grand Prix" 0 "Olds 98" 0 "Volvo 260" 1 "Buick Opel" 3 "Toyota Celica" 1 "Subaru" 1 "Plym. Champ" 0 end label values foreign origin label def origin 0 "Domestic", modify label def origin 1 "Foreign", modify label def origin 3 "Whatever", modify
Array
OLS vs. Ordered Logistic Regression
Hello,
I have five employee-level variables as below, and I was wondering if I could use the OLS with fixed effect or ordered logistic regression...
Z1 = employee wage
Z2 = employee tenure
Z3 = employee age
Y = employee perception of fairness (ordinal survey item with 5 = very fair... 1 = very unfair)
X = employee job satisfaction (ordinal survey item with 5 = very satisfied...1 = very unsatisfied).
My understanding is that with the two ordinal independent and dependent variables, I perhaps need to use the ordered logistic model. However, I was wondering if using OLS would be incorrect? Furthermore, would it be acceptable to incorporate firm-level fixed effect in the ordered logistic model?
Does STATA allow fixed effects with an ordered logistic model?
Thank you!
I have five employee-level variables as below, and I was wondering if I could use the OLS with fixed effect or ordered logistic regression...
Z1 = employee wage
Z2 = employee tenure
Z3 = employee age
Y = employee perception of fairness (ordinal survey item with 5 = very fair... 1 = very unfair)
X = employee job satisfaction (ordinal survey item with 5 = very satisfied...1 = very unsatisfied).
My understanding is that with the two ordinal independent and dependent variables, I perhaps need to use the ordered logistic model. However, I was wondering if using OLS would be incorrect? Furthermore, would it be acceptable to incorporate firm-level fixed effect in the ordered logistic model?
Does STATA allow fixed effects with an ordered logistic model?
Thank you!
How to realize data deformation operation
How to perform the following data operations as follows:
(1) Place the data scattered in multiple rows on one row
My goal is to transform the data into the following data
Array
(2) How to store the serial number of a specific value in a variable A, or how to store the value of another variable B corresponding to a specific value in a variable A for subsequent operations
The data are the same as aboveL
Array
We want to store the serial number of the variable A when it is not empty, that is, the value of the corresponding variable obs ( the value of the red in the figure ), in the temporary element for subsequent data operations.
Looking forward to your reply, looking forward to your help, Thank you very much,
(1) Place the data scattered in multiple rows on one row
Code:
* Example generated by -dataex-. For more info, type help dataex clear input str2(a b c d) "a" "" "" "" "" "b" "" "" "" "" "c" "" "" "" "" "d" "fd" "" "" "" "" "fe" "" "" "" "" "fe" "" "" "" "" "ef" "de" "" "" "" "" "ss" "" "" "" "" "" "" "" "" "" "fw" "cc" "" "" "" "" "fd" "" "" "" "" "ef" "" "" "" "" "ge" end
Array
(2) How to store the serial number of a specific value in a variable A, or how to store the value of another variable B corresponding to a specific value in a variable A for subsequent operations
The data are the same as aboveL
Array
We want to store the serial number of the variable A when it is not empty, that is, the value of the corresponding variable obs ( the value of the red in the figure ), in the temporary element for subsequent data operations.
Looking forward to your reply, looking forward to your help, Thank you very much,
Clarifying interpretation of marginal effects after reghdfe
Hello, I am looking for clarification on post-estimation margins command after reghdfe. I am running a pretty standard Diff-in-Diff (effect of a state-level insurance policy change on cancer screening rates in younger vs older individuals) using reghdfe, and want to confirm that I am using/interpreting the margins command correctly.
When I run the margins command (albeit with the 'noestimcheck' option, but I think is kosher for calculating marginal effects), it omits one category of individual. This because it is collinear with the fixed effects, yes?
When I do it a different way, it tells me that the predicted change for the older group is 0. This "0" predicted change is garbage, similarly because it is collinear with the fixed effects, yes?
So, do I understand correctly that there is no way to get the marginal effect of pre vs post for the older group? Thanks
Code:
reghdfe screen i.age##i.post i.RACE i.EDUC , absorb(state_num year month) vce(cluster state_num) (MWFE estimator converged in 4 iterations) note: 1bn.post is probably collinear with the fixed effects (all partialled-out values are close to zero; tol = 1.0e-09) HDFE Linear regression Number of obs = 2,056,819 Absorbing 3 HDFE groups F( 10, 14) = 776.20 Statistics robust to heteroskedasticity Prob > F = 0.0000 R-squared = 0.0049 Adj R-squared = 0.0049 Within R-sq. = 0.0042 Number of clusters (state_num) = 15 Root MSE = 0.2780 (Std. Err. adjusted for 15 clusters in state_num) -------------------------------------------------------------------------------------- | Robust screen | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------------------+---------------------------------------------------------------- age | older | -.0260963 .0022391 -11.65 0.000 -.0308987 -.0212939 1.post | 0 (omitted) | age#post | older#1 | .0032939 .00066 4.99 0.000 .0018784 .0047094 | RACE | Black, NH | .0078715 .0007183 10.96 0.000 .0063309 .0094121 Hispanic | .0222411 .001537 14.47 0.000 .0189446 .0255376 Asian | .043212 .00239 18.08 0.000 .0380859 .0483381 | | EDUC | HS | -.0067054 .0008325 -8.05 0.000 -.0084909 -.00492 SOME COLLEGE | -.0154122 .0009276 -16.61 0.000 -.0174018 -.0134226 BACHELOR/GRADUATE | -.0364558 .0016695 -21.84 0.000 -.0400365 -.0328751 | _cons | .0947413 .0005795 163.48 0.000 .0934983 .0959843 -------------------------------------------------------------------------------------- Absorbed degrees of freedom: -----------------------------------------------------+ Absorbed FE | Categories - Redundant = Num. Coefs | -------------+---------------------------------------| state_num | 15 15 0 *| year | 9 1 8 | month | 12 1 11 | -----------------------------------------------------+ * = FE nested within cluster; treated as redundant for DoF computation
Code:
. margins age, dydx(post) noestimcheck Conditional marginal effects Number of obs = 2,056,819 Model VCE : Robust -------------------------------------------------------------------------------- | Delta-method | dy/dx Std. Err. z P>|z| [95% Conf. Interval] ---------------+---------------------------------------------------------------- 1.post | age | older | 0 (omitted) younger | .0032939 .00066 4.99 0.000 .0020004 .0045875 --------------------------------------------------------------------------------
Code:
. margins age#post, noestimcheck Predictive margins Number of obs = 2,056,819 Model VCE : Robust Expression : Linear prediction, predict() -------------------------------------------------------------------------------------- | Delta-method | Margin Std. Err. z P>|z| [95% Conf. Interval] ---------------------+---------------------------------------------------------------- age#post | older#0 | .0945361 .000978 96.66 0.000 .0926192 .096453 older#1 | .0945361 .000978 96.66 0.000 .0926192 .096453 younger#0 | .0684398 .0012712 53.84 0.000 .0659483 .0709314 younger#1 | .0717337 .0017257 41.57 0.000 .0683514 .0751161 --------------------------------------------------------------------------------------
Performing Regressions on Panel and Time Series Data
Good afternoon, first time poster here. I'm an undergraduate student trying to run some regressions on data I have been compiling. I thought this method of compiling would make it simple to regress, but it's proving to be more complicated than I thought.
I am in the process of trying to use daily market data to calculate the ANNUAL Betas by Firm then calculate the error term for each firm_year combination. Ultimately, I'm trying to use daily stock return data (regressed on/controlled by the Fama-French 3-factors**) to calculate the annual systemic (Error Term) and systematic (Beta) risk for each firm.
In order to do this, I have ~252 daily data points over 11 years (2011 - 2021), for 100 firms (this totals 277,200 data points). In the end, I hope to have one data point for every firm & year combination for each of the 100 firms (100 firms * 11 years = 1100 data points).
I have been able to run the regression using "sort" and "by" to perform one regression per year-firm combination and receive the appropriate annual beta (displayed below). However, calculating the error term for each of these points is proving to be much more complicated. It seems as if using the "predict y_hat" and "predict residuals" doesn't break the results into each individual firm_year subset.
Finally, if anybody has any advice on exporting the results into an excel or CSV file that contains the relevant Firm, Year, Systemic Risk, Systematic Risk compiled, that would be greatly appreciated (currently I'm planning on manually combining the datapoints).
Below I have listed an example of my variables, as well code I'm currently using:
Array
Code:
Array
At this point it should provide me the difference squared of the (predicted - actual)^2. If I'm not mistaken, I would then need to sum that variable for each year_firm to get the error term (Annual Systemic Risk) for that firm.
I hope I was clear enough in explaining my goals, and explain what I've been doing so far. I really appreciate your time and any support you may have to offer.
Best!
I am in the process of trying to use daily market data to calculate the ANNUAL Betas by Firm then calculate the error term for each firm_year combination. Ultimately, I'm trying to use daily stock return data (regressed on/controlled by the Fama-French 3-factors**) to calculate the annual systemic (Error Term) and systematic (Beta) risk for each firm.
In order to do this, I have ~252 daily data points over 11 years (2011 - 2021), for 100 firms (this totals 277,200 data points). In the end, I hope to have one data point for every firm & year combination for each of the 100 firms (100 firms * 11 years = 1100 data points).
I have been able to run the regression using "sort" and "by" to perform one regression per year-firm combination and receive the appropriate annual beta (displayed below). However, calculating the error term for each of these points is proving to be much more complicated. It seems as if using the "predict y_hat" and "predict residuals" doesn't break the results into each individual firm_year subset.
Finally, if anybody has any advice on exporting the results into an excel or CSV file that contains the relevant Firm, Year, Systemic Risk, Systematic Risk compiled, that would be greatly appreciated (currently I'm planning on manually combining the datapoints).
Below I have listed an example of my variables, as well code I'm currently using:
Array
Code:
Array
At this point it should provide me the difference squared of the (predicted - actual)^2. If I'm not mistaken, I would then need to sum that variable for each year_firm to get the error term (Annual Systemic Risk) for that firm.
I hope I was clear enough in explaining my goals, and explain what I've been doing so far. I really appreciate your time and any support you may have to offer.
Best!
Automatically run a do file when opened
Hello all,
Is there a feature that will have Stata automatically run designated DO files when they are opened?
Thanks!
Deguza
Is there a feature that will have Stata automatically run designated DO files when they are opened?
Thanks!
Deguza
Wednesday, March 29, 2023
Model fitness
Dear all,
Good morning!
How could I measure the model fittness of teffects ipwra (inverse probability weighted regression adjustment for impact analysis)?
Could you suggest me any command to execute the log likelihood test of the model or other fitness measures?
Thank you in advance for sharing your busy schedule.
Good morning!
How could I measure the model fittness of teffects ipwra (inverse probability weighted regression adjustment for impact analysis)?
Could you suggest me any command to execute the log likelihood test of the model or other fitness measures?
Thank you in advance for sharing your busy schedule.
Entity fixed effects model - repeated time values within panel
Hello, I want to run a regression in which the explained variable is the bond coupon and there are some explanatory variables, such as size of the issuance, tenor, etc. I want to know if one specific characteristic has a significant impact on the coupon rate, this characteristic will be identified with a dummy variable. My database, so it is not complete yet, so this variable is not there yet.
I thought the best way to run this regression was through an entity fixed effects model, controlilng both for issuers id and month of the issuance, something like: 𝑌𝑖𝑡 = 𝛼𝑖 + 𝛽𝑋𝑖𝑡 + 𝑢𝑖 + 𝑒𝑖𝑡. However, as you can see below, I have more than one time period for each issuer (one issuer issued more than one bond in the same month). So, when I tried to use 'xtset id month', I got the error message 'repeated time values within panel'. I know this is probably more of an lack of knowledge regarging which is the most appropriate model, but I would like to know if there is any advice you could give me?
Thanks a lot!
I thought the best way to run this regression was through an entity fixed effects model, controlilng both for issuers id and month of the issuance, something like: 𝑌𝑖𝑡 = 𝛼𝑖 + 𝛽𝑋𝑖𝑡 + 𝑢𝑖 + 𝑒𝑖𝑡. However, as you can see below, I have more than one time period for each issuer (one issuer issued more than one bond in the same month). So, when I tried to use 'xtset id month', I got the error message 'repeated time values within panel'. I know this is probably more of an lack of knowledge regarging which is the most appropriate model, but I would like to know if there is any advice you could give me?
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input str6 ticker double(coupon issuedate size) byte tenor float(mnth id logsize) "ACCES" 15.5 21625.874675925923 33998957 5 21625 1 17.34184 "ADB" 4.4 18366.874675925923 367149939 2 18366 2 19.72128 "ADB" 7.18 18366.874675925923 195806943 3 18366 2 19.09264 "ADB" .5 18533.874675925923 56103825 7 18533 2 17.842714 "ADB" .5 18533.874675925923 30614167 7 18533 2 17.236973 "ADB" 4.35 18533.874675925923 104088416 4 18533 2 18.46075 "ADB" 7.02 18533.874675925923 15312514 4 18533 2 16.544182 "ADB" 3.14 19133.874675925923 14366736 4 19133 2 16.480427 "ADB" 2.81 19387.916342592594 28223921 4 19387 2 17.15568 "ADB" 3.41 19584.874675925923 30717117 4 19584 2 17.24033 "ADB" 2.125 20165.874675925923 5.000e+08 10 20165 2 20.03012 "ADB" 6.71 20221.874675925923 2338831 3 20221 2 14.665162 "ADB" 1 20681.874675925923 8.000e+08 3 20681 2 20.50012 "ADB" 1.75 20681.874675925923 5.000e+08 10 20681 2 20.03012 "ADB" 2.625 20907.874675925923 50000000 10 20907 2 17.727533 "ADB" 6 20962.874675925923 83070277 4 20962 2 18.235197 "ADB" 1.57 20963.874675925923 10200000 5 20963 2 16.137897 "ADB" 2 20963.874675925923 40564903 5 20963 2 17.518414 "ADB" 9.55 20963.874675925923 49547852 4 20963 2 17.718449 "ADB" 1.875 21040.874675925923 7.500e+08 5 21040 2 20.435583 "ADB" 2.375 21040.874675925923 5.000e+08 10 21040 2 20.03012 "ADB" 3.18 21137.916342592594 16889951 10 21137 2 16.64223 "ADB" 1.47 21209.916342592594 23864070 10 21209 2 16.987885 "ADB" 1.95 21264.874675925923 12903558 2 21264 2 16.373014 "ADB" 2.25 21303.874675925923 12903558 2 21303 2 16.373014 "ADB" .37 21360.874675925923 334096983 5 21360 2 19.62694 "ADB" .35 21380.874675925923 620880201 7 21380 2 20.24665 "ADB" 3.125 21452.874675925923 7.500e+08 10 21452 2 20.435583 "ADB" 2.45 21565.916342592594 749913861 5 21565 2 20.43547 "ADB" .295 21737.874675925923 467735777 7 21737 2 19.963413 "ADB" 1.6 21809.874675925923 148631576 11 21809 2 18.81698 "ADB" 1.5 21825.874675925923 98006546 3 21825 2 18.400545 "ADB" .625 21839.874675925923 296450889 7 21839 2 19.507393 "ADB" 0 21845.874675925923 776100251 10 21845 2 20.46979 "ADB" .8 22117.874675925923 50000000 10 22117 2 17.727533 "ADB" .29 22200.874675925923 47728140 8 22200 2 17.681032 "ADB" 10.12 22236.874675925923 8336391 2 22236 2 15.93614 "ADB" 10.1 22236.874675925923 21945745 2 22236 2 16.904083 "ADB" .253 22300.874675925923 143184421 7 22300 2 18.779644 "ADB" .603 22312.874675925923 36218011 30 22312 2 17.405067 "ADB" 6 22315.874675925923 75011720 5 22315 2 18.133156 "ADB" .75 22320.874675925923 941477743 5 22320 2 20.66296 "ADB" .715 22333.874675925923 41392013 30 22333 2 17.538599 "ADB" .92 22334.874675925923 51740016 40 22334 2 17.761742 "ADB" .8325 22342.874675925923 36218011 30 22342 2 17.405067 "ADB" 4.7 22350.874675925923 53843392 3 22350 2 17.80159 "ADB" 2.15 22532.874675925923 133606703 10 22532 2 18.710411 "ADB" 1.8 22532.874675925923 140524399 15 22532 2 18.760891 "ADB" .6 22627.874675925923 38367096 2 22627 2 17.462711 "ADB" 14.5 22936.874675925923 7337761 2 22936 2 15.808544 "ADNA" 4.375 22530.874675925923 7.500e+08 3 22530 3 20.435583 "ADNA" 4.375 22530.874675925923 7.500e+08 3 22530 3 20.435583 "ADNARE" 4.625 21836.874675925923 3.625e+08 20 21836 4 19.708534 "ADNARE" 4.625 21836.874675925923 3.625e+08 20 21836 4 19.708534 "ADNGN" 6.25 21709.874675925923 5.000e+08 6 21709 5 20.03012 "ADNGN" 6.25 21709.874675925923 5.000e+08 6 21709 5 20.03012 "AESOC" 4.7133 21655.874675925923 120223159 10 21655 6 18.60486 "AESOC" 4.7133 21655.874675925923 33550867 10 21655 6 17.328573 "AFDB" 4.52 18349.874675925923 61384460 4 18349 7 17.932667 "AFDB" 4.68 18408.874675925923 56611583 4 18408 7 17.851725 "AFDB" 4.35 18408.874675925923 26820632 4 18408 7 17.104683 "AFDB" .5 18470.874675925923 7135164 7 18470 7 15.780546 "AFDB" 4.42 18499.874675925923 11462412 3 18499 7 16.254583 "AFDB" .5 18533.874675925923 7514248 10 18533 7 15.832312 "AFDB" .5 18562.916342592594 6154319 6 18562 7 15.632665 "AFDB" 3.71 18562.916342592594 28560046 4 18562 7 17.167519 "AFDB" 4.8 18562.916342592594 45594931 4 18562 7 17.635307 "AFDB" .75 19648.874675925923 5.000e+08 3 19648 7 20.03012 "AFDB" 1.75 19793.874675925923 89266585 5 19793 7 18.307138 "AFDB" 1.375 20438.916342592594 5.000e+08 3 20438 7 20.03012 "AFDB" .375 20788.916342592594 114729422 6 20788 7 18.558086 "AFDB" 3.5 20802.916342592594 94583730 15 20802 7 18.364996 "AFDB" 3 21523.916342592594 5.000e+08 3 21523 7 20.03012 "AFDB" .375 21649.874675925923 190912562 5 21649 7 19.067326 "AFDB" .25 22391.874675925923 95456281 5 22391 7 18.37418 "AFDB" .823 22685.874675925923 143184421 5 22685 7 18.779644 "AFFNN" 1.205 22187.874675925923 158881474 5 22187 8 18.883669 "AGBKC" 4.15 20380.916342592594 85178875 2 20380 9 18.260263 "AGBKC" 2.75 20380.916342592594 5.000e+08 5 20380 9 20.03012 "AGBKC" 2.125 20380.916342592594 4.000e+08 3 20380 9 19.806974 "AGBKC" 2.4 22937.874675925923 2071851821 3 22937 9 21.45171 "AGBKC" 2.4 22937.874675925923 2129471890 3 22937 9 21.47914 "AGBKC" 2.8 22937.874675925923 690617273 5 22937 9 20.353096 "AGBKC" 2.8 22937.874675925923 709823963 5 22937 9 20.38053 "AGBKCK" 2.25 22704.874675925923 3.000e+08 5 22704 10 19.51929 "AGBKCK" 2 22704.874675925923 6.000e+08 3 22704 10 20.21244 "AGBKFH" 3.68 21702.874675925923 483691533 3 21702 11 19.99696 "AGDBC" 3.79 20808.916342592594 1612305112 3 20808 12 21.20093 "AGDBC" 4.48 21138.916342592594 1612305112 2 21138 12 21.20093 "AGDBC" 3.18 21858.874675925923 348063375 3 21858 12 19.667894 "AGDBC" 2 22545.874675925923 1419647927 2 22545 12 21.073675 "AGSPA" .5 22572.874675925923 103480033 5 22572 13 18.45489 "AGUA" 8.65 20997.874675925923 206675622 10 20997 14 19.14666 "AGUAS" 1.8 21257.874675925923 57039000 7 21257 15 17.859245 "AGUAS" 2.5 21622.874675925923 76052000 25 21622 15 18.146929 "AHWLIA" 3.99 21739.874675925923 85178875 3 21739 16 18.260263 "AHWLIA" 3.8 22822.874675925923 88018171 3 22822 16 18.293055 "AIAUT" 1.75 22656.874675925923 1.000e+09 5 22656 17 20.723267 "AIAUT" 1.75 22656.874675925923 1.000e+09 5 22656 17 20.723267 "AJBAN" 4.6 21502.916342592594 41437036 3 21502 18 17.539686 end format %tdnn/dd/CCYY issuedate
Plot survival curve of the interaction in Stratified Cox Model
Dear all,
I ran a Cox model with strata(year),
stcox X1 X2 c.X1#i.X2, strata(year) vce(robust)
X1 is a continuous variable, X2 is a category variable (=1 or 0), now I want to plot the survival and hazard curves to show the effects of interaction, may I know the commands? stcurve is invalid after strata().
Thank you.
Best
Josh
I ran a Cox model with strata(year),
stcox X1 X2 c.X1#i.X2, strata(year) vce(robust)
X1 is a continuous variable, X2 is a category variable (=1 or 0), now I want to plot the survival and hazard curves to show the effects of interaction, may I know the commands? stcurve is invalid after strata().
Thank you.
Best
Josh
estout test results
I am trying to obtain the test results using estout command however I am unable to put them there.
However I want to see the poolability test result(It gives automatically with xtreg) and also the hausman test result at the bottom of the table where r2 N etc exists. Can someone please help me on that?
Code:
reg y x1 x2 x3 x4 eststo 1 xtset ıd time xtreg y x1 x2 x3 x4,be eststo 2 xtreg y x1 x2 x3 x4,fe eststo 3 xtreg y x1 x2 x3 x4 eststo 4 hausman 3 4 hausman 3 4, sigmamore estout 1 2 3 4, cells(b(star fmt(%9.4f)) se(par) t(par) p(par)) stats(N r2 r2_b r2_w r2_o , fmt(%9.4f %9.0g)labels (N))
Loop to fill missing values
Hello,
I am using a panel data set that has a unique firm id called 'firm_id'. However, this firm_id has some missing values. These missing values are all unique firms, so among the missing values no firm occurs twice. I want to fill those missing values by continuing to count from the current highest available firm_id.
(max_firm_id = the max of firm_id)
I wrote this loop:
global i = max_firm_id + 1
foreach x of varlist firm_id {
replace `x' = $i if (`x' == .)
global i = $i + 1
}
However, when I run this, all the missing values are replaced by the value of max_firm_id + 1. What I want is that only the first missing value gets the value of max_firm_id + 1, the second missing value is one higher, the third missing value is again one higher, and so on.
I am using a panel data set that has a unique firm id called 'firm_id'. However, this firm_id has some missing values. These missing values are all unique firms, so among the missing values no firm occurs twice. I want to fill those missing values by continuing to count from the current highest available firm_id.
(max_firm_id = the max of firm_id)
I wrote this loop:
global i = max_firm_id + 1
foreach x of varlist firm_id {
replace `x' = $i if (`x' == .)
global i = $i + 1
}
However, when I run this, all the missing values are replaced by the value of max_firm_id + 1. What I want is that only the first missing value gets the value of max_firm_id + 1, the second missing value is one higher, the third missing value is again one higher, and so on.
test "dose-response" with categorical variable
Hi! I am performing cox regressions with a categorical indep. variable. In the survey, we have a likert-scale with 7 responses from "never" "once" ... to "daily". We coded 1) never 2) once 3) reoccurring. I want to know if (3) reoccurring exposure has a stronger association with the outcome than exposure (2) "once". My supervisor suggested treating the indep. variable as continuous (put it in stata without the i.) and claims that getting a stat. sign. p-value would prove that there is a trend. I find this hard to believe, it has been questioned by a reviewer and I can't find this described anywhere. I suggested instead to compare the two coefficients with a Wald test for stat. sign. of "not being the same". I would be greatefull for a) an explanation that supports my supervisors suggestion (possibly literature I can reference) b) support for the Wald test or c) an alternative to test this. Just to say it in advance: Including the variable with finer categories is off the table due to low power.
Tuesday, March 28, 2023
Using nlcom after postestimation margins
Dear forum,
Thanks in advance for any guidance you may give. I would like to estimate the ratio (95%CI) of two (2) predicted probabilities that were computed using postestimation margins after logistic regression. I thought I could use nlcom but I`m coming stuck.
.
Thanks,
Itai
Thanks in advance for any guidance you may give. I would like to estimate the ratio (95%CI) of two (2) predicted probabilities that were computed using postestimation margins after logistic regression. I thought I could use nlcom but I`m coming stuck.
Code:
logit dm i.sex margins sex, post
PHP Code:
Logistic regression Number of obs = 4,830
LR chi2(1) = 7.04
Prob > chi2 = 0.0080
Log likelihood = -2596.4629 Pseudo R2 = 0.0014
------------------------------------------------------------------------------
dm | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
sex |
Male | -.1912397 .0725017 -2.64 0.008 -.3333405 -.049139
_cons | -1.14623 .0420583 -27.25 0.000 -1.228663 -1.063797
------------------------------------------------------------------------------
Code:
margins sex, post
PHP Code:
Expression: Pr(dm), predict()
------------------------------------------------------------------------------
| Delta-method
| Margin std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
sex |
Female | 0.146 0.012 12.62 0.000 0.124 0.169
Male | 0.117 0.014 8.17 0.000 0.089 0.146
------------------------------------------------------------------------------
Code:
logit dm i.lclass
PHP Code:
Logistic regression Number of obs = 3,338
LR chi2(4) = 89.35
Prob > chi2 = 0.0000
Log likelihood = -1431.0752 Pseudo R2 = 0.0303
------------------------------------------------------------------------------
dm | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
lclass |
2 | .1144973 .1324727 0.86 0.387 -.1451444 .374139
3 | 1.012298 .181102 5.59 0.000 .6573448 1.367252
4 | -.6119494 .1525491 -4.01 0.000 -.9109402 -.3129587
5 | .3836895 .1778637 2.16 0.031 .0350831 .7322959
|
_cons | -1.662886 .1079938 -15.40 0.000 -1.87455 -1.451222
------------------------------------------------------------------------------
Code:
. margins lclass, post
PHP Code:
Expression: Pr(dm), predict()
------------------------------------------------------------------------------
| Delta-method
| Margin std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
lclass |
1 | 0.078 0.023 3.47 0.001 0.034 0.123
2 | 0.044 0.010 4.46 0.000 0.025 0.064
3 | 0.213 0.052 4.08 0.000 0.111 0.316
4 | 0.051 0.039 1.30 0.196 -0.026 0.128
5 | 0.296 0.104 2.84 0.005 0.091 0.501
------------------------------------------------------------------------------
Code:
nlcom _b[Male.sex]\_b[1.lclass]
PHP Code:
[Male.sex]\_b[1.lclass] not found
Thanks,
Itai
Generating tables using the eststo and esttab commands in Stata and exporting the tex. file into Latex
Hi, I am currently trying to generate a table in stata to import into Latex using the following code:
esttab using OLS_Tables.tex, replace keep(Leverage logTotalAssets Tangibility RevenueGrowth FirmAge _cons) order(Leverage logTotalAssets Tangibility RevenueGrowth FirmAge _cons) r2 b(%9.3f) se(%9.3f)label star(* 0.10 ** 0.05 *** 0.01) nonotes addnotes("$^{\ast} \text{ } p<0.10,\text{ } ^{\ast \ast}p<0.05, \text{ } ^{\ast \ast}p<0.01$" "Robust standard errors in parentheses") compress
It generates the following table as seen below
Array
How would I edit my code further to replicate the format of the following table including a title, a row below the equation number to include the type of regression (OLS) above the dependent variable (ROA), a seperate section showing Year FE?
Array
Thanks!
esttab using OLS_Tables.tex, replace keep(Leverage logTotalAssets Tangibility RevenueGrowth FirmAge _cons) order(Leverage logTotalAssets Tangibility RevenueGrowth FirmAge _cons) r2 b(%9.3f) se(%9.3f)label star(* 0.10 ** 0.05 *** 0.01) nonotes addnotes("$^{\ast} \text{ } p<0.10,\text{ } ^{\ast \ast}p<0.05, \text{ } ^{\ast \ast}p<0.01$" "Robust standard errors in parentheses") compress
It generates the following table as seen below
Array
How would I edit my code further to replicate the format of the following table including a title, a row below the equation number to include the type of regression (OLS) above the dependent variable (ROA), a seperate section showing Year FE?
Array
Thanks!
Callaway and Sant'Anna Diff in Diff
Hi:
I am trying to estimate the effect of a policy using the dynamic DID model by Callaway and Sant'Anna using the CSDID package. I have individual-level cross-sectional data. The policy I am investigating affects individuals living in urban areas (I have an indicator for urban) who were born after the year 1990 (sample contains individuals born between 1975 and 2000). I have tried the following code:
. where first_treat=1 if living in urban area and born after 1990. However, I have realized that this code would only work with proper panel data setup, with only one observation per ID per period. Does anybody know how I can modify the above code for individual level data with many observations per urban/rural id per year of birth?
I am trying to estimate the effect of a policy using the dynamic DID model by Callaway and Sant'Anna using the CSDID package. I have individual-level cross-sectional data. The policy I am investigating affects individuals living in urban areas (I have an indicator for urban) who were born after the year 1990 (sample contains individuals born between 1975 and 2000). I have tried the following code:
Code:
csdid years_schooling, ivar(urban_rural) time(YOB) gvar(first_treat) method(ipw)
Difference-in-Differences (DiD) Methods with Jeff Wooldridge -- livestream workshop
Hi everyone
Instats presents Difference-in-Differences (DiD) Methods with Panel Data by Jeff Wooldridge, running April 4-5 (starting 11am EST). This workshop covers recent developments in difference-in-differences (DiD) estimation for causal inference with longitudinal panel data using Stata, covering the foundational theory for DiD approaches along with flexible regression-based methods and staggered interventions. More recent methods based on rolling controls, long differencing, and more will also will be covered. You can participate in the live sessions or asynchronously through the Instats platform where all Zoom sessions will be recorded and available for 30 days. Jeff will also be available for Q&A during the workshop and afterwards for 30 days through the Instats platform.
Please consider coming along, and feel free to tell your friends and colleagues about this unique opportunity to get taught DiD methods by one of the world's best econometric educators.
The workshop fees are ~$400 for researchers and $250 for PhD and other students, and for our institutional partners these come down to ~$275 and $185 USD. It's free to become an Instats partner if you'd like to receive similar discounts on any Instats workshop -- to inquire contact info@instats.org
Happy modeling!
Michael Zyphur
Director
Institute for Statistical and Data Science
instats.org
Instats presents Difference-in-Differences (DiD) Methods with Panel Data by Jeff Wooldridge, running April 4-5 (starting 11am EST). This workshop covers recent developments in difference-in-differences (DiD) estimation for causal inference with longitudinal panel data using Stata, covering the foundational theory for DiD approaches along with flexible regression-based methods and staggered interventions. More recent methods based on rolling controls, long differencing, and more will also will be covered. You can participate in the live sessions or asynchronously through the Instats platform where all Zoom sessions will be recorded and available for 30 days. Jeff will also be available for Q&A during the workshop and afterwards for 30 days through the Instats platform.
Please consider coming along, and feel free to tell your friends and colleagues about this unique opportunity to get taught DiD methods by one of the world's best econometric educators.
The workshop fees are ~$400 for researchers and $250 for PhD and other students, and for our institutional partners these come down to ~$275 and $185 USD. It's free to become an Instats partner if you'd like to receive similar discounts on any Instats workshop -- to inquire contact info@instats.org
Happy modeling!
Michael Zyphur
Director
Institute for Statistical and Data Science
instats.org
Drawing the yearly production of 16 region in one graph
Hello Stata users;
I have this data:
What I wanna do with this data is draw a graph that shows, for each region (NABEUL, DAAR CHAABEN, ...), the evolution of the yearly production of oil.
I thought about having a 16 line graph, but then I guess the graph would be crowded, but at the same time, I would love to have all 16 regions in one graph, just for comparaison reasons, I also thought about drawing a bar graph, but I don't think that's the best solution.
Any ideas and code suggestions please?
Thanks!
I have this data:
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input str15 year int(year2012 year2013 year2014 year2015 year2016 year2017 year2018 year2019 year2020) "NABEUL" 2500 2500 2500 2500 2500 2500 2500 2200 2200 "DAR CHAABANE" 2500 2500 2500 2500 2500 2500 2800 2200 2200 "BENI KHIAR" 2250 2250 2250 2250 2240 2240 2500 2500 2500 "KORBA" 8730 8730 8730 7730 7200 7200 7500 7500 7500 "MENZEL TEMIME" 9400 9400 9400 9400 14900 15500 15650 16500 19000 "EL MIDA" 14900 13900 13900 14900 8200 8500 8650 9200 9200 "KELIBIA" 9500 9500 9500 9500 9200 10000 10000 10000 10000 "HAMMAM GUEZAZE" 6500 6500 6500 6500 5500 6000 6500 6500 6500 "EL HAOUARIA" 9550 9550 9550 9550 8590 8590 8590 9500 12000 "TAKELSA" 3800 3800 3800 3800 3800 3800 3900 3900 3900 "SOLIMAN" 5050 5050 5050 5050 4050 4850 4850 4850 4850 "MENZEL BOUZELFA" 6950 6950 6950 6950 6950 6950 6950 6950 6950 "BENI KHALLED" 4600 4600 4600 4600 4600 4600 4800 4800 4800 "GROMBALIA" 5950 5950 5950 5950 5950 5950 5950 5000 5000 "BOU ARGOUB" 2280 2280 2280 2280 2280 2280 2320 2000 2000 "HAMMAMET" 1540 1540 1540 1540 1540 1540 1540 1400 1400 end
I thought about having a 16 line graph, but then I guess the graph would be crowded, but at the same time, I would love to have all 16 regions in one graph, just for comparaison reasons, I also thought about drawing a bar graph, but I don't think that's the best solution.
Any ideas and code suggestions please?
Thanks!
Long data formatting and Clogit model for discrete choice experiments with multiple choice questions
I have survey data from a discrete choice experiment where each survey respondent was asked to complete 8 discrete choice questions (choose one) with each choice question having 4 unlabeled alternatives, where one alternative was a "pick none" option. There were 3 different survey versions and the 8 choice questions differed on each version, meaning there are 24 total choice situations across the 3 surveys.
I know choice data requires long formatting, but I want to make sure I am setting it up to correctly incorporate that each respondent completed 8 choice questions and that there were 3 different sets of choice questions a respondent might have seen.
This is an example of how my data is currently formatted. I have the survey version denoted, but label choice sets as 1-8 on all survey versions and alternatives 1-4 within all choice sets. Should the choice sets and alternatives be numbered differently? For example, choice sets numbered 1-24 (1-8 on survey 1, 9-16 on survey 2, and 17-24 on survey 3)? What about for alternatives... is 1-4 okay, or do they need to be numbered differently to indicate the alternatives they are choosing between differs between choice questions?
After correctly formatting the data, how can I make sure the clogit model is correctly interpreting this as one respondent making 8 different choices? If I group by ID it says there are multiple positive outcomes within a group since there are 8 ones in the "choice" column.
I know choice data requires long formatting, but I want to make sure I am setting it up to correctly incorporate that each respondent completed 8 choice questions and that there were 3 different sets of choice questions a respondent might have seen.
This is an example of how my data is currently formatted. I have the survey version denoted, but label choice sets as 1-8 on all survey versions and alternatives 1-4 within all choice sets. Should the choice sets and alternatives be numbered differently? For example, choice sets numbered 1-24 (1-8 on survey 1, 9-16 on survey 2, and 17-24 on survey 3)? What about for alternatives... is 1-4 okay, or do they need to be numbered differently to indicate the alternatives they are choosing between differs between choice questions?
SurveyVersion | ID | ChoiceSet | Alternative | Choice | Price | Outages |
1 | 1 | 1 | 1 | 0 | 20 | 0 |
1 | 1 | 1 | 2 | 1 | 50 | 1 |
1 | 1 | 1 | 3 | 0 | 75 | 0 |
1 | 1 | 1 | 4 | 0 | 0 | 0 |
Monday, March 27, 2023
Exporting psmatch2 results on the matched sample: a solution with esttab
I am writing my thesis right now and have had many troubles trying to export the results of the user-written command psmatch2. Eststo and esttab generally only report the results from the unmatched sample instead of the average treatment effect on the treated (ATT) and the associated standard error. I have seen at least 5 different forum posts, all unsolved, with this issue. After much experimentation, I have found a solution.
Let's examine the problem. Run the psmatch2 command with eststo, and then run "return list" to see the scalars stored as a result of psmatch2:
You will see that the ATT is stored in r(att), and the standard error of the ATT is stored in r(seatt).
However, eststo does not store r(att) and r(seatt) automatically as an active estimation result. Therefore, running esttab after simply gives you the results from the unmatched sample.
The solution here is to use estadd to store r(att) and r(seatt), and then run esttab. Estadd will store r(att) and r(seatt) as estimation results, which esttab can then access.
Here is my solution, in which I also calculate the t statistic and two-tailed p-value. Obviously, substitute the dependent, independent, and outcome variables for your own variables.
Under the esttab options, I drop _treated and _cons because these are the results on the unmatched sample. I use obslast so that the number of observations will appear at the bottom of the table instead of the top. Of course, you can modify the code to specify if you want to export the results to HTML, pdf, or whatever other format is possible with esttab.
This is still an incomplete solution as I cannot find a way to automatically star the p-values, but I believe it is helpful nonetheless.
Let's examine the problem. Run the psmatch2 command with eststo, and then run "return list" to see the scalars stored as a result of psmatch2:
Code:
eststo: psmatch2 dependent_var independent_var1 independent_var2, out(outcome_var) return list
However, eststo does not store r(att) and r(seatt) automatically as an active estimation result. Therefore, running esttab after simply gives you the results from the unmatched sample.
The solution here is to use estadd to store r(att) and r(seatt), and then run esttab. Estadd will store r(att) and r(seatt) as estimation results, which esttab can then access.
Here is my solution, in which I also calculate the t statistic and two-tailed p-value. Obviously, substitute the dependent, independent, and outcome variables for your own variables.
Code:
eststo clear eststo: psmatch2 dependent_var independent_var1 independent_var2, out(outcome_var) estadd scalar r(att) estadd scalar r(seatt) estadd scalar t_stat = (r(att)/r(seatt)) estadd scalar p_value = (2 * ttail(e(df_r), abs(e(t_stat)))) esttab, scalars(att seatt t_stat p_value) drop(_treated _cons) obslast
This is still an incomplete solution as I cannot find a way to automatically star the p-values, but I believe it is helpful nonetheless.
Creating filename as columns in multiple files dta files before appending
Hi,
I am trying to include the name of the file as a variable before appending. Please see code below, which works well.
ideally the variable should be created after importing the delimited file, but before saving as a dta file for all files to be appended
Thanks
clear
cd "C:\Users\aaaa"
local files: dir "C:\Users\aaaa" files "*.txt"
foreach file in `files' {
clear
import delimited `file'
save `file'.dta, replace
}
local files: dir "C:\Users\aaaa " files "*.dta"
foreach file in `files' {
append using `file'
}
I am trying to include the name of the file as a variable before appending. Please see code below, which works well.
ideally the variable should be created after importing the delimited file, but before saving as a dta file for all files to be appended
Thanks
clear
cd "C:\Users\aaaa"
local files: dir "C:\Users\aaaa" files "*.txt"
foreach file in `files' {
clear
import delimited `file'
save `file'.dta, replace
}
local files: dir "C:\Users\aaaa " files "*.dta"
foreach file in `files' {
append using `file'
}
converting "None"(python) to "" (Stata).
Hi ,
manipulating a list in python I ended up with the following tuple, which I need to send to stata default frame
('47313362', None) where None means missing in python parlance.
So, Sfi library is not correctly converting "None"(python) to "" (Stata).
Does anyone know how to handle it?
thanks
manipulating a list in python I ended up with the following tuple, which I need to send to stata default frame
('47313362', None) where None means missing in python parlance.
Code:
. clear all . gen a="" . gen b="" . set obs 1 Number of observations (_N) was 0, now 1. . python ----------------------------------------------- python (type end to exit) ---------------------------------------------------------------------------------------------------------------------------------------- >>> from sfi import Data >>> Data.store(['a','b'],None,('47313362', None)) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Program Files\Stata17\ado\base\py\sfi.py", line 1432, in store return _stp._st_storedata(vars, obss, val, svar) TypeError: failed to store the string value in the current Stata dataset r(7102); >>> end ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ . list +--------------+ | a b | |--------------| 1. | 47313362 | +--------------+
Does anyone know how to handle it?
thanks
Interpretation of a Regression Output
Dear experts,
I would like to kindly ask for your help in the interpretation of a regression output. Compare statistical significants and the economic importance of my various regressors--Is my key explanatory variable important? What are your most important variables? Fully and carefully use words to interpret at least one parameter estimate (using B, marginal effect, or elasticity; elasticity is easiest as long as it’s a continuous variable).
here is my regression:
use "C:\Users\hartm\Downloads\cosub_place_panel_pr oper ty2_9018 (6).dta
> "
. regress incherfindahl pctbachelors pctnevermarr pct18to64 unemprate
Source | SS df MS Number of obs = 65,143
-------------+---------------------------------- F(4, 65138) = 811.53
Model | 23.8730515 4 5.96826289 Prob > F = 0.0000
Residual | 479.046789 65,138 .007354337 R-squared = 0.0475
-------------+---------------------------------- Adj R-squared = 0.0474
Total | 502.91984 65,142 .007720362 Root MSE = .08576
------------------------------------------------------------------------------
incherfind~l | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
pctbachelors | -.1945729 .0048227 -40.34 0.000 -.2040255 -.1851204
pctnevermarr | -.003355 .005335 -0.63 0.529 -.0138116 .0071016
pct18to64 | .1157761 .0071806 16.12 0.000 .1017021 .1298501
unemprate | -.3474631 .008519 -40.79 0.000 -.3641603 -.330766
_cons | .0709203 .0042465 16.70 0.000 .0625972 .0792435
------------------------------------------------------------------------------
I would like to kindly ask for your help in the interpretation of a regression output. Compare statistical significants and the economic importance of my various regressors--Is my key explanatory variable important? What are your most important variables? Fully and carefully use words to interpret at least one parameter estimate (using B, marginal effect, or elasticity; elasticity is easiest as long as it’s a continuous variable).
here is my regression:
use "C:\Users\hartm\Downloads\cosub_place_panel_pr oper ty2_9018 (6).dta
> "
. regress incherfindahl pctbachelors pctnevermarr pct18to64 unemprate
Source | SS df MS Number of obs = 65,143
-------------+---------------------------------- F(4, 65138) = 811.53
Model | 23.8730515 4 5.96826289 Prob > F = 0.0000
Residual | 479.046789 65,138 .007354337 R-squared = 0.0475
-------------+---------------------------------- Adj R-squared = 0.0474
Total | 502.91984 65,142 .007720362 Root MSE = .08576
------------------------------------------------------------------------------
incherfind~l | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
pctbachelors | -.1945729 .0048227 -40.34 0.000 -.2040255 -.1851204
pctnevermarr | -.003355 .005335 -0.63 0.529 -.0138116 .0071016
pct18to64 | .1157761 .0071806 16.12 0.000 .1017021 .1298501
unemprate | -.3474631 .008519 -40.79 0.000 -.3641603 -.330766
_cons | .0709203 .0042465 16.70 0.000 .0625972 .0792435
------------------------------------------------------------------------------
Sunday, March 26, 2023
Count unique appearance of values across multiple variables in wide format
Hello!
It has been a while since I have used Stata and I need a bit of help.
I have a wide dataset with variables activ1-active27. Across these variables, the values I am interested in are 3001-3008 (these values represent specific categories of activities a participant has engaged in).
I would like to count:
1) the unique appearance of 3001-3008 across activ1-active27 for each idpriv
2) the total count of any value within the range 3001-3008 across activ1-activ27 for each idpriv
I have tried a number of loops without success. While I would love to try -egenmore-, I cannot add packages because I am working on a disconnected server.
It has been a while since I have used Stata and I need a bit of help.
I have a wide dataset with variables activ1-active27. Across these variables, the values I am interested in are 3001-3008 (these values represent specific categories of activities a participant has engaged in).
I would like to count:
1) the unique appearance of 3001-3008 across activ1-active27 for each idpriv
2) the total count of any value within the range 3001-3008 across activ1-activ27 for each idpriv
I have tried a number of loops without success. While I would love to try -egenmore-, I cannot add packages because I am working on a disconnected server.
Generating IDs
Dear all,
Suppose I have the following dataset
I'd like to create a list of IDs based on var1 and var2. In particular, I'd like to get the same ID when the observations in var1 and var2 are swapped. For example, var1=3 and var2=4 should have the same ID as where var1=4 and var2=3.
I have tried
but it doesn't not give me exactly what I'm looking for.
Could anyone please help me with this?
Thank you very much for your help,
Vinh
Suppose I have the following dataset
Code:
clear input var1 var2 1 1 1 2 1 3 1 4 2 1 2 2 2 3 2 4 3 1 3 2 3 3 3 4 4 1 4 2 4 3 4 4 end
I have tried
Code:
egen id = group(var1 var2)
Could anyone please help me with this?
Thank you very much for your help,
Vinh
Bysort and egen code not carrying over correct data
Hello,
I have a long data set containing college enrollment data by term, multiple instances per student ID. If a student enrolled 10 terms in college, there will be 10 date rows per ID. The date data variables include high school graduate date, enrollment begin (college) enrollment (end) college and graduation date (college). Each date is in a separate column, my issue is occurring specifically in graduation_date.
The graduated variable is marked N or Y if a student graduated college. I created a variable (degree) to note N = 0 and Y = 1. Then ran code
to create college_grad and carry over that 1 to each observation by student ID.
I want to now carry over graduation_date by each observation where college_grad is 1. I did this with code
graduation date carries over to degree_date variable, but not in all instances does graduation_date match degree_date. Graduation date will be June 08, 2020 and under degree_date it will be noted as May 20, 2022. It happens in multiple instances. I noticed the first one in row 221, which is more than dataex shows. Can someone help troubleshoot my coding?
I also formatted the dates using
Essentially what I am trying to do is convert the different dates to terms in order to run length of time between each date, over time
Thank you.
I have a long data set containing college enrollment data by term, multiple instances per student ID. If a student enrolled 10 terms in college, there will be 10 date rows per ID. The date data variables include high school graduate date, enrollment begin (college) enrollment (end) college and graduation date (college). Each date is in a separate column, my issue is occurring specifically in graduation_date.
The graduated variable is marked N or Y if a student graduated college. I created a variable (degree) to note N = 0 and Y = 1. Then ran code
Code:
by randomid, sort: egen college_grad = max(degree)
I want to now carry over graduation_date by each observation where college_grad is 1. I did this with code
Code:
bys randomid: egen degree_date=max(graduation_date)
I also formatted the dates using
Code:
foreach var of varlist enrollment_begin enrollment_end graduation_date high_school_grad_date degree_date { replace `var' = date(string(`var' ,"%08.0f"),"YMD") format %td `var' }
Essentially what I am trying to do is convert the different dates to terms in order to run length of time between each date, over time
Code:
* Example generated by -dataex-. For more info, type help dataex clear input double randomid long(high_school_grad_date enrollment_begin enrollment_end) str1 graduated long graduation_date float(degree college_grad degree_date) 2103300231 22804 22886 22997 "N" . 0 0 . 2103300237 22804 . . "N" . 0 0 . 2103300243 22804 . . "N" . 0 0 . 2103300255 22069 22886 22997 "N" . 0 0 . 2103300255 22069 22312 22408 "N" . 0 0 . 2103300255 22069 22522 22633 "N" . 0 0 . 2103300255 22069 22432 22497 "N" . 0 0 . 2103300255 22069 22432 22505 "N" . 0 0 . 2103300255 22069 22662 22773 "N" . 0 0 . 2103300255 22069 22158 22262 "N" . 0 0 . 2103300264 22069 22508 22631 "N" . 0 0 . 2103300264 22069 22144 22267 "N" . 0 0 . 2103300264 22069 22872 22995 "N" . 0 0 . 2103300264 22069 22872 22996 "N" . 0 0 . 2103300264 22069 22663 22789 "N" . 0 0 . 2103300264 22069 22872 22995 "N" . 0 0 . 2103300264 22069 22872 22996 "N" . 0 0 . 2103300267 21705 22817 22836 "N" . 0 0 . 2103300267 21705 22067 22074 "N" . 0 0 . 2103300267 21705 22515 22630 "N" . 0 0 . 2103300267 21705 21780 21903 "N" . 0 0 . 2103300267 21705 22879 22996 "N" . 0 0 . 2103300267 21705 22144 22210 "N" . 0 0 . 2103300267 21705 21927 22056 "N" . 0 0 . 2103300267 21705 22690 22806 "N" . 0 0 . 2103300282 21341 . . "N" . 0 0 . 2103300291 20977 . . "N" . 0 0 . 2103300321 22804 . . "N" . 0 0 . 2103300336 21340 22676 22792 "N" . 0 0 . 2103300336 21340 22883 22995 "N" . 0 0 . 2103300426 22440 . . "N" . 0 0 . 2103300477 21341 21409 21532 "N" . 0 0 . 2103300477 21341 21927 22056 "N" . 0 0 . 2103300477 21341 21563 21692 "N" . 0 0 . 2103300477 21341 22872 22995 "N" . 0 0 . 2103300477 21341 22299 22425 "N" . 0 0 . 2103300477 21341 22144 22267 "N" . 0 0 . 2103300477 21341 21780 21903 "N" . 0 0 . 2103300477 21341 22879 22995 "N" . 0 0 . 2103300477 21341 22872 22996 "N" . 0 0 . 2103300477 21341 22515 22631 "N" . 0 0 . 2103300477 21341 22797 22873 "N" . 0 0 . 2103300477 21341 22067 22141 "N" . 0 0 . 2103300477 21341 22669 22785 "N" . 0 0 . 2103300477 21341 22872 22996 "N" . 0 0 . 2103300477 21341 22872 22995 "N" . 0 0 . 2103300495 20977 21409 21431 "N" . 0 0 . 2103300495 20977 21200 21328 "N" . 0 0 . 2103300552 21705 22515 22626 "N" . 0 0 . 2103300552 21705 21927 22056 "N" . 0 0 . 2103300552 21705 22879 22990 "N" . 0 0 . 2103300552 21705 22319 22437 "N" . 0 0 . 2103300552 21705 21780 21903 "N" . 0 0 . 2103300552 21705 22151 22262 "N" . 0 0 . 2103300552 21705 22683 22801 "N" . 0 0 . 2103300552 21705 22144 22267 "N" . 0 0 . 2103300552 21705 22067 22141 "N" . 0 0 . 2103300552 21705 22879 22994 "N" . 0 0 . 2103300561 22069 22144 22267 "N" . 0 0 . 2103300561 22069 22879 22995 "N" . 0 0 . 2103300561 22069 22648 22778 "N" . 0 0 . 2103300561 22069 22284 22407 "N" . 0 0 . 2103300561 22069 22781 22868 "N" . 0 0 . 2103300561 22069 22515 22631 "N" . 0 0 . 2103300585 22440 22522 22632 "N" . 0 0 . 2103300585 22440 22669 22784 "N" . 0 0 . 2103300585 22440 22796 22861 "N" . 0 0 . 2103300585 22440 22796 22869 "N" . 0 0 . 2103300591 22440 . . "N" . 0 0 . 2103300594 22804 . . "N" . 0 0 . 2103300600 22804 . . "N" . 0 0 . 2103300603 22656 . . "N" . 0 0 . 2103300855 22804 22900 22991 "N" . 0 0 . 2103300861 22440 22872 22996 "N" . 0 0 . 2103300861 22440 22872 22995 "N" . 0 0 . 2103300861 22440 22796 22861 "N" . 0 0 . 2103300861 22440 22872 22995 "N" . 0 0 . 2103300861 22440 22796 22869 "N" . 0 0 . 2103300861 22440 22508 22631 "N" . 0 0 . 2103300861 22440 22663 22789 "N" . 0 0 . 2103300861 22440 22872 22996 "N" . 0 0 . 3333004638 20248 . . "N" . 0 0 . 3333004845 20248 20472 20600 "N" . 0 0 . 3333004845 20248 20318 20440 "N" . 0 0 . 3333004845 20248 22524 22704 "N" . 0 0 . 3333004845 20248 22705 22888 "N" . 0 0 . 3333004845 20248 22889 23069 "N" . 0 0 . 3333005328 20248 21409 21456 "N" . 0 0 . 3333005328 20248 20682 20804 "N" . 0 0 . 3333005340 20248 22515 22631 "N" . 0 1 22078 3333005340 20248 21818 21889 "N" . 0 1 22078 3333005340 20248 . . "Y" 22078 1 1 22078 3333005340 20248 21556 21623 "N" . 0 1 22078 3333005340 20248 21818 21896 "N" . 0 1 22078 3333005340 20248 22305 22420 "N" . 0 1 22078 3333005340 20248 22151 22267 "N" . 0 1 22078 3333005340 20248 22879 22995 "N" . 0 1 22078 3333005340 20248 . . "Y" 22078 1 1 22078 3333005340 20248 21916 21994 "N" . 0 1 22078 3333005340 20248 21999 22078 "N" . 0 1 22078 end format %td high_school_grad_date format %td enrollment_begin format %td enrollment_end format %td graduation_date format %td degree_date
Thank you.
Panel Var and Fan Charts
Hi all,
Its bit urgent please
I'm using panel var in my project, i need little help
could anyone please share command to generate stochastic simulation after panel var?
And most importantly how to generate Fan charts in stata?
i have three variable in my model.
xtvar y1 y2 y3 , mc lags(2)
I look forward to kind suggestions
Thanks
Its bit urgent please
I'm using panel var in my project, i need little help
could anyone please share command to generate stochastic simulation after panel var?
And most importantly how to generate Fan charts in stata?
i have three variable in my model.
xtvar y1 y2 y3 , mc lags(2)
I look forward to kind suggestions
Thanks
Fix a coefficient to the value it had in previous mlogit
Hi everyone,
this may be a very simple problem and I am just not getting it, but here's what I am trying to do. I want to use -mlogit- on a 4-category dependent variable. The plan is to estimate the model, save the coefficients, fix the value of some of the b's to the value obtained in the previous regression, then run a new model. My problem is I cannot seem to find any syntax that will fix the value. I tried the following:
Method 1:
Method 2
Method 3 (I knew this would not work, but it was nice confirmation):
I think I could display "enough" of the digits and copy them one-by-one into the code, but 1)I am likely to make mistakes; 2)I want to do this for too many variables in a model for it to be feasible, 3)I want to do this for too many models for it to be feasible, and 4)I find it very hard to believe stata has no way for me to use some elements in _b to constrain coefficients in the next regression. I just can't seem to find my way to the write (no pun intended) syntax. I've looked at the manual entries for a)mlogit, b)constraints, c)matrix operations, and a few others. I googled various terms, and I searched the forum as well (google turned up some listserv era discussions, but I could find nothing on my question). It may be a very simple thing I am just missing. But I'm nowhere near converging to a solution. I appreciate any help anyone can offer.
Thanks!
Sam
this may be a very simple problem and I am just not getting it, but here's what I am trying to do. I want to use -mlogit- on a 4-category dependent variable. The plan is to estimate the model, save the coefficients, fix the value of some of the b's to the value obtained in the previous regression, then run a new model. My problem is I cannot seem to find any syntax that will fix the value. I tried the following:
Method 1:
mlogit depvar age c.age#c.age . . . ;The constraint is ignored (because the second mlogit has standard errors for the coefficient, instead of just saying the value and then saying (constrained).
constraint 1 [#2]age = [#2]_b[age] ;
mlogit depvar age c.age##c.age . . . , constraints(1) ;
Method 2
mlogit depvar age c.age##c.age . . . ;This produces error r(111).
gen c01 = [#2]_b[age]] ;
constraint 1 [#2]_cons = c01;
mlogit depvar age c.age##c.age . . ., constraints(1) ;
Method 3 (I knew this would not work, but it was nice confirmation):
mlogit depvar age c.age#c.age . . . ;This constrains the coefficient, but other coefficients change their values, which just shows I don't have "all" (or enough of) the digits.
constraint 1 [#2]age = .0296174 ;
mlogit depvar age c.age##c.age . . . , constraints(1) ;
I think I could display "enough" of the digits and copy them one-by-one into the code, but 1)I am likely to make mistakes; 2)I want to do this for too many variables in a model for it to be feasible, 3)I want to do this for too many models for it to be feasible, and 4)I find it very hard to believe stata has no way for me to use some elements in _b to constrain coefficients in the next regression. I just can't seem to find my way to the write (no pun intended) syntax. I've looked at the manual entries for a)mlogit, b)constraints, c)matrix operations, and a few others. I googled various terms, and I searched the forum as well (google turned up some listserv era discussions, but I could find nothing on my question). It may be a very simple thing I am just missing. But I'm nowhere near converging to a solution. I appreciate any help anyone can offer.
Thanks!
Sam
Saturday, March 25, 2023
How to change the format of numbers stored in a local macro?
Dear Statalist,
I am trying to format the number stored in a local macro using the string(n, s) function. Here is my code:
The expected output of display command is 3,274; however, it appeared without coma, like this 3 274.
Please help me solve this issue.
Thank you
Abdullah
I am trying to format the number stored in a local macro using the string(n, s) function. Here is my code:
Code:
version 17.0 quietly: count // Store total no. of pts to scalar scalar SC_total = r(N) local ML_total = string(SC_total, "%9.0fc") display `ML_total' 3 274
Please help me solve this issue.
Thank you
Abdullah
Unexpected error with -bayes:mixed-
The following
aborts with the following error message:
random effects level patient is empty
an error occurred when bayes executed mixed
r(198);
Does anyone see what I'm doing wrong?
(The code is excerpted; the complete do-file and SMCL log file are attached below.)
The helpfile, help bayes_mixed, specifically mentions allowing the noconstant option in the random effects equation as does the corresponding user's manual entry (Page 593), although the latter doesn't contain a worked example that exercises that option.
I'd rather not trouble Technical Support with this if someone can readily spot an error.
Code:
sysuse bplong bayes: mixed bp i.when || patient: , noconstant residuals(exchangeable)
random effects level patient is empty
an error occurred when bayes executed mixed
r(198);
Does anyone see what I'm doing wrong?
(The code is excerpted; the complete do-file and SMCL log file are attached below.)
The helpfile, help bayes_mixed, specifically mentions allowing the noconstant option in the random effects equation as does the corresponding user's manual entry (Page 593), although the latter doesn't contain a worked example that exercises that option.
I'd rather not trouble Technical Support with this if someone can readily spot an error.
R-squared and -didregress-
Good evening all,
Is there a way to have -didregress- show a value for r-squared- as you would see if you did a difference-in-difference regression using the old methods (just using -reg-)?
Thanks!
Is there a way to have -didregress- show a value for r-squared- as you would see if you did a difference-in-difference regression using the old methods (just using -reg-)?
Thanks!
Mixed-Effect Zero Inflated Negative Binomial model ??
Hello,
I have a study where I am evaluating the relationship between the total number of health care visits and geographic access (measured by travel time categories).
Initially I thought this would be a mixed-effects Poisson regression - total visits + individual level fixed effects & random intercept at geographic area level -- and using robust SE.
However, I have a significant amount of no-visits (25.7%) for my dependent variable.
Therefore I'm looking at a mixed effect zero-inflated negative binomial model. I am wondering if this is logical or if others have recommendations.
Thanks,
Devin
I have a study where I am evaluating the relationship between the total number of health care visits and geographic access (measured by travel time categories).
Initially I thought this would be a mixed-effects Poisson regression - total visits + individual level fixed effects & random intercept at geographic area level -- and using robust SE.
However, I have a significant amount of no-visits (25.7%) for my dependent variable.
Therefore I'm looking at a mixed effect zero-inflated negative binomial model. I am wondering if this is logical or if others have recommendations.
Thanks,
Devin
Why does the asreg command give me values for the first observations of my sample even thought I specified a min?
Hello everyone!
I am using the asreg command to calculate betas using one year of daily data, but I want the year to have at least 150 observations. (I am using CRSP daily data)
To that end I applied the following :
gen day = day(date)
bys PERMNO : asreg exc_ret MktRF, window( day 252 ) min(151)
It works, however, I get values even for the very first observations of my sample.
Does anyone have any idea why that happens?
Thank you for taking the time to read my question.
P.S: In case it is relevant, here is a preview of my data (before I added the MktRF and before I created the exc_ret) using dataex.
. dataex
----------------------- copy starting from the next line -----------------------
I am using the asreg command to calculate betas using one year of daily data, but I want the year to have at least 150 observations. (I am using CRSP daily data)
To that end I applied the following :
gen day = day(date)
bys PERMNO : asreg exc_ret MktRF, window( day 252 ) min(151)
It works, however, I get values even for the very first observations of my sample.
Does anyone have any idea why that happens?
Thank you for taking the time to read my question.
P.S: In case it is relevant, here is a preview of my data (before I added the MktRF and before I created the exc_ret) using dataex.
. dataex
----------------------- copy starting from the next line -----------------------
Code:
* Example generated by -dataex-. For more info, type help dataex clear input double PERMNO long date double(SHRCD EXCHCD) str8 CUSIP double(PRC RET SHROUT) 10001 12057 11 3 "36720410" 14.5 .0357142873108387 1080 10001 12058 11 3 "36720410" -14.25 -.017241379246115685 1080 10001 12059 11 3 "36720410" -14.25 0 1080 10001 12060 11 3 "36720410" 14.5 .017543859779834747 1080 10001 12061 11 3 "36720410" 14.5 0 1080 10001 12064 11 3 "36720410" 14.5 0 1080 10001 12065 11 3 "36720410" -14.25 -.017241379246115685 1080 10001 12066 11 3 "36720410" 13.75 -.035087719559669495 1080 10001 12067 11 3 "36720410" 13.75 0 1080 10001 12068 11 3 "36720410" -14 .0181818176060915 1080 10001 12071 11 3 "36720410" -14 0 1080 10001 12072 11 3 "36720410" 13.75 -.01785714365541935 1080 10001 12073 11 3 "36720410" 14.25 .036363635212183 1080 10001 12074 11 3 "36720410" 14.25 0 1080 10001 12075 11 3 "36720410" 13.75 -.035087719559669495 1080 10001 12078 11 3 "36720410" -14 .0181818176060915 1080 10001 12079 11 3 "36720410" -14 0 1080 10001 12080 11 3 "36720410" 13.75 -.01785714365541935 1080 10001 12081 11 3 "36720410" -14 .0181818176060915 1080 10001 12082 11 3 "36720410" -14 0 1080 10001 12085 11 3 "36720410" 13.75 -.01785714365541935 1080 10001 12086 11 3 "36720410" 13.75 0 1080 10001 12087 11 3 "36720410" -14 .0181818176060915 1080 10001 12088 11 3 "36720410" -14 0 1080 10001 12089 11 3 "36720410" -14 0 1080 10001 12092 11 3 "36720410" 13.75 -.01785714365541935 1080 10001 12093 11 3 "36720410" 14.25 .036363635212183 1080 10001 12094 11 3 "36720410" 13.75 -.035087719559669495 1080 10001 12095 11 3 "36720410" -14 .0181818176060915 1080 10001 12096 11 3 "36720410" 14.25 .01785714365541935 1080 10001 12100 11 3 "36720410" 14 -.017543859779834747 1080 10001 12101 11 3 "36720410" -14 0 1080 10001 12102 11 3 "36720410" 14.25 .01785714365541935 1080 10001 12103 11 3 "36720410" -14 -.017543859779834747 1080 10001 12106 11 3 "36720410" -14 0 1080 10001 12107 11 3 "36720410" -14 0 1080 10001 12108 11 3 "36720410" 14.25 .01785714365541935 1080 10001 12109 11 3 "36720410" -14 -.017543859779834747 1080 10001 12110 11 3 "36720410" 14.25 .01785714365541935 1080 10001 12113 11 3 "36720410" 14.5 .017543859779834747 1080 10001 12114 11 3 "36720410" -14.125 -.0258620698004961 1080 10001 12115 11 3 "36720410" -14.125 0 1080 10001 12116 11 3 "36720410" -14.125 0 1080 10001 12117 11 3 "36720410" -14.125 0 1080 10001 12120 11 3 "36720410" 13.75 -.015398230403661728 1080 10001 12121 11 3 "36720410" 13.75 0 1080 10001 12122 11 3 "36720410" 14.25 .036363635212183 1080 10001 12123 11 3 "36720410" 14 -.017543859779834747 1080 10001 12124 11 3 "36720410" 14.25 .01785714365541935 1080 10001 12127 11 3 "36720410" 14.125 -.008771929889917374 1080 10001 12128 11 3 "36720410" 14.25 .008849557489156723 1080 10001 12129 11 3 "36720410" -14.125 -.008771929889917374 1080 10001 12130 11 3 "36720410" 13.75 -.02654867246747017 1080 10001 12131 11 3 "36720410" -14.125 .027272727340459824 1080 10001 12134 11 3 "36720410" 14.5 .02654867246747017 1080 10001 12135 11 3 "36720410" 13.75 -.0517241396009922 1080 10001 12136 11 3 "36720410" 13.75 0 1080 10001 12137 11 3 "36720410" 13.75 0 1080 10001 12138 11 3 "36720410" -14.125 .027272727340459824 1080 10001 12141 11 3 "36720410" 13.75 -.02654867246747017 1080 10001 12142 11 3 "36720410" -14.125 .027272727340459824 1080 10001 12143 11 3 "36720410" 14.25 .008849557489156723 1075 10001 12144 11 3 "36720410" -14.125 -.008771929889917374 1075 10001 12145 11 3 "36720410" -14.125 0 1075 10001 12148 11 3 "36720410" 14.5 .02654867246747017 1075 10001 12149 11 3 "36720410" -14.125 -.0258620698004961 1075 10001 12150 11 3 "36720410" -14.125 0 1075 10001 12151 11 3 "36720410" 13.75 -.02654867246747017 1075 10001 12155 11 3 "36720410" -14.125 .027272727340459824 1075 10001 12156 11 3 "36720410" 14.5 .02654867246747017 1075 10001 12157 11 3 "36720410" 14.5 0 1075 10001 12158 11 3 "36720410" 14.5 0 1075 10001 12159 11 3 "36720410" 14.75 .017241379246115685 1075 10001 12162 11 3 "36720410" 14.75 0 1075 10001 12163 11 3 "36720410" 14.75 0 1075 10001 12164 11 3 "36720410" 14.75 0 1075 10001 12165 11 3 "36720410" 14.5 -.016949152573943138 1075 10001 12166 11 3 "36720410" 14.75 .017241379246115685 1075 10001 12169 11 3 "36720410" 14.75 0 1075 10001 12170 11 3 "36720410" 14.75 0 1075 10001 12171 11 3 "36720410" -15.25 .033898305147886276 1075 10001 12172 11 3 "36720410" 15.75 .032786883413791656 1075 10001 12173 11 3 "36720410" -15.25 -.0317460335791111 1075 10001 12176 11 3 "36720410" -15.25 0 1075 10001 12177 11 3 "36720410" 15.5 .016393441706895828 1075 10001 12178 11 3 "36720410" 15.75 .016129031777381897 1075 10001 12179 11 3 "36720410" -15.25 -.0317460335791111 1075 10001 12180 11 3 "36720410" 15.75 .032786883413791656 1075 10001 12183 11 3 "36720410" -15.25 -.0317460335791111 1075 10001 12184 11 3 "36720410" -15.25 0 1075 10001 12185 11 3 "36720410" 15.5 .016393441706895828 1075 10001 12186 11 3 "36720410" 15.5 0 1075 10001 12187 11 3 "36720410" 14.75 -.04838709533214569 1075 10001 12190 11 3 "36720410" 15 .016949152573943138 1075 10001 12191 11 3 "36720410" 15.5 .03333333507180214 1075 10001 12192 11 3 "36720410" 15.25 -.016129031777381897 1075 10001 12193 11 3 "36720410" 15.5 .016393441706895828 1075 10001 12194 11 3 "36720410" 15.5 0 1075 10001 12197 11 3 "36720410" -15.125 -.024193547666072845 1075 10001 12198 11 3 "36720410" -15.125 0 1075 end format %td date
Appending excel files - Error r(601)
Hi. I am trying to append all excel files in a single folder into one big file, but the code below throws the r(601) error. I've made sure that the files have no leading or trailing spaces, so am not sure what is happening. Any help would be great. Thanks.
Code:
local filenames: dir "C:\Users\katja\Dropbox\Raw Data" files "*.xlsx*" foreach f of local filenames { import excel using `"`f'"' gen source = `"`f'"' display `"Appending `f'"' append using `building' save `"`building'"', replace } export excel using 1_large_excel_file.xlsx, replace
Counts in a long dataset taking into account other variables including time
Hello,
I have a very large dataset that includes people ("personid") who could have been seen at 4 multiple sites (distinguished by "siteid"). The variable "monthyear" is the month and the year that the visit occurred. I want to determine:
1) the counts of people who were seen at more than one site and what the combination of these sites were (for example, in the below dataex example of 4 sites, how many people were seen at sites 19 and 32, 19 and 24, 19 and 47, 32 and 47, etc.)
2) how many people were seen at site 24 AFTER being seen at site 19 (essentially the same table as in my first question but incorporating time)
3) how many people were seen specifically at site 24 AFTER site 19 within 3 months
Any advice much appreciated!
Sarah
I have a very large dataset that includes people ("personid") who could have been seen at 4 multiple sites (distinguished by "siteid"). The variable "monthyear" is the month and the year that the visit occurred. I want to determine:
1) the counts of people who were seen at more than one site and what the combination of these sites were (for example, in the below dataex example of 4 sites, how many people were seen at sites 19 and 32, 19 and 24, 19 and 47, 32 and 47, etc.)
2) how many people were seen at site 24 AFTER being seen at site 19 (essentially the same table as in my first question but incorporating time)
3) how many people were seen specifically at site 24 AFTER site 19 within 3 months
Any advice much appreciated!
Sarah
Code:
* Example generated by -dataex-. For more info, type help dataex clear input long personid byte siteid str7 monthyear 2 19 "2021_03" 2 19 "2020_07" 3 19 "2019_11" 51 19 "2020_03" 51 19 "2019_01" 51 19 "2019_07" 51 19 "2019_01" 51 19 "2020_12" 52 32 "21-Jan" 52 19 "2020_03" 52 32 "21-Feb" 52 19 "2020_01" 52 19 "2020_03" 53 19 "2021_09" 53 19 "2020_02" 82 19 "2019_10" 82 19 "2021_04" 82 19 "2020_02" 83 19 "2022_02" 83 19 "2019_07" 83 19 "2022_02" 83 19 "2021_02" 83 19 "2022_02" 84 19 "2019_01" 84 47 "19-Oct" 84 19 "2019_04" 84 19 "2019_01" 84 19 "2019_05" 84 19 "2020_08" 84 19 "2019_01" 145 32 "21-Apr" 145 32 "21-Mar" 145 32 "21-Feb" 214 32 "21-Jan" 217 47 "20-Jan" 246 47 "20-Sep" 257 32 "21-Jan" 300 47 "20-May" 306 47 "18-Apr" 335 47 "18-Dec" 347 32 "21-Sep" 347 32 "21-Feb" 375 47 "18-Oct" 379 32 "21-Jan" 379 32 "20-Nov" 399 47 "19-Feb" 432 32 "20-Oct" 432 32 "21-Mar" 432 32 "21-Jan" 432 32 "21-Apr" 432 32 "20-Sep" 432 32 "21-Jan" 432 32 "20-Mar" 432 32 "20-Dec" 432 32 "22-Jan" 432 32 "21-Jan" 432 32 "20-Dec" 451 47 "18-Mar" 451 47 "18-Mar" 451 47 "20-Oct" 451 47 "19-Jul" 451 47 "18-Jul" 451 47 "18-Mar" 452 32 "19-Oct" 452 32 "20-Dec" 457 47 "19-Mar" 463 47 "18-Jun" 539 47 "19-Feb" 545 47 "20-Jan" 545 47 "20-Sep" 545 47 "20-Jan" 552 47 "20-Mar" 570 47 "18-Jan" 570 47 "18-Mar" 570 47 "18-Mar" 8775 47 "20-Jan" 8779 24 "17-Feb" 8779 24 "17-Sep" 8781 47 "19-Aug" 8781 47 "18-Oct" 8795 47 "20-Oct" 8795 47 "18-Oct" 8798 32 "21-Nov" 8806 47 "22-Feb" 8806 47 "21-Aug" 8806 47 "21-Oct" 8806 47 "21-Sep" 8806 47 "21-Mar" 8814 47 "18-Nov" 8815 32 "19-May" 8815 32 "19-Nov" 8815 32 "19-Dec" 8815 32 "20-Mar" 8815 32 "19-Nov" 8815 32 "19-Nov" 8815 32 "20-Mar" 8828 47 "19-Sep" 8828 47 "19-Sep" 8838 47 "18-Oct" 8838 47 "18-Oct" end
Friday, March 24, 2023
Perturb with Mlogit
Hi All,
I estimate a multinomial logit model with 10 independent variables. I suspect that there might be a multicollinearity issue. So, I am using perturb command. However, I always get an error message. If I use perturb command for logit or linear models, I do not get any errors. I have added a sample code that demonstrates what I am doing. I would appreciate any help.
clear all
set obs 90
gen y = 0 if _n<=30
replace y = 1 if _n>30 & _n<=60
replace y = 2 if _n>60
gen x1 = rnormal(0,1)
gen x2 = rnormal(5,2)
gen x3 = rnormal(10,3)
perturb: mlogit y x1 x2 x3, poptions(pvars(x1 x2) prange(1 1))
variable x1 not found
r(111);
end of do-file
I estimate a multinomial logit model with 10 independent variables. I suspect that there might be a multicollinearity issue. So, I am using perturb command. However, I always get an error message. If I use perturb command for logit or linear models, I do not get any errors. I have added a sample code that demonstrates what I am doing. I would appreciate any help.
clear all
set obs 90
gen y = 0 if _n<=30
replace y = 1 if _n>30 & _n<=60
replace y = 2 if _n>60
gen x1 = rnormal(0,1)
gen x2 = rnormal(5,2)
gen x3 = rnormal(10,3)
perturb: mlogit y x1 x2 x3, poptions(pvars(x1 x2) prange(1 1))
variable x1 not found
r(111);
end of do-file
Endogeneity concerns
I am working with a dataset of about a total of 28000 observations for over a period of 10 years. I ran a probit model for my baseline results. Both my dependent and independent variables are binary. Now i suspect that there might be a reverse causality issue in my model. So I did a IV probit in Stata. But result turned out to be insignificant (showing there is no endogeneity concerns). Then I realized I should not be using IV probit if both the variables are binary. Now I need to solve my reverse causality issue using some other method. I need help with stata commands.
Control variables for intraday data.
Hi everyone,
I would like to add control variables (BookToMarket, Size...) to my data to regress Abnormal return on text sentiment. For each firm and specific date, I have several thousands of row of data.
Each row is a unique sentence and variables associated with it.
I have included a sample of my data.
If I add the size(log of market cap), BooktoMarket ...., then I would be adding the same values for thousands of rows which raises some question marks on my mind. If my data was based on days rather than intraday data, I would have included the values for those variables.
I can also tag each of those values according to quintiles (1...5) and then use it in my regression as categorical variable.
In the sample data, I categorized variables (ln_mktcap bm roa) according to quintiles.
Should I just use the raw data rather than quintiles in my regression? What would be the best way to add those variables to my regression?
reghdfe AbnRet i.Sentiment i.ln_mktcap i.bm i.roa, absorb(year SIC) cluster(month_year)
Thank you.
I would like to add control variables (BookToMarket, Size...) to my data to regress Abnormal return on text sentiment. For each firm and specific date, I have several thousands of row of data.
Each row is a unique sentence and variables associated with it.
I have included a sample of my data.
If I add the size(log of market cap), BooktoMarket ...., then I would be adding the same values for thousands of rows which raises some question marks on my mind. If my data was based on days rather than intraday data, I would have included the values for those variables.
I can also tag each of those values according to quintiles (1...5) and then use it in my regression as categorical variable.
In the sample data, I categorized variables (ln_mktcap bm roa) according to quintiles.
Should I just use the raw data rather than quintiles in my regression? What would be the best way to add those variables to my regression?
reghdfe AbnRet i.Sentiment i.ln_mktcap i.bm i.roa, absorb(year SIC) cluster(month_year)
Thank you.
Code:
* Example generated by -dataex-. For more info, type help dataex clear input byte Sentiment double AbnRet float(year month_year) byte(SIC ln_mktcap bm roa) 0 .000042760626015603265 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 0 -.00008552490912983046 2017 685 38 5 3 5 0 .0000427551413056948 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 0 -.0010592237732415244 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 1 -.00004275514130580582 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 1 -.00004275331338177146 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 1 -.000042751485614056506 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 0 .00004274965800266095 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 2 0 2017 685 38 5 3 5 1 .000042751485614056506 2017 685 38 5 3 5 1 .000042753313381882485 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 2 0 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 2 0 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 2 0 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 1 .0000427551413056948 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 0 -.00012828187804658775 2017 685 38 5 3 5 0 .0001282654239170844 2017 685 38 5 3 5 0 .0021446675765255385 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 0 .000042760626015603265 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 1 -.0001282873636947457 2017 685 38 5 3 5 0 .0000855139387719861 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 2 0 2017 685 38 5 3 5 2 0 2017 685 38 5 3 5 1 -.00008552125203098448 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 1 .0000855139387719861 2017 685 38 5 3 5 1 1.1102230246251565e-16 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 0 .00008552125203120653 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 0 0 2017 685 38 5 3 5 2 0 2017 685 38 5 3 5 1 -.0000427642832705466 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 1 0 2017 685 38 5 3 5 0 .00004276245456480421 2017 685 38 5 3 5 end format %tm month_year label values Sentiment sentimentlabel label def sentimentlabel 0 "Neu", modify label def sentimentlabel 1 "Pos", modify label def sentimentlabel 2 "Neg", modify
Different colors for graph bars
Hello,
I have a basic doubt. I am trying to plot a graph on GDP per capita which has 2 bars- one for OECD and another for non OECD countries. I want to have two different colors for each bar group. But, the following command gives me the same color bar for OECD and non OECD countries.
graph bar (mean) wb_gdppercap, over(OECD,relabel(1 "Non OECD" 2 "OECD")) title("GDP per Capita")ytitle("GDP per capita") bar(1, color(maroon) fintensity(inten80)) bar(2, color(navy) fintensity(inten60))
Can someone please explain what I am doing wrong here? I have inserted the graph here for reference.
Thank you very much!
I have a basic doubt. I am trying to plot a graph on GDP per capita which has 2 bars- one for OECD and another for non OECD countries. I want to have two different colors for each bar group. But, the following command gives me the same color bar for OECD and non OECD countries.
graph bar (mean) wb_gdppercap, over(OECD,relabel(1 "Non OECD" 2 "OECD")) title("GDP per Capita")ytitle("GDP per capita") bar(1, color(maroon) fintensity(inten80)) bar(2, color(navy) fintensity(inten60))
Can someone please explain what I am doing wrong here? I have inserted the graph here for reference.
Thank you very much!
Compute a difference rate between two variables
Hello everyone,
I have to use the variables lonely and ca_couple, calculate the proportions of individuals who feel lonely some of the time or often separately by whether they are living with a partner or not.
Then, I need to calculate the difference in the rate of "loneliness" between those two groups.
Is a better way to do that? Thank you in advance. I obtain that in stata:
Best,
Michael
I have to use the variables lonely and ca_couple, calculate the proportions of individuals who feel lonely some of the time or often separately by whether they are living with a partner or not.
Then, I need to calculate the difference in the rate of "loneliness" between those two groups.
- lonely takes 1 if the individual feels lonely sometimes or often. And 0 if the individual hardly ever or never feels lonely.
- ca_couple takes 1 if living in couple, and 0 otherwise.
Code:
tabulate lonely ca_couple, row col local p_lonely_couple = 100*r(row1col1)/r(row1col2) local p_lonely_single = 100*r(row2col1)/r(row2col2) local diff_lonely = `p_lonely_single' - `p_lonely_couple' display "Difference in the rate of loneliness: " %5.2f `diff_lonely'
Code:
. local p_lonely_couple = 100*r(row1col1)/r(row1col2) . local p_lonely_single = 100*r(row2col1)/r(row2col2) . local diff_lonely = `p_lonely_single' - `p_lonely_couple' . display "Difference in the rate of loneliness: " %5.2f `diff_lonely' Difference in the rate of loneliness: .
Michael
Descriptive Statistics at Baseline
Hi,
Suppose I have two waves of a survey. I make use of a DID.
The treatment is whether a particular HH_member passed away by wave 2.
I restricted the sample to those respondents living with that hh member in wave 1.
The treatment dummy takes the value 0 if HH-member is alive and 1 if the HH member died.
Post takes value 0 if the observation is in wave 1 and takes value 1 if in wave 2.
The regression I run to get DID output is
I try to look at the descriptive statistics of two groups at the baseline to see if any statistical difference exists at the baseline between them.
1) respondents for whom the hh_member was alive in wave1 but passed away by wave2
2) respondents for whom the hh_member was alive in wave 1 and remained alive even in wave 2.
I run the following command
But this shows the difference between groups for whom the HH_member was died and alive in wave 1. Not the difference between these two groups in wave 1 for whom the HH_member passed away in wave2 and the other did not.
How can I get that?
Suppose I have two waves of a survey. I make use of a DID.
The treatment is whether a particular HH_member passed away by wave 2.
I restricted the sample to those respondents living with that hh member in wave 1.
The treatment dummy takes the value 0 if HH-member is alive and 1 if the HH member died.
Post takes value 0 if the observation is in wave 1 and takes value 1 if in wave 2.
The regression I run to get DID output is
Code:
reg outcome i.HH_member_died##i.post $controls $FE , cl(var)
I try to look at the descriptive statistics of two groups at the baseline to see if any statistical difference exists at the baseline between them.
1) respondents for whom the hh_member was alive in wave1 but passed away by wave2
2) respondents for whom the hh_member was alive in wave 1 and remained alive even in wave 2.
I run the following command
Code:
keep post==0 ttest outcome, by(HH_memeber_died)
How can I get that?
Code:
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input double BMI_survey float(HH_member_died post) 25.4386425018 0 0 16.9948692322 0 0 30.0425415039 0 0 29.6986141205 1 1 26.7642993927 1 1 20.6010570526 0 1 25.5052757263 1 1 23.8579883576 0 1 22.4769897461 0 0 21.7111110687 0 0 20.8466835022 0 0 25.7420845032 0 0 17.8782272339 0 0 25.3662319183 1 1 22.0370044708 0 1 19.8885383606 0 0 21.9642086029 0 1 29.1248645783 1 1 . 1 1 20.7804164887 1 1 19.079946518 0 0 29.4948539734 0 0 21.731502533 0 0 24.1133422852 0 0 14.3230552673 1 1 23.4682006836 0 1 21.426651001 1 1 19.5180568695 0 1 22.80884552 1 1 27.1408443451 1 1 33.8613395691 0 0 21.102804184 0 0 21.5114250183 0 0 21.3299102783 1 1 23.688167572 1 0 31.5484447479 1 1 26.3711357117 1 1 28.5294895172 0 0 22.6051673889 1 1 22.8827972412 1 0 18.0570392609 1 0 21.54296875 0 1 17.9458408356 0 1 29.5692863465 0 1 20.625705719 1 1 22.8537845612 1 0 20.0530490875 1 1 25.8380050659 1 1 24.6478939057 0 1 21.3943805695 0 0 28.6160259247 0 0 25.1688022614 0 0 17.9838981629 0 0 26.0739994049 1 1 21.6328411102 1 1 21.1121730805 1 1 26.4681358338 0 1 24.112241745 0 1 24.8491153717 0 0 20.7447032929 0 0 21.5888004303 1 0 25.0364208222 0 0 24.9698143005 0 1 24.6674957276 0 1 18.8573570252 1 1 17.0916538239 1 1 18.9992809296 1 1 13.7701473236 1 1 30.8436298371 0 0 23.0513725281 1 0 24.9562129974 1 1 23.4150829315 1 1 23.6802539826 1 1 39.5971870423 0 0 24.2142925263 1 0 28.9832057953 1 1 25.234375 1 0 34.2484893799 1 1 21.8441505432 0 1 34.4006996155 0 0 21.1394233704 0 0 32.6331367493 0 0 26.7796020508 1 1 25.7466068268 1 1 21.3556995392 0 1 29.2932777405 1 1 23.8346385956 0 0 30.2977848053 0 0 24.3925933838 0 1 30.3410663605 1 0 22.2304401398 1 0 20.5079307556 0 0 23.6710567474 1 0 24.0117797852 1 0 20.9045200348 1 1 26.642944336 0 1 21.9777774811 0 0 17.3138198853 0 0 25.1688022614 1 0 23.3634643555 0 0 end
------------------ copy up to and including the previous line ------------------
Thursday, March 23, 2023
guidance on minimum sample size for 'medsem' package
My understanding was that SEM typically requires a larger sample size, ideally over 200 observations for reliable results, but the 'medsem' package could be used on smaller sample sizes? (please correct me if I am wrong). Now that I've finished my analyses with this assumption ( 82 observations), I cant seem to find any official sources which mentions appropriate sample size for 'medsem'. Would anyone be able to direct me? Thank you!
Nesting foreach loops within a while loop
I am trying to create a foreach loop within a while loop, and the issue is that the foreach loop executes completely, even if the condition to break the while loop has been met using one of the variables. I'm including source code to explain what is happening.
I want the while loop to stop after the `loop_e' > `total_e' the first time it happens (when the foreach loop loops over trunk). However, the while loop only stops executing after all the variables in the foreach loop have been looped over.
I want the while loop to stop after the `loop_e' > `total_e' the first time it happens (when the foreach loop loops over trunk). However, the while loop only stops executing after all the variables in the foreach loop have been looped over.
Code:
sysuse auto, clear local total_e = 1e-2 local loop_e = 1 while `loop_e' > `total_e' { foreach var in trunk weight length turn { di "`var'" local loop_e = 1e-6 } }