Specialized on Data processing, Data management Implementation plan, Data Collection tools - electronic and paper base, Data cleaning specifications, Data extraction, Data transformation, Data load, Analytical Datasets, and Data analysis. BJ Data Tech Solutions teaches on design and developing Electronic Data Collection Tools using CSPro, and STATA commands for data manipulation. Setting up Data Management systems using modern data technologies such as Relational Databases, C#, PHP and Android.
Monday, May 31, 2021
Is it correct to interpret the sign of statistically non-signifcant coefficient?
Dear Members
This question turns to be somewhat foolish or trivial for many in this forum, but I think a clear cut answer if it all possible by members in this forum can uproot the doubt in my mind.
Question 1: Can we interpret the "SIGN" of an non-significant coefficient?
Asnwer1: "
"If a coefficient's t-statistic is not significant, don't interpret it at all. You can't be sure that the value of the corresponding parameter in the underlying regression model isn't really zero."
DeVeaux, Velleman, and Bock (2012), Stats: Data and Models, 3rd edition, Addison-Wesley
p. 801 (in Chapter 10: Multiple Regression, under the heading "What Can Go Wrong?")
Source:https://web.ma.utexas.edu/users/mks/...ioncoeffs.html
Answer2: Some authors interpret the sign at least, stating that the nature of the relationship is -ve (or +ve) but not statistically significant.
Which one of the above is more correct? Faced with an insignificant coefficient should we ignore them completely and move forward or should we stop and interpret the sign?
I have read a similar post on this forum but couldn't find the answer for my doubt
https://www.statalist.org/forums/for...t-coefficients
Bar graph of proportion over the years
I have two questions to clarify.
First question is how to show the proportion using graph bar, if the only commands available are mean median p1 p2 ... p99 sum count percent min max under the stat command. Do I use percent or count to show the proportion?
Next question is the years in the x-axis is clumped up for my graph, is there a solution to remedy this?
Array
First question is how to show the proportion using graph bar, if the only commands available are mean median p1 p2 ... p99 sum count percent min max under the stat command. Do I use percent or count to show the proportion?
Next question is the years in the x-axis is clumped up for my graph, is there a solution to remedy this?
Array
skewed-Normal GARCH Estimation Using ml command.
Dear Statalist,
I'm trying to generate conditional skewness and kurtosis by using Gram-Charlier Expansion Series. Essentially, I'm trying to estimate the following system:
Array
where my likelihood function is :
Array
Array
Array
The following is my code:
//////////////////////////////////////////
clear
set more off
set fredkey 0c79529a0ad2485bee772c948e68374e, permanently
import fred DEXJPUS, daterange(1990-01-02 2002-05-03) aggregate(daily,eop) clear
drop if missing(DEXJPUS)
tsset daten
format daten %td
gen ret = ln(DEXJPUS)-ln(DEXJPUS[_n-1])
drop if _n==1
export excel using "jpnusfx9102daily", firstrow(variables) replace
sum ret, detail
global h0=r(Var)
global s0=r(skewness)
global k0=r(kurtosis)
* Own maximum likelihood program *
capture program drop garchsktry
program garchsktry
args lnf mu b0 b1 b2 c0 c1 c2 g0 g1 g2
tempvar et ht st kt nt psi gamma
qui gen double `et'=$ML_y1-`mu'
qui gen double `ht'=1
qui gen double `nt'=`et'/sqrt(`ht')
qui gen double `st'=`nt'^3
qui gen double `kt'=`nt'^4
//qui gen double `ht'=$h0
//qui gen double `st'=$s0
//qui gen double `kt'=$k0
//qui gen double `nt'=`et'/sqrt(`ht')
qui replace `ht'=`b0'+`b1'*`nt'[_n-1]^2 + `b2'*`ht'[_n-1] if _n>1
qui replace `st'=`c0'+`c1'*`nt'[_n-1]^3 + `c2'*`st'[_n-1] if _n>1
qui replace `kt'=`g0'+`g1'*`nt'[_n-1]^4 + `g2'*`kt'[_n-1] if _n>1
qui gen double `psi'= log((1 + (`st'*(`nt')^3 - 3*`nt')/6 + (`kt'*(((`nt')^4-6*(`nt')^2+3))/24))^2)
qui gen double `gamma'= log(1+ (`st'^2)/6 + (`kt'^2)/24)
qui replace `lnf'= -0.5*(log(`ht')+(`et'^2)/`ht') + `psi'-`gamma'
end
ml model lf garchsktry (mu: ret=) /lnf /b0 /b1 /b2 /c0 /c1 /c2 /g0 /g1 /g2
ml init /b0=0.0061 /b1=0.0309 /b2=0.9537 /c0=-0.0494 /c1=0.0018 /c2=0.3414 /g0=1.2365 /g1=0.0014 /g2=0.6464
ml search
ml max, difficult
ml report
///////////////////////////////////////////////////////////
this keeps giving me the following message:
"could not calculate numerical derivatives -- flat or discontinuous region encountered
r(430);"
I'm thinking that it might be because the starting values maybe wrong, I tried to use unconditional moments, still I keep getting the same error message, is there a way to tell Stata to not rely too much on the starting values, it shouldn't be an issue. or can anyone tell me what is wrong with my program, I really run out of good ideas.
and It has been a really long time ago since I'm stuck in here, and any help is really really appreciated.
Sarah
P.S.: I forgot to mention that I'm using Stata 17, if that makes any difference.
I'm trying to generate conditional skewness and kurtosis by using Gram-Charlier Expansion Series. Essentially, I'm trying to estimate the following system:
Array
where my likelihood function is :
Array
Array
Array
The following is my code:
//////////////////////////////////////////
clear
set more off
set fredkey 0c79529a0ad2485bee772c948e68374e, permanently
import fred DEXJPUS, daterange(1990-01-02 2002-05-03) aggregate(daily,eop) clear
drop if missing(DEXJPUS)
tsset daten
format daten %td
gen ret = ln(DEXJPUS)-ln(DEXJPUS[_n-1])
drop if _n==1
export excel using "jpnusfx9102daily", firstrow(variables) replace
sum ret, detail
global h0=r(Var)
global s0=r(skewness)
global k0=r(kurtosis)
* Own maximum likelihood program *
capture program drop garchsktry
program garchsktry
args lnf mu b0 b1 b2 c0 c1 c2 g0 g1 g2
tempvar et ht st kt nt psi gamma
qui gen double `et'=$ML_y1-`mu'
qui gen double `ht'=1
qui gen double `nt'=`et'/sqrt(`ht')
qui gen double `st'=`nt'^3
qui gen double `kt'=`nt'^4
//qui gen double `ht'=$h0
//qui gen double `st'=$s0
//qui gen double `kt'=$k0
//qui gen double `nt'=`et'/sqrt(`ht')
qui replace `ht'=`b0'+`b1'*`nt'[_n-1]^2 + `b2'*`ht'[_n-1] if _n>1
qui replace `st'=`c0'+`c1'*`nt'[_n-1]^3 + `c2'*`st'[_n-1] if _n>1
qui replace `kt'=`g0'+`g1'*`nt'[_n-1]^4 + `g2'*`kt'[_n-1] if _n>1
qui gen double `psi'= log((1 + (`st'*(`nt')^3 - 3*`nt')/6 + (`kt'*(((`nt')^4-6*(`nt')^2+3))/24))^2)
qui gen double `gamma'= log(1+ (`st'^2)/6 + (`kt'^2)/24)
qui replace `lnf'= -0.5*(log(`ht')+(`et'^2)/`ht') + `psi'-`gamma'
end
ml model lf garchsktry (mu: ret=) /lnf /b0 /b1 /b2 /c0 /c1 /c2 /g0 /g1 /g2
ml init /b0=0.0061 /b1=0.0309 /b2=0.9537 /c0=-0.0494 /c1=0.0018 /c2=0.3414 /g0=1.2365 /g1=0.0014 /g2=0.6464
ml search
ml max, difficult
ml report
///////////////////////////////////////////////////////////
this keeps giving me the following message:
"could not calculate numerical derivatives -- flat or discontinuous region encountered
r(430);"
I'm thinking that it might be because the starting values maybe wrong, I tried to use unconditional moments, still I keep getting the same error message, is there a way to tell Stata to not rely too much on the starting values, it shouldn't be an issue. or can anyone tell me what is wrong with my program, I really run out of good ideas.
and It has been a really long time ago since I'm stuck in here, and any help is really really appreciated.
Sarah
P.S.: I forgot to mention that I'm using Stata 17, if that makes any difference.
Split a string variable into character and numeric parts
Hi,
I have a string variable (profile) in my dataset which includes prefixes of H, W, B and the suffix numbers ranging from 1 to 1000 (H1, H2, H3, ..., H1000, W1, W2, W3, ..., W1000, etc). How can I split the prefix characters and the suffix numbers into two different parts. I have tried the below code which generates the suffix numbers but not the character part.
Thanks,
NM
I have a string variable (profile) in my dataset which includes prefixes of H, W, B and the suffix numbers ranging from 1 to 1000 (H1, H2, H3, ..., H1000, W1, W2, W3, ..., W1000, etc). How can I split the prefix characters and the suffix numbers into two different parts. I have tried the below code which generates the suffix numbers but not the character part.
Thanks,
NM
gen iteration = regexs(0) if regexm(profile, "[0-9]*$")
destring iteration, replace
destring iteration, replace
Change reference category of dependent variable
Hi,
I am running the following code
mlogit cq25 i.group [pweight = cq25pp], rrr
The dependent variable has three levels. It all works fine except Stata is choosing level 2 as the reference level for the dependent variable when I want level 1 to be the reference level.
I think Stata is doing this as level 1 has n = 80 and level 2 has n = 201 - a default for the larger n I guess.
I know independent variable reference levels can be changed with ib# but how can the dependent variable's reference level be chosen?
Don
I am running the following code
mlogit cq25 i.group [pweight = cq25pp], rrr
The dependent variable has three levels. It all works fine except Stata is choosing level 2 as the reference level for the dependent variable when I want level 1 to be the reference level.
I think Stata is doing this as level 1 has n = 80 and level 2 has n = 201 - a default for the larger n I guess.
I know independent variable reference levels can be changed with ib# but how can the dependent variable's reference level be chosen?
Don
How to identify oldest children with highest education?
Dear all,
I have a household data and I want to generate a variable indicating year of birth of a child with the highest educational level (condition 1) and in cases, if there are two children (or more) with the same educational level, choose the oldest one (condition 2).
For example, take a look at pid=183 (pid is id of parents) and this individual has three children. Here, I want to create a binary variable identifying the one born in 1953 (e.g., =1 and 0 or missing otherwise) because she is older though her educational level is the same as the one born in 1962.
Note: I think this question deserves a new thread so I created this one. My previous post can be found at: https://www.statalist.org/forums/for...r-parents-data
Thank you.
I have a household data and I want to generate a variable indicating year of birth of a child with the highest educational level (condition 1) and in cases, if there are two children (or more) with the same educational level, choose the oldest one (condition 2).
For example, take a look at pid=183 (pid is id of parents) and this individual has three children. Here, I want to create a binary variable identifying the one born in 1953 (e.g., =1 and 0 or missing otherwise) because she is older though her educational level is the same as the one born in 1962.
Note: I think this question deserves a new thread so I created this one. My previous post can be found at: https://www.statalist.org/forums/for...r-parents-data
Code:
clear input long qid byte(csex relationship) int cyob byte cedu 11 2 4 1970 7 13 1 3 1975 7 16 1 3 1971 5 110 1 3 1959 7 111 1 3 1977 7 112 1 3 1977 6 123 1 3 1988 7 125 1 3 1982 7 129 2 4 1985 6 134 1 3 1975 5 136 1 3 1963 4 136 2 4 1967 4 137 1 3 1976 6 137 2 3 1980 7 137 1 3 1983 6 138 2 4 1969 5 139 1 3 1955 5 140 1 3 1978 4 141 1 3 1970 5 142 2 4 1959 4 146 1 3 1976 5 146 1 3 1978 5 147 1 3 1957 4 148 1 3 1982 5 151 1 3 1975 3 152 1 3 1986 3 152 1 3 1992 3 153 1 3 1955 4 154 1 3 1977 4 156 1 3 1957 4 159 1 3 1962 4 161 1 4 1968 5 161 1 3 1963 6 163 1 3 1956 4 164 1 3 1973 7 164 1 3 1979 7 167 2 4 1978 9 168 1 3 1973 7 168 2 4 1975 7 169 1 3 1959 4 171 1 3 1962 5 173 1 3 1970 5 175 1 3 1980 8 176 1 3 1979 8 178 1 3 1974 6 180 1 3 1990 5 181 1 3 1985 7 182 1 3 1957 4 182 2 4 1962 7 183 2 4 1953 4 183 2 4 1949 1 183 1 3 1962 4 186 1 3 1955 4 188 1 3 1964 4 190 1 3 1971 5 191 1 3 1984 7 192 1 3 1971 7 192 1 3 1977 5 193 1 3 1964 7 196 2 4 1956 4 197 1 3 1981 7 331 2 4 1993 4 332 1 3 1968 3 333 1 3 1973 3 336 1 3 1975 7 337 1 3 1965 3 338 2 4 1967 4 362 1 3 1963 5 363 2 4 1977 4 366 1 3 1960 5 369 1 3 1949 6 384 1 3 1975 7 387 1 3 1977 4 389 1 3 1975 7 463 1 3 1979 8 464 1 3 1973 7 465 1 3 1966 7 469 1 3 1981 7 491 2 4 1983 6 491 2 4 1991 5 493 2 4 1958 7 494 1 3 1982 7 496 1 3 1973 8 497 2 4 1983 6 end label values csex LABEL_B25 label def LABEL_B25 1 "Male", modify label def LABEL_B25 2 "Female", modify label values relationship relationship label def relationship 3 "Son", modify label def relationship 4 "Daughter", modify label values cedu cedu label def cedu 1 "No schooling", modify label def cedu 3 "Primary", modify label def cedu 4 "Lower secondary", modify label def cedu 5 " Upper secondary", modify label def cedu 6 "Prof secondary education", modify label def cedu 7 "Junior college/University", modify label def cedu 8 "Master", modify label def cedu 9 "PhD", modify
Identifying partial name overlap in string variables
Hi. I have two datasets with the variable “Product”, the name of different pharma drugs. I am trying to merge the two datasets by name. However, the names are not standardized as shown in the example below. Once merged, I want to identify observations that have parts of the name overlapping which, on merging, are in the “master only (1)” or the “using only (2)”. In the example below, this would identify observations 3 to 7. I don’t have much familiarity working with strings and would appreciate any guidance. Thanks.
Code:
input str50 Product byte _merge "A&D" 1 "A/B OTIC" 1 "ALLERX" 2 "ALLERX (AM/PM DOSE PACK 30)" 1 "ALLERX (AM/PM DOSE PACK)" 1 "ALLERX DF" 2 "ALLERX PE" 1 "ABILIFY" 1 "ACARBOSE" 1
Graphing a multilevel model with interaction effect
Hi all!
I'm looking for suggestions on how to graph a multilevel model I'm working on.
I'm working with some covariates which I nest around one group: country. Therefore my model looks like this:
I was looking for suggestions on how to possibly graph the interaction effect (i.soldicontatti##c.individualism) by country.
I've looked around, but what I have found doesn't help much.
I'm using Stata 16.
Thanks in advance and let me know if you need me to be more specific about my dataset.
I'm looking for suggestions on how to graph a multilevel model I'm working on.
I'm working with some covariates which I nest around one group: country. Therefore my model looks like this:
Code:
melogit trust_d i.soldicontatti##c.individualism c.tightz i.health_d c.difficulties_o c.supportclosec i.vitasociale c.edu_c sex i.diffecon c.agepul c.agepul_q c.familymember || country:
I've looked around, but what I have found doesn't help much.
I'm using Stata 16.
Thanks in advance and let me know if you need me to be more specific about my dataset.
Using the new table collect row percentages for multiple factor variables
Hi all,
I recently attended the webinar on customizable tables in Stata 17. One of the sample tables used in the webinar, as shown below in slighted modified, demonstrates the power of the new table/collect command.
Equally common is the need to show row percentages across the column variable, in this instance, the yes/no values of hypertension. Alternatively, in a slightly different arrangement, a table that combines the results from multiple tabstat results, showing total case count, count of cases with the yes value, and percent with yes, all properly formatted, would be very helpful as well. I searched without success any postings related this type of table using the new Stata 17 command table/collect. Shown below is the log result from tabstat using the same dataset in the webinar.
Any help will be greatly appreciated.
Best,
Ron
I recently attended the webinar on customizable tables in Stata 17. One of the sample tables used in the webinar, as shown below in slighted modified, demonstrates the power of the new table/collect command.
Code:
-------------------------------------------------------
Hypertension
No Yes Total
-------------------------------------------------------
Sex
Male 2,611 43.7% 2,304 52.7% 4,915 47.5%
Female 3,364 56.3% 2,072 47.3% 5,436 52.5%
Race
White 5,317 89.0% 3,748 85.6% 9,065 87.6%
Black 545 9.1% 541 12.4% 1,086 10.5%
Other 113 1.9% 87 2.0% 200 1.9%
-------------------------------------------------------Equally common is the need to show row percentages across the column variable, in this instance, the yes/no values of hypertension. Alternatively, in a slightly different arrangement, a table that combines the results from multiple tabstat results, showing total case count, count of cases with the yes value, and percent with yes, all properly formatted, would be very helpful as well. I searched without success any postings related this type of table using the new Stata 17 command table/collect. Shown below is the log result from tabstat using the same dataset in the webinar.
Code:
. webuse nhanes2, clear
. tabstat highbp, by(sex) s(count sum mean)
Summary for variables: highbp
Group variable: sex (Sex)
sex | N Sum Mean
-------+------------------------------
Male | 4915 2304 .4687691
Female | 5436 2072 .3811626
-------+------------------------------
Total | 10351 4376 .4227611
--------------------------------------
. tabstat highbp, by(race) s(count sum mean)
Summary for variables: highbp
Group variable: race (Race)
race | N Sum Mean
-------+------------------------------
White | 9065 3748 .4134584
Black | 1086 541 .4981584
Other | 200 87 .435
-------+------------------------------
Total | 10351 4376 .4227611
--------------------------------------Best,
Ron
average log size in panel data
Hello,
I have a random effects model with clustered errors and I would like to add the log size to the model. When I generate the log and regress it remains insignificant. Is this due to that the size is an average for each fund so it does not change over time?
I have a random effects model with clustered errors and I would like to add the log size to the model. When I generate the log and regress it remains insignificant. Is this due to that the size is an average for each fund so it does not change over time?
Dropping string observations starting with a number
Hi. I have the following data and want to delete all observations where the Product starts with a number. The Product variable is a string variable. I’d appreciate any help. Thanks.
Code:
"1 ANTI-INFECTIVES" "1 ANTI-INFECTIVES" "1 ANTI-INFECTIVES" "105 MISCELLANEOUS AGENTS" "105 MISCELLANEOUS AGENTS" "105 MISCELLANEOUS AGENTS" "105 MISCELLANEOUS AGENTS" "105 MISCELLANEOUS AGENTS" "105 MISCELLANEOUS AGENTS" "105 MISCELLANEOUS AGENTS" "105 MISCELLANEOUS AGENTS" "105 MISCELLANEOUS AGENTS" "105 MISCELLANEOUS AGENTS" "115 NUTRITIONAL PRODUCTS" "115 NUTRITIONAL PRODUCTS" "122 RESPIRATORY AGENTS" "122 RESPIRATORY AGENTS" "122 RESPIRATORY AGENTS" "122 RESPIRATORY AGENTS" "20 ANTINEOPLASTICS" "20 ANTINEOPLASTICS" "ACCOLATE" "ABILITY" "ACCU-CHECK"
I got different result from xtreg and eventdd
Hello,
I have unbalanced panel data, the DV is shortage, and the Independent variable civica_status that takes 0 before the event and 1 after the event happened. Not all the observations got treated.
I used the command
and
but I got different results from each of them! Shouldn't the result be the same?
Thanks,
I have unbalanced panel data, the DV is shortage, and the Independent variable civica_status that takes 0 before the event and 1 after the event happened. Not all the observations got treated.
I used the command
Code:
xi: xtreg shortage civica_status i.date lead4 lead3 lead2 lead1 lag1-lag3, fe vce(cluster NDCid)
Code:
eventdd shortage civica_status i.date, fe timevar(dif) ci(rcap) level(95) cluster(NDCid) accum lags(3) leads(4) graph_op(ytitle("Civica"))Thanks,
Using omega squared after regression
Dear Forum,
Can I use omega squared after regression to determine effect size if my independent variables of interest are quantitative?
Thank you!
Can I use omega squared after regression to determine effect size if my independent variables of interest are quantitative?
Thank you!
Regression insignificant after adding industry fixed effects
Hello,
I am new to this forum and i am currently writing a MSc thesis for finance.
I am researching the effect of CEO overconfidence on the effect of seasoned equity announcements (SEO) and short-term stock returns.
My regression model looks like this.

Dependent variable: Cumulative abnormal returns (CAR) post- SEO announcement
Independent variable: Overconfidence dummy
control variables; size, book-to-market, leverage, return on assets, issue size, underpricing, ceo age, ceo gender (all continuous, except for gender) (size is expressed in logarithm)
I have computed the CARs for firms having announced SEOs during 2010-2020, therefore my data is cross sectional.
My thesis supervisor suggested i maybe could add firm/industry and/or year fixed effects.
Therefore, i have the following regression command: reg CAR Overconfidence MktCap BtM Lev Roa IssSize UndPr Age Male i.Industry i.Year, vce(robust)
However, this command returns very small coefficients (max. 0.05) and all of the variables including the constant are statistically insignificant, overall regression significance p>F returns a dot.
I have tried experimenting with the regression, removing year fixed effects and/or industry fixed effects and removing the robust standard errors.
The most significance i get is when i add year fixed effects and robust s.e., 3/8 variables become significant (Overconifdence, IssSize, UndPr), the overall regression significance p>F becomes 0.000.
How could this be explained or interpreted.
Looking forward to your answers
Kind regards,
Darya
I am new to this forum and i am currently writing a MSc thesis for finance.
I am researching the effect of CEO overconfidence on the effect of seasoned equity announcements (SEO) and short-term stock returns.
My regression model looks like this.
Dependent variable: Cumulative abnormal returns (CAR) post- SEO announcement
Independent variable: Overconfidence dummy
control variables; size, book-to-market, leverage, return on assets, issue size, underpricing, ceo age, ceo gender (all continuous, except for gender) (size is expressed in logarithm)
I have computed the CARs for firms having announced SEOs during 2010-2020, therefore my data is cross sectional.
My thesis supervisor suggested i maybe could add firm/industry and/or year fixed effects.
Therefore, i have the following regression command: reg CAR Overconfidence MktCap BtM Lev Roa IssSize UndPr Age Male i.Industry i.Year, vce(robust)
However, this command returns very small coefficients (max. 0.05) and all of the variables including the constant are statistically insignificant, overall regression significance p>F returns a dot.
I have tried experimenting with the regression, removing year fixed effects and/or industry fixed effects and removing the robust standard errors.
The most significance i get is when i add year fixed effects and robust s.e., 3/8 variables become significant (Overconifdence, IssSize, UndPr), the overall regression significance p>F becomes 0.000.
How could this be explained or interpreted.
Looking forward to your answers
Kind regards,
Darya
How to convert date variable to year group?
I am given a list of date variables in dd/mm/yy and I would like to sort the variables out in the form of grouping them together by the specific year. For example, I want all date variables of 2010 to be in 2010 year variable and all date variables of 2011 to be in 2011 year variable.
How do I go about doing this?
How do I go about doing this?
Need help with constraint dropped when testing for equality coefficients and joint significance of two variables
Hi all,
Currently, I'm doing a multiple regression, with the totalcases as the dependent variable, while other 6 variables (border, PM, health, popu, totalphy, GDP) are independent. GDP, totalphy (total physicians in a country), health (health expenditure of a country) are calculated using population * per capita or per 1000 indicators (GDP per capita, health expenditure per capita, physicians per 1000) by generating new variable.
The problem is that when I tried to test the equality of coefficients of health and GDP, the result is constraint 1 dropped and there was no F value nor Prob > F. I also test joint significance for 15 pairs of variables and many pairs suffered from constraint dropped.
Can anyone provide me with a detailed explanation and solution? I appreciate every answer given to me.
Array
Currently, I'm doing a multiple regression, with the totalcases as the dependent variable, while other 6 variables (border, PM, health, popu, totalphy, GDP) are independent. GDP, totalphy (total physicians in a country), health (health expenditure of a country) are calculated using population * per capita or per 1000 indicators (GDP per capita, health expenditure per capita, physicians per 1000) by generating new variable.
The problem is that when I tried to test the equality of coefficients of health and GDP, the result is constraint 1 dropped and there was no F value nor Prob > F. I also test joint significance for 15 pairs of variables and many pairs suffered from constraint dropped.
Can anyone provide me with a detailed explanation and solution? I appreciate every answer given to me.
Array
Reshaping and keeping labels
Hello!
As context, I'm estimating several regressions with around 900 parameters. I need to save around 400 of them (the ones that are an interaction, so I can then merge them back into the data set and do a scatter plot). I'm using statsby to save the parameters to a dataset, however, statsby gives us something like:
When reshaping, the 619 means nothing to me. I actually only care about the 4672 in the label.
Practically, my problem can be explaining in the following manner:
Thanks for you time!
Hélder
As context, I'm estimating several regressions with around 900 parameters. I need to save around 400 of them (the ones that are an interaction, so I can then merge them back into the data set and do a scatter plot). I'm using statsby to save the parameters to a dataset, however, statsby gives us something like:
. des _stat_619
Variable Storage Display Value
name type format label Variable label
----------------------------------------------------------------------------------------------------------------------------
_stat_619 float %9.0g _b[4672.cae4#c.educ]
Variable Storage Display Value
name type format label Variable label
----------------------------------------------------------------------------------------------------------------------------
_stat_619 float %9.0g _b[4672.cae4#c.educ]
When reshaping, the 619 means nothing to me. I actually only care about the 4672 in the label.
Practically, my problem can be explaining in the following manner:
Code:
clear
input id x2007 x2008 x2009
1 12 16 18
end
foreach v of varlist x* {
label variable `v' "`=substr("`v'",1,1)' factor(`=substr("`v'",length("`v'")-3,4)')"
}
desc *
//This is what I'm getting (see list):
reshape long x, i(id) j(year)
list
//This is what I want (see list):
label define yearlbl 2007 "x factor(2007)" 2008 "x factor(2008)" 2009 "x factor(2009)" , replace
label values year yearlbl
listHélder
Help needed Pearsons R table for Categorical Independent variables
Hi All,
I am trying to create a Pearson's correlation table in Stata.
The experiment looks at Independent variable (respondents being randomised between 3 manipulations) = none, negative none-verbal cues, and positive none-verbal cues.
Dependent variable looks into investing decision and attitudes.
Atm I have my stata sheet cleaned (with all the way on the left column with string variables Cues = with under neath all the negative, positive, none responses) = see foto
Array
Now i want to compare quite a few variables within the Pearsons's table. How can I do this though given the categorical IV situation?
I am trying to create a Pearson's correlation table in Stata.
The experiment looks at Independent variable (respondents being randomised between 3 manipulations) = none, negative none-verbal cues, and positive none-verbal cues.
Dependent variable looks into investing decision and attitudes.
Atm I have my stata sheet cleaned (with all the way on the left column with string variables Cues = with under neath all the negative, positive, none responses) = see foto
Array
Now i want to compare quite a few variables within the Pearsons's table. How can I do this though given the categorical IV situation?
Line graph (mean) with panel data
Hey Stata community,
I tried to plot something using a panel data set and the "line" command (sales over time from 20 different companies).
Unfortunately, Stata shows me for each company separate lines. I just want to have 1 line over time (the mean of sales from all companies).
To demonstrate, I added a graph.
How can I see the mean line instead of 20 lines in the plot? Is there a mean-command that works?
Many thanks and best regards!
Array
I tried to plot something using a panel data set and the "line" command (sales over time from 20 different companies).
Unfortunately, Stata shows me for each company separate lines. I just want to have 1 line over time (the mean of sales from all companies).
To demonstrate, I added a graph.
How can I see the mean line instead of 20 lines in the plot? Is there a mean-command that works?
Many thanks and best regards!
Array
Non-parametric correlation test between dichotomous variable and continuous one
Dear Statalist.
I have a small sample of countries. I have two variables, X which is a continuous variable, and Y which is a dichotomous one. The problem is that my variable X is not normally distributed, so I cannot use point-biserial test, then I am looking for a non-parametric test. Reading the literature I have read about "Eta correlation test". However, I do not Know how to compute it in Stata. Any advise?
Thanks in advanced,
Ibai
I have a small sample of countries. I have two variables, X which is a continuous variable, and Y which is a dichotomous one. The problem is that my variable X is not normally distributed, so I cannot use point-biserial test, then I am looking for a non-parametric test. Reading the literature I have read about "Eta correlation test". However, I do not Know how to compute it in Stata. Any advise?
Thanks in advanced,
Ibai
How to create a variable from values of other members of a group?
Dear all,
I am analysing an unbalanced panel dataset that is similar to the following example:
I want to create a variable partner income (partnerinc) that shows the income of the partner in the corresponding year, if a partner was in the family of course.
The solution should look like this:
I did it manually for these very few observation, but how do I create such a variable in general? I was looking through other posts on the forum but could not find a related one or get the hang of it when it was related.
I am using Stata 16.0. on Windows.
Thank you very much for your help in advance.
Best regards,
Marvin
I am analysing an unbalanced panel dataset that is similar to the following example:
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input byte(family id) str1 partnerid int(year inc) 1 1 "2" 2000 1250 1 1 "2" 2001 1250 1 1 "2" 2002 1300 1 1 "2" 2003 1300 1 1 "2" 2004 1380 1 1 "2" 2005 1400 1 2 "1" 2000 2000 1 2 "1" 2001 2120 1 2 "1" 2002 2120 1 2 "1" 2003 2120 1 2 "1" 2004 2250 1 2 "1" 2005 2250 2 3 "4" 2000 1300 2 3 "4" 2001 0 2 4 "3" 2000 1500 2 4 "3" 2001 1600 2 4 "." 2002 1600 2 4 "." 2003 1800 2 4 "5" 2004 1800 2 5 "4" 2004 1400 end
The solution should look like this:
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input byte(family id) str1 partnerid int(year inc) str4 partnerinc 1 1 "2" 2000 1250 "2000" 1 1 "2" 2001 1250 "2120" 1 1 "2" 2002 1300 "2120" 1 1 "2" 2003 1300 "2120" 1 1 "2" 2004 1380 "2250" 1 1 "2" 2005 1400 "2250" 1 2 "1" 2000 2000 "1250" 1 2 "1" 2001 2120 "1250" 1 2 "1" 2002 2120 "1300" 1 2 "1" 2003 2120 "1300" 1 2 "1" 2004 2250 "1380" 1 2 "1" 2005 2250 "1400" 2 3 "4" 2000 1300 "1500" 2 3 "4" 2001 0 "1600" 2 4 "3" 2000 1500 "1300" 2 4 "3" 2001 1600 "0" 2 4 "." 2002 1600 "." 2 4 "." 2003 1800 "." 2 4 "5" 2004 1800 "1400" 2 5 "4" 2004 1400 "1800" end
I am using Stata 16.0. on Windows.
Thank you very much for your help in advance.
Best regards,
Marvin
predicted probability meaning
Hi All,
I appologies for asking this very basic question. Part of my PhD research, I have calculated the predicted probabilities by calling mtable command in STATA and presented their results by following the interepretation in Regression Models for Categorical Dependent Variables Using Stata-Stata Press (2014) by J. Scott Long and Jeremy Freese. Now, I am asked to define the predicted probability by my superviser. However, I tried myself to find answer for this question but ending up with nothing. So, kindly help me to find the meaning of predicted probability. It would be really greatful for me. Thanks in advance.
Regards
Karthick Veerapandian.
I appologies for asking this very basic question. Part of my PhD research, I have calculated the predicted probabilities by calling mtable command in STATA and presented their results by following the interepretation in Regression Models for Categorical Dependent Variables Using Stata-Stata Press (2014) by J. Scott Long and Jeremy Freese. Now, I am asked to define the predicted probability by my superviser. However, I tried myself to find answer for this question but ending up with nothing. So, kindly help me to find the meaning of predicted probability. It would be really greatful for me. Thanks in advance.
Regards
Karthick Veerapandian.
Kakwani index Standard errors
Hi,
I want to know how to obtain Kakwani index Standard errors, where Kakwani index is the concentration of health care payments minus the Gini coefficient.
Thanks,
Aarushi
I want to know how to obtain Kakwani index Standard errors, where Kakwani index is the concentration of health care payments minus the Gini coefficient.
Thanks,
Aarushi
Reshaping from wide to long and keeping the label
Hello everyone,
I want to reshape my dataset from wide to long format and keep the label information as a new variable.
This is what my data looks like in long format:
_v1 is return of day one (the label of the variable is the date, here it is 5/11/2021), _v2 is the market cap of the same day (again, the label is 5/11/2021).
I reshaped the data using the following command:
unfortunately, I was not able yet to create a new variable with the date.
This is what my data looks like after reshaping it:
What I would like is a new variable with the label information. Hence, for firm_num=1 it would be 5/11/2021, so the same date as for variable _v1 and _v2.
I want to reshape my dataset from wide to long format and keep the label information as a new variable.
This is what my data looks like in long format:
Code:
CompanyName _v1 _v2 _v3 _v4 "GSW Immobilien AG" 0 5894403840 -1.88679245283019 5910893287.23354 "Deutsche Rohstoff AG" 4.28571428571428 74193506.2 9.80392156862745 71144458 "Draegerwerk AG & Co KGaA" -.134408602150549 1392782583.91451 -.626666666666659 1407410211.36748 "Tonkens Agrar AG" .442477876106205 7521354.29451648 0 7519657.36964375 "BHB Brauholding Bayern Mitte AG" -13.9240506329114 8432000 16.1764705882353 9796000 "Deutsche Konsum REIT AG" -.324675324675329 538890934.198159 -.323624595469249 542916002.194247 "BRAIN Biotech AG" 2.40700218818379 185902329.6 -1.72043010752688 181532830.4 "Senvion SA" 10.4972375690608 14623376.6 .555555555555556 13234155.823 "Lion E Mobility AG" -6.63265306122449 36719436.78 -3.20987654320987 39327921.36 "va Q tec AG" -6.35179153094462 376323182.5 -.486223662884934 401847711.4 "Bitcoin Group SE" -8.73180873180874 2.195e+08 -1.83673469387755 2.405e+08 "Shop Apotheke Europe NV" -.846354166666656 2731518928.3 -4.77371357718538 2754834585.6 "Medios AG" -2.77777777777778 709274685 0 729539676 end
I reshaped the data using the following command:
Code:
gen long obs_no = _n reshape long _v, i(obs_no) j(_j) gen which_var = "return" if mod(_j, 2) == 1, before(_v) replace which_var = "market_cap" if missing(which_var) gen firm_num = ceil(_j/2) drop _j reshape wide _v, i(firm_num obs_no) j(which_var) string rename _v* * drop obs_no
This is what my data looks like after reshaping it:
Code:
* Example generated by -dataex-. For more info, type help dataex clear firm_num market_cap return CompanyName 1 5894403840 0 "GSW Immobilien AG" 1 74193506.2 4.28571428571428 "Deutsche Rohstoff AG" 1 1392782583.91451 -.134408602150549 "Draegerwerk AG & Co KGaA" 1 7521354.29451648 .442477876106205 "Tonkens Agrar AG" 1 8432000 -13.9240506329114 "BHB Brauholding Bayern Mitte AG" 1 3931445305.56908 -2.66106442577032 "Stroeer SE & Co KGaA" 1 11003685830.4026 -1.27868852459017 "Uniper SE" 1 36719436.78 -6.63265306122449 "Lion E Mobility AG" 1 376323182.5 -6.35179153094462 "va Q tec AG" 1 2.195e+08 -8.73180873180874 "Bitcoin Group SE" 1 2731518928.3 -.846354166666656 "Shop Apotheke Europe NV" 1 709274685 -2.77777777777778 "Medios AG" end
Sunday, May 30, 2021
ITSA command - stationarity
Hello all,
I am using the -itsa- command for a time series regression with covariates. My time series are a mixture of I(0) and I(1), i.e. some are stationary in levels forms and others are stationary in first differences. I have not been able to see from anywhere that how should I include both types of time series in the -itsa- command. The Stata Journal article of this command does not explain how to deal with such a situation either.
So should I write only the original variables and the -itsa- command will automatically do calculations based on at which levels the time series are stationary? Or should I include a mixed model with some variables in levels forms and others in first differences?
Any help would be appreciated.
Regards,
Mujahid
I am using the -itsa- command for a time series regression with covariates. My time series are a mixture of I(0) and I(1), i.e. some are stationary in levels forms and others are stationary in first differences. I have not been able to see from anywhere that how should I include both types of time series in the -itsa- command. The Stata Journal article of this command does not explain how to deal with such a situation either.
So should I write only the original variables and the -itsa- command will automatically do calculations based on at which levels the time series are stationary? Or should I include a mixed model with some variables in levels forms and others in first differences?
Any help would be appreciated.
Regards,
Mujahid
Time invariant variables in hybrid model
Dear all,
I am studying on the determinants of labour productivity and I am using the hybrid model.
First of all, I chose growth rate of log_LP as dependent variables (continuous) and growth rate of log_LP of frontier firms, lagged gap between the LP of frontier and laggard firms, firm-level control variables (age, size, intangible assets, digital training, digital adoption, STRI indicators) as explanatory variables.
They are all time-variant variables except for digital training (Broadband, ERP, CRM, Cloudcomputing), digital training (ICT training, ICT specialists etc) and STRI indicators i.e. time invariant variables & sectoral-level variables.
For those variables, since they are not covering all years in the dataset, I have replaced the missing values by their sectoral-level averaged values. Therefore, the values vary across sectors, not over years.
And when I apply this model in Stata, it says that the variables below are omitted.
I surely think that there are many mistakes on my model as well as codes, however I could not really understand the problem..
Could you please tell me the problems here?
Thanks,
Anne-Claire
I am studying on the determinants of labour productivity and I am using the hybrid model.
First of all, I chose growth rate of log_LP as dependent variables (continuous) and growth rate of log_LP of frontier firms, lagged gap between the LP of frontier and laggard firms, firm-level control variables (age, size, intangible assets, digital training, digital adoption, STRI indicators) as explanatory variables.
They are all time-variant variables except for digital training (Broadband, ERP, CRM, Cloudcomputing), digital training (ICT training, ICT specialists etc) and STRI indicators i.e. time invariant variables & sectoral-level variables.
For those variables, since they are not covering all years in the dataset, I have replaced the missing values by their sectoral-level averaged values. Therefore, the values vary across sectors, not over years.
And when I apply this model in Stata, it says that the variables below are omitted.
I surely think that there are many mistakes on my model as well as codes, however I could not really understand the problem..
Could you please tell me the problems here?
Code:
note: d_Broadband omitted because of collinearity note: d_ERPusing omitted because of collinearity note: d_CRMusing omitted because of collinearity note: d_Cloudcomputing omitted because of collinearity note: d_ICTspecialist omitted because of collinearity note: d_ICTtraining omitted because of collinearity note: d_training_ICTspecialist omitted because of collinearity note: d_regulatory_transparency omitted because of collinearity note: m_CRMusing omitted because of collinearity note: m_Cloudcomputing omitted because of collinearity note: m_ICTspecialist omitted because of collinearity note: m_ICTtraining omitted because of collinearity note: m_training_ICTspecialist omitted because of collinearity Random-effects GLS regression Number of obs = 340 Group variable: id Number of groups = 192 R-sq: Obs per group: within = 0.3428 min = 1 between = 0.2131 avg = 1.8 overall = 0.2208 max = 6 Wald chi2(20) = . corr(u_i, X) = 0 (assumed) Prob > chi2 = . -------------------------------------------------------------------------------- dlog_lp_VA | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------------+---------------------------------------------------------------- Frontier_gro~h | .1618926 .0592357 2.73 0.006 .0457927 .2779926 Lagged_gap | .3425023 .0419372 8.17 0.000 .260307 .4246976 d_g_intangible | -.4667624 .3021783 -1.54 0.122 -1.059021 .1254962 d_share_inta~l | .0111534 .0900579 0.12 0.901 -.1653567 .1876636 d_age | -.0009252 .002784 -0.33 0.740 -.0063816 .0045313 d_Broadband | 0 (omitted) d_ERPusing | 0 (omitted) d_CRMusing | 0 (omitted) d_Cloudcompu~g | 0 (omitted) d_ICTspecial~t | 0 (omitted) d_ICTtraining | 0 (omitted) d_training_I~t | 0 (omitted) d_ind_STRI | -189.4641 733.7918 -0.26 0.796 -1627.67 1248.741 d_res_for | 196.2257 736.4029 0.27 0.790 -1247.097 1639.549 d_res_monvem~t | 203.6151 775.5217 0.26 0.793 -1316.379 1723.61 d_discrimina~e | 141.2516 550.3346 0.26 0.797 -937.3844 1219.888 d_barriers_c~n | 149.8324 604.2113 0.25 0.804 -1034.4 1334.065 d_regulatory~y | 0 (omitted) m_g_intangible | 3.318489 1.839872 1.80 0.071 -.287593 6.924571 m_share_inta~l | -1.182859 .6844581 -1.73 0.084 -2.524372 .1586545 m_age | -.162613 .0905383 -1.80 0.072 -.3400648 .0148387 m_Broadband | -4.683323 4.759522 -0.98 0.325 -14.01181 4.645168 m_ERPusing | -.1093101 .1632869 -0.67 0.503 -.4293465 .2107263 m_CRMusing | 0 (omitted) m_Cloudcompu~g | 0 (omitted) m_ICTspecial~t | 0 (omitted) m_ICTtraining | 0 (omitted) m_training_I~t | 0 (omitted) m_ind_STRI | 142.9161 138.9487 1.03 0.304 -129.4183 415.2505 m_res_for | 0 (omitted) m_res_monvem~t | -261.7917 266.8595 -0.98 0.327 -784.8267 261.2434 m_discrimina~e | -141.1574 215.5238 -0.65 0.512 -563.5762 281.2614 m_barriers_c~n | -371.0285 383.4101 -0.97 0.333 -1122.498 380.4414 m_regulatory~y | -579.394 576.4592 -1.01 0.315 -1709.233 550.4452 _cons | 479.3276 488.1428 0.98 0.326 -477.4147 1436.07 ---------------+---------------------------------------------------------------- sigma_u | 0 sigma_e | .45474182 rho | 0 (fraction of variance due to u_i) --------------------------------------------------------------------------------
Thanks,
Anne-Claire
Renaming every 3 variable
Hi,
I am trying to rename every three variable in my dataset such that the first three variable X1 X2 X3 read HH-W-X1 HH-B-X2 HH-H-X3. Also, I would like to rename the three subsequent variable as follows: X4 X5 X6 —> UH-X4 UX-x5 UX-x6. My data includes 42 variables as X1 X2 ... X42 and I would like to loop this over 42 variables. Thanks for your help.
Best,
NM
I am trying to rename every three variable in my dataset such that the first three variable X1 X2 X3 read HH-W-X1 HH-B-X2 HH-H-X3. Also, I would like to rename the three subsequent variable as follows: X4 X5 X6 —> UH-X4 UX-x5 UX-x6. My data includes 42 variables as X1 X2 ... X42 and I would like to loop this over 42 variables. Thanks for your help.
Best,
NM
Creating an IF or Conditional command
Hi All,
It is my first time in this forum and I'm hoping you will be able to assist me.
My study is on Adolescence Obesity and I have huge data that consists of everyone including adults. I would like to generate a variable called Mother_IBM. So the condition is that the output should be female (numlabel 2) and have No_of_kids>0
Thanks
It is my first time in this forum and I'm hoping you will be able to assist me.
My study is on Adolescence Obesity and I have huge data that consists of everyone including adults. I would like to generate a variable called Mother_IBM. So the condition is that the output should be female (numlabel 2) and have No_of_kids>0
Thanks
Lee bounds with multiple treatment group
Hi,
I'm working with a dataset with high attrition rate and am considering using Lee bounds to estimate the treatment effect. Below is the code I am using to determine the share of respondents I need to trim above/below to compute the Lee bounded treatment effect for a binary treatment variable.
I was wondering how I would proceed if I have three treatment groups (Treatment 1, Treatment 2 and Control).
Thanks
I'm working with a dataset with high attrition rate and am considering using Lee bounds to estimate the treatment effect. Below is the code I am using to determine the share of respondents I need to trim above/below to compute the Lee bounded treatment effect for a binary treatment variable.
Code:
quietly count if intervention==0 & time==0
local tot_control=`r(N)'
quietly count if intervention==1 & time==0
local tot_treatment=`r(N)'
quietly count if intervention==0 & time==1 & consent2==1
local found_control=`r(N)'
quietly count if intervention==1 & time==1 & consent2==1
local found_treatment=`r(N)'
local q_control=`found_control'/`tot_control'
local q_treatment=`found_treatment'/`tot_treatment'
if `q_treatment'>`q_control' {
local q1=(`q_treatment'-`q_control')/`q_treatment'
}
if `q_treatment'<`q_control' {
local q1=(`q_control'-`q_treatment')/`q_control'
}Thanks
Merge Panel Data with multiple same year dataset
Hi All,
I am trying to merge two datasets. One is a firm ID -year (panel master data). The other dataset (hereafter ECHO dataset) is from the US Government's information of penalties charged to a firm due to environmental violations. So, it can happen that a firm has years when there were no violation but on the other hand, it can also happen that a firm has two or more violations in one year.
My objective is to merge some variables like "total yearly sanction" (Variable 9 below) into my master data.
I have generated some variables on the ECHO dataset that I want to merge with my panel master data. In particular, my question is to understand how I could merge the panel data with variable 9.
Here is a copy of the ECHO data set that I am using:
Note: I am not sure why long settle year is come as a 2 digit output because in stata it is showing as years like 1980, 1981...
Thank you.
Deb
I am trying to merge two datasets. One is a firm ID -year (panel master data). The other dataset (hereafter ECHO dataset) is from the US Government's information of penalties charged to a firm due to environmental violations. So, it can happen that a firm has years when there were no violation but on the other hand, it can also happen that a firm has two or more violations in one year.
My objective is to merge some variables like "total yearly sanction" (Variable 9 below) into my master data.
I have generated some variables on the ECHO dataset that I want to merge with my panel master data. In particular, my question is to understand how I could merge the panel data with variable 9.
Code:
7) sum of fed penalty, compliance cost and SEP cost gen tot_sanction = fed_pen + sep_cost + tot_compl_amt 8) total sanction per year by f_id settle_year: gen tot_sanction_y = sum(tot_sanction) 9) variable storing total sanction per year/ by f_id settle_year: keep if _n =_N
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input int f_id str10 settle_date long settle_year double(fed_pen sep_cost tot_compl_amt) float(tot_sanction tot_sanction_y) 2 "07/28/1978" 4 . . . . 0 2 "01/29/1987" 13 8500 0 0 8500 8500 2 "04/07/1988" 14 0 0 0 0 0 2 "09/19/1989" 15 0 0 0 0 0 2 "10/18/1990" 16 112500 0 0 112500 112500 2 "09/25/1990" 16 . . . . 112500 2 "08/30/1990" 16 0 0 0 0 112500 2 "08/26/1991" 17 4920 0 0 4920 4920 2 "02/15/1991" 17 7267 0 0 7267 12187 2 "05/06/1992" 18 0 0 0 0 0 2 "09/24/1992" 18 25000 0 0 25000 25000 2 "09/24/1992" 18 25000 0 0 25000 50000 2 "06/09/1992" 18 25000 0 0 25000 75000 2 "01/29/1992" 18 0 0 0 0 75000 2 "08/07/1992" 18 9750 0 0 9750 84750 2 "12/07/1993" 19 18381 0 0 18381 18381 2 "06/30/1993" 19 0 0 0 0 18381 2 "12/07/1993" 19 18381 0 0 18381 36762 2 "06/30/1993" 19 0 0 0 0 36762 2 "12/07/1993" 19 18381 0 0 18381 55143 2 "06/30/1993" 19 0 0 0 0 55143 2 "02/25/1993" 19 . . . . 55143 2 "10/29/1993" 19 30000 0 0 30000 85143 2 "05/04/1994" 20 0 0 0 0 0 2 "10/19/1995" 21 0 0 0 0 0 2 "02/14/1995" 21 4200 0 0 4200 4200 2 "05/12/1995" 21 50000 0 0 50000 54200 2 "01/04/1995" 21 4950 0 0 4950 59150 2 "06/12/1995" 21 0 0 0 0 59150 2 "10/03/1996" 22 . . . . 0 2 "04/18/1996" 22 0 0 70000000 7.00e+07 7.00e+07 2 "10/03/1996" 22 84000 0 0 84000 70084000 2 "12/02/1997" 23 17510 0 0 17510 17510 2 "06/18/1997" 23 22500 0 0 22500 40010 2 "09/30/1997" 23 0 0 0 0 40010 2 "07/07/1997" 23 213000 0 0 213000 253010 2 "09/29/1997" 23 238000 0 0 238000 491010 2 "02/20/1997" 23 75515 0 0 75515 566525 2 "09/29/1997" 23 238000 0 0 238000 804525 2 "07/07/1997" 23 213000 0 0 213000 1017525 2 "12/01/1997" 23 0 0 0 0 1017525 2 "10/02/1997" 23 0 0 1590000 1590000 2607525 2 "05/17/1998" 24 0 0 0 0 0 2 "06/30/1998" 24 46450 0 0 46450 46450 2 "11/18/1998" 24 0 0 1000000 1000000 1046450 2 "06/30/1998" 24 46450 0 0 46450 1092900 2 "06/30/1998" 24 46450 0 0 46450 1139350 2 "07/22/1998" 24 0 0 2000000 2000000 3139350 2 "06/30/1998" 24 46450 0 0 46450 3185800 2 "02/04/1999" 25 1000 0 0 1000 1000 2 "06/14/1999" 25 0 0 0 0 1000 2 "03/24/1999" 25 0 0 0 0 1000 2 "01/15/1999" 25 143800 0 0 143800 144800 2 "09/22/2000" 26 38596 0 0 38596 38596 2 "09/22/2000" 26 7150 0 0 7150 45746 2 "09/22/2000" 26 52340 0 0 52340 98086 2 "12/20/2000" 26 5000 0 0 5000 103086 2 "04/17/2000" 26 0 0 5000000 5000000 5103086 2 "09/27/2001" 27 0 0 850000 850000 850000 2 "07/05/2002" 28 0 0 300000 300000 300000 2 "09/26/2002" 28 2810 0 0 2810 302810 2 "06/24/2002" 28 0 0 43000000 4.30e+07 43302808 2 "08/01/2002" 28 0 0 0 0 43302808 2 "09/30/2003" 29 0 0 10000 10000 10000 2 "10/06/2003" 29 0 0 800000 800000 810000 2 "11/14/2003" 29 0 0 0 0 810000 2 "09/30/2003" 29 0 0 0 0 810000 2 "02/20/2003" 29 16170 62225 5000 83395 893395 2 "08/09/2004" 30 0 0 10000 10000 10000 2 "09/30/2004" 30 0 0 7500000 7500000 7510000 2 "09/28/2004" 30 27500 165000 10000 202500 7712500 2 "05/04/2004" 30 0 0 963000 963000 8675500 2 "05/26/2004" 30 0 0 0 0 8675500 2 "06/29/2005" 31 0 0 24500000 2.45e+07 2.45e+07 2 "08/31/2005" 31 0 0 0 0 2.45e+07 2 "01/14/2005" 31 0 0 0 0 2.45e+07 2 "04/25/2005" 31 1 0 0 1 2.45e+07 2 "09/29/2005" 31 0 0 0 0 2.45e+07 2 "09/07/2006" 32 3107 11673 100 14880 14880 2 "05/01/2006" 32 1521983 0 0 1521983 1536863 2 "04/12/2007" 33 0 0 100 100 100 2 "04/12/2007" 33 0 0 100 100 200 2 "02/21/2008" 34 0 0 0 0 0 2 "08/07/2008" 34 0 0 9382412 9382412 9382412 2 "05/07/2008" 34 0 0 18000000 1.80e+07 27382412 2 "06/02/2008" 34 30000 0 0 30000 27412412 2 "09/24/2008" 34 0 0 0 0 27412412 2 "08/21/2008" 34 0 0 0 0 27412412 2 "10/27/2009" 35 0 0 27000000 2.70e+07 2.70e+07 2 "04/14/2009" 35 0 0 59000000 5.90e+07 8.60e+07 2 "12/24/2009" 35 1310 4913 100 6323 86006320 2 "09/01/2009" 35 0 0 100 100 86006424 2 "11/09/2010" 36 0 0 29980000 29980000 29980000 2 "10/07/2010" 36 0 0 5600000 5600000 35580000 2 "12/06/2011" 37 0 0 1800000 1800000 1800000 2 "02/09/2012" 38 0 0 7300000 7300000 7300000 2 "02/26/2013" 39 0 0 16390000 16390000 16390000 2 "09/25/2014" 40 65000 0 32400 97400 97400 2 "03/17/2014" 40 0 0 7830000 7830000 7927400 2 "05/05/2014" 40 0 0 2000000 2000000 9927400 2 "04/21/2014" 40 0 0 3400000 3400000 13327400 2 "11/22/2016" 42 40000 0 5500000 5540000 5540000 2 "09/26/2017" 43 0 0 7000 7000 7000 2 "01/10/2017" 43 0 0 0 0 7000 2 "04/10/2017" 43 8251 0 0 8251 15251 2 "09/29/2017" 43 50000 0 0 50000 65251 2 "09/30/2019" 45 0 0 11000000 1.10e+07 1.10e+07 2 "02/05/2020" 46 74360 0 6000 80360 80360 2 "" . . . . . 0 2 "" . . . . . 0 2 "" . 4550 0 0 4550 4550 2 "" . . . . . 4550 3 "10/09/1985" 11 4000000 0 0 4000000 4000000 3 "09/30/1990" 16 1 0 0 1 1 3 "07/16/1990" 16 . . . . 1 3 "12/20/1991" 17 0 0 0 0 0 3 "09/28/1992" 18 0 0 0 0 0 3 "08/30/1993" 19 0 0 0 0 0 3 "12/19/1994" 20 28000 0 0 28000 28000 3 "10/31/1994" 20 28000 0 0 28000 56000 3 "09/29/1995" 21 0 0 0 0 0 3 "10/02/1995" 21 182654 0 0 182654 182654 3 "06/12/1995" 21 0 0 0 0 182654 3 "04/18/1996" 22 0 0 0 0 0 3 "09/03/1997" 23 7200 0 0 7200 7200 3 "03/31/1998" 24 0 0 10000005 10000005 10000005 3 "06/08/1999" 25 0 0 0 0 0 3 "09/29/2000" 26 5000 0 0 5000 5000 3 "06/28/2002" 28 0 0 100000 100000 100000 3 "11/01/2002" 28 0 0 35000 35000 135000 3 "10/17/2002" 28 500 0 18750 19250 154250 3 "08/12/2002" 28 3500 0 0 3500 157750 3 "03/31/2003" 29 0 0 0 0 0 3 "04/10/2007" 33 0 0 0 0 0 3 "09/18/2008" 34 25033 0 0 25033 25033 3 "09/30/2009" 35 0 0 0 0 0 3 "02/09/2012" 38 0 0 7300000 7300000 7300000 3 "05/16/2014" 40 0 0 1500 1500 1500 3 "" . . . . . 0 4 "08/12/2004" 30 0 0 430000 430000 430000 4 "08/16/2004" 30 17903 144692 0 162595 592595 4 "03/30/2006" 32 57372 418300 714000 1189672 1189672 4 "09/30/2011" 37 0 0 0 0 0 4 "12/14/2011" 37 10155 125601 0 135756 135756 4 "06/03/2013" 39 0 0 1500000 1500000 1500000 4 "09/08/2014" 40 825 0 2000 2825 2825 5 "08/03/2009" 35 30000 0 0 30000 30000 5 "08/03/2009" 35 30000 0 0 30000 60000 5 "01/11/2010" 36 30000 0 110000 140000 140000 6 "09/21/2009" 35 0 0 0 0 0 7 "03/02/1994" 20 0 0 0 0 0 7 "08/25/1998" 24 400 0 0 400 400 7 "05/15/1998" 24 0 0 0 0 400 7 "06/25/1999" 25 0 0 0 0 0 7 "04/13/2000" 26 0 0 0 0 0 7 "11/02/2001" 27 0 0 0 0 0 7 "09/28/2001" 27 2640 0 0 2640 2640 7 "09/30/2003" 29 0 0 0 0 0 7 "02/28/2003" 29 5500 0 0 5500 5500 7 "10/14/2003" 29 0 0 0 0 5500 7 "09/30/2004" 30 0 0 7500000 7500000 7500000 7 "04/13/2004" 30 11880 0 0 11880 7511880 7 "01/23/2006" 32 0 0 15 15 15 7 "07/07/2006" 32 0 0 0 0 15 7 "01/23/2006" 32 0 0 15 15 30 7 "11/09/2006" 32 7000 0 0 7000 7030 7 "08/22/2006" 32 0 0 0 0 7030 7 "01/23/2006" 32 0 0 15 15 7045 7 "02/27/2007" 33 3000 0 100 3100 3100 7 "05/08/2007" 33 0 0 37000000 3.70e+07 37003100 7 "07/02/2007" 33 0 0 1400000 1400000 38403100 7 "05/07/2008" 34 0 0 18000000 1.80e+07 1.80e+07 7 "01/30/2009" 35 800 0 100 900 900 7 "05/04/2010" 36 2790 0 100 2890 2890 7 "06/27/2011" 37 0 0 0 0 0 7 "07/15/2015" 41 4080 0 0 4080 4080 7 "08/21/2015" 41 0 0 0 0 4080 7 "11/10/2016" 42 0 0 0 0 0 7 "11/10/2016" 42 0 0 0 0 0 7 "10/28/2016" 42 0 0 0 0 0 7 "06/14/2017" 43 1785 0 2500 4285 4285 7 "05/25/2017" 43 1785 0 2500 4285 8570 7 "04/12/2018" 44 800 0 0 800 800 7 "04/11/2019" 45 0 0 0 0 0 8 "09/28/2015" 41 30800 0 3000 33800 33800 10 "06/13/1985" 11 50000 0 0 50000 50000 11 "03/19/1999" 25 0 0 0 0 0 12 "07/24/2006" 32 142500 0 500000 642500 642500 14 "10/28/1991" 17 15000 0 0 15000 15000 14 "10/04/2004" 30 0 0 1000000 1000000 1000000 15 "07/25/1994" 20 . . . . 0 15 "09/25/2002" 28 0 0 0 0 0 15 "09/24/2002" 28 750 0 100 850 850 15 "08/05/2004" 30 20619 0 51513 72132 72132 15 "06/22/2010" 36 3600 0 5000 8600 8600 16 "09/24/2004" 30 0 0 0 0 0 16 "10/17/2018" 44 0 0 0 0 0 16 "01/18/2018" 44 0 0 0 0 0 20 "04/15/1992" 18 45000 0 0 45000 45000 20 "11/15/1995" 21 0 0 0 0 0 end
Thank you.
Deb
Interpretation of a linear (percentage) - log regression model
To provide context, I am running a fixed effects regression model, assessing the relationship between the percent an organization spends on overhead to how much it produces in terms of the number of houses built, as well as how much revenue it generates.
My independent variable is the overhead ratio, which is between 0 and 1. My dependent variable is the log of the total number of houses. The coefficient is .88 and significant. So, I take the exponent of .88, which I believe is 2.41, subtract 1 and multiply times 100, which gives me 141%. For the interpretation, I'm saying that a one-percentage-point increase in the overhead ratio equates to a 141% increase in the total number of houses built. However, this seems way too high.
I also use a second dependent variable - the log of total revenue. For this, I get a significant coefficient of .32, which again, I take the exponent of .32 and get 1.38. I then subtract 1 and multiply times 100, which gives me 37.17%. I interpret this as a one-percentage-point increase in the overhead ratio equates to a 37.17% increase in total houses built. Again, this seems way too high to me.
Am I doing the calculations and interpretations correctly? Many thanks in advance!
My independent variable is the overhead ratio, which is between 0 and 1. My dependent variable is the log of the total number of houses. The coefficient is .88 and significant. So, I take the exponent of .88, which I believe is 2.41, subtract 1 and multiply times 100, which gives me 141%. For the interpretation, I'm saying that a one-percentage-point increase in the overhead ratio equates to a 141% increase in the total number of houses built. However, this seems way too high.
I also use a second dependent variable - the log of total revenue. For this, I get a significant coefficient of .32, which again, I take the exponent of .32 and get 1.38. I then subtract 1 and multiply times 100, which gives me 37.17%. I interpret this as a one-percentage-point increase in the overhead ratio equates to a 37.17% increase in total houses built. Again, this seems way too high to me.
Am I doing the calculations and interpretations correctly? Many thanks in advance!
Problem with Survey data analysis, non-response, selection bias, use of paradata
Hi Statalisters, Greetings from Inddi
I am analysing a survey data and I am relatively new to this commonly used study design. I am seeking help on topics of survey weighting, selection bias and paradata.
Survey data setup
The survey was on doctors (registered under a particular program in the country) on the impact of the pandemic on health services. The sampling frame consists of 23,900 doctors covered by 3 agencies (Agency A, B and C). Under A, there are 13,400 doctors, Under B there are 6000 doctors and under C, 4500 doctors. Among these, 700, 1000 and 1100 doctors were randomly sampled from Agency A, B and C respectively (total sample= 2800). From the survey conducted on these 2800 doctors across the 3 agencies, response was received from 400 doctors from Agency A, 800 from agency B and 700 from agency C.
As per the above, I have assumed that this survey used a stratified random sampling at the agency level. The dataset I have (Data respondents) is on these 1900 doctors. Data is available on about 200 variables from the 1900 responders.
Data available on non-respondents and paradata
The central concern is non-responders and how to account for the ensuing bias as described below. The challenge I am facing with non-response analysis is that the data I have on the 900 non-responders are minimal. In the data set with the full 2800 doctors (Data full) the data I have common across responders and non-responders are only on their (1) agency (A, B or C), (2) qualification (3 category variable bachelors, specialization, super specialization) , and (3) province (5 category variable). Additionally, I also have paradata on the ‘number of attempts’ to contact the doctors (Attempt 1, 2 and 3) – Var 4. Reason for nonresponse among the 900 doctors is also recorded (reasons fall under 10 categories)
Analysis will involve estimating frequencies and proportions, and few regression models giving crude Odds ratio estimates. What is the practical way to analyse this survey data accounting for selection bias?
I give below a sample data set with 30 observations and few variables produced by -dataex-..The data structure below is that of respondents only.
The following are the codes I have started with (I am using StataMP 13 on Windows 10):
Following this I ran the survey set command:
Where ‘id’ is a variable specific for each doctor in the list.
Please correct me if I have gone wrong in the above steps assuming stratified random sampling. Or should strategies to account for non-response be incorporated in the above command lines?
Accounting for non-response
To account for non-response, I read about post stratification (in previous threads in Statalist, and literature) but I have data on only 3 variables across non-responders and responders. I also read that paradata can be used to account for non-response analysis (Kreuter F, Olson K. Paradata for Nonresponse Error Investigation. 2013). I have 1 paradata variable, "number of attempts to contact" specifying the number of times (maximum 3 attempts) a particular doctor was contacted to get a successful interview. But I do not know how to use this variable and in Stata or whether this variable is enough to account for bias.
Requesting your insights on the aforementioned.
Thank you!
I am analysing a survey data and I am relatively new to this commonly used study design. I am seeking help on topics of survey weighting, selection bias and paradata.
Survey data setup
The survey was on doctors (registered under a particular program in the country) on the impact of the pandemic on health services. The sampling frame consists of 23,900 doctors covered by 3 agencies (Agency A, B and C). Under A, there are 13,400 doctors, Under B there are 6000 doctors and under C, 4500 doctors. Among these, 700, 1000 and 1100 doctors were randomly sampled from Agency A, B and C respectively (total sample= 2800). From the survey conducted on these 2800 doctors across the 3 agencies, response was received from 400 doctors from Agency A, 800 from agency B and 700 from agency C.
As per the above, I have assumed that this survey used a stratified random sampling at the agency level. The dataset I have (Data respondents) is on these 1900 doctors. Data is available on about 200 variables from the 1900 responders.
Data available on non-respondents and paradata
The central concern is non-responders and how to account for the ensuing bias as described below. The challenge I am facing with non-response analysis is that the data I have on the 900 non-responders are minimal. In the data set with the full 2800 doctors (Data full) the data I have common across responders and non-responders are only on their (1) agency (A, B or C), (2) qualification (3 category variable bachelors, specialization, super specialization) , and (3) province (5 category variable). Additionally, I also have paradata on the ‘number of attempts’ to contact the doctors (Attempt 1, 2 and 3) – Var 4. Reason for nonresponse among the 900 doctors is also recorded (reasons fall under 10 categories)
Analysis will involve estimating frequencies and proportions, and few regression models giving crude Odds ratio estimates. What is the practical way to analyse this survey data accounting for selection bias?
I give below a sample data set with 30 observations and few variables produced by -dataex-..The data structure below is that of respondents only.
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input byte id int dateofsurvey str1 agency str2 province str19 qualification byte numberofattemptstocontact str23 age byte opdload_ct str14 opdload_hilo str64 servicesb4c19 byte(services_tests_b4c19 services_meds_b4c20 services_ehealth_b4c21) 1 22494 "B" "P2" "Superspecialization" 1 "30 - 45 years old" 10 "Same as before" "Testing, Providing medication" 1 1 0 2 22494 "B" "P2" "Superspecialization" 1 "30 - 45 years old" 10 "Higher" "Testing, Providing medication" 1 1 0 3 22494 "C" "P1" "Superspecialization" 1 "Older than 61 years old" 20 "Lower" "Testing, Providing medication" 1 1 0 4 22494 "C" "P1" "Superspecialization" 1 "30 - 45 years old" 20 "Lower" "Testing, Providing medication" 1 1 0 5 22494 "C" "P1" "Superspecialization" 1 "30 - 45 years old" 40 "Same as before" "Testing, Providing medication" 1 1 0 6 22494 "B" "P2" "Superspecialization" 2 "46 - 60 years old" 30 "Same as before" "Testing, Providing medication, Home consultation" 1 1 0 7 22494 "B" "P3" "Superspecialization" 1 "46 - 60 years old" 15 "Same as before" "Testing " 1 0 0 8 22494 "B" "P3" "Specialization" 1 "30 - 45 years old" 20 "Higher" "Testing " 1 0 0 9 22494 "B" "P3" "Superspecialization" 2 "46 - 60 years old" 25 "Lower" "Other, please specify" 0 0 0 10 22494 "B" "P3" "Superspecialization" 1 "30 - 45 years old" 60 "Lower" "Testing, Other, please specify" 1 0 0 11 22494 "B" "P3" "Superspecialization" 2 "30 - 45 years old" 25 "Lower" "Testing, Providing medication" 1 1 0 12 22525 "C" "P1" "Superspecialization" 1 "30 - 45 years old" 25 "Same as before" "Providing medication, testing" 1 1 0 13 22525 "C" "P1" "Superspecialization" 1 "46 - 60 years old" 30 "Lower" "Testing, Providing medication" 1 1 0 14 22525 "C" "P1" "Superspecialization" 2 "30 - 45 years old" 3 "Lower" "Providing medication, testing, Other, please specify" 1 1 0 15 22525 "C" "P1" "Superspecialization" 1 "30 - 45 years old" 10 "Lower" "Providing medication, testing" 1 1 0 16 22525 "C" "P1" "Superspecialization" 1 "46 - 60 years old" 40 "Lower" "Providing medication, testing" 1 1 0 17 22525 "C" "P1" "Superspecialization" 2 "46 - 60 years old" 10 "Lower" "Providing medication " 0 1 0 18 22525 "C" "P1" "Superspecialization" 1 "30 - 45 years old" 10 "Lower" "Testing, Providing medication" 1 1 0 19 22525 "C" "P1" "Superspecialization" 1 "46 - 60 years old" 50 "Lower" "Testing, Providing medication" 1 1 0 20 22555 "B" "P2" "Superspecialization" 3 "30 - 45 years old" 30 "Same as before" "Testing, Providing medication, Home consultation" 1 1 0 21 22555 "B" "P2" "Superspecialization" 1 "30 - 45 years old" 15 "Lower" "Testing, Providing medication" 1 1 0 22 22555 "B" "P2" "Superspecialization" 3 "30 - 45 years old" 20 "Lower" "Testing, Other, please specify Providing medication, E-health " 1 1 1 23 22555 "B" "P2" "Superspecialization" 1 "Less than 30 years old" 20 "Lower" "Testing, Providing medication" 1 1 0 24 22555 "A" "P5" "Superspecialization" 1 "Less than 30 years old" 20 "Same as before" "Providing medication, testing, Other, please specify" 1 1 0 25 22555 "A" "P4" "Bachelors" 1 "30 - 45 years old" 20 "Higher" "Testing " 1 0 0 26 22555 "A" "P4" "Superspecialization" 3 "Less than 30 years old" 20 "Lower" "Testing " 1 0 0 27 22555 "A" "P4" "Specialization" 1 "Less than 30 years old" 100 "Lower" "E-health " 0 0 1 28 22555 "A" "P4" "Superspecialization" 3 "30 - 45 years old" 5 "Lower" "Providing medication " 0 1 0 29 22494 "B" "P2" "Superspecialization" 1 "30 - 45 years old" 60 "Lower" "Testing, Other, please specify" 1 0 0 30 22555 "A" "P5" "Superspecialization" 1 "30 - 45 years old" 30 "Same as before" "Testing, Providing medication, Home consultation" 1 1 0 end format %tdnn/dd/CCYY dateofsurvey
The following are the codes I have started with (I am using StataMP 13 on Windows 10):
Code:
gen wt_strat=13400/400 replace wt_strat=6000/800 if agency=="B" replace wt_strat=4500/700 if agency=="C" gen FPC_Strata=1/wt_strat
Code:
svyset id [pweight=wt_strat], strata(agency) fpc(fpc_strat)
Code:
pweight: wt_strat
VCE: linearized
Single unit: missing
Strata 1: agency
SU 1: id
FPC 1: fpc_stratPlease correct me if I have gone wrong in the above steps assuming stratified random sampling. Or should strategies to account for non-response be incorporated in the above command lines?
Accounting for non-response
To account for non-response, I read about post stratification (in previous threads in Statalist, and literature) but I have data on only 3 variables across non-responders and responders. I also read that paradata can be used to account for non-response analysis (Kreuter F, Olson K. Paradata for Nonresponse Error Investigation. 2013). I have 1 paradata variable, "number of attempts to contact" specifying the number of times (maximum 3 attempts) a particular doctor was contacted to get a successful interview. But I do not know how to use this variable and in Stata or whether this variable is enough to account for bias.
Requesting your insights on the aforementioned.
Thank you!
Cross section Panel data for exam - elections across 23 countries
Hi all. I appear to be in deep, so hoping anyone can help.
Im doing an exam in quantative methods. For this exam I want to study the effect of the amount of veto-actors on the election turnout, with data from 23 different countries between 1950-2017. The id variable is then country and the time-variable is year. My dependent variable (election turnout) and independent variable (veto-actors) are observed in different years across the different countries, as elections take place irregularly from country to country. Some of my control-variables are measured yearly, while others with regular intervals and others with missing variables some years, depending on the country. So when my x and y observations don't take place in the same year in every country, and when my control-variables are measured with such variation, can i still run a meaningful regression? And what should I have to be aware of when I do so?
Kind regards from a rather stressed student.
Im doing an exam in quantative methods. For this exam I want to study the effect of the amount of veto-actors on the election turnout, with data from 23 different countries between 1950-2017. The id variable is then country and the time-variable is year. My dependent variable (election turnout) and independent variable (veto-actors) are observed in different years across the different countries, as elections take place irregularly from country to country. Some of my control-variables are measured yearly, while others with regular intervals and others with missing variables some years, depending on the country. So when my x and y observations don't take place in the same year in every country, and when my control-variables are measured with such variation, can i still run a meaningful regression? And what should I have to be aware of when I do so?
Kind regards from a rather stressed student.
Assigning Values of Variable for All Panels
Hi all,
I have tried finding a post already discussing this issue but have not found any; apologies in advance if one exists
Here is my data:
input float(SecId1 dm) double WeeklyReturn float tb3ms
4 705 .5516592499999999 2.25
4 706 .606598 2.33
4 707 -.4112093999999999 2.37
4 708 1.64722825 2.37
4 709 .33371999999999996 2.39
4 710 1.1331326000000002 2.4
4 711 .09498449999999997 2.38
4 712 -1.9586029999999999 2.35
4 713 1.67149 2.17
4 714 -1.3865355 2.1
4 715 -.34597675000000006 1.95
4 716 .21807700000000008 1.89
4 717 1.6438857500000001 1.65
4 718 -.02667074999999998 1.54
4 719 1.3116976 1.54
4 720 -1.0276205 1.52
4 721 -.9711949999999999 1.52
4 722 -3.6885015999999995 .29
4 723 2.11304575 .14
4 724 2.6366712 .13
4 725 .945435 .16
4 726 1.3224075 .13
4 727 .5792539999999999 .1
4 728 -.39374025 .11
4 729 .061644750000000026 .1
4 730 2.6141734000000003 .09
4 731 .96156325 .09
4 732 1.7286028000000002 .08
4 733 -.7715675000000002 .04
4 734 -.0966455 .03
4 735 .10510433333333336 .02
5 705 -.95295275 .
5 706 -.019057249999999915 .
5 707 -1.358661 .
5 708 1.3026425 .
5 709 .6011622499999999 .
5 710 .17128120000000008 .
5 711 -.054412 .
5 712 -2.04645475 .
5 713 1.3629257999999997 .
5 714 -.81613775 .
5 715 -.5691437500000001 .
5 716 .5598118000000001 .
5 717 1.3654247499999999 .
5 718 .3868075 .
5 719 .4214298 .
5 720 -.84145575 .
5 721 -1.54410075 .
5 722 -2.7440836 .
5 723 1.583352 .
5 724 1.9272106 .
5 725 -.7555624999999999 .
5 726 .09120900000000004 .
5 727 .8842686000000001 .
5 728 -.53119275 .
5 729 -.513367 .
5 730 2.970973 .
5 731 .19980999999999996 .
5 732 .5054336 .
5 733 .5529535 .
5 734 .6751975 .
5 735 .5922396666666666 .
SecId1 is the panel variable and dm the monthly variable.
tb3ms is the three month Treasury bill rate that I downloaded separately from the dataset and then pasted onto it. I would like to compute the excess return of each security i.e. WeeklyReturn - tb3ms.
For this, I would need to assign the same 31 values for tb3ms (one for each month) to each panel. How could this be achieved? I have tried using expand, which failed. I also tried using David Kantor's "carryforward" command, combined with tsfill, but this also failed.
I would greatly appreciate any help!
Best regards,
Maxence
I have tried finding a post already discussing this issue but have not found any; apologies in advance if one exists
Here is my data:
input float(SecId1 dm) double WeeklyReturn float tb3ms
4 705 .5516592499999999 2.25
4 706 .606598 2.33
4 707 -.4112093999999999 2.37
4 708 1.64722825 2.37
4 709 .33371999999999996 2.39
4 710 1.1331326000000002 2.4
4 711 .09498449999999997 2.38
4 712 -1.9586029999999999 2.35
4 713 1.67149 2.17
4 714 -1.3865355 2.1
4 715 -.34597675000000006 1.95
4 716 .21807700000000008 1.89
4 717 1.6438857500000001 1.65
4 718 -.02667074999999998 1.54
4 719 1.3116976 1.54
4 720 -1.0276205 1.52
4 721 -.9711949999999999 1.52
4 722 -3.6885015999999995 .29
4 723 2.11304575 .14
4 724 2.6366712 .13
4 725 .945435 .16
4 726 1.3224075 .13
4 727 .5792539999999999 .1
4 728 -.39374025 .11
4 729 .061644750000000026 .1
4 730 2.6141734000000003 .09
4 731 .96156325 .09
4 732 1.7286028000000002 .08
4 733 -.7715675000000002 .04
4 734 -.0966455 .03
4 735 .10510433333333336 .02
5 705 -.95295275 .
5 706 -.019057249999999915 .
5 707 -1.358661 .
5 708 1.3026425 .
5 709 .6011622499999999 .
5 710 .17128120000000008 .
5 711 -.054412 .
5 712 -2.04645475 .
5 713 1.3629257999999997 .
5 714 -.81613775 .
5 715 -.5691437500000001 .
5 716 .5598118000000001 .
5 717 1.3654247499999999 .
5 718 .3868075 .
5 719 .4214298 .
5 720 -.84145575 .
5 721 -1.54410075 .
5 722 -2.7440836 .
5 723 1.583352 .
5 724 1.9272106 .
5 725 -.7555624999999999 .
5 726 .09120900000000004 .
5 727 .8842686000000001 .
5 728 -.53119275 .
5 729 -.513367 .
5 730 2.970973 .
5 731 .19980999999999996 .
5 732 .5054336 .
5 733 .5529535 .
5 734 .6751975 .
5 735 .5922396666666666 .
SecId1 is the panel variable and dm the monthly variable.
tb3ms is the three month Treasury bill rate that I downloaded separately from the dataset and then pasted onto it. I would like to compute the excess return of each security i.e. WeeklyReturn - tb3ms.
For this, I would need to assign the same 31 values for tb3ms (one for each month) to each panel. How could this be achieved? I have tried using expand, which failed. I also tried using David Kantor's "carryforward" command, combined with tsfill, but this also failed.
I would greatly appreciate any help!
Best regards,
Maxence
Putexcel with row and column name labels
I am trying to export tables from stata to word using putexcel, where the row and column variables have value labels. However, after using the matlist command, the frequencies are not reported correctly
I have used the following commands to proceed towards using putexcel to transfer the results to stata
As I have shown from the output above, the tabulate option frequencies are not rendered after I use the matlist code. The string variables for sect and edulevel are ordered alphabetically rather than what the initial table says. Thus, the frequencies are jumbled up. For instance in the first row, for agriculture, the tabulate option shows that frequency for primary educated is 1383, while after the matlist command, it shows that frequency for Higher educated is 1383 instead.
I require some help regarding correct alignment of the data in the tables. Any help would be really appreciated.
Code:
. tab sect edulevel, matcell(cellcounts)
| edulevel
sect | Below Pri Primary middle | Total
----------------------+---------------------------------+----------
Agriculture | 4,213 1,383 1,545 | 8,555
Forestry and Fishing | 84 77 104 | 403
Mining | 118 64 131 | 551
Food manuf | 207 144 292 | 1,125
Textile and leather m | 350 386 618 | 2,359
Wood Manuf | 107 71 178 | 523
Media Printing and Re | 6 19 58 | 199
Chemicals Manuf | 91 82 203 | 1,078
Non-metal Manuf | 193 77 168 | 693
Basic Metal Manuf | 65 64 119 | 661
Machinery manuf | 87 91 171 | 640
Electronics Manuf | 5 7 19 | 148
Equipt Manuf | 68 102 235 | 1,197
Furniture Manuf | 59 75 150 | 432
Manuf Others | 47 77 149 | 458
Repair & install mach | 25 38 103 | 354
Elec & Gas Supply | 38 29 115 | 594
Sanitation Services | 61 44 88 | 338
Construction | 2,877 1,612 2,544 | 9,253
Civil Engineering | 219 148 199 | 940
Special Construction | 246 268 570 | 1,676
Wholesale | 180 177 420 | 1,877
Retail | 253 329 1,014 | 3,689
Land & Pipeline Trans | 456 470 1,109 | 3,495
Transportation & post | 62 46 144 | 742
Tourism | 240 208 378 | 1,418
Media, Telecom & IT | 15 10 73 | 1,290
Finance Legal & Mktg | 33 59 277 | 3,081
Real Estate | 7 3 12 | 69
Architect. & Engineer | 0 2 1 | 95
R&D | 3 2 4 | 49
Veterinary | 0 2 5 | 53
Employment | 6 3 15 | 63
Security and Building | 92 72 201 | 677
Public Admin | 181 167 617 | 4,111
Education | 188 180 421 | 6,591
Health svs | 69 55 251 | 1,843
Resid & social Worker | 48 37 75 | 444
Art & entertain. | 32 26 58 | 226
Organizations | 33 32 82 | 335
Repair and personal s | 221 114 159 | 802
Domestic Personnel | 816 335 460 | 1,966
----------------------+---------------------------------+----------
Total | 12,101 7,187 13,535 | 65,093
| edulevel
sect | secondary Higher Se Higher | Total
----------------------+---------------------------------+----------
Agriculture | 875 391 148 | 8,555
Forestry and Fishing | 62 47 29 | 403
Mining | 81 64 93 | 551
Food manuf | 195 140 147 | 1,125
Textile and leather m | 499 281 225 | 2,359
Wood Manuf | 74 46 47 | 523
Media Printing and Re | 48 17 51 | 199
Chemicals Manuf | 182 179 341 | 1,078
Non-metal Manuf | 99 68 88 | 693
Basic Metal Manuf | 135 110 168 | 661
Machinery manuf | 121 78 92 | 640
Electronics Manuf | 16 19 82 | 148
Equipt Manuf | 191 184 417 | 1,197
Furniture Manuf | 94 41 13 | 432
Manuf Others | 89 56 40 | 458
Repair & install mach | 80 45 63 | 354
Elec & Gas Supply | 106 104 202 | 594
Sanitation Services | 61 39 45 | 338
Construction | 1,395 571 254 | 9,253
Civil Engineering | 137 66 171 | 940
Special Construction | 315 159 118 | 1,676
Wholesale | 317 284 499 | 1,877
Retail | 772 658 663 | 3,689
Land & Pipeline Trans | 737 387 336 | 3,495
Transportation & post | 135 156 199 | 742
Tourism | 225 167 200 | 1,418
Media, Telecom & IT | 89 120 983 | 1,290
Finance Legal & Mktg | 303 425 1,984 | 3,081
Real Estate | 8 11 28 | 69
Architect. & Engineer | 2 2 88 | 95
R&D | 4 2 34 | 49
Veterinary | 10 7 29 | 53
Employment | 5 8 26 | 63
Security and Building | 138 98 76 | 677
Public Admin | 716 890 1,540 | 4,111
Education | 537 844 4,421 | 6,591
Health svs | 241 328 899 | 1,843
Resid & social Worker | 97 78 109 | 444
Art & entertain. | 29 27 54 | 226
Organizations | 60 46 82 | 335
Repair and personal s | 133 97 78 | 802
Domestic Personnel | 184 86 85 | 1,966
----------------------+---------------------------------+----------
Total | 9,597 7,426 15,247 | 65,093Code:
decode sect, g(sect_s)
levelsof sect_s, local(namesec)
"Agriculture"' `"Architect. & Engineer"' `"Art & entertain."' `"Basic Metal Manuf"' `"Chemicals Manuf"' `"Civil Engineering"' `"Construction"' `"Domestic Personnel"' `"Education"' `"Elec & Gas Supply"' `"Electronics Manuf"' `"Employment"' `"Equipt Manuf"' `"Finance Legal & Mktg Corp Svs"' `"Food manuf"' `"Forestry and Fishing"' `"Furniture Manuf"' `"Health svs"' `"Land & Pipeline Transport"' `"Machinery manuf"' `"Manuf Others"' `"Media Printing and Records"' `"Media, Telecom & IT"' `"Mining"' `"Non-metal Manuf"' `"Organizations "' `"Public Admin"' `"R&D"' `"Real Estate"' `"Repair & install machinery"' "Repair and personal svs"' `"Resid & social Workers"' `"Retail"' `"Sanitation Services"' `"Security and Building svs"' `"Special Construction"' `"Textile and leather manuf"' `"Tourism"' `"Transportation & postal"' `"Veterinary"' `"Wholesale"' `"Wood Manuf"'
matrix rownames cellcounts = `namesec'
.decode edulevel, g(edu_s)
. levelsof edu_s, local(edu)
`"Below Primary"' `"Higher"' `"Higher Secondary"' `"Primary"' `"middle"' `"sec
> ondary"'
. matrix colnames cellcounts = `edu'
. matlist cellcounts
| Below P~y Higher Higher ~y Primary middle
-------------+-------------------------------------------------------
Agriculture | 4213 1383 1545 875 391
Architect.~r | 84 77 104 62 47
Art & ente~. | 118 64 131 81 64
Basic Meta~f | 207 144 292 195 140
Chemicals ~f | 350 386 618 499 281
Civil Engi~g | 107 71 178 74 46
Construction | 6 19 58 48 17
Domestic P~l | 91 82 203 182 179
Education | 193 77 168 99 68
Elec & Gas~y | 65 64 119 135 110
Electronic~f | 87 91 171 121 78
Employment | 5 7 19 16 19
Equipt Manuf | 68 102 235 191 184
Finance Le~s | 59 75 150 94 41
Food manuf | 47 77 149 89 56
Forestry a~g | 25 38 103 80 45
Furniture ~f | 38 29 115 106 104
Health svs | 61 44 88 61 39
Land & Pip~t | 2877 1612 2544 1395 571
Machinery ~f | 219 148 199 137 66
Manuf Others | 246 268 570 315 159
Media Prin~s | 180 177 420 317 284
Media, Tel~T | 253 329 1014 772 658
Mining | 456 470 1109 737 387
Non-metal ~f | 62 46 144 135 156
Organizati~s | 240 208 378 225 167
Public Admin | 15 10 73 89 120
R&D | 33 59 277 303 425
Real Estate | 7 3 12 8 11
Repair & i~y | 0 2 1 2 2
Repair and~s | 3 2 4 4 2
Resid & so~s | 0 2 5 10 7
Retail | 6 3 15 5 8
Sanitation~s | 92 72 201 138 98
Security a~s | 181 167 617 716 890
Special Co~n | 188 180 421 537 844
Textile an~f | 69 55 251 241 328
Tourism | 48 37 75 97 78
Transporta~l | 32 26 58 29 27
Veterinary | 33 32 82 60 46
Wholesale | 221 114 159 133 97
Wood Manuf | 816 335 460 184 86
| secondary
-------------+-----------
Agriculture | 148
Architect.~r | 29
Art & ente~. | 93
Basic Meta~f | 147
Chemicals ~f | 225
Civil Engi~g | 47
Construction | 51
Domestic P~l | 341
Education | 88
Elec & Gas~y | 168
Electronic~f | 92
Employment | 82
Equipt Manuf | 417
Finance Le~s | 13
Food manuf | 40
Forestry a~g | 63
Furniture ~f | 202
Health svs | 45
Land & Pip~t | 254
Machinery ~f | 171
Manuf Others | 118
Media Prin~s | 499
Media, Tel~T | 663
Mining | 336
Non-metal ~f | 199
Organizati~s | 200
Public Admin | 983
R&D | 1984
Real Estate | 28
Repair & i~y | 88
Repair and~s | 34
Resid & so~s | 29
Retail | 26
Sanitation~s | 76
Security a~s | 1540
Special Co~n | 4421
Textile an~f | 899
Tourism | 109
Transporta~l | 54
Veterinary | 82
Wholesale | 78
Wood Manuf | 85
.I require some help regarding correct alignment of the data in the tables. Any help would be really appreciated.
"No observations" problem
Hello,
I have a problem when generating new variables out of existing one. I have a variable "pclass" having 1310 observations where some have value 1, some value 2 and some value 3.
My task was to generate pclass1, pclass2 and pclass3. But when I did, the output was "no observations"
Does anybody know how would they solve this?
Thanks
I have a problem when generating new variables out of existing one. I have a variable "pclass" having 1310 observations where some have value 1, some value 2 and some value 3.
My task was to generate pclass1, pclass2 and pclass3. But when I did, the output was "no observations"
Does anybody know how would they solve this?
Thanks
analyzes of companies accounts
I am trying to make an analysis of all danish companies accounting for some years. Do any one have made and publish a accounting analysis for there own country, and do you use open source for accouting (do-files) as probably in IFRS9 standards it is the same system, that you are gathering information from, and probably you have some ideas too. Just contact me if you have suggestions or links to articles and opens sources. Thanks in advance.
Defining varlists for temporary variables
I'm stumped on how to create a varlist vt that contains the temporary variables
that are defined in the following code.
The foreach loop I use at the bottom accomplishes this but I'm assuming there must be a one-line local definition using a * wildcard. However I can't seem to get the single/double quote nesting correct to accomplish this. (The real application has lots of temporary variables, thus the desire to use a wildcard.)
Can anyone advise? I'm happy to use the foreach loop but this seems unnecessarily clunky. Also in the real example the variables aren't indexed by a simple `j' thus the desire to use y* as the varlist in my current foreach loop.
Thanks in advance.
Code:
`_ty1' `_ty2' `_ty3'
The foreach loop I use at the bottom accomplishes this but I'm assuming there must be a one-line local definition using a * wildcard. However I can't seem to get the single/double quote nesting correct to accomplish this. (The real application has lots of temporary variables, thus the desire to use a wildcard.)
Can anyone advise? I'm happy to use the foreach loop but this seems unnecessarily clunky. Also in the real example the variables aren't indexed by a simple `j' thus the desire to use y* as the varlist in my current foreach loop.
Thanks in advance.
Code:
cap preserve
cap drop _all
set obs 100
forval j=1/3 {
tempvar _ty`j'
gen y`j'=10*uniform()
gen `_ty`j''=floor(y`j')
}
local vt=" "
foreach y of varlist y* {
local vt="`vt'"+"`_t`y'' "
}
sum `vt'
cap restoreivreghdfe runs a long time after convergence
Hello everyone.
I am performing an IV-Regression with multiple fixed effects. My dataset contains about 20 million observations, with a size of around 3gb.
For this i am running the code:
In general my code works and the regression converges after a short amount of time (5-10 minutes). But after it shows me that the regressor converged it takes a long time to display the result. If i am using a bigger dataset it doesn't even display me the result after 12 hours.
So my question is, why does it take so long to show me the results? Or which processes happen after the convergence?
I also checked my computers performance while running the regression but it does not seem to reach its limit.
I am performing an IV-Regression with multiple fixed effects. My dataset contains about 20 million observations, with a size of around 3gb.
For this i am running the code:
Code:
ivreghdfe dlunit_price (dlquantity = dlduty_percent), absorb(ct ht cs) cluster(hs6 iso2)
So my question is, why does it take so long to show me the results? Or which processes happen after the convergence?
I also checked my computers performance while running the regression but it does not seem to reach its limit.
Power calculation Cox regression one cohort multiple vars
Hi there!
I am doing a cohort study (1 group) with cancer patients for which I made a Cox model with 6 significant predictor variables (a continuous variable of interest and 5 category variables that are known predictors of death) to predict survival (death/alive).
In total I have n=795 patients of which only 35% had the event (death) and follow-up time differed for patients between 1 and 4 years.
I got the question to explain that the study is currently powered enough given N=795 and only a few events and 6 predictor variables of which some have multiple levels. (e.g. variable "stage" has values of 1, 2 and 3)
At the moment, I can only find only find examples of binary variables of interest or the comparison of survival between 2 groups. Given I have only 1 group, I am struggeling on what analysis to do. I have tried the power cox command, but am not sure what values to put in (e.g. how to interpret the SD of the tested covariate properties)
Could someone help me in the right direction?
Kind regards Jessy
I am doing a cohort study (1 group) with cancer patients for which I made a Cox model with 6 significant predictor variables (a continuous variable of interest and 5 category variables that are known predictors of death) to predict survival (death/alive).
In total I have n=795 patients of which only 35% had the event (death) and follow-up time differed for patients between 1 and 4 years.
I got the question to explain that the study is currently powered enough given N=795 and only a few events and 6 predictor variables of which some have multiple levels. (e.g. variable "stage" has values of 1, 2 and 3)
At the moment, I can only find only find examples of binary variables of interest or the comparison of survival between 2 groups. Given I have only 1 group, I am struggeling on what analysis to do. I have tried the power cox command, but am not sure what values to put in (e.g. how to interpret the SD of the tested covariate properties)
Could someone help me in the right direction?
Kind regards Jessy
Variable gets omitted
Dear Forum,
I have the following command:
Now I want to add an interaction to the regression:
However for one group of the independent variable "income group" I always get the result omitted
Am I doing something wrong? Is there a way to show all the results?
I have the following command:
Code:
reg birthrate ibn.incomegroup, noconstant
Code:
reg birthrate c.GDP##ibn.incomegroup, noconstant
Am I doing something wrong? Is there a way to show all the results?
identifying the predictors of Cataract progressions
Dear Researchers,
I have asked this question before, but it seems that I didn’t get the answer so far, and this is because the way that I asked the question is not clear, so I will ask it in another way to get an answer, please.
I have three questions, and I need you kindly to answer the ones that you know, please.
I have cross-sectional data for the Diabetes group. I am trying to test a model that predicts a specific complication (i.e., dependent variable) by other health complications (i.e., independent variables).
Where:
The dependent variable is called Cataract and it is an ordinal variable.
The independent variables are nominal, ordinal, and scale variables.
So, I think the code at first will be:
The first question:
I have an independent variable that consists of two groups in the same column, where the first group named type 1 while the second group named type 2. So, in this case, I need to test if there is significant differences between the two groups or not, so in this case, I think I should create another dummy variable coded 1 for the first group and 2 for the second group, then I will use the independent sample t-test. Am I correct?
The second question:
Let us assume that the model that I want to use is the following model, and please remember that my main aim is to predict a specific complication “ Cataract”. So, the model is:
Cataract= Gender + Age + Diabetes duration Where the Cataract is ordinal variable.
Gender is nominal variable
Age is ordinal
Diabetes duration is scale.
So, I think that I should use the ordinal logistic regression, and the code I think for this will be:
Code:
But I need to see the relationship between each of the independent variable with the dependent variable, for instance, the effect of Gender on all levels of Cataract, so I think the code will be:
So could you please correct me if any of the above codes are not correct?
The third question:
I am very interested in finding whether the component of gender I mean male and female have different predictions on the dependent variable or not, so could you please tell me what is the code if I can do that?
Thanks very much in advance.
I have asked this question before, but it seems that I didn’t get the answer so far, and this is because the way that I asked the question is not clear, so I will ask it in another way to get an answer, please.
I have three questions, and I need you kindly to answer the ones that you know, please.
I have cross-sectional data for the Diabetes group. I am trying to test a model that predicts a specific complication (i.e., dependent variable) by other health complications (i.e., independent variables).
Where:
The dependent variable is called Cataract and it is an ordinal variable.
The independent variables are nominal, ordinal, and scale variables.
So, I think the code at first will be:
Code:
ge id= _n
encode patient, gen(PATIENT)I have an independent variable that consists of two groups in the same column, where the first group named type 1 while the second group named type 2. So, in this case, I need to test if there is significant differences between the two groups or not, so in this case, I think I should create another dummy variable coded 1 for the first group and 2 for the second group, then I will use the independent sample t-test. Am I correct?
The second question:
Let us assume that the model that I want to use is the following model, and please remember that my main aim is to predict a specific complication “ Cataract”. So, the model is:
Cataract= Gender + Age + Diabetes duration Where the Cataract is ordinal variable.
Gender is nominal variable
Age is ordinal
Diabetes duration is scale.
So, I think that I should use the ordinal logistic regression, and the code I think for this will be:
Code:
Code:
Ologit Cataract Gender Age Diabetes_duration, rCode:
margins, dydx ( Gender)
marginsplotThe third question:
I am very interested in finding whether the component of gender I mean male and female have different predictions on the dependent variable or not, so could you please tell me what is the code if I can do that?
Thanks very much in advance.
Drop embedded observations
Hello,
I have a household panel dataset with some observations being fully captured by other observations. Households can split off over time and are then reassigned their original household ID. Thus, a household ID can be assigned to several observations (where household ID is only the wave specific ID). I reshaped the panel and kept only the relevant variables which I hope illustrates the case.
For instance, obs 3, 6 or 10 are redundant. Ideally I would like to keep observations that are included in others but only as split-offs, though. For instance, obs 53 is captured by obs 51 and 52, but both of them are split-offs in the third period, for which obs 53 does not have information. In such cases it might make more sense to keep it as an "original" household.
The "rule" derives from the examples: Observations should be dropped if 1) their sequence of hhids appears in another observation, 2) (ideally but I do not know how feasible it is) this other observation is an original household in the period following the one for which the to be dropped observation does not contain a value anymore.
I tried to use collapse, to create duplicates or to count missing values, but none really works. I would appreciate any suggestion. And I already contacted the data provider about the panel composition, but did not receive any answer.
I have a household panel dataset with some observations being fully captured by other observations. Households can split off over time and are then reassigned their original household ID. Thus, a household ID can be assigned to several observations (where household ID is only the wave specific ID). I reshaped the panel and kept only the relevant variables which I hope illustrates the case.
For instance, obs 3, 6 or 10 are redundant. Ideally I would like to keep observations that are included in others but only as split-offs, though. For instance, obs 53 is captured by obs 51 and 52, but both of them are split-offs in the third period, for which obs 53 does not have information. In such cases it might make more sense to keep it as an "original" household.
The "rule" derives from the examples: Observations should be dropped if 1) their sequence of hhids appears in another observation, 2) (ideally but I do not know how feasible it is) this other observation is an original household in the period following the one for which the to be dropped observation does not contain a value anymore.
I tried to use collapse, to create duplicates or to count missing values, but none really works. I would appreciate any suggestion. And I already contacted the data provider about the panel composition, but did not receive any answer.
Code:
* Example generated by -dataex-. For more info, type help dataex clear input double obs str16(hhid1 hhid2) double split_off2 str16 hhid3 double split_off3 str16 hhid4 double split_off4 1 "01010140020171" "0101014002017101" 1 "0001-001" 1 "0001-001" 1 2 "01010140020171" "0101014002017101" 1 "0001-001" 1 "0001-004" 2 3 "01010140020171" "0101014002017101" 1 "" . "" . 4 "01010140020284" "0101014002028401" 1 "0002-001" 1 "0002-001" 1 5 "01010140020297" "0101014002029701" 1 "0003-001" 1 "0003-001" 1 6 "01010140020297" "0101014002029701" 1 "" . "" . 7 "01010140020297" "0101014002029704" 2 "" . "" . 8 "01010140020409" "0101014002040901" 1 "0005-001" 1 "0005-001" 1 9 "01010140020471" "0101014002047101" 1 "0006-001" 1 "" . 10 "01010140020471" "" . "" . "" . 11 "01010140020551" "0101014002055101" 1 "0007-001" 1 "0007-001" 1 12 "01010140020761" "0101014002076101" 1 "0008-001" 1 "0008-001" 1 13 "01010140020762" "0101014002076201" 1 "0009-001" 1 "0009-001" 1 14 "01020030030004" "0102003003000401" 1 "0010-001" 1 "0010-001" 1 15 "01020030030022" "0102003003002201" 1 "0011-001" 1 "0012-001" 1 16 "01020030030022" "0102003003002201" 1 "0011-001" 1 "0012-003" 2 17 "01020030030022" "0102003003002201" 1 "0011-004" 2 "" . 18 "01020030030140" "0102003003014001" 1 "0012-001" 1 "0013-001" 1 19 "01020030030161" "0102003003016101" 1 "0013-001" 1 "0014-001" 1 20 "01020030030174" "0102003003017401" 1 "0014-001" 1 "0015-001" 1 21 "01020030030174" "0102003003017407" 2 "0015-001" 1 "0017-001" 1 22 "01020030030200" "0102003003020001" 1 "0016-001" 1 "0018-001" 1 23 "01020030030430" "0102003003043001" 1 "0017-001" 1 "0019-001" 1 24 "01020030030430" "0102003003043001" 1 "" . "" . 25 "01020030030479" "0102003003047901" 1 "0018-001" 1 "0020-001" 1 26 "01020170030001" "0102017003000101" 1 "0019-001" 1 "" . 27 "01020170030001" "0102017003000101" 1 "0019-003" 2 "" . 28 "01020170030001" "0102017003000104" 2 "0020-001" 1 "" . 29 "01020170030017" "0102017003001701" 1 "0021-001" 1 "" . 30 "01020170030022" "0102017003002201" 1 "0022-001" 1 "" . 31 "01020170030022" "0102017003002201" 1 "" . "" . 32 "01020170030048" "0102017003004801" 1 "0023-001" 1 "" . 33 "01020170030100" "0102017003010001" 1 "0024-001" 1 "" . 34 "01020170030209" "0102017003020901" 2 "0025-001" 1 "" . 35 "01020170030209" "" . "0025-001" 1 "" . 36 "01020170030241" "0102017003024101" 1 "0026-001" 1 "" . 37 "01020170030241" "0102017003024101" 1 "" . "" . 38 "01020170030246" "0102017003024601" 1 "0027-001" 1 "" . 39 "01030130040161" "0103013004016101" 1 "0028-001" 1 "" . 40 "01030130040219" "0103013004021901" 1 "0029-001" 1 "" . 41 "01030130040259" "0103013004025901" 1 "0030-001" 1 "" . 42 "01030130040346" "0103013004034601" 1 "0031-001" 1 "" . 43 "01030130040468" "0103013004046801" 1 "0032-001" 1 "" . 44 "01030130040685" "0103013004068501" 1 "0033-001" 1 "" . 45 "01030130040739" "0103013004073901" 1 "0034-001" 1 "" . 46 "01030130040739" "0103013004073901" 1 "0034-003" 2 "" . 47 "01030130040739" "0103013004073901" 1 "" . "" . 48 "01030130040745" "0103013004074501" 1 "0035-001" 1 "" . 49 "01030133010068" "0103013301006801" 1 "0036-001" 1 "" . 50 "01030133010092" "0103013301009201" 1 "0037-001" 1 "" . 51 "01030133010175" "0103013301017501" 1 "0038-001" 2 "" . 52 "01030133010175" "0103013301017501" 1 "0038-002" 2 "" . 53 "01030133010175" "0103013301017501" 1 "" . "" . 54 "01030133010188" "0103013301018801" 1 "0039-001" 1 "" . 55 "01030133010188" "0103013301018801" 1 "0039-004" 2 "" . 56 "01030133010188" "0103013301018801" 1 "" . "" . 57 "01030133010188" "0103013301018803" 2 "0040-001" 1 "" . 58 "01030133010300" "0103013301030001" 1 "0041-002" 1 "" . 59 "01030133010300" "0103013301030001" 1 "0041-006" 2 "" . 60 "01030133010300" "0103013301030001" 1 "" . "" . 61 "01030133010322" "0103013301032201" 1 "0042-001" 1 "" . 62 "01030133010411" "0103013301041101" 1 "0043-001" 1 "" . 63 "01030133010411" "0103013301041101" 1 "0043-002" 2 "" . 64 "01030133010411" "0103013301041102" 2 "0044-001" 1 "" . 65 "01030133010652" "0103013301065201" 1 "0045-001" 1 "" . 66 "01040173040004" "0104017304000401" 1 "0046-001" 1 "" . 67 "01040173040004" "0104017304000401" 1 "0046-002" 2 "" . 68 "01040173040004" "0104017304000401" 1 "" . "" . 69 "01040173040017" "0104017304001701" 1 "0047-001" 1 "" . 70 "01040173040022" "0104017304002201" 1 "0048-001" 1 "" . 71 "01040173040022" "0104017304002201" 1 "0048-002" 2 "" . 72 "01040173040022" "" . "0048-001" 1 "" . 73 "01040173040034" "0104017304003401" 1 "0049-001" 1 "" . 74 "01040173040034" "0104017304003406" 2 "" . "" . 75 "01040173040034" "0104017304003407" 2 "0051-002" 2 "" . 76 "01040173040041" "0104017304004102" 2 "0052-001" 1 "" . 77 "01040173040041" "" . "" . "" . 78 "01040173040086" "0104017304008601" 1 "0053-001" 1 "" . 79 "01040173040086" "" . "0053-001" 1 "" . 80 "01040173040092" "0104017304009201" 1 "0054-001" 1 "" . 81 "01040173040094" "0104017304009401" 1 "0055-001" 1 "" . 82 "01040173040094" "0104017304009402" 2 "0056-001" 1 "" . 83 "01040310010030" "0104031001003001" 1 "0057-001" 1 "" . 84 "01040310010102" "0104031001010201" 1 "0058-001" 1 "" . 85 "01040310010174" "0104031001017402" 1 "0059-001" 1 "" . 86 "01040310010174" "0104031001017402" 1 "0059-002" 2 "" . 87 "01040310010174" "0104031001017403" 2 "0060-001" 1 "" . 88 "01040310010174" "" . "" . "" . 89 "01040310010180" "0104031001018001" 1 "0061-001" 1 "" . 90 "01040310010462" "0104031001046201" 1 "0062-001" 1 "" . 91 "01040310010482" "0104031001048201" 1 "0063-001" 1 "" . 92 "01040310010482" "" . "" . "" . 93 "01040310010745" "0104031001074501" 1 "0064-001" 1 "" . 94 "01040310010745" "0104031001074502" 2 "0065-001" 1 "" . 95 "01040310010745" "" . "0064-001" 1 "" . 96 "01040310011128" "0104031001112801" 1 "0066-001" 1 "" . 97 "01040310011128" "0104031001112801" 1 "" . "" . 98 "01040310011128" "0104031001112804" 2 "0067-001" 1 "" . 99 "01040380030347" "0104038003034701" 1 "0068-001" 1 "" . 100 "01040380030396" "0104038003039601" 1 "0069-001" 1 "" . end label values split_off2 ha_10 label values split_off3 ha_10 label values split_off4 ha_10 label def ha_10 1 "ORIGINAL HOUSEHOLD", modify label def ha_10 2 "SPLIT-OFF HOUSEHOLD", modify
Clarification on merge command
I am trying to merge two dataset- one dta file consist of 8576 unique observation and another dta file consist of 1187 unique id and 1220 observations. Now as i merge using merge 1:m command, the result is 1189 matched and 7415 from master and 31 from using file are not matched. What i make out is 1189 and 21 makes 1220 observation remaining observation should be 8576 - 1220 = 7356. But unmatched from master is 7415 which is 59 more. How to read it & it makes my observation more than the actual no. of cases surveyed. Please guide. After merging the dataex file attached here with
-------------------- copy starting from the next line -----------------------
------------------ copy up to and including the previous line ------------------
-------------------- copy starting from the next line -----------------------
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input double id long u_id float umpce_class byte(sector district) float(gender age_group est_pop blk_id age weight) byte(blk6_q4 blk6_q5 blk6_q6) long(blk7_q11 blk7_q14) float(eblk7_q11 eblk7_q14) byte blk7_q16 float ail_cat byte(x _merge) 57100110101 571001101 1 2 3 1 4 15.225 . . . . . . . . . . . . . 1 57100110102 571001101 1 2 3 2 3 15.225 1 27 15.225 87 1 1 1070 1500 16290.75 22837.5 1 12 1 3 57100110103 571001101 1 2 3 1 1 15.225 . . . . . . . . . . . . . 1 57100110104 571001101 1 2 3 2 1 15.225 . . . . . . . . . . . . . 1 57100110201 571001102 1 2 3 1 3 15.225 . . . . . . . . . . . . . 1 end label values _merge _merge label def _merge 1 "master only (1)", modify label def _merge 3 "matched (3)", modify
How to analyze data for repeated measures?
Hi everyone
Could you please share your recommendations, regarding the questions at the end of this post?
Research objectives:
1. What is the association between tourists' and residents' aesthetic experiences and destination aesthetic features?
Note: Aesthetic experiences are categorized into 6 types of experiences. For example, the experience of the beautiful and experience of the ugly and 4 more types of experiences.
Destination aesthetic features: a distinguished 7-point Likert semantic differential scale with 18 items. For example item 1 reads like:
I would say that the place was............ not crowded 1 2 3 4 5 6 7 crowded.
2. How often the six types of aesthetic experiences occur? (7 points Likert scale for frequency)
Variables:
Dependent variables:
1. Comprehensive descriptions of six types of aesthetic experiences. For example, the experience of the beautiful reads like "You feel you are lucky that you have the chance to enjoy and acknowledge the appealing moment of experiencing the beauty. You feel thankful, fascinated, happy, and very pleased......"
2. The frequency of occurrence of the experiences
Independent variables:
1. Residenst' district of living in the destination (5 districts)/ Tourists' city of residence (9 cities)
2. Tourist's and resident's demographic profile (Age, Gender, Education)
3. Tourists length of stay at the destination during their current travel and residents length of residency at the same destination
4. Travel frequency of tourists and residents during last year
5. Tourists' purpose of the trip (leisure, business, visiting friends and family)
6. The individual's evaluation of the Destination Aesthetic Features
I have the following research design.
Multilevel analysis: experiences are nested in individuals
Repeated measure mixed model design
Level 1: repeated measurement of the association between aesthetic experiences and destination aesthetic features
Level 2: tourists and residents
Study setting: A specific city (tourism destination)
Sample: Two groups of people (300 Tourists who travel to that specific city and 300 Residents who live in that city)
Repeated Measurement: A semantic differential scale of 18 items (18 features of a city that may make the city to be perceived as beautiful or ugly)
At the occurrence of 6 types of a specific kind of experience (e.g., the experience of beauty, the experience of ugliness, ...)
Time: Note 1: A cross-section survey is currently distributing among the target population (at a single point in time).
Note 2: Participants will answer the survey considering the occurrence of 6 types of experiences during a specific period of time. This means the residents will consider the occurrence of those experiences during the time that they have been residing in that city (e.g., some years). With the same token, tourists will consider it during the time they have been staying in the city in their current travel (e.g., some days).
Subject factor: Both within-subject factor and between-subject factor
Note A: As no study has been conducted to link the above-mentioned features of a city to those specific types of experiences, our study is exploratory and does not pose hypotheses.
Note B: The repeated measures are to be compared among the mentioned six experiences, following the principles of a within-subject design.
Note C: A cross-level interaction term (group × features of the city) will also be entered and estimated in order to compare the evaluations between tourists and residents. Please see an example of my dataset here:
Note D: Descriptions of 6 experiences and the frequency of occurrence of those experiences are dependent variables and other variables are independent.
May I sincerely ask your recommendations on:
1) How to analyze the data?
2) How should I conduct power analysis for calculating the sample size? For now, I considered collecting data from 300 tourists and 300 residents but I am not sure whether it is necessary to recruit overall 600 people or not. (Note: Some people may experience all 6 types of experiences, some may have 5 to 2 experiences and few people may have only 1 experience)
3) How should I treat missing data?
Many thanks in advance,
Could you please share your recommendations, regarding the questions at the end of this post?
Research objectives:
1. What is the association between tourists' and residents' aesthetic experiences and destination aesthetic features?
Note: Aesthetic experiences are categorized into 6 types of experiences. For example, the experience of the beautiful and experience of the ugly and 4 more types of experiences.
Destination aesthetic features: a distinguished 7-point Likert semantic differential scale with 18 items. For example item 1 reads like:
I would say that the place was............ not crowded 1 2 3 4 5 6 7 crowded.
2. How often the six types of aesthetic experiences occur? (7 points Likert scale for frequency)
Variables:
Dependent variables:
1. Comprehensive descriptions of six types of aesthetic experiences. For example, the experience of the beautiful reads like "You feel you are lucky that you have the chance to enjoy and acknowledge the appealing moment of experiencing the beauty. You feel thankful, fascinated, happy, and very pleased......"
2. The frequency of occurrence of the experiences
Independent variables:
1. Residenst' district of living in the destination (5 districts)/ Tourists' city of residence (9 cities)
2. Tourist's and resident's demographic profile (Age, Gender, Education)
3. Tourists length of stay at the destination during their current travel and residents length of residency at the same destination
4. Travel frequency of tourists and residents during last year
5. Tourists' purpose of the trip (leisure, business, visiting friends and family)
6. The individual's evaluation of the Destination Aesthetic Features
I have the following research design.
Multilevel analysis: experiences are nested in individuals
Repeated measure mixed model design
Level 1: repeated measurement of the association between aesthetic experiences and destination aesthetic features
Level 2: tourists and residents
Study setting: A specific city (tourism destination)
Sample: Two groups of people (300 Tourists who travel to that specific city and 300 Residents who live in that city)
Repeated Measurement: A semantic differential scale of 18 items (18 features of a city that may make the city to be perceived as beautiful or ugly)
At the occurrence of 6 types of a specific kind of experience (e.g., the experience of beauty, the experience of ugliness, ...)
Time: Note 1: A cross-section survey is currently distributing among the target population (at a single point in time).
Note 2: Participants will answer the survey considering the occurrence of 6 types of experiences during a specific period of time. This means the residents will consider the occurrence of those experiences during the time that they have been residing in that city (e.g., some years). With the same token, tourists will consider it during the time they have been staying in the city in their current travel (e.g., some days).
Subject factor: Both within-subject factor and between-subject factor
Note A: As no study has been conducted to link the above-mentioned features of a city to those specific types of experiences, our study is exploratory and does not pose hypotheses.
Note B: The repeated measures are to be compared among the mentioned six experiences, following the principles of a within-subject design.
Note C: A cross-level interaction term (group × features of the city) will also be entered and estimated in order to compare the evaluations between tourists and residents. Please see an example of my dataset here:
Note D: Descriptions of 6 experiences and the frequency of occurrence of those experiences are dependent variables and other variables are independent.
May I sincerely ask your recommendations on:
1) How to analyze the data?
2) How should I conduct power analysis for calculating the sample size? For now, I considered collecting data from 300 tourists and 300 residents but I am not sure whether it is necessary to recruit overall 600 people or not. (Note: Some people may experience all 6 types of experiences, some may have 5 to 2 experiences and few people may have only 1 experience)
3) How should I treat missing data?
Many thanks in advance,