I am working on counts data and my study is to compare performance of several models that deal with such data (e.g., Poisson, NB, Hurdle, Zero-inflated models). The study's dependent variable is out-patient visits (visit_op) which is overdispersed and the independent vars are age, age square, sex, area, marital status, employment status, education, social health insurance, and log of household income.
Based on in-sample tests (mpe, rmse, and mape) and information criteria tests (log likelihood, AIC, and BIC), hurdle NB2 is the best model. Next, I utilize K-fold cross-validation to test those models' performance (out of sample test). In this out of sample test, the data is split into 10 parts (almost identical one another), among which 9 parts are used as training part and the last one is reserved as validation one. My coding procedures are as:
Data example
Code:
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input float(visit_op biny pid06 age agesqr) byte sex float area byte(mstt insur work) float(edu lnhhinc) 1 1 8 69 4761 0 1 0 1 0 9 10.09823 12 1 133 61 3721 1 1 0 1 0 12 10.740648 1 1 150 75 5625 0 0 0 1 0 2 9.877502 4 1 164 62 3844 0 0 0 0 1 0 10.462332 10 1 166 67 4489 0 0 0 0 0 5 8.547916 1 1 174 63 3969 0 1 0 1 0 12 10.165852 3 1 180 77 5929 1 0 1 0 0 0 8.318742 10 1 188 64 4096 0 0 0 1 0 12 11.261897 2 1 27 86 7396 1 1 1 0 0 0 10.515966 3 1 31 74 5476 0 1 0 1 0 12 10.217276 3 1 21 66 4356 0 1 0 1 1 12 10.959888 1 1 51 75 5625 0 1 0 1 0 9 9.517825 1 1 57 68 4624 1 1 1 0 1 5 10.910277 13 1 49 63 3969 1 1 0 1 0 9 10.299171 15 1 61 63 3969 0 1 0 1 0 12 11.48329 0 0 64 64 4096 1 1 0 1 0 5 11.15545 1 1 69 68 4624 0 1 0 1 0 12 11.34687 1 1 59 69 4761 0 1 0 1 0 12 11.715866 2 1 92 76 5776 1 1 0 1 0 10 10.09823 20 1 96 64 4096 1 1 1 1 0 4 11.252858 2 1 85 66 4356 0 1 0 1 0 12 11.46772 1 1 88 80 6400 1 1 1 1 0 3 10.113749 1 1 114 61 3721 1 1 1 0 0 7 11.626254 2 1 117 82 6724 1 1 0 1 0 12 10.27505 1 1 102 84 7056 1 1 1 1 0 7 11.357675 5 1 118 71 5041 0 1 1 1 1 12 12.115185 1 1 122 83 6889 1 1 0 1 0 5 10.599132 1 1 106 62 3844 1 1 0 1 0 12 10.767643 3 1 288 68 4624 0 0 1 1 0 7 9.700943 10 1 309 78 6084 1 0 0 0 0 1 10.234624 end label values sex sex label def sex 0 "Male", modify label def sex 1 "Female", modify label values area area label def area 0 "Rural", modify label def area 1 "urban", modify label values mstt mstt label def mstt 0 "Married", modify label def mstt 1 "Single", modify label values insur insur label def insur 0 "No", modify label def insur 1 "Yes", modify label values work work label def work 0 "No", modify label def work 1 "Yes", modify
Code:
global v=10
xtile idq = pid06, nq($v)
*** Hurdle NB2
gen p1 = 0
gen yfv = 0
forvalues i=1/$v {
qui {
logit biny $X if idq~=`i', vce(robust)
predict plogit
replace p1 = plogit
drop plogit
}
qui {
tnbreg visit_op $X if visit_op>0 & idq~=`i', vce(robust)
predict p2
gen hnb2 = p1*p2
replace yfv = hnb2 if idq==`i'
gen pe = y - yfv
sum pe
sca hnb2mpe = r(mean)
gen spe = (y-yfv)^2
sum spe
sca hnb2rmse = sqrt(r(mean))
gen ape = abs(y-yfv)
sum ape
sca hnb2mape = r(mean)
drop p2 hnb2 pe spe ape
}
}
drop p1 yfv
Code:
gen yfv = 0
forvalues i=1/$v {
qui {
nbreg visit_op $X if idq~=`i', vce(robust)
predict yhatnb2
replace yfv = yhatnb2 if idq==`i'
gen pe = y - yfv
sum pe
sca nb2mpe = r(mean)
gen spe = (y-yfv)^2
sum spe
sca nb2rmse = sqrt(r(mean))
gen ape = abs(y-yfv)
sum ape
sca nb2mape = r(mean)
drop yhatnb2 pe spe ape
}
}
drop yfv
Code:
*** Poisson
gen yfv = 0
forvalues i=1/$v {
qui {
poisson visit_op $X if idq~=`i', vce(robust)
predict pyhat
replace yfv = pyhat if idq==`i'
gen pe = y - yfv
sum pe
sca pmpe = r(mean)
gen spe = (y-yfv)^2
sum spe
sca prmse = sqrt(r(mean))
gen ape = abs(y-yfv)
sum ape
sca pmape = r(mean)
drop pyhat pe spe ape
}
}
drop yfv
Thank you.
Best regards,
DL
0 Response to Out of sample cross-validation for counts data, am I correct?
Post a Comment