I am working on counts data and my study is to compare performance of several models that deal with such data (e.g., Poisson, NB, Hurdle, Zero-inflated models). The study's dependent variable is out-patient visits (visit_op) which is overdispersed and the independent vars are age, age square, sex, area, marital status, employment status, education, social health insurance, and log of household income.
Based on in-sample tests (mpe, rmse, and mape) and information criteria tests (log likelihood, AIC, and BIC), hurdle NB2 is the best model. Next, I utilize K-fold cross-validation to test those models' performance (out of sample test). In this out of sample test, the data is split into 10 parts (almost identical one another), among which 9 parts are used as training part and the last one is reserved as validation one. My coding procedures are as:
Data example
Code:
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input float(visit_op biny pid06 age agesqr) byte sex float area byte(mstt insur work) float(edu lnhhinc) 1 1 8 69 4761 0 1 0 1 0 9 10.09823 12 1 133 61 3721 1 1 0 1 0 12 10.740648 1 1 150 75 5625 0 0 0 1 0 2 9.877502 4 1 164 62 3844 0 0 0 0 1 0 10.462332 10 1 166 67 4489 0 0 0 0 0 5 8.547916 1 1 174 63 3969 0 1 0 1 0 12 10.165852 3 1 180 77 5929 1 0 1 0 0 0 8.318742 10 1 188 64 4096 0 0 0 1 0 12 11.261897 2 1 27 86 7396 1 1 1 0 0 0 10.515966 3 1 31 74 5476 0 1 0 1 0 12 10.217276 3 1 21 66 4356 0 1 0 1 1 12 10.959888 1 1 51 75 5625 0 1 0 1 0 9 9.517825 1 1 57 68 4624 1 1 1 0 1 5 10.910277 13 1 49 63 3969 1 1 0 1 0 9 10.299171 15 1 61 63 3969 0 1 0 1 0 12 11.48329 0 0 64 64 4096 1 1 0 1 0 5 11.15545 1 1 69 68 4624 0 1 0 1 0 12 11.34687 1 1 59 69 4761 0 1 0 1 0 12 11.715866 2 1 92 76 5776 1 1 0 1 0 10 10.09823 20 1 96 64 4096 1 1 1 1 0 4 11.252858 2 1 85 66 4356 0 1 0 1 0 12 11.46772 1 1 88 80 6400 1 1 1 1 0 3 10.113749 1 1 114 61 3721 1 1 1 0 0 7 11.626254 2 1 117 82 6724 1 1 0 1 0 12 10.27505 1 1 102 84 7056 1 1 1 1 0 7 11.357675 5 1 118 71 5041 0 1 1 1 1 12 12.115185 1 1 122 83 6889 1 1 0 1 0 5 10.599132 1 1 106 62 3844 1 1 0 1 0 12 10.767643 3 1 288 68 4624 0 0 1 1 0 7 9.700943 10 1 309 78 6084 1 0 0 0 0 1 10.234624 end label values sex sex label def sex 0 "Male", modify label def sex 1 "Female", modify label values area area label def area 0 "Rural", modify label def area 1 "urban", modify label values mstt mstt label def mstt 0 "Married", modify label def mstt 1 "Single", modify label values insur insur label def insur 0 "No", modify label def insur 1 "Yes", modify label values work work label def work 0 "No", modify label def work 1 "Yes", modify
Code:
global v=10 xtile idq = pid06, nq($v) *** Hurdle NB2 gen p1 = 0 gen yfv = 0 forvalues i=1/$v { qui { logit biny $X if idq~=`i', vce(robust) predict plogit replace p1 = plogit drop plogit } qui { tnbreg visit_op $X if visit_op>0 & idq~=`i', vce(robust) predict p2 gen hnb2 = p1*p2 replace yfv = hnb2 if idq==`i' gen pe = y - yfv sum pe sca hnb2mpe = r(mean) gen spe = (y-yfv)^2 sum spe sca hnb2rmse = sqrt(r(mean)) gen ape = abs(y-yfv) sum ape sca hnb2mape = r(mean) drop p2 hnb2 pe spe ape } } drop p1 yfv
Code:
gen yfv = 0 forvalues i=1/$v { qui { nbreg visit_op $X if idq~=`i', vce(robust) predict yhatnb2 replace yfv = yhatnb2 if idq==`i' gen pe = y - yfv sum pe sca nb2mpe = r(mean) gen spe = (y-yfv)^2 sum spe sca nb2rmse = sqrt(r(mean)) gen ape = abs(y-yfv) sum ape sca nb2mape = r(mean) drop yhatnb2 pe spe ape } } drop yfv
Code:
*** Poisson gen yfv = 0 forvalues i=1/$v { qui { poisson visit_op $X if idq~=`i', vce(robust) predict pyhat replace yfv = pyhat if idq==`i' gen pe = y - yfv sum pe sca pmpe = r(mean) gen spe = (y-yfv)^2 sum spe sca prmse = sqrt(r(mean)) gen ape = abs(y-yfv) sum ape sca pmape = r(mean) drop pyhat pe spe ape } } drop yfv
Thank you.
Best regards,
DL
0 Response to Out of sample cross-validation for counts data, am I correct?
Post a Comment