I am trying to teach myself survival analysis estimation in STATA and at the moment I am following Prof. Jenkins's materials available at https://www.iser.essex.ac.uk/resourc...sis-with-stata. Apologies if my questions are obvious. I am focusing on discrete time at the moment. Let's consider a code from Lesson 6 (ex6_1.do). Although the data is in person-month format, I am confused as to what difference does the id have for the estimation, as I see the id variable is not a part of estimation itself:
(I am skipping a few lines of code relating to alternative functional forms for the baseline hazard function)
Code:
************** Estimation: (i) discrete time models ********* ************* Cancer data ********** use cancer, clear ge id = _n lab var id "subject identifier" * drug = 1 (placebo); drug =2,3 (receives drug) ta drug died recode drug 1=0 2/3=1 lab var drug "receives drug?" lab def drug 0 "placebo" 1 "drug" lab val drug drug ta drug ************************************ * Episode-splitting --> data in person-month format expand studytim bysort id: ge j = _n * spell month identifier, by subject lab var j "spell month" bysort id: ge dead = died==1 & _n==_N lab var dead "binary depvar for discrete hazard model" * We don't have to -stset- the data for estimation, but might as * well -- it emphasises parallels with continuous time models * esp. when there are TVCs. stset j, failure(dead) id(id) ********************************************************** ****************** CLOGLOG HAZARD MODELS ***************** * Compare model estimated with different baseline hazard specifications. * Use -predict- to derive estimate of predicted hazard and survivor function * and thence median duration. First use within-sample info. * log(j) baseline [and 'or' option; logit versus logistic] * cloglog = glm, f(b) l(c). Can also use -glm- . * See help -glm- and note glm Deviance = -2*LogL from cloglog glm dead drug age lnj, f(b) l(c) glm, eform cloglog dead drug age lnj, nolog predict h, p cloglog, eform // replay results, but this time with hazard ratio
Code:
Complementary log-log regression Number of obs = 744 Zero outcomes = 713 Nonzero outcomes = 31 LR chi2(3) = 35.20 Log likelihood = -111.26371 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ dead | exp(b) Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- drug | .1120209 .0460504 -5.33 0.000 .0500473 .2507365 age | 1.126762 .0418758 3.21 0.001 1.047605 1.2119 lnj | 1.896999 .465617 2.61 0.009 1.172574 3.068979 _cons | .0000488 .0001108 -4.37 0.000 5.67e-07 .0041954 ------------------------------------------------------------------------------ . end of do-file . count 744
Then, for this specific example, what would be the disadvantages of instead using a probit model on this same data structure?, i.e. where each time period would be estimated as a separate observation with time as an independent variable, just like in the example above?
My final question is more straightforward, how can I test the goodness of fit of the model from the Prof. Jenkins's code, specifically the cloglog?
0 Response to Survival analysis, discrete time - question on the model set up and goodness of fit
Post a Comment