Hello Statalisters,

I am trying to teach myself survival analysis estimation in STATA and at the moment I am following Prof. Jenkins's materials available at https://www.iser.essex.ac.uk/resourc...sis-with-stata. Apologies if my questions are obvious. I am focusing on discrete time at the moment. Let's consider a code from Lesson 6 (ex6_1.do). Although the data is in person-month format, I am confused as to what difference does the id have for the estimation, as I see the id variable is not a part of estimation itself:

(I am skipping a few lines of code relating to alternative functional forms for the baseline hazard function)

Code:
************** Estimation: (i) discrete time models *********


************* Cancer data **********
use cancer, clear

ge id = _n  
lab var id "subject identifier"

* drug = 1 (placebo); drug =2,3 (receives drug)
ta drug died
recode drug 1=0 2/3=1
lab var drug "receives drug?"
lab def drug 0 "placebo" 1 "drug"
lab val drug drug

ta drug

************************************
* Episode-splitting --> data in person-month format

expand studytim  
bysort id: ge j = _n  
* spell month identifier, by subject
lab var j "spell month"
bysort id: ge dead = died==1 & _n==_N
lab var dead "binary depvar for discrete hazard model"

* We don't have to -stset- the data for estimation, but might as
*    well -- it emphasises parallels with continuous time models
*    esp. when there are TVCs.
stset j, failure(dead) id(id)

**********************************************************
****************** CLOGLOG HAZARD MODELS *****************
* Compare model estimated with different baseline hazard specifications.
* Use -predict- to derive estimate of predicted hazard and survivor function
* and thence median duration.  First use within-sample info.


* log(j) baseline [and 'or' option; logit versus logistic]

* cloglog = glm, f(b) l(c). Can also use -glm- .
* See help -glm- and note glm Deviance = -2*LogL from cloglog

glm dead drug age lnj, f(b) l(c)  
glm, eform

cloglog dead drug age lnj, nolog
predict h, p

cloglog, eform    // replay results, but this time with hazard ratio
Output of the last command:

Code:
Complementary log-log regression                Number of obs     =        744
                                                Zero outcomes     =        713
                                                Nonzero outcomes  =         31

                                                LR chi2(3)        =      35.20
Log likelihood = -111.26371                     Prob > chi2       =     0.0000

------------------------------------------------------------------------------
        dead |     exp(b)   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        drug |   .1120209   .0460504    -5.33   0.000     .0500473    .2507365
         age |   1.126762   .0418758     3.21   0.001     1.047605      1.2119
         lnj |   1.896999    .465617     2.61   0.009     1.172574    3.068979
       _cons |   .0000488   .0001108    -4.37   0.000     5.67e-07    .0041954
------------------------------------------------------------------------------

. 
end of do-file

. count
  744
Further if I completely drop the id and rerun the cloglog commands, expectedly, it will produce the exact same output. Doesn't this mean that the model estimates the data as if it is a set of 744 unrelated persons? And the length of a spell is controlled via having time as one of the explanatory variables? What I am missing?

Then, for this specific example, what would be the disadvantages of instead using a probit model on this same data structure?, i.e. where each time period would be estimated as a separate observation with time as an independent variable, just like in the example above?

My final question is more straightforward, how can I test the goodness of fit of the model from the Prof. Jenkins's code, specifically the cloglog?