Dear statalist community,

Following the advice from the community given in my last post, I set up a survival analysis of my data. To recap briefly, I am doing a study about risk factors for stroke in a subset of patients. For doing so, I gathered retrospective data from a cohort of patients and organised it into" assessments", which every assessment meaning each time they came to the clinic and we have different clinical information from them. As in every assessment not all patients have the same clinical information, I have a problem with missing data in some variables which I will come to later.

My idea is to:

1. Encounter covariates (glomerular filtration rate, presence of white matter lesions at the brain MRI, heart status, treatment...) associated with an increasing number of events (stroke) or events that happen sooner
2. With such variables, fit a Cox model for risk prediction.

For doing so, I have stset my data, I have done the Kaplan-Meyer analysis of the different variables (graphically and logrank analysis), run an univariate Cox model with all the important covariates and then a multivariate one. As I several doubts have arisen, I wanted to share them with you to seek help. I will try to maintain one post per doubt, so I will start with the first one, in case I had to redo everything from the beginning.

To give you a glimpse of how my stset data looks, let me share here the stset output and a dataex example of some covariates and the stset variables.

Code:
 stset
-> stset meanageass_, id(id) failure(stroketotallong_==1) enter(time==.)
                      exit(stroketotallong_==2)

                id:  id
     failure event:  stroketotallong_ == 1
obs. time interval:  (meanageass_[_n-1], meanageass_]
 enter on or after:  time==.
 exit on or before:  stroketotallong_==2

------------------------------------------------------------------------------
      2,258  total observations
        409  observations end on or before enter()
------------------------------------------------------------------------------
      1,849  observations remaining, representing
        380  subjects
         73  failures in multiple-failure-per-subject data
      2,093  total analysis time at risk and under observation
                                                at risk from t =         0
                                     earliest observed entry t =        15
                                          last observed exit t =        87

1-My subject identification variable is id.
2-My failure variables is stroketotallong_ which records 0/1 the status of stroke yes/no. As I wanted to include the subjects with more than one stroke, I set up the exit condition with stroketotallong_==2 which is not met by any subject, and let me include as many as 1 as possible.
3-My time variable was initially the number of assessment, as each assessment was 1 year apart from the other. However, from reading the recommended book "An introduction to survival analysis using stata" and after reading this article , I decided to change it to the age of the participants. It makes sense, as in the same assessment patients with different ages, and thus, with different risk for stroke, might be considered to have equal risk if the time variable is the assessment time.

The data with some covariates looks like:


Code:
* Example generated by -dataex-. For more info, type help dataex
clear
input int id float meanageass_ int gfr_ float(wml_ treatment_ LAecho_ stroketotallong_) byte(_st _d _t _t0)
 1 34   . . . . 0 0 .  .  .
 1 35  97 1 0 . 0 1 0 35 34
 1 36   . . 0 . 0 1 0 36 35
 1 37   . . 0 . 0 1 0 37 36
 1 38   . . 0 . 0 1 0 38 37
 1 39   . . 0 . 0 1 0 39 38
 2 27   . . . . 0 0 .  .  .
 2 28  98 0 1 1 0 1 0 28 27
 2 29  96 0 2 . 0 1 0 29 28
 2 30  98 . 1 . 0 1 0 30 29
 2 31 104 . 1 . 0 1 0 31 30
 2 32 103 . 1 . 0 1 0 32 31
 2 33 105 . . . 0 1 0 33 32
 4 40   . . . . 0 0 .  .  .
 4 41 112 0 1 0 0 1 0 41 40
 4 42 102 . 3 1 0 1 0 42 41
 4 43 103 . 3 . 0 1 0 43 42
 4 44   . . 3 . 0 1 0 44 43
 4 45   . . 3 . 0 1 0 45 44
 5 41   . . . . 0 0 .  .  .
 6 32   . . . . 0 0 .  .  .
 6 33  94 0 1 2 0 1 0 33 32
 6 34  79 . 2 . 0 1 0 34 33
 6 35  93 0 2 . 0 1 0 35 34
 6 36 103 . 2 . 0 1 0 36 35
 6 37  63 . 2 . 0 1 0 37 36
 6 41  48 . . . 0 1 0 41 37
 7 58   . . . . 0 0 .  .  .
 7 59  98 . 3 . 0 1 0 59 58
 7 61 100 . 3 . 0 1 0 61 59
 7 62 106 . 1 0 0 1 0 62 61
 7 63 102 . 3 . 0 1 0 63 62
 7 64 111 . 3 . 0 1 0 64 63
 7 67  93 . . . 0 1 0 67 64
 8 20   . . . . 0 0 .  .  .
 8 21 115 0 3 1 0 1 0 21 20
 9 40   . . . . 0 0 .  .  .
 9 42  92 1 1 0 0 1 0 42 40
 9 44  80 1 1 . 0 1 0 44 42
 9 45  84 . 1 . 0 1 0 45 44
 9 46   . . 1 . 0 1 0 46 45
 9 47   . . 1 . 0 1 0 47 46
10 21   . . . . 0 0 .  .  .
10 23 117 0 0 0 0 1 0 23 21
10 25 121 0 0 0 0 1 0 25 23
10 26 115 . 0 . 0 1 0 26 25
10 27 129 . 0 . 0 1 0 27 26
10 28 121 . 0 . 0 1 0 28 27
10 29 106 . . . 0 1 0 29 28
11 25   . . . . 0 0 .  .  .
11 26 131 0 0 0 0 1 0 26 25
11 28 120 0 0 1 0 1 0 28 26
11 29 132 . 0 . 0 1 0 29 28
11 30 124 . 0 . 0 1 0 30 29
11 31  97 . 0 . 0 1 0 31 30
11 32 125 . . . 0 1 0 32 31
13 38   . . . . 0 0 .  .  .
13 39 121 0 0 . 0 1 0 39 38
13 41 124 . 0 . 0 1 0 41 39
13 42 106 . 0 . 0 1 0 42 41
13 43 105 . 0 . 0 1 0 43 42
13 44 114 . 0 . 0 1 0 44 43
14 41   . . . . 1 0 .  .  .
14 42 109 . 3 . 0 1 0 42 41
14 43  95 0 1 0 0 1 0 43 42
14 44  89 0 3 0 0 1 0 44 43
14 45 107 . 3 0 0 1 0 45 44
14 46 105 . 3 . 0 1 0 46 45
14 47 103 . . . 0 1 0 47 46
15 35   . . . . 0 0 .  .  .
15 36 108 0 1 0 0 1 0 36 35
15 37 108 0 1 0 0 1 0 37 36
15 38 102 . 1 0 0 1 0 38 37
15 39 113 . 1 0 0 1 0 39 38
15 40 105 . 1 . 0 1 0 40 39
15 41 104 . . . 0 1 0 41 40
16 66   . . . . 0 0 .  .  .
16 67 110 0 1 0 0 1 0 67 66
16 69  97 0 3 0 0 1 0 69 67
16 70 103 . 3 . 0 1 0 70 69
16 71  93 . 3 . 0 1 0 71 70
16 72   . . 3 . 0 1 0 72 71
17 60   . . . . 1 0 .  .  .
17 61  76 1 1 0 0 1 0 61 60
17 62  77 1 1 0 0 1 0 62 61
17 63  72 . 1 0 0 1 0 63 62
17 64  64 . 1 . 0 1 0 64 63
17 65  61 . 1 . 0 1 0 65 64
17 66  63 . . . 0 1 0 66 65
18 21   . . . . 0 0 .  .  .
18 22 113 0 1 0 0 1 0 22 21
18 23 119 0 1 . 0 1 0 23 22
18 24 113 . 1 . 0 1 0 24 23
18 25 122 . 1 . 0 1 0 25 24
18 26 123 . 1 . 0 1 0 26 25
18 27 116 . . . 0 1 0 27 26
19 33   . . . . 1 0 .  .  .
19 34 115 2 1 0 0 1 0 34 33
19 35  99 2 1 0 0 1 0 35 34
19 36 110 . 1 . 0 1 0 36 35
end
label values gfr_ gfrgoods
label values wml_ wml
label def wml 0 "No WML", modify
label def wml 1 "Fazekas 1", modify
label def wml 2 "Fazekas 2", modify
label values treatment_ Treatment
label def Treatment 0 "No treatment", modify
label def Treatment 1 "Agalsidase alpha", modify
label def Treatment 2 "Agalsidase beta", modify
label def Treatment 3 "Migalastat", modify
label values LAecho_ LA
label def LA 0 "No dilated", modify
label def LA 1 "Mildly Dilated", modify
label def LA 2 "Modeately dilated", modify
The overall survival function in this data is the following:

Code:
     failure _d:  stroketotallong_ == 1
   analysis time _t:  meanageass_
  enter on or after:  time==.
  exit on or before:  stroketotallong_==2
                 id:  id

             At           Net    Survivor      Std.
  Time     Risk   Fail   Lost    Function     Error     [95% Conf. Int.]
------------------------------------------------------------------------
    15        0      0     -2      1.0000         .          .         .
    16        2      0     -1      1.0000         .          .         .
    17        3      0    -10      1.0000         .          .         .
    18       13      0     -7      1.0000         .          .         .
    19       20      0     -5      1.0000         .          .         .
    20       25      0     -4      1.0000         .          .         .
    21       29      0     -7      1.0000         .          .         .
    22       36      0     -5      1.0000         .          .         .
    23       41      0      6      1.0000         .          .         .
    24       35      0      1      1.0000         .          .         .
    25       34      0     -4      1.0000         .          .         .
    26       38      1      5      0.9737    0.0260     0.8275    0.9963
    27       32      0     -3      0.9737    0.0260     0.8275    0.9963
    29       35      0      4      0.9737    0.0260     0.8275    0.9963
    31       31      1     -1      0.9423    0.0398     0.7870    0.9853
    32       31      0     -4      0.9423    0.0398     0.7870    0.9853
    33       35      1      4      0.9154    0.0469     0.7593    0.9720
    34       30      0      5      0.9154    0.0469     0.7593    0.9720
    35       25      0    -11      0.9154    0.0469     0.7593    0.9720
    36       36      2     -7      0.8645    0.0564     0.7043    0.9413
    37       41      0      2      0.8645    0.0564     0.7043    0.9413
    38       39      2     -5      0.8202    0.0616     0.6592    0.9100
    39       42      1      3      0.8006    0.0632     0.6403    0.8950
    40       38      1     -4      0.7796    0.0649     0.6190    0.8787
    41       41      3     -1      0.7225    0.0680     0.5636    0.8318
    42       39      2     -5      0.6855    0.0694     0.5280    0.7999
    43       42      2      1      0.6528    0.0698     0.4980    0.7704
    44       39      0     -2      0.6528    0.0698     0.4980    0.7704
    45       41      1     -1      0.6369    0.0699     0.4834    0.7558
    46       41      3    -10      0.5903    0.0698     0.4413    0.7120
    47       48      1      5      0.5780    0.0694     0.4308    0.6999
    48       42      1      2      0.5642    0.0691     0.4187    0.6865
    49       39      1     -5      0.5498    0.0688     0.4059    0.6724
    50       43      2     -3      0.5242    0.0680     0.3840    0.6468
    51       44      1      0      0.5123    0.0675     0.3739    0.6346
    52       43      4     -6      0.4646    0.0653     0.3337    0.5855
    53       45      4     -6      0.4233    0.0626     0.2998    0.5415
    54       47      1      1      0.4143    0.0619     0.2927    0.5316
    55       45      1      1      0.4051    0.0613     0.2853    0.5216
    56       43      1      0      0.3957    0.0605     0.2777    0.5113
    57       42      0      3      0.3957    0.0605     0.2777    0.5113
    58       39      2     -5      0.3754    0.0591     0.2612    0.4892
    59       42      4      1      0.3397    0.0561     0.2328    0.4493
    60       37      0      2      0.3397    0.0561     0.2328    0.4493
    61       35      2     -7      0.3202    0.0546     0.2172    0.4278
    62       40      1      0      0.3122    0.0538     0.2110    0.4186
    63       39      2      0      0.2962    0.0522     0.1987    0.4002
    64       37      2      3      0.2802    0.0506     0.1863    0.3817
    65       32      0     -1      0.2802    0.0506     0.1863    0.3817
    66       33      1      5      0.2717    0.0498     0.1797    0.3719
    67       27      1      3      0.2617    0.0489     0.1717    0.3606
    68       23      1      0      0.2503    0.0481     0.1623    0.3481
    69       22      2     -1      0.2275    0.0464     0.1439    0.3229
    70       21      3     -2      0.1950    0.0434     0.1184    0.2859
    71       20      1     -1      0.1853    0.0423     0.1110    0.2744
    72       20      1      0      0.1760    0.0412     0.1042    0.2633
    73       19      1     -1      0.1667    0.0400     0.0974    0.2522
    74       19      1      1      0.1580    0.0389     0.0910    0.2415
    75       17      2     -3      0.1394    0.0365     0.0776    0.2189
    76       18      2      0      0.1239    0.0340     0.0671    0.1991
    77       16      3      0      0.1007    0.0302     0.0516    0.1690
    78       13      2      3      0.0852    0.0274     0.0415    0.1487
    79        8      0      3      0.0852    0.0274     0.0415    0.1487
    81        5      0     -1      0.0852    0.0274     0.0415    0.1487
    83        6      2     -1      0.0568    0.0246     0.0212    0.1183
    84        5      0      2      0.0568    0.0246     0.0212    0.1183
    86        3      0      1      0.0568    0.0246     0.0212    0.1183
    87        2      0      2      0.0568    0.0246     0.0212    0.1183
------------------------------------------------------------------------
Note: Net Lost equals the number lost minus the number who entered.
Which looks like

Array

I can show you what it looked like with the assessment time (time since enrolment more or less) as time variable:

Code:
            At           Net    Survivor      Std.
  Time     Risk   Fail   Lost    Function     Error     [95% Conf. Int.]
------------------------------------------------------------------------
     1      380     28    -10      0.9263    0.0134     0.8951    0.9485
     2      362      6      5      0.9110    0.0146     0.8776    0.9355
     3      351     10      0      0.8850    0.0163     0.8486    0.9131
     4      341     10     14      0.8591    0.0178     0.8200    0.8902
     5      317      7    103      0.8401    0.0188     0.7993    0.8733
     6      207      9    124      0.8036    0.0215     0.7572    0.8420
     7       74      0     45      0.8036    0.0215     0.7572    0.8420
     8       29      0     14      0.8036    0.0215     0.7572    0.8420
     9       15      1      7      0.7500    0.0555     0.6210    0.8405
    10        7      1      2      0.6428    0.1100     0.3891    0.8132
    11        4      0      2      0.6428    0.1100     0.3891    0.8132
    12        2      0      1      0.6428    0.1100     0.3891    0.8132
    16        1      1      0      0.0000         .          .         .
------------------------------------------------------------------------
Note: Net Lost equals the number lost minus the number who entered.
Array

So, continuing from the point where I have my data stset with age as the time variable, let me show you what I did with one of the variables to find if it is a risk factor for stroke or not. I am going to choose gender as the example variable.

1. Kaplan Meyer analysis of gender and stroke

Code:
. sts list, by(sex) compare

         failure _d:  stroketotallong_ == 1
   analysis time _t:  meanageass_
  enter on or after:  time==.
  exit on or before:  stroketotallong_==2
                 id:  id

                 Survivor Function
sex                Male     Female
----------------------------------
time      15     1.0000     1.0000
          24     1.0000     1.0000
          33     0.9412     0.9143
          42     0.5686     0.7514
          51     0.3162     0.6667
          60     0.2163     0.4321
          69     0.1471     0.2807
          78     0.0525     0.1026
          87     0.0525     0.0616
----------------------------------

. sts test sex, logrank

         failure _d:  stroketotallong_ == 1
   analysis time _t:  meanageass_
  enter on or after:  time==.
  exit on or before:  stroketotallong_==2
                 id:  id


Log-rank test for equality of survivor functions

       |   Events         Events
sex    |  observed       expected
-------+-------------------------
Male   |        34          28.93
Female |        39          44.07
-------+-------------------------
Total  |        73          73.00

             chi2(1) =       1.59
             Pr>chi2 =     0.2072
Array

The results show that it is not significant, so the number of events is not different between groups. My first question is: Can I say this is somehow adjusted by age? I mean, that despite that the curves seems horizontally separated, there is no difference in age? Because I can understand that the number of events is more or less similar, but in this graph, the median survival time age is different between males and females, and this also happens with other variables despite not being significant in the logrank test, and I do not know how to demonstrate that the median age of survival is different, as age is the time variable. Or perhaps it is no different and the logrank test accounts for that as age is the time variable.

2-Univariate Cox analysis

Then I proceed with each variable that was significative in the kaplan meyer + those that despite not being significant I consider clinically important to the Cox analysis. I first do a univariate cox analysis alone and then I "adjust" each variable by gender and a genetic status, called N215S (0/1, yes/no). Let me show you the process for sex and for another variable.

Code:
 stcox sex

         failure _d:  stroketotallong_ == 1
   analysis time _t:  meanageass_
  enter on or after:  time==.
  exit on or before:  stroketotallong_==2
                 id:  id

Iteration 0:   log likelihood = -250.79081
Iteration 1:   log likelihood = -250.04182
Iteration 2:   log likelihood = -250.04144
Refining estimates:
Iteration 0:   log likelihood = -250.04144

Cox regression -- Breslow method for ties

No. of subjects =          380                  Number of obs    =       1,849
No. of failures =           73
Time at risk    =         2093
                                                LR chi2(1)       =        1.50
Log likelihood  =   -250.04144                  Prob > chi2      =      0.2209

------------------------------------------------------------------------------
          _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         sex |   .7456718   .1779156    -1.23   0.219     .4671464    1.190262
------------------------------------------------------------------------------


. stcox sex N215S

         failure _d:  stroketotallong_ == 1
   analysis time _t:  meanageass_
  enter on or after:  time==.
  exit on or before:  stroketotallong_==2
                 id:  id

Iteration 0:   log likelihood = -250.79081
Iteration 1:   log likelihood = -241.75783
Iteration 2:   log likelihood = -241.52702
Iteration 3:   log likelihood = -241.52647
Refining estimates:
Iteration 0:   log likelihood = -241.52647

Cox regression -- Breslow method for ties

No. of subjects =          380                  Number of obs    =       1,849
No. of failures =           73
Time at risk    =         2093
                                                LR chi2(2)       =       18.53
Log likelihood  =   -241.52647                  Prob > chi2      =      0.0001

------------------------------------------------------------------------------
          _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         sex |   .6068005   .1492187    -2.03   0.042     .3747369    .9825745
       N215S |   .3001275   .0959839    -3.76   0.000      .160355    .5617316
------------------------------------------------------------------------------

My second question is: I am not adding age in the "adjusted cox model with N215S" because I think it is iplied adjusted by the time variable. Is that correct?
My third question is: As you can see, the univariate cox analysis with gender alone has a loglikehood test not significant. Should I proceed despite that result to the adjust model with the variable sex despite that result? Because then, when I adjust by N215S, the loglikelihood ratio is significant and even the HR is significant, so I do not know if consider sex, preliminary, a risk factor for stroke or not.


Let me show another examples with another variables: dilatation at the echography (LAechodilated 0/1 yes/no) and white matter lesions at the mri (0/1 yes/no). Here I know that I have a problem with missing data which I am thinking to resolve with mi, at least to compare a model with multiple imputation and one without.

White matter lesions:

Code:
. stcox wmlyesno

         failure _d:  stroketotallong_ == 1
   analysis time _t:  meanageass_
  enter on or after:  time==.
  exit on or before:  stroketotallong_==2
                 id:  id

Iteration 0:   log likelihood = -108.62454
Iteration 1:   log likelihood = -106.17538
Iteration 2:   log likelihood = -106.15317
Iteration 3:   log likelihood = -106.15316
Refining estimates:
Iteration 0:   log likelihood = -106.15316

Cox regression -- Breslow method for ties

No. of subjects =          373                  Number of obs    =         868
No. of failures =           41
Time at risk    =         1057
                                                LR chi2(1)       =        4.94
Log likelihood  =   -106.15316                  Prob > chi2      =      0.0262

------------------------------------------------------------------------------
          _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    wmlyesno |   2.440238   1.028713     2.12   0.034     1.068064    5.575284
------------------------------------------------------------------------------

. stcox wmlyesno sex N215S

         failure _d:  stroketotallong_ == 1
   analysis time _t:  meanageass_
  enter on or after:  time==.
  exit on or before:  stroketotallong_==2
                 id:  id

Iteration 0:   log likelihood = -108.62454
Iteration 1:   log likelihood =  -103.1311
Iteration 2:   log likelihood = -103.07765
Iteration 3:   log likelihood = -103.07762
Refining estimates:
Iteration 0:   log likelihood = -103.07762

Cox regression -- Breslow method for ties

No. of subjects =          373                  Number of obs    =         868
No. of failures =           41
Time at risk    =         1057
                                                LR chi2(3)       =       11.09
Log likelihood  =   -103.07762                  Prob > chi2      =      0.0112

------------------------------------------------------------------------------
          _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    wmlyesno |   2.206474   .9194987     1.90   0.058     .9749442    4.993649
         sex |   .5305371   .1806634    -1.86   0.063     .2721803    1.034129
       N215S |   .4525625   .1840767    -1.95   0.051     .2039193    1.004382
------------------------------------------------------------------------------

.
With wml yes/no I can see that the loglikehood pvalue are significant in both cases and that the HR do not reach significance probably because of the missing data. However I have the same questions: Should I add age?

Dilatation at the echo:

Code:
. stcox LAechodilated

         failure _d:  stroketotallong_ == 1
   analysis time _t:  meanageass_
  enter on or after:  time==.
  exit on or before:  stroketotallong_==2
                 id:  id

Iteration 0:   log likelihood = -84.677888
Iteration 1:   log likelihood = -84.665632
Iteration 2:   log likelihood = -84.665632
Refining estimates:
Iteration 0:   log likelihood = -84.665632

Cox regression -- Breslow method for ties

No. of subjects =          344                  Number of obs    =         687
No. of failures =           34
Time at risk    =          833
                                                LR chi2(1)       =        0.02
Log likelihood  =   -84.665632                  Prob > chi2      =      0.8756

-------------------------------------------------------------------------------
           _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
LAechodilated |   .9440953   .3467623    -0.16   0.876     .4595926    1.939361
-------------------------------------------------------------------------------

. stcox sex N215S LAechodilated

         failure _d:  stroketotallong_ == 1
   analysis time _t:  meanageass_
  enter on or after:  time==.
  exit on or before:  stroketotallong_==2
                 id:  id

Iteration 0:   log likelihood = -84.677888
Iteration 1:   log likelihood = -83.489152
Iteration 2:   log likelihood = -83.483187
Iteration 3:   log likelihood = -83.483187
Refining estimates:
Iteration 0:   log likelihood = -83.483187

Cox regression -- Breslow method for ties

No. of subjects =          344                  Number of obs    =         687
No. of failures =           34
Time at risk    =          833
                                                LR chi2(3)       =        2.39
Log likelihood  =   -83.483187                  Prob > chi2      =      0.4956

-------------------------------------------------------------------------------
           _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
          sex |   .6688925   .2716924    -0.99   0.322     .3017263    1.482858
        N215S |   .5543998   .2375908    -1.38   0.169     .2393516    1.284132
LAechodilated |   .7876342   .3085109    -0.61   0.542     .3655231    1.697205
-------------------------------------------------------------------------------
Here the loglikehood ratio is never significant. Should I drop that variable for further risk models?

I know that this post is very long, I have not shown the PH assumption. However, I wanted to start asking by the issue of the age as the time variable, in case I had to modify it. Besides, I took the opportunity to ask about the loglikehood ratio of the stcox and if the way I am "adjusting" the HR in the second stcox analysis is correct or not.

After this, with the variables that are significant plus sex and N215S I plan to define the risk model via stcox, check the PH assumptions and then see If it can be internally validated.

Thank you very much for all your help and I hope this post make sense somehow.

Best,

David.