Following the advice from the community given in my last post, I set up a survival analysis of my data. To recap briefly, I am doing a study about risk factors for stroke in a subset of patients. For doing so, I gathered retrospective data from a cohort of patients and organised it into" assessments", which every assessment meaning each time they came to the clinic and we have different clinical information from them. As in every assessment not all patients have the same clinical information, I have a problem with missing data in some variables which I will come to later.
My idea is to:
1. Encounter covariates (glomerular filtration rate, presence of white matter lesions at the brain MRI, heart status, treatment...) associated with an increasing number of events (stroke) or events that happen sooner
2. With such variables, fit a Cox model for risk prediction.
For doing so, I have stset my data, I have done the Kaplan-Meyer analysis of the different variables (graphically and logrank analysis), run an univariate Cox model with all the important covariates and then a multivariate one. As I several doubts have arisen, I wanted to share them with you to seek help. I will try to maintain one post per doubt, so I will start with the first one, in case I had to redo everything from the beginning.
To give you a glimpse of how my stset data looks, let me share here the stset output and a dataex example of some covariates and the stset variables.
Code:
stset -> stset meanageass_, id(id) failure(stroketotallong_==1) enter(time==.) exit(stroketotallong_==2) id: id failure event: stroketotallong_ == 1 obs. time interval: (meanageass_[_n-1], meanageass_] enter on or after: time==. exit on or before: stroketotallong_==2 ------------------------------------------------------------------------------ 2,258 total observations 409 observations end on or before enter() ------------------------------------------------------------------------------ 1,849 observations remaining, representing 380 subjects 73 failures in multiple-failure-per-subject data 2,093 total analysis time at risk and under observation at risk from t = 0 earliest observed entry t = 15 last observed exit t = 87
1-My subject identification variable is id.
2-My failure variables is stroketotallong_ which records 0/1 the status of stroke yes/no. As I wanted to include the subjects with more than one stroke, I set up the exit condition with stroketotallong_==2 which is not met by any subject, and let me include as many as 1 as possible.
3-My time variable was initially the number of assessment, as each assessment was 1 year apart from the other. However, from reading the recommended book "An introduction to survival analysis using stata" and after reading this article , I decided to change it to the age of the participants. It makes sense, as in the same assessment patients with different ages, and thus, with different risk for stroke, might be considered to have equal risk if the time variable is the assessment time.
The data with some covariates looks like:
Code:
* Example generated by -dataex-. For more info, type help dataex clear input int id float meanageass_ int gfr_ float(wml_ treatment_ LAecho_ stroketotallong_) byte(_st _d _t _t0) 1 34 . . . . 0 0 . . . 1 35 97 1 0 . 0 1 0 35 34 1 36 . . 0 . 0 1 0 36 35 1 37 . . 0 . 0 1 0 37 36 1 38 . . 0 . 0 1 0 38 37 1 39 . . 0 . 0 1 0 39 38 2 27 . . . . 0 0 . . . 2 28 98 0 1 1 0 1 0 28 27 2 29 96 0 2 . 0 1 0 29 28 2 30 98 . 1 . 0 1 0 30 29 2 31 104 . 1 . 0 1 0 31 30 2 32 103 . 1 . 0 1 0 32 31 2 33 105 . . . 0 1 0 33 32 4 40 . . . . 0 0 . . . 4 41 112 0 1 0 0 1 0 41 40 4 42 102 . 3 1 0 1 0 42 41 4 43 103 . 3 . 0 1 0 43 42 4 44 . . 3 . 0 1 0 44 43 4 45 . . 3 . 0 1 0 45 44 5 41 . . . . 0 0 . . . 6 32 . . . . 0 0 . . . 6 33 94 0 1 2 0 1 0 33 32 6 34 79 . 2 . 0 1 0 34 33 6 35 93 0 2 . 0 1 0 35 34 6 36 103 . 2 . 0 1 0 36 35 6 37 63 . 2 . 0 1 0 37 36 6 41 48 . . . 0 1 0 41 37 7 58 . . . . 0 0 . . . 7 59 98 . 3 . 0 1 0 59 58 7 61 100 . 3 . 0 1 0 61 59 7 62 106 . 1 0 0 1 0 62 61 7 63 102 . 3 . 0 1 0 63 62 7 64 111 . 3 . 0 1 0 64 63 7 67 93 . . . 0 1 0 67 64 8 20 . . . . 0 0 . . . 8 21 115 0 3 1 0 1 0 21 20 9 40 . . . . 0 0 . . . 9 42 92 1 1 0 0 1 0 42 40 9 44 80 1 1 . 0 1 0 44 42 9 45 84 . 1 . 0 1 0 45 44 9 46 . . 1 . 0 1 0 46 45 9 47 . . 1 . 0 1 0 47 46 10 21 . . . . 0 0 . . . 10 23 117 0 0 0 0 1 0 23 21 10 25 121 0 0 0 0 1 0 25 23 10 26 115 . 0 . 0 1 0 26 25 10 27 129 . 0 . 0 1 0 27 26 10 28 121 . 0 . 0 1 0 28 27 10 29 106 . . . 0 1 0 29 28 11 25 . . . . 0 0 . . . 11 26 131 0 0 0 0 1 0 26 25 11 28 120 0 0 1 0 1 0 28 26 11 29 132 . 0 . 0 1 0 29 28 11 30 124 . 0 . 0 1 0 30 29 11 31 97 . 0 . 0 1 0 31 30 11 32 125 . . . 0 1 0 32 31 13 38 . . . . 0 0 . . . 13 39 121 0 0 . 0 1 0 39 38 13 41 124 . 0 . 0 1 0 41 39 13 42 106 . 0 . 0 1 0 42 41 13 43 105 . 0 . 0 1 0 43 42 13 44 114 . 0 . 0 1 0 44 43 14 41 . . . . 1 0 . . . 14 42 109 . 3 . 0 1 0 42 41 14 43 95 0 1 0 0 1 0 43 42 14 44 89 0 3 0 0 1 0 44 43 14 45 107 . 3 0 0 1 0 45 44 14 46 105 . 3 . 0 1 0 46 45 14 47 103 . . . 0 1 0 47 46 15 35 . . . . 0 0 . . . 15 36 108 0 1 0 0 1 0 36 35 15 37 108 0 1 0 0 1 0 37 36 15 38 102 . 1 0 0 1 0 38 37 15 39 113 . 1 0 0 1 0 39 38 15 40 105 . 1 . 0 1 0 40 39 15 41 104 . . . 0 1 0 41 40 16 66 . . . . 0 0 . . . 16 67 110 0 1 0 0 1 0 67 66 16 69 97 0 3 0 0 1 0 69 67 16 70 103 . 3 . 0 1 0 70 69 16 71 93 . 3 . 0 1 0 71 70 16 72 . . 3 . 0 1 0 72 71 17 60 . . . . 1 0 . . . 17 61 76 1 1 0 0 1 0 61 60 17 62 77 1 1 0 0 1 0 62 61 17 63 72 . 1 0 0 1 0 63 62 17 64 64 . 1 . 0 1 0 64 63 17 65 61 . 1 . 0 1 0 65 64 17 66 63 . . . 0 1 0 66 65 18 21 . . . . 0 0 . . . 18 22 113 0 1 0 0 1 0 22 21 18 23 119 0 1 . 0 1 0 23 22 18 24 113 . 1 . 0 1 0 24 23 18 25 122 . 1 . 0 1 0 25 24 18 26 123 . 1 . 0 1 0 26 25 18 27 116 . . . 0 1 0 27 26 19 33 . . . . 1 0 . . . 19 34 115 2 1 0 0 1 0 34 33 19 35 99 2 1 0 0 1 0 35 34 19 36 110 . 1 . 0 1 0 36 35 end label values gfr_ gfrgoods label values wml_ wml label def wml 0 "No WML", modify label def wml 1 "Fazekas 1", modify label def wml 2 "Fazekas 2", modify label values treatment_ Treatment label def Treatment 0 "No treatment", modify label def Treatment 1 "Agalsidase alpha", modify label def Treatment 2 "Agalsidase beta", modify label def Treatment 3 "Migalastat", modify label values LAecho_ LA label def LA 0 "No dilated", modify label def LA 1 "Mildly Dilated", modify label def LA 2 "Modeately dilated", modify
Code:
failure _d: stroketotallong_ == 1 analysis time _t: meanageass_ enter on or after: time==. exit on or before: stroketotallong_==2 id: id At Net Survivor Std. Time Risk Fail Lost Function Error [95% Conf. Int.] ------------------------------------------------------------------------ 15 0 0 -2 1.0000 . . . 16 2 0 -1 1.0000 . . . 17 3 0 -10 1.0000 . . . 18 13 0 -7 1.0000 . . . 19 20 0 -5 1.0000 . . . 20 25 0 -4 1.0000 . . . 21 29 0 -7 1.0000 . . . 22 36 0 -5 1.0000 . . . 23 41 0 6 1.0000 . . . 24 35 0 1 1.0000 . . . 25 34 0 -4 1.0000 . . . 26 38 1 5 0.9737 0.0260 0.8275 0.9963 27 32 0 -3 0.9737 0.0260 0.8275 0.9963 29 35 0 4 0.9737 0.0260 0.8275 0.9963 31 31 1 -1 0.9423 0.0398 0.7870 0.9853 32 31 0 -4 0.9423 0.0398 0.7870 0.9853 33 35 1 4 0.9154 0.0469 0.7593 0.9720 34 30 0 5 0.9154 0.0469 0.7593 0.9720 35 25 0 -11 0.9154 0.0469 0.7593 0.9720 36 36 2 -7 0.8645 0.0564 0.7043 0.9413 37 41 0 2 0.8645 0.0564 0.7043 0.9413 38 39 2 -5 0.8202 0.0616 0.6592 0.9100 39 42 1 3 0.8006 0.0632 0.6403 0.8950 40 38 1 -4 0.7796 0.0649 0.6190 0.8787 41 41 3 -1 0.7225 0.0680 0.5636 0.8318 42 39 2 -5 0.6855 0.0694 0.5280 0.7999 43 42 2 1 0.6528 0.0698 0.4980 0.7704 44 39 0 -2 0.6528 0.0698 0.4980 0.7704 45 41 1 -1 0.6369 0.0699 0.4834 0.7558 46 41 3 -10 0.5903 0.0698 0.4413 0.7120 47 48 1 5 0.5780 0.0694 0.4308 0.6999 48 42 1 2 0.5642 0.0691 0.4187 0.6865 49 39 1 -5 0.5498 0.0688 0.4059 0.6724 50 43 2 -3 0.5242 0.0680 0.3840 0.6468 51 44 1 0 0.5123 0.0675 0.3739 0.6346 52 43 4 -6 0.4646 0.0653 0.3337 0.5855 53 45 4 -6 0.4233 0.0626 0.2998 0.5415 54 47 1 1 0.4143 0.0619 0.2927 0.5316 55 45 1 1 0.4051 0.0613 0.2853 0.5216 56 43 1 0 0.3957 0.0605 0.2777 0.5113 57 42 0 3 0.3957 0.0605 0.2777 0.5113 58 39 2 -5 0.3754 0.0591 0.2612 0.4892 59 42 4 1 0.3397 0.0561 0.2328 0.4493 60 37 0 2 0.3397 0.0561 0.2328 0.4493 61 35 2 -7 0.3202 0.0546 0.2172 0.4278 62 40 1 0 0.3122 0.0538 0.2110 0.4186 63 39 2 0 0.2962 0.0522 0.1987 0.4002 64 37 2 3 0.2802 0.0506 0.1863 0.3817 65 32 0 -1 0.2802 0.0506 0.1863 0.3817 66 33 1 5 0.2717 0.0498 0.1797 0.3719 67 27 1 3 0.2617 0.0489 0.1717 0.3606 68 23 1 0 0.2503 0.0481 0.1623 0.3481 69 22 2 -1 0.2275 0.0464 0.1439 0.3229 70 21 3 -2 0.1950 0.0434 0.1184 0.2859 71 20 1 -1 0.1853 0.0423 0.1110 0.2744 72 20 1 0 0.1760 0.0412 0.1042 0.2633 73 19 1 -1 0.1667 0.0400 0.0974 0.2522 74 19 1 1 0.1580 0.0389 0.0910 0.2415 75 17 2 -3 0.1394 0.0365 0.0776 0.2189 76 18 2 0 0.1239 0.0340 0.0671 0.1991 77 16 3 0 0.1007 0.0302 0.0516 0.1690 78 13 2 3 0.0852 0.0274 0.0415 0.1487 79 8 0 3 0.0852 0.0274 0.0415 0.1487 81 5 0 -1 0.0852 0.0274 0.0415 0.1487 83 6 2 -1 0.0568 0.0246 0.0212 0.1183 84 5 0 2 0.0568 0.0246 0.0212 0.1183 86 3 0 1 0.0568 0.0246 0.0212 0.1183 87 2 0 2 0.0568 0.0246 0.0212 0.1183 ------------------------------------------------------------------------ Note: Net Lost equals the number lost minus the number who entered.
Array
I can show you what it looked like with the assessment time (time since enrolment more or less) as time variable:
Code:
At Net Survivor Std. Time Risk Fail Lost Function Error [95% Conf. Int.] ------------------------------------------------------------------------ 1 380 28 -10 0.9263 0.0134 0.8951 0.9485 2 362 6 5 0.9110 0.0146 0.8776 0.9355 3 351 10 0 0.8850 0.0163 0.8486 0.9131 4 341 10 14 0.8591 0.0178 0.8200 0.8902 5 317 7 103 0.8401 0.0188 0.7993 0.8733 6 207 9 124 0.8036 0.0215 0.7572 0.8420 7 74 0 45 0.8036 0.0215 0.7572 0.8420 8 29 0 14 0.8036 0.0215 0.7572 0.8420 9 15 1 7 0.7500 0.0555 0.6210 0.8405 10 7 1 2 0.6428 0.1100 0.3891 0.8132 11 4 0 2 0.6428 0.1100 0.3891 0.8132 12 2 0 1 0.6428 0.1100 0.3891 0.8132 16 1 1 0 0.0000 . . . ------------------------------------------------------------------------ Note: Net Lost equals the number lost minus the number who entered.
So, continuing from the point where I have my data stset with age as the time variable, let me show you what I did with one of the variables to find if it is a risk factor for stroke or not. I am going to choose gender as the example variable.
1. Kaplan Meyer analysis of gender and stroke
Code:
. sts list, by(sex) compare failure _d: stroketotallong_ == 1 analysis time _t: meanageass_ enter on or after: time==. exit on or before: stroketotallong_==2 id: id Survivor Function sex Male Female ---------------------------------- time 15 1.0000 1.0000 24 1.0000 1.0000 33 0.9412 0.9143 42 0.5686 0.7514 51 0.3162 0.6667 60 0.2163 0.4321 69 0.1471 0.2807 78 0.0525 0.1026 87 0.0525 0.0616 ---------------------------------- . sts test sex, logrank failure _d: stroketotallong_ == 1 analysis time _t: meanageass_ enter on or after: time==. exit on or before: stroketotallong_==2 id: id Log-rank test for equality of survivor functions | Events Events sex | observed expected -------+------------------------- Male | 34 28.93 Female | 39 44.07 -------+------------------------- Total | 73 73.00 chi2(1) = 1.59 Pr>chi2 = 0.2072
The results show that it is not significant, so the number of events is not different between groups. My first question is: Can I say this is somehow adjusted by age? I mean, that despite that the curves seems horizontally separated, there is no difference in age? Because I can understand that the number of events is more or less similar, but in this graph, the median survival time age is different between males and females, and this also happens with other variables despite not being significant in the logrank test, and I do not know how to demonstrate that the median age of survival is different, as age is the time variable. Or perhaps it is no different and the logrank test accounts for that as age is the time variable.
2-Univariate Cox analysis
Then I proceed with each variable that was significative in the kaplan meyer + those that despite not being significant I consider clinically important to the Cox analysis. I first do a univariate cox analysis alone and then I "adjust" each variable by gender and a genetic status, called N215S (0/1, yes/no). Let me show you the process for sex and for another variable.
Code:
stcox sex failure _d: stroketotallong_ == 1 analysis time _t: meanageass_ enter on or after: time==. exit on or before: stroketotallong_==2 id: id Iteration 0: log likelihood = -250.79081 Iteration 1: log likelihood = -250.04182 Iteration 2: log likelihood = -250.04144 Refining estimates: Iteration 0: log likelihood = -250.04144 Cox regression -- Breslow method for ties No. of subjects = 380 Number of obs = 1,849 No. of failures = 73 Time at risk = 2093 LR chi2(1) = 1.50 Log likelihood = -250.04144 Prob > chi2 = 0.2209 ------------------------------------------------------------------------------ _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- sex | .7456718 .1779156 -1.23 0.219 .4671464 1.190262 ------------------------------------------------------------------------------ . stcox sex N215S failure _d: stroketotallong_ == 1 analysis time _t: meanageass_ enter on or after: time==. exit on or before: stroketotallong_==2 id: id Iteration 0: log likelihood = -250.79081 Iteration 1: log likelihood = -241.75783 Iteration 2: log likelihood = -241.52702 Iteration 3: log likelihood = -241.52647 Refining estimates: Iteration 0: log likelihood = -241.52647 Cox regression -- Breslow method for ties No. of subjects = 380 Number of obs = 1,849 No. of failures = 73 Time at risk = 2093 LR chi2(2) = 18.53 Log likelihood = -241.52647 Prob > chi2 = 0.0001 ------------------------------------------------------------------------------ _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- sex | .6068005 .1492187 -2.03 0.042 .3747369 .9825745 N215S | .3001275 .0959839 -3.76 0.000 .160355 .5617316 ------------------------------------------------------------------------------
My second question is: I am not adding age in the "adjusted cox model with N215S" because I think it is iplied adjusted by the time variable. Is that correct?
My third question is: As you can see, the univariate cox analysis with gender alone has a loglikehood test not significant. Should I proceed despite that result to the adjust model with the variable sex despite that result? Because then, when I adjust by N215S, the loglikelihood ratio is significant and even the HR is significant, so I do not know if consider sex, preliminary, a risk factor for stroke or not.
Let me show another examples with another variables: dilatation at the echography (LAechodilated 0/1 yes/no) and white matter lesions at the mri (0/1 yes/no). Here I know that I have a problem with missing data which I am thinking to resolve with mi, at least to compare a model with multiple imputation and one without.
White matter lesions:
Code:
. stcox wmlyesno failure _d: stroketotallong_ == 1 analysis time _t: meanageass_ enter on or after: time==. exit on or before: stroketotallong_==2 id: id Iteration 0: log likelihood = -108.62454 Iteration 1: log likelihood = -106.17538 Iteration 2: log likelihood = -106.15317 Iteration 3: log likelihood = -106.15316 Refining estimates: Iteration 0: log likelihood = -106.15316 Cox regression -- Breslow method for ties No. of subjects = 373 Number of obs = 868 No. of failures = 41 Time at risk = 1057 LR chi2(1) = 4.94 Log likelihood = -106.15316 Prob > chi2 = 0.0262 ------------------------------------------------------------------------------ _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- wmlyesno | 2.440238 1.028713 2.12 0.034 1.068064 5.575284 ------------------------------------------------------------------------------ . stcox wmlyesno sex N215S failure _d: stroketotallong_ == 1 analysis time _t: meanageass_ enter on or after: time==. exit on or before: stroketotallong_==2 id: id Iteration 0: log likelihood = -108.62454 Iteration 1: log likelihood = -103.1311 Iteration 2: log likelihood = -103.07765 Iteration 3: log likelihood = -103.07762 Refining estimates: Iteration 0: log likelihood = -103.07762 Cox regression -- Breslow method for ties No. of subjects = 373 Number of obs = 868 No. of failures = 41 Time at risk = 1057 LR chi2(3) = 11.09 Log likelihood = -103.07762 Prob > chi2 = 0.0112 ------------------------------------------------------------------------------ _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- wmlyesno | 2.206474 .9194987 1.90 0.058 .9749442 4.993649 sex | .5305371 .1806634 -1.86 0.063 .2721803 1.034129 N215S | .4525625 .1840767 -1.95 0.051 .2039193 1.004382 ------------------------------------------------------------------------------ .
Dilatation at the echo:
Code:
. stcox LAechodilated failure _d: stroketotallong_ == 1 analysis time _t: meanageass_ enter on or after: time==. exit on or before: stroketotallong_==2 id: id Iteration 0: log likelihood = -84.677888 Iteration 1: log likelihood = -84.665632 Iteration 2: log likelihood = -84.665632 Refining estimates: Iteration 0: log likelihood = -84.665632 Cox regression -- Breslow method for ties No. of subjects = 344 Number of obs = 687 No. of failures = 34 Time at risk = 833 LR chi2(1) = 0.02 Log likelihood = -84.665632 Prob > chi2 = 0.8756 ------------------------------------------------------------------------------- _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] --------------+---------------------------------------------------------------- LAechodilated | .9440953 .3467623 -0.16 0.876 .4595926 1.939361 ------------------------------------------------------------------------------- . stcox sex N215S LAechodilated failure _d: stroketotallong_ == 1 analysis time _t: meanageass_ enter on or after: time==. exit on or before: stroketotallong_==2 id: id Iteration 0: log likelihood = -84.677888 Iteration 1: log likelihood = -83.489152 Iteration 2: log likelihood = -83.483187 Iteration 3: log likelihood = -83.483187 Refining estimates: Iteration 0: log likelihood = -83.483187 Cox regression -- Breslow method for ties No. of subjects = 344 Number of obs = 687 No. of failures = 34 Time at risk = 833 LR chi2(3) = 2.39 Log likelihood = -83.483187 Prob > chi2 = 0.4956 ------------------------------------------------------------------------------- _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] --------------+---------------------------------------------------------------- sex | .6688925 .2716924 -0.99 0.322 .3017263 1.482858 N215S | .5543998 .2375908 -1.38 0.169 .2393516 1.284132 LAechodilated | .7876342 .3085109 -0.61 0.542 .3655231 1.697205 -------------------------------------------------------------------------------
I know that this post is very long, I have not shown the PH assumption. However, I wanted to start asking by the issue of the age as the time variable, in case I had to modify it. Besides, I took the opportunity to ask about the loglikehood ratio of the stcox and if the way I am "adjusting" the HR in the second stcox analysis is correct or not.
After this, with the variables that are significant plus sex and N215S I plan to define the risk model via stcox, check the PH assumptions and then see If it can be internally validated.
Thank you very much for all your help and I hope this post make sense somehow.
Best,
David.
0 Response to Survival analysis. Age as time variable. Need to include age in posterior analysis.
Post a Comment