Dear all,
I am trying to restructure/compress my dataset because it’s currently too big to do anything with. In its current form I have around 200,000 individuals observed in 1080 time periods (days) each – giving me a dataset with more than 200m obs.
I am using it for a survival analysis and its current form looks like this:
id t0 t1 y var1 var2 var3
1 0 1 0 0 0 4
1 1 2 0 2 1 4
1 2 3 0 2 1 4
1 3 4 0 3 1 4
1 4 5 1 5 0 4
I.e. individual 1’s failure time is t1==5.
var1 and var2 are time-varying variables and var3 is constant.
I am mainly interested in the effect of the time-varying variables var1 and var2.
For instance, I want to run the following cox regression model
stset t1, failure(y==1) time0(t0) id(id)
stcox var1 var2 var3
Here, the interpretation of (the exponentiated coefficient of) var1 is the percentage-change in the hazard ratio associated with a unit increase in var2 in a given day.
However, because of the size of the dataset I am thinking about restructuring to something like:
id t0 t1 y var1 var2 var3
1 0 1 0 0 0 4
1 1 3 0 2 1 4
1 3 4 0 3 1 4
1 4 5 1 5 0 4
I.e. I want to collapse rows where var1 and var2 don’t change to end up with fewer observations. This can be done with:
collapse (first) t0 (last) t1 (first) var3, by(id var1 var2 y)
My question is:
Will the interpretation of var1 remain the same? I.e. will Stata still know that individual 1 had var1=2 in t1=2 and t1=3?