Dear statalist community,

Thank you all in advance for taking the time to read this post and help me.

I am doing a research project focused on risk factors for stroke in a subset of patients. For doing so, I gathered retrospective data from a cohort of patients and organised it into" assessments". The first assessment is the first time they started medical follow up and in each assessment different tests and explorations were carried out. The first problem is that not all patients have the same number of assessments, not all patients have the same number of explorations per assessments and, therefore, there is a lot of longitudinal missing data in some areas.

Once the data was collected, we organised the next steps in this way:

1. Descriptive analysis of the cohort to see important variables that might be considered as risk factors
2. Set up panel data from the longitudinal evolution of patients with the important variables and do a univariate logistic regression to pick up variables for a multivariate logistic regression
3. Generate a multivariate regression model with the variables that were significant in the previous step.

When we did the descriptive analysis we picked up some variables that changed across different assessments and others that did not. In the end, I assembled the data into a panel data with the variable id identifying each individual, time for identifying the assessment number (1,2...) and the outcome variable stroketotal (0/1) Besides, there are the variables: sex for gender, renal function (gfr), mean age at the assessment (it is mean age because some tests were done at different times and I had to do a mean age for each assessment), presence of an autoimmune disease (autoinmunity 0/1) and presence or not of a specific mutation (N215S 0/1), presence of white matter lesions in the MRI (wml 0/1/2/3) and the degree of enlargement of the left atrium (LAecho 0/1/2/3)

It is a panel data of 414 patients, some of them had the stroke before the first assessment (about 40) and about 40 more had a stroke at the end of the study period. I limited the panel data to 7 assessments and, therefore, I ended up with around 2800 data rows.

It looks like this:

Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input int id byte time float(stroketotallong_ meanageass_) int gfr_ float(wml_ LAecho_ autoinmunitynum N215S) byte sex
 1 1 0 34  97 1 . 0 0 1
 1 2 0 35   . . . 0 0 1
 1 3 0 36   . . . 0 0 1
 1 4 0 37   . . . 0 0 1
 1 5 0 38   . . . 0 0 1
 1 6 0 39   . . . 0 0 1
 1 7 0  .   . . . 0 0 1
 2 1 0 27  98 0 1 0 0 1
 2 2 0 28  96 0 . 0 0 1
 2 3 0 29  98 . . 0 0 1
 2 4 0 30 104 . . 0 0 1
 2 5 0 31 103 . . 0 0 1
 2 6 0 32 105 . . 0 0 1
 2 7 0 33 102 . . 0 0 1
 4 1 0 40 112 0 0 0 0 1
 4 2 0 41 102 . 1 0 0 1
 4 3 0 42 103 . . 0 0 1
 4 4 0 43   . . . 0 0 1
 4 5 0 44   . . . 0 0 1
 4 6 0 45   . . . 0 0 1
 4 7 0  .   . . . 0 0 1
 5 1 0 41  83 1 . 0 0 1
 5 2 1  .   . . . 0 0 1
 5 3 1  .   . . . 0 0 1
 5 4 1  .   . . . 0 0 1
 5 5 1  .   . . . 0 0 1
 5 6 1  .   . . . 0 0 1
 5 7 1  .   . . . 0 0 1
 6 1 0 32  94 0 2 0 1 0
 6 2 0 34  93 0 . 0 1 0
 6 3 0 35 103 . . 0 1 0
 6 4 0 33  79 . . 0 1 0
 6 5 0 36  63 . . 0 1 0
 6 6 0 37  48 . . 0 1 0
 6 7 0 41  46 . . 0 1 0
 7 1 0 61 106 . 0 0 1 1
 7 2 0 58  98 . . 0 1 1
 7 3 0 59 100 . . 0 1 1
 7 4 0 62 102 . . 0 1 1
 7 5 0 63 111 . . 0 1 1
 7 6 0 64  93 . . 0 1 1
 7 7 0 67  99 . . 0 1 1
 8 1 0 20 115 0 1 0 0 0
 8 2 0 21 110 . . 0 0 0
 8 3 0  .   . . . 0 0 0
 8 4 0  .   . . . 0 0 0
 8 5 0  .   . . . 0 0 0
 8 6 0  .   . . . 0 0 0
 8 7 0  .   . . . 0 0 0
 9 1 0 40  92 1 0 0 0 1
 9 2 0 42  80 1 . 0 0 1
 9 3 0 44  84 . . 0 0 1
 9 4 0 45   . . . 0 0 1
 9 5 0 46   . . . 0 0 1
 9 6 0 47   . . . 0 0 1
 9 7 0  .   . . . 0 0 1
10 1 0 21 117 0 0 0 0 1
10 2 0 23 121 0 0 0 0 1
10 3 0 25 115 . . 0 0 1
10 4 0 26 129 . . 0 0 1
10 5 0 27 121 . . 0 0 1
10 6 0 28 106 . . 0 0 1
10 7 0 29  98 . . 0 0 1
11 1 0 25 131 0 0 0 0 1
11 2 0 26 120 0 1 0 0 1
11 3 0 28 132 . . 0 0 1
11 4 0 29 124 . . 0 0 1
11 5 0 30  97 . . 0 0 1
11 6 0 31 125 . . 0 0 1
11 7 0 32 106 . . 0 0 1
13 1 0 38 121 0 . 0 1 1
13 2 0 39 124 . . 0 1 1
13 3 0 41 106 . . 0 1 1
13 4 0 42 105 . . 0 1 1
13 5 0 43 114 . . 0 1 1
13 6 0 44 108 . . 0 1 1
13 7 0  .   . . . 0 1 1
14 1 1 42  95 0 0 0 0 1
14 2 1 43  89 0 0 0 0 1
14 3 1 44 107 . 0 0 0 1
14 4 1 45 105 . . 0 0 1
14 5 1 41 109 . . 0 0 1
14 6 1 46 103 . . 0 0 1
14 7 1 47 109 . . 0 0 1
15 1 0 35 108 0 0 0 0 1
15 2 0 36 108 0 0 0 0 1
15 3 0 37 102 . 0 0 0 1
15 4 0 38 113 . 0 0 0 1
15 5 0 39 105 . . 0 0 1
15 6 0 40 104 . . 0 0 1
15 7 0 41 108 . . 0 0 1
16 1 0 66 110 0 0 0 1 0
16 2 0 67  97 0 0 0 1 0
16 3 0 69 103 . . 0 1 0
16 4 0 70  93 . . 0 1 0
16 5 0 71   . . . 0 1 0
16 6 0 72   . . . 0 1 0
16 7 0  .   . . . 0 1 0
17 1 1 60  76 1 0 0 0 1
17 2 1 61  77 1 0 0 0 1
end
label values autoinmunitynum Zero
label values N215S Zero
label def Zero 0 "No", modify
label def Zero 1 "Yes", modify
label values sex Sex
label def Sex 0 "Male", modify
label def Sex 1 "Female", modify
For doing the univariate regression analysis I chose the re model as I had time-varying variables and patients where the outcome did not vary. (Perhaps this assumption is wrong) Doing so, for some variables I obtain reasonable OR whereas for others it is a weird value probably driven by the small number of events in that category or vice-versa.

For example: xtlogit comparison of GFR and stroke in a model where age and gender is also added or N215S and stroke with the same other variables

Code:
xtset id

xtlogit stroketotallong_ i.sex c.meanageass_ c.gfr_, or

Random-effects logistic regression              Number of obs     =      1,838
Group variable: id                              Number of groups  =        390

Random effects u_i ~ Gaussian                   Obs per group:
                                                              min =          1
                                                              avg =        4.7
                                                              max =          7

Integration method: mvaghermite                 Integration pts.  =         12

                                                Wald chi2(3)      =      38.56
Log likelihood  = -297.21479                    Prob > chi2       =     0.0000

----------------------------------------------------------------------------------
stroketotallong_ | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-----------------+----------------------------------------------------------------
             sex |
         Female  |   .3959993   .4896565    -0.75   0.454     .0350895     4.46902
     meanageass_ |   1.416943   .0838234     5.89   0.000     1.261819    1.591137
            gfr_ |   .9731659   .0225645    -1.17   0.241     .9299302    1.018412
           _cons |   5.24e-13   1.87e-12    -7.90   0.000     4.72e-16    5.81e-10
-----------------+----------------------------------------------------------------
        /lnsig2u |   5.151202   .1560547                      4.845341    5.457064
-----------------+----------------------------------------------------------------
         sigma_u |   13.13921   1.025218                      11.27593     15.3104
             rho |      .9813   .0028637                       .974778    .9861595
----------------------------------------------------------------------------------
Note: Estimates are transformed only in the first equation.
Note: _cons estimates baseline odds (conditional on zero random effects).
LR test of rho=0: chibar2(01) = 877.78                 Prob >= chibar2 = 0.000


 xtlogit stroketotallong_ i.sex c.meanageass_ i.N215S, or nolog

Random-effects logistic regression              Number of obs     =      2,256
Group variable: id                              Number of groups  =        409

Random effects u_i ~ Gaussian                   Obs per group:
                                                              min =          1
                                                              avg =        5.5
                                                              max =          7

Integration method: mvaghermite                 Integration pts.  =         12

                                                Wald chi2(3)      =     715.32
Log likelihood  = -282.26214                    Prob > chi2       =     0.0000

----------------------------------------------------------------------------------
stroketotallong_ | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-----------------+----------------------------------------------------------------
             sex |
         Female  |   .0002462   .0002721    -7.52   0.000     .0000282    .0021489
     meanageass_ |   2.382608   .0859723    24.06   0.000     2.219926    2.557212
                 |
           N215S |
            Yes  |   4.65e-11   8.67e-11   -12.76   0.000     1.20e-12    1.80e-09
           _cons |   8.09e-24   1.68e-23   -25.64   0.000     1.39e-25    4.71e-22
-----------------+----------------------------------------------------------------
        /lnsig2u |   6.104662   .1448829                      5.820697    6.388627
-----------------+----------------------------------------------------------------
         sigma_u |   21.16462   1.533196                      18.36319    24.39342
             rho |   .9927091   .0010486                       .990338    .9945016
----------------------------------------------------------------------------------
Note: Estimates are transformed only in the first equation.
Note: _cons estimates baseline odds (conditional on zero random effects).
LR test of rho=0: chibar2(01) = 1205.64                Prob >= chibar2 = 0.000
As you can see, the OR for N215S makes no sense.

If I do a calculation of OR without a panel data analysis

Code:
 tab stroketotallong_ N215S

stroketota |         N215S
    llong_ |        No        Yes |     Total
-----------+----------------------+----------
         0 |     1,586        875 |     2,461 
         1 |       367         70 |       437 
-----------+----------------------+----------
     Total |     1,953        945 |     2,898 

logit stroketotallong_ i.sex c.meanageass_ i.N215S, or nolog

Logistic regression                             Number of obs     =      2,256
                                                LR chi2(3)        =     180.27
                                                Prob > chi2       =     0.0000
Log likelihood = -885.08092                     Pseudo R2         =     0.0924

----------------------------------------------------------------------------------
stroketotallong_ | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-----------------+----------------------------------------------------------------
             sex |
         Female  |   .6185883   .0776087    -3.83   0.000     .4837367    .7910324
     meanageass_ |   1.043193   .0042768    10.31   0.000     1.034845    1.051609
                 |
           N215S |
            Yes  |   .2324415   .0380368    -8.92   0.000     .1686641    .3203353
           _cons |   .0425143   .0095265   -14.09   0.000     .0274031    .0659583
----------------------------------------------------------------------------------
So I tried to guess why this happened but I have been unsuccessful to know why when running xtlogit the OR differs so much. In the end, the important question I ask myself is, can I do a panel data analysis with this kind of data or I am trying the impossible? If so, should I try to use other models (melogit? xtcloglog because the event is rare?) ?

Thank you all very much for your help.

David.