Dear Statalisters,

I'm using Stata 17.0.

I have a composite outcome: in particular, the 3 components of my outcome are: 1) Status (positive vs negative) at T1; 2) Status at T2; 3) Status at T3. My outcome is: "being positive in at least one timepoint vs being always negative". If I ignored observations with missing values, I would lose also the ones that I'm sure have outcome=1, because they are positive in at least one timepoint. If I included observations positive in at least one timepoint but with at least one missing value, I would overestimate the outcome probability (observations with 1 or 2 missing values could be either positive or missing, but not negative). Thus, I am forced to do imputation. In particular, I

A) used the "mi impute chained (logit)" command, using one binary predictor ("treatment", that is actually also my variable of interest in the final model), with 10 multiple imputations;

B) built my composite predictor, using the "mi passive" command;

C) performed my logistic regression of the composite outcomand on treatment, through "mi estimate";

D) after observing that the Largest FMI is equal to 0.4082, repeated points A-C by using 50 imputations (to respect the rule of thumb to have at least 100*LFMI imputations);

E) after observing that the Largest FMI with 50 replications is equal to 0.4379, accepted the results and passed to the post-estimation phase.

I have, however, several doubts about such approach.

1) Does it work better than simply estimating probabilities for each group separately (e.g.: P (T3=1 | T1=0, T2=0, treatment=1)), in order to replace missing values with the probabilities to have Outcome=1, and then performing a fractional logit regression? At the end of the day, the probabilities to estimate would be 7*2=14 (7 combinations of T1, T2 and T3 being missing or 0, since when they are all zeros no imputation is required, and 2 treatment status). I understand that the probabilities estimated in such a way wouldn't be observed values, thus I guess I would somehow underestimate standard errors, but it would seem to me a much more intuitive approach.

2) Is the rule of thumb of 100*LFMI imputations still valid when the outcome is binary, and when the imputed values are outcome components, or I should increase the number of imputations?

3) The imputation model estimates how the probability of being positive in a given timepoint differ depending on whether it is positive or negative in the other timepoints. I'm not actually interested in that: a positive status in any timepoint makes status at the other timepoints irrelevant. Shouldn't I use a somehow more direct approach, meant to estimate the probabilities I talk about at point A, i.e.: P(Outcome=1 | available information), by disregarding all situations where I already know that the outcome is positive? Or, put another way, shouldn't I just base my estimations on P(T1=1 | t2!=1, t3!=1); P(T2=1 | t1!=1, t3!=1); P(T3=1 | t1!=1, t2!=1) (I used capital letters to mean "the real value", thus either 0 or 1, and lower-case letter to mean "the observed value", thus including "missing")?

4) Does the order in which I list the variables to impute matter? I noticed Stata first impute values at T1 using treatment, then at T2 using T1 and treatment, then at T3 using T2, T1 and treatment, then re-estimate everything using everything. Does convergence guarantee irrelevance of the starting point? Otherwise, how could I get rid of such arbitrariness?