Stata/MP 14.2 and my installation does not have internet access, so I cannot copy code or output to this forum.

My data is longitudinal, with 128 "zones" and around 700 daily observations for whether or not a pipe break has occurred on that day. My covariates consist of various time-varying factors specific to each zone, like water demand, pressure measurements, etc. One issue is that pressure measurements (a key variable) are only available for 42 zones and are severely unbalanced. Time invariant factors are ignored since there are so many that we can't quantify.

Initially the idea was to do a regression on breaks per mile of pipe, but it has since come to light that the miles used in that calculation are unreliable estimates. So, a binary outcome of whether or not a break happened seems reasonable.

xtlogit has the random effects, conditional fixed effects, and population averaged approach available, but I am not sure which would be best.

As I understand it, random effects is only valid for random samples from a larger population, and since the population averaged approach is similar, does that exclude that approach too? Also, since we don't quantify the time-invariant variables, isn't random effects invalid? Does that apply to population averaged approaches too?

So conditional fixed effects remains, but it doesn't have cluster-robust standard errors. I could use the bootstrap option (but the docs don't explicitly say that this would be sufficient, but threads on this forum suggest this is the case) or do clogit with robust standard errors. But clogit is for matched case-control data according to the docs...

What is the most appropriate approach? Hosmer and Lemeshow (2013) mention a "cluster-specific" model, but I don't see that language anywhere in the Stata documentation.


Hosmer Jr., D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression. Wiley series in probability and statistics. Hoboken, NJ, USA: John Wiley & Sons, Inc.