Dear Statalisters,

I am seeking guidance on how to approach inference with multilevel mixed-effects linear and logistic regression models when the number of units at the highest level is small, six in my case.

Background: I am analyzing data from an RCT in India. The trial was stratified across six states in various parts of the country. Respondents were randomly assigned to one of five arms (one control and four interventions) and were asked to complete a questionnaire by an interviewer. Some outcomes are binary, some were measured on a four-point Likert scale. All outcomes have repeated measures at the respondent level.

There is a great deal of heterogeneity across states, both in the control arm's mean and in the intervention arms' means relative to the control. That steered me towards a multilevel mixed-effects framework with random intercepts at the state and respondent level, and random slopes at the state level. However, I believe that inference with multilevel mixed-effects models assumes a large number of units at the highest level, so given the small number of units at the highest level (six states), I doubt I can rely on this framework for inference.

I considered adding state dummies and arm-state interactions in the fixed-effects part of the model instead of a random intercept and random slopes, but I would like to allow for intra-state correlation, and again clustering the standard errors at the state level does not sound like a good option.

I came across a paper on the OLS (I think) properties of the wild bootstrap when the number of clusters is small (Cameron et al., 2008), and the user-written command boottest (Roodman et al., 2019). However, boottest's only option after mixed or melogit is the score bootstrap because these models are estimated by ML, and Roodman et al. say that this method 'does not always provide a good approximation' and 'suggest that the score bootstrap not be relied upon without evidence that it works well in the case of interest' (pp. 32-33).

Does anyone have insights on the properties of the score bootstrap in multilevel mixed-effects modeling applications when the number of units at the highest level is small/the clustering variable has few levels? Or more generally, on how to get reliable inference in my case?

Any help would be greatly appreciated.

References:

Cameron, A.C., Gelbach, J.B. and Miller, D.L., 2008. Bootstrap-based improvements for inference with clustered errors. The review of economics and statistics, 90(3), pp.414-427.
Roodman, D., Nielsen, M.Ø., MacKinnon, J.G. and Webb, M.D., 2019. Fast and wild: Bootstrap inference in Stata using boottest. The Stata Journal, 19(1), pp.4-60.