Hi statalist members,

This is my first post here so pardon me if I deviate from the established etiquette for the forum. I shall cut right to the chase.

I am trying to understand and replicate the analysis of Abadie, Athey, Imbens, and. Wooldridge (2017) (https://arxiv.org/abs/1710.02926), particularly what was presented at the Chamberlain Seminar last year (https://www.google.com/url?q=https%3...xpGV5v9dU8jDBi). I am running into issues setting up the Monte Carlo simulation.

The regression is of an outcome regressed on only a constant and treatment assignment variable (W). Outcome is generated by drawing from a normal distribution, with mean for control as alpha and for treatment as alpha + tau. Alpha and tau vary across clusters with variance 0.15 and 0.12 and have means 9.9 and 0.4 respectively.

Firstly, the treatment assignment variable (W). This should be drawn from a binomial distribution since we want W to be a binary variable with mean 0.55. Now my understanding is that W1 should be a 52x1 vector which is the means of W in each cluster. W1 will then help to generate data for W in each cluster by drawing from a binomial distribution with probability W1i where i belongs to [0,52]. sigmaK which Abadie et al are varying should be the variance of W1. To reiterate simply, the assignment probabilities across clusters should have mean 0.55 and variance sigmaK. My problem is that I am drawing W1 from a normal distribution with mean 0.55 and standard deviation sigmaK. When sigmaK is less than approx 0.23, the draws are all within (0,1). We need the draws to be between 0 and 1 because these will be the probability values for the binomial distribution. Abadie et al have a case of highly correlated assignment probability where sigmaK = 0.6. This lets the draws from normal distribution be outside (0,1). So my question is what should I be doing so that I get the correct form of W.

Secondly, the simulation results show the true standard deviation. I would naively assume them to be the standard error from OLS regression using all population as sample. But this does not sit right as standard errors (variance) is a function of q (proportion of observed clusters) and should vary accordingly. What would they be considering as true standard error?

Thirdly, in generating the outcome variable, they mention that it is drawn from a distribution with variance estimated on original data. This might be a long shot but I don’t have (know) the exact data they use. Could there be a workaround? I for now draw outcome variable from a multivariate normal with variances 1 and covariances 0.5.

Any help in understanding and correcting my understanding would be highly appreciated.

Regards,
Abbas