Hi, I'd been trying to set up a simple bootstrap that involves a small code that I'd written and I was noticing something odd--the values that should not vary across random samples were coming out with standard errors. Puzzled, I wrote a mock program to get to the bottom of this and realized that cluster option with bsample was causing something strange in the output:
The raw data has about 400 obs in 4 groups ("forms") and 5 obs per caseid, with all 5 obs for the same caseid being assigned to the same "form." My test program looks like the following:
program sim1
preserve
bsample 10, cluster(caseid) strata(form) idcluster(s_id)
ttest correct if form=="A"|form=="B", by(form)
scalar n1 = r(N_1)
scalar n2 = r(N_2)
scalar p1 = r(p)
ttest correct if form=="C"|form=="D", by(form)
scalar n3 = r(N_1)
scalar n4 = r(N_2)
scalar p2 = r(p)
restore
end
So, this should produce a random sample with 200 randomly drawn obs, 5 for each cluster, 10 clusters for each form. If the program is not run as part of the bootstrap command, nothing unexpected happens--I've used forval loops to generate up to 100 random samples using this very program and found that the samples generated, do, in fact have appropriate balances. As it should be the case, n1-n4 are all 50.
But once this is incorporated as part of the bootstrap command, as follows, something odd happens:
bootstrap n1 = n1 n2 = n2 n3 = n3 n4=n4 p1 = p1 p2=p2 ////
, rep(1000) saving(testset, replace): sim1
once the bootstrap is done and I open the testset.dta file, the simulated n1-n4 are not uniformly 50. For instance:
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
n1 | 1,000 50.56 8.058867 24 80
n2 | 1,000 50.161 8.800371 25 92
n3 | 1,000 50.476 8.260973 27 79
n4 | 1,000 50.095 7.672995 29 81
This seems to take place only when using bootstrap + bsample with cluster option: when I'm using only strata option, no strange sample sizes are reported (and, as noted previously, I don't think random samples created actually have unusual sample sizes--I've actually created forval loop and this very program to generate 100 random samples manually and nothing of the sort with any of the random samples.). So this seems to be bootstrap generating, eh, strange stats that are not very grounded on the actual random samples being generated. Where are these numbers coming from, why is Stata doing this, and what does it mean for other stats it is reporting, and what can I do to get Stata to report proper numbers, short of generating 1000s of random samples manually (I suppose I can use simulate to do this as well, but I am curious as to what exactly cluster option does that produces these numbers in this context)? Once again, I note that this seems to take place ONLY with cluster option specified. (Using Stata 15, in case it is relevant).
Thank you so much in advance!
0 Response to using bsample, cluster in conjunction with bootstrap (Stata 15)
Post a Comment