Hello,
I am analysing data for a large number of school students (over 1 million).
Data on parental background --comprising 3 variables-- is missing for about 5 percent. Few students are missing all 3 variables.
I am using the mi impute chained command.
Two variables are binary. The other is continuous.
The heavy lifting is being done using MP on a HPC. In batch mode. I'm running 16 CPUs, but I could increase to 32.
(I did try testing my code on my 6 core desktop...which didn't go well).
And I am generating 20 imputations (m=20), with the data set as flong.
Problem is, it has taken 13 hours to develop just one imputation (doing 10 iterations).
So at this rate it would take 11 days to generate 20 imputations.
I can run multiple jobs in parallel.
So... is it an option to say run 10 jobs in parallel, where each generate 2 imputations. With each job using a different seed.
And then append the resulting observations together?
I believe it is theoretically sound, as imputing with m = 20, is essentially randomly choosing 20 points on the distribution. Each imputation is independent of the others.
Which I believe is no different to imputing with m = 2, 10 times...so long as a different seed is used.
Does anyone disagree?
If the theory is sound, there is then a question of how to append the 10 files together to look like one file. And for the mi settings etc to work.
Has anyone got any experience doing this? With the data on the HPC in command line mode, my usual trick of going in to inspect the data isn't possible.
Regards,
Andrew
0 Response to Generating a set of imputed data in parallel
Post a Comment