Hello,
I am analysing data for a large number of school students (over 1 million).
Data on parental background --comprising 3 variables-- is missing for about 5 percent. Few students are missing all 3 variables.
I am using the mi impute chained command.
Two variables are binary. The other is continuous.
The heavy lifting is being done using MP on a HPC. In batch mode. I'm running 16 CPUs, but I could increase to 32.
(I did try testing my code on my 6 core desktop...which didn't go well).
And I am generating 20 imputations (m=20), with the data set as flong.
Problem is, it has taken 13 hours to develop just one imputation (doing 10 iterations).
So at this rate it would take 11 days to generate 20 imputations.
I can run multiple jobs in parallel.
So... is it an option to say run 10 jobs in parallel, where each generate 2 imputations. With each job using a different seed.
And then append the resulting observations together?
I believe it is theoretically sound, as imputing with m = 20, is essentially randomly choosing 20 points on the distribution. Each imputation is independent of the others.
Which I believe is no different to imputing with m = 2, 10 times...so long as a different seed is used.
Does anyone disagree?
If the theory is sound, there is then a question of how to append the 10 files together to look like one file. And for the mi settings etc to work.
Has anyone got any experience doing this? With the data on the HPC in command line mode, my usual trick of going in to inspect the data isn't possible.
Regards,
Andrew
Related Posts with Generating a set of imputed data in parallel
SEM for longitudinal Data!Hi to all, I am new to stata and have to build an SEM for my research, but i am struggling to figure…
Dummy variable interpretation Code: Source | SS df MS Number of obs = 1,315 -------------+…
Help with regression of (unbalanced) panel data - xtreg, statsby/regressby generating difference resultsHi Statalist! First of all, thank you for all the help you give on a daily basis! It's been very he…
Instrumental variables with many missing values in the endogenous regressorHi, I have a data set with 10k observations for Y and an endogenous regressor X with many missing o…
Logistic Regression with ClusteringHello users, I am new to the forum, but hoped you would be able to settle a small dispute we are ha…
Subscribe to:
Post Comments (Atom)
0 Response to Generating a set of imputed data in parallel
Post a Comment