Hi I’m working with very large datasets - ones that contain for 1mill observations with 20+ variables.

lets call these datasets: hospital dataset and procedure dataset

Question: in your experience would you recommend keeping the number of merges to a minimum and just merge once then work on replicates from that one time merge ? Or do you think its equally as effective with the following process. I should think the process below is just more manageable as one knows exactly that all values are unique using the process below and there shouldnt be any issues when merging 1:1

Step 1:
identify the hes Unique values in the hospitaldataset merge 1:1 onto my proceduredataset

Step2:
Identify replicate values in the hospital dataset and pick the first one and merge onto my proceduredataset using 1:1

Step3:
Look at the ones with same procedureno but different hospitalid no and identify what’s different. Then final merge onto procedure dataset.

That’s a total of 3 merges. Would you recommend this or would you recommend , just do 1 merge from hospitaldataset to proceduredataset and then deal with replicates within that dataset - as the above requires 3 merges.