Dear Statalist

I have a dataset with 32 million observations, and around 30 variables, I want to perform an operation on one variable (diag) to label ICD10 codes. To make this run quicker, I created an id variable (gen id = _n), and saved the dataset as "original_file"

I then dropped all variables except id and diag, then ran the operation on diag_01 (which took several hours)

I then wanted to merge the original file using

merge 1:1 id using original_file

But my id variable does not uniquely identify observations in either file. When I look at the data in Data Editor, I see that at large numbers the id variable repeats its self.

Does anyone know why this happens, and how to get round it? Should I be specifying the format of the id variable, to make sure its long enough not to round the large numbers?

Any help would be much appreciated

Best Wishes

Joe