I am currently working on matching two databases, that entail observations on about 600'000 individuals. Some numbers of individuals are the same accross the two datasets, and the goal is to identify which ones. I use Stata 14.
There are three common variables that are used to identify the individuals: birthdates, city of birth and first name, which are string variables.
One individual will for example have the observation "NAME" "xx.yy.zzzz" "CITY" in both datasets.
Please note no other variable can be used to identify the individuals, due to data limitations.
In order to conduct the merge, I encoded the three variables to get numerical values (using encode, gen as usual)
I merged 1:1 the three variables simultaneously.
Code:
use "xx1" keep firstname1 birthdate city_birthplace_c encode firstname1, gen(firstname1_num) encode birthdate, gen(birthdate_num) encode city_birthplace_c, gen(city_birthplace_c_num) keep firstname1_num birthdate_num city_birthplace_c_num save use "xx2" keep firstname1_c birthdate city_birthplace_c encode birthdate, gen(birthdate_num) encode city_birthplace_c, gen(city_birthplace_c_num) encode firstname1_c, gen(firstname1_num) keep birthdate_num city_birthplace_c_num firstname1_num use "xx1" merge 1:1 birthdate_num firstname1_num city_birthplace_c using "xx2"
Which I understand means more than one individual has the same first name, the same birthdate and was born in the same city, which is actually quite likely due to the high number of observations.
My question if the following: is there any comprehensive way to identify which individuals have those 3 identical observations? I've looked around on Statlist, but cannot find anything directly related to this issue.
Thank a lot for you help,
Best regards,
Simon
0 Response to Merging Data - Isolate non-uniquely identified observations
Post a Comment