Dear all,

I am currently working on matching two databases, that entail observations on about 600'000 individuals. Some numbers of individuals are the same accross the two datasets, and the goal is to identify which ones. I use Stata 14.

There are three common variables that are used to identify the individuals: birthdates, city of birth and first name, which are string variables.
One individual will for example have the observation "NAME" "xx.yy.zzzz" "CITY" in both datasets.
Please note no other variable can be used to identify the individuals, due to data limitations.

In order to conduct the merge, I encoded the three variables to get numerical values (using encode, gen as usual)
I merged 1:1 the three variables simultaneously.

Code:
use "xx1"
keep firstname1 birthdate city_birthplace_c
encode firstname1, gen(firstname1_num)
encode birthdate, gen(birthdate_num)
encode city_birthplace_c, gen(city_birthplace_c_num)
keep firstname1_num birthdate_num city_birthplace_c_num
save

use "xx2"
keep firstname1_c birthdate city_birthplace_c
encode birthdate, gen(birthdate_num)
encode city_birthplace_c, gen(city_birthplace_c_num)
encode firstname1_c, gen(firstname1_num)
keep birthdate_num city_birthplace_c_num firstname1_num


use "xx1"
merge 1:1 birthdate_num firstname1_num city_birthplace_c using "xx2"
I get the r(459) error message: variable birthdate_num firstname1_num city_birthplace_c_num does not uniquely identify observations in the master data
Which I understand means more than one individual has the same first name, the same birthdate and was born in the same city, which is actually quite likely due to the high number of observations.

My question if the following: is there any comprehensive way to identify which individuals have those 3 identical observations? I've looked around on Statlist, but cannot find anything directly related to this issue.


Thank a lot for you help,

Best regards,

Simon