I have a dataset where there are several variables that I'm being asked to match and come up with some kind of "highly probable duplicate" score. For example, I have name, age, and address variables. In a situation where the name and address match perfectly, but the age does not I would suspect that to be two different people. However, the age variables are within a year or maybe even matching, then I would assume then are the same person and flag one observation as a duplicate.

One difficultly I'm having is that the name variable is an agglutinated string with FIRST LAST MIDDLE SALUTATION, etc. all possibly crammed into one value. Some are as short as one "word", others as long as seven. It just depends on the data collector and the respondent.

To add to this, the order is not set. So some subjects list FIRST LAST while others list LAST FIRST. It's this problem that I want to try to tackle first.

Does anyone have any suggestions on how to go about this? My instinct is to create two new variables for the first and second words in the name blank, but then I'm unclear how to run any kind of -duplicates- function on the variables such that it would criss-cross the and flag the people who appear to be the same.

Name1 Name2 Duplicate
Abe Lincoln Yes
Ada Lovelace Yes
Lincoln Abe Yes
Earheart Amelia Yes
Hamilton Alexander No
Earheart Amelia Yes
Amelia Earheart Yes

Secondly, when I get to the addresses, I'd like to develop some fuzzy match so that misspellings and other data entry variable can be sorted out.

Perhaps this kind of thing already exists, either in Stata or in another program? Any guidance or suggestions are most graciously welcome.