Matching subjects across multiple variables and within variable

I have a dataset where there are several variables that I'm being asked to match and come up with some kind of "highly probable duplicate" score. For example, I have name, age, and address variables. In a situation where the name and address match perfectly, but the age does not I would suspect that to be two different people. However, the age variables are within a year or maybe even matching, then I would assume then are the same person and flag one observation as a duplicate.

One difficultly I'm having is that the name variable is an agglutinated string with FIRST LAST MIDDLE SALUTATION, etc. all possibly crammed into one value. Some are as short as one "word", others as long as seven. It just depends on the data collector and the respondent.

To add to this, the order is not set. So some subjects list FIRST LAST while others list LAST FIRST. It's this problem that I want to try to tackle first.

Does anyone have any suggestions on how to go about this? My instinct is to create two new variables for the first and second words in the name blank, but then I'm unclear how to run any kind of -duplicates- function on the variables such that it would criss-cross the and flag the people who appear to be the same.

Name1	Name2	Duplicate
Abe	Lincoln	Yes
Ada	Lovelace	Yes
Lincoln	Abe	Yes
Earheart	Amelia	Yes
Hamilton	Alexander	No
Earheart	Amelia	Yes
Amelia	Earheart	Yes

Secondly, when I get to the addresses, I'd like to develop some fuzzy match so that misspellings and other data entry variable can be sorted out.

Perhaps this kind of thing already exists, either in Stata or in another program? Any guidance or suggestions are most graciously welcome.

BJ Data Tech Solution

Home / Data Cleaning / Data management / Data Processing / Matching subjects across multiple variables and within variable
Matching subjects across multiple variables and within variable

0 Response to Matching subjects across multiple variables and within variable

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Matching subjects across multiple variables and within variable Matching subjects across multiple variables and within variable

0 Response to Matching subjects across multiple variables and within variable

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Matching subjects across multiple variables and within variable
Matching subjects across multiple variables and within variable