One difficultly I'm having is that the name variable is an agglutinated string with FIRST LAST MIDDLE SALUTATION, etc. all possibly crammed into one value. Some are as short as one "word", others as long as seven. It just depends on the data collector and the respondent.
To add to this, the order is not set. So some subjects list FIRST LAST while others list LAST FIRST. It's this problem that I want to try to tackle first.
Does anyone have any suggestions on how to go about this? My instinct is to create two new variables for the first and second words in the name blank, but then I'm unclear how to run any kind of -duplicates- function on the variables such that it would criss-cross the and flag the people who appear to be the same.
Name1 | Name2 | Duplicate |
Abe | Lincoln | Yes |
Ada | Lovelace | Yes |
Lincoln | Abe | Yes |
Earheart | Amelia | Yes |
Hamilton | Alexander | No |
Earheart | Amelia | Yes |
Amelia | Earheart | Yes |
Secondly, when I get to the addresses, I'd like to develop some fuzzy match so that misspellings and other data entry variable can be sorted out.
Perhaps this kind of thing already exists, either in Stata or in another program? Any guidance or suggestions are most graciously welcome.
0 Response to Matching subjects across multiple variables and within variable
Post a Comment