Fuzzy match across multiple variables within a dataset

I have a dataset of about 15000 observations of different patients, many of which are duplicates. There is a lot of missing information, however, and they are not exact duplicates, so I would like to do a fuzzy matching process based on (ideally) three string variables. I have experimented with using matchit and reclink, but there are obvious problems if I try to merge the dataset to itself (because a perfect match exists), and I haven't worked out how to overcome this without knowing in advance which variables are duplicates. Please accept my apologies in advance if the answer is posted somewhere - I'm sure I'm not the first person to come across this problem but I haven't found it anywhere. The data are confidential but I've created a toy example below. Ideally, a solution to this problem would identify four unique patients (Remy, Eleanor, Jo and Josh). I'm using Stata 15.

Thanks.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str18(name dob_y dob_m)
"Jo Bloggs"     "1997" "March"    
"Remi W"        "1984" "September"
"Jo"            "1997" "March"    
"Remy W"        "1984" "Sept"    
"Eleanor Jones" "1989" ""        
"Eleanor J"     "1989" "January"  
"Remy W"        "1984" "September"
"Jo Blogg"      "1997" ""        
"Eleanor J"     "1989" "January"  
"Josh Hastings" "2001" "January"  
end

BJ Data Tech Solution

Home / Data Cleaning / Data management / Data Processing / Fuzzy match across multiple variables within a dataset
Fuzzy match across multiple variables within a dataset

0 Response to Fuzzy match across multiple variables within a dataset

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Fuzzy match across multiple variables within a dataset Fuzzy match across multiple variables within a dataset

Related Posts with Fuzzy match across multiple variables within a dataset

0 Response to Fuzzy match across multiple variables within a dataset

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Fuzzy match across multiple variables within a dataset
Fuzzy match across multiple variables within a dataset