I have a dataset of about 15000 observations of different patients, many of which are duplicates. There is a lot of missing information, however, and they are not exact duplicates, so I would like to do a fuzzy matching process based on (ideally) three string variables. I have experimented with using matchit and reclink, but there are obvious problems if I try to merge the dataset to itself (because a perfect match exists), and I haven't worked out how to overcome this without knowing in advance which variables are duplicates. Please accept my apologies in advance if the answer is posted somewhere - I'm sure I'm not the first person to come across this problem but I haven't found it anywhere. The data are confidential but I've created a toy example below. Ideally, a solution to this problem would identify four unique patients (Remy, Eleanor, Jo and Josh). I'm using Stata 15.
Thanks.
Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str18(name dob_y dob_m)
"Jo Bloggs" "1997" "March"
"Remi W" "1984" "September"
"Jo" "1997" "March"
"Remy W" "1984" "Sept"
"Eleanor Jones" "1989" ""
"Eleanor J" "1989" "January"
"Remy W" "1984" "September"
"Jo Blogg" "1997" ""
"Eleanor J" "1989" "January"
"Josh Hastings" "2001" "January"
end
0 Response to Fuzzy match across multiple variables within a dataset
Post a Comment