Fuzzy match inconsistent identifiers in a panel

Dear Statalist,

I've been working on what I feel like should be a common problem but I can't find any threads on it. I'm trying to create consistent identifiers for a panel dataset of cities that have slight spelling differences over time using some kind of fuzzy matching algorithm. A very simple example is below, in my real data, I have many hundreds of place names spelled differently over 30 years, so manual fixes are infeasible. I know there are not multiple observations of the same place for the same year. The end result I'm looking for is to get a third variable that is some numeric place_id that would consistently identify Chicago and New York in the example below. If anyone has experienced this problem before and could just give a general overview of their workflow, it would be much appreciated, I don't necessarily need specific code. I've been using reclink from ssc, but I'm open to anything. Let me know if I've left out important details or can give more background. Thanks!

Code:

clear
input str20 place_name year
"chicago" 1990
"chicag" 1991
"chicago" 1992
"new york" 1990
"new york city" 1991
"new york c" 1992
end

BJ Data Tech Solution

Home / Data Cleaning / Data management / Data Processing / Fuzzy match inconsistent identifiers in a panel
Fuzzy match inconsistent identifiers in a panel

0 Response to Fuzzy match inconsistent identifiers in a panel

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Fuzzy match inconsistent identifiers in a panel Fuzzy match inconsistent identifiers in a panel

Related Posts with Fuzzy match inconsistent identifiers in a panel

0 Response to Fuzzy match inconsistent identifiers in a panel

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Fuzzy match inconsistent identifiers in a panel
Fuzzy match inconsistent identifiers in a panel