Dear Statalist,

I've been working on what I feel like should be a common problem but I can't find any threads on it. I'm trying to create consistent identifiers for a panel dataset of cities that have slight spelling differences over time using some kind of fuzzy matching algorithm. A very simple example is below, in my real data, I have many hundreds of place names spelled differently over 30 years, so manual fixes are infeasible. I know there are not multiple observations of the same place for the same year. The end result I'm looking for is to get a third variable that is some numeric place_id that would consistently identify Chicago and New York in the example below. If anyone has experienced this problem before and could just give a general overview of their workflow, it would be much appreciated, I don't necessarily need specific code. I've been using reclink from ssc, but I'm open to anything. Let me know if I've left out important details or can give more background. Thanks!


Code:
clear
input str20 place_name year
"chicago" 1990
"chicag" 1991
"chicago" 1992
"new york" 1990
"new york city" 1991
"new york c" 1992
end