I am trying to match data on cities in two data sets using only the names of the cities. Unfortunately, there are many variants for how names can be presented.
For example,
in one data set the name of a city is TERREBONNE PARISH CONSOLIDATED GOVERNMENT but
in the other data set the name of the city is TERREBONNE CONSOLIDATED GOVERNMENT.
These are almost certainly the same city but when I use the matchit function they do not give me a perfect match.
I cannot just use matchit's similarity scores because there are also cases in the data like APPLEGATE VILLAGE which is matched to THE VILLAGE OF DOUGLAS. These two are almost certainly not the same city but I get a matchit score of .5 because they both contain the word VILLAGE
I thought about creating a variable for each word in the city name and then dropping words like "Village", "Town" "of" etc. but even if I programmed that in I can't figure out any reasonablely efficient strategy to find that "TERREBONNE" is in each name since the word order can differ across data sets.
Suggestions for how to proceed would be appreciated.
Related Posts with Help matching city names in two data sets
Logistic regression "outcome does not vary" I got values on an independent variables for only one of the options of my dependent variableLogistic regression "outcome does not vary" Hello, I know which variable is giving me problems, bu…
merging datasetsDear All, I want to merge a dimension repeated cross section data with another two-repeated cross se…
Shrinkage of estimates towards group mean using Mixed Effects Models for a binary outcomeHello Statalist, I have survey data that collected individual level use of a bed net against mosqu…
Using xtivreg2 for Simultaneous Equation model with fixed effectsHi, I am using a simultaneous equations model with fixed effects estimation procedure. I was using x…
Variable splittingI admit that the split syntax has been explained severally but I don't seem to get an answer that ca…
Subscribe to:
Post Comments (Atom)
0 Response to Help matching city names in two data sets
Post a Comment