I am currently working on a dataset of government contracts and have run into an issue while cleaning it. I have found that many of the contractor names are spelled multiple ways like the example shown below.
Code:
"2 fast entertaiment"
"2 fast entertainment"
"2 fast entertainment copr"
"2 fast entertainment corp"
"2 fast entertainment corp."
I have about 269,000 observations and about 97,000 unique values of contractor name. Assuming the avg number of different spellings is anywhere between 1.5-2 (i guesstimated this by eye going over a collapsed list of names) I should have anywhere between 48,000 and 64,000. Using
-strtrim- -stritrim- -strlower- -usubinstr- and
-ustrnormalize- to address the most obvious and common problems I have been able to reduce the number of unique names to about 90,000, but this is far from ideal, to illustrate this is the reduced list of the example name.
Code:
"2 fast entertaiment"
"2 fast entertainment"
"2 fast entertainment corp"
The problems I am facing are that 1) I don't necessarily want to remove the corp/llc/llp/inc components of company names because I fear that I could have something like ACME llc and ACME corp be two different entities and I group them into one, but in a case like the example above I feel that it is clearly the same entity and I can remove or add a corp. And 2) I don't know how to fix the misspelling of words like
entertaiment to
entertainment without going mistake by mistake and using
-replace-. If this dataset weren't so large I wouldn't mind doing this by hand but it seems like a poor use of time to spend at the very least a whole day scouring through names to identify and correct these mistakes. Please let me know if you have any ideas or suggestions on how to tackle this problem.
Your help and guidance are greatly appreciated. Have a great [insert appropriate time of day here] and I hope you are staying safe and healthy!
0 Response to Correcting and consolidating multiple name spellings
Post a Comment