I am trying to compare a string variable with several others for similarity:
The goal is to compare variable "investor_name" to the company names listed in variables firm1 – firm3. If the string of "investor_name" is a match with one of the others, then the investor name is correct.
As you can see, a difficulty is that the string are not always an identical match, e.g. Blue Ocean Partners LLC vs. Blue Ocean.
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input byte person_id str32 investor_name str13 firm1 str19 firm2 str6 firm3 1 "Blue Ocean Partners LLC" "Blue Ocean" "Goldman Sachs" "" 2 "Goldman Sachs" "Goldman" "Breakthrough Energy" "" 3 "JP Morgan" "Deutsche Bank" "" "" 4 "Kleiner Perkins Caufield & Byers" "" "Kleiner Perkins" "Google" end
One approach I thought of was to run matchit 3 times and then select the one with highest similarity score. Do you have any other suggestion? Many thanks in advance!
0 Response to Comparing multiple string variables for similarity
Post a Comment