I have a project that requires me to use fuzzy match to identify firm names in different forms.
Let's see examples of two datasets. I want to fuzzy match the variable "cnms" between two datasets that contain many firm names.
Example 1 (master file) has over 291,000 obs.
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input float fyear str58 conm str6 gvkey str10 cusip str4 sic str6 naics str50 cnms str8 ctype double salecs float idmaster 2003 "1-800 CONTACTS INC" "066425" "681977104" "5961" "454111" "Not Reported" "COMPANY" 4.838 1 2004 "1-800 CONTACTS INC" "066425" "681977104" "5961" "454111" "Not Reported" "COMPANY" 80.438 2 2005 "1-800 CONTACTS INC" "066425" "681977104" "5961" "454111" "Not Reported" "COMPANY" 6.268000000000001 3 2006 "1-800 CONTACTS INC" "066425" "681977104" "5961" "454111" "Not Reported" "COMPANY" 154.179 4 2000 "1-800-FLOWERS.COM" "122519" "68243Q106" "5961" "454110" "Online" "MARKET" 119.019 5 2000 "1-800-FLOWERS.COM" "122519" "68243Q106" "5961" "454110" "Telephonic" "MARKET" 230.221 6 2000 "1-800-FLOWERS.COM" "122519" "68243Q106" "5961" "454110" "Retail/Fulfillment" "MARKET" 36.01 7 2001 "1-800-FLOWERS.COM" "122519" "68243Q106" "5961" "454110" "Retail/Fulfillment" "MARKET" 28.592 8 2001 "1-800-FLOWERS.COM" "122519" "68243Q106" "5961" "454110" "Online" "MARKET" 182.924 9 2001 "1-800-FLOWERS.COM" "122519" "68243Q106" "5961" "454110" "Telephonic" "MARKET" 230.723 10 2002 "1-800-FLOWERS.COM" "122519" "68243Q106" "5961" "454110" "Telephonic" "MARKET" 248.931 11 2002 "1-800-FLOWERS.COM" "122519" "68243Q106" "5961" "454110" "Online" "MARKET" 218.179 12 2002 "1-800-FLOWERS.COM" "122519" "68243Q106" "5961" "454110" "Retail/Fulfillment" "MARKET" 30.095 13 2003 "1-800-FLOWERS.COM" "122519" "68243Q106" "5961" "454110" "Online" "MARKET" 265.278 14 2003 "1-800-FLOWERS.COM" "122519" "68243Q106" "5961" "454110" "Retail/Fulfillment" "MARKET" 29.269 15 2003 "1-800-FLOWERS.COM" "122519" "68243Q106" "5961" "454110" "Telephonic" "MARKET" 271.071 16 2004 "1-800-FLOWERS.COM" "122519" "68243Q106" "5961" "454110" "Online" "MARKET" 307.47 17 2004 "1-800-FLOWERS.COM" "122519" "68243Q106" "5961" "454110" "Telephonic" "MARKET" 263.039 18 2004 "1-800-FLOWERS.COM" "122519" "68243Q106" "5961" "454110" "Retail/Fulfillment" "MARKET" 33.469 19 2005 "1-800-FLOWERS.COM" "122519" "68243Q106" "5961" "454110" "Retail/Fulfillment" "MARKET" 49.848 20 end
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input str50 cnms str6 gvkey_cus str9 cusip_cus float idusing "1DRI" "#N/A" "" 1 "1INSURER LTD" "#N/A" "" 2 "1MC-AGRICO" "#N/A" "" 3 "1ST FL NA" "#N/A" "" 4 "1ST SOLUTIONS" "#N/A" "" 5 "1ST UNION FL" "#N/A" "" 6 "2 WIRELESS CARRIERS" "#N/A" "" 7 "20/20 SPORT" "#N/A" "" 8 "20TH CENTRY" "012886" "90130A101" 9 "20TH CENTURY FOX" "012886" "90130A101" 10 "20TH CENTY" "012886" "90130A101" 11 "TWENTY-FIRST CENTURY FOX INC" "012886" "90130A101" 12 "2122UNITED NATURAL FOODS INC" "#N/A" "" 13 "21ST CENTY TELECOM GROUP INC" "#N/A" "" 14 "238 TELECOM LIMITED" "#N/A" "" 15 "24 HOUR FITNESS" "#N/A" "" 16 "24 HOUR FITNESS USA, INC." "#N/A" "" 17 "24 HOUR FITNESS WORLD, INC." "#N/A" "" 18 "24/7" "#N/A" "" 19 "24/7 MEDIA INC" "#N/A" "" 20 end
Code:
matchit idmaster cnms using final, idusing(idusing) txtusing(cnms)
Code:
reclink cnms using final, idmaster(idmaster) idusing(idusing) gen(matchscore) _merge(_merge) minscore(.9)
Thanks in advance.
0 Response to Fuzzy match via -matchit- and -reclink-
Post a Comment