Hi Statalisters,

I have a project that requires me to use fuzzy match to identify firm names in different forms.

Let's see examples of two datasets. I want to fuzzy match the variable "cnms" between two datasets that contain many firm names.

Example 1 (master file) has over 291,000 obs.

Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float fyear str58 conm str6 gvkey str10 cusip str4 sic str6 naics str50 cnms str8 ctype double salecs float idmaster
2003 "1-800 CONTACTS INC" "066425" "681977104" "5961" "454111" "Not Reported"       "COMPANY"             4.838  1
2004 "1-800 CONTACTS INC" "066425" "681977104" "5961" "454111" "Not Reported"       "COMPANY"            80.438  2
2005 "1-800 CONTACTS INC" "066425" "681977104" "5961" "454111" "Not Reported"       "COMPANY" 6.268000000000001  3
2006 "1-800 CONTACTS INC" "066425" "681977104" "5961" "454111" "Not Reported"       "COMPANY"           154.179  4
2000 "1-800-FLOWERS.COM"  "122519" "68243Q106" "5961" "454110" "Online"             "MARKET"            119.019  5
2000 "1-800-FLOWERS.COM"  "122519" "68243Q106" "5961" "454110" "Telephonic"         "MARKET"            230.221  6
2000 "1-800-FLOWERS.COM"  "122519" "68243Q106" "5961" "454110" "Retail/Fulfillment" "MARKET"              36.01  7
2001 "1-800-FLOWERS.COM"  "122519" "68243Q106" "5961" "454110" "Retail/Fulfillment" "MARKET"             28.592  8
2001 "1-800-FLOWERS.COM"  "122519" "68243Q106" "5961" "454110" "Online"             "MARKET"            182.924  9
2001 "1-800-FLOWERS.COM"  "122519" "68243Q106" "5961" "454110" "Telephonic"         "MARKET"            230.723 10
2002 "1-800-FLOWERS.COM"  "122519" "68243Q106" "5961" "454110" "Telephonic"         "MARKET"            248.931 11
2002 "1-800-FLOWERS.COM"  "122519" "68243Q106" "5961" "454110" "Online"             "MARKET"            218.179 12
2002 "1-800-FLOWERS.COM"  "122519" "68243Q106" "5961" "454110" "Retail/Fulfillment" "MARKET"             30.095 13
2003 "1-800-FLOWERS.COM"  "122519" "68243Q106" "5961" "454110" "Online"             "MARKET"            265.278 14
2003 "1-800-FLOWERS.COM"  "122519" "68243Q106" "5961" "454110" "Retail/Fulfillment" "MARKET"             29.269 15
2003 "1-800-FLOWERS.COM"  "122519" "68243Q106" "5961" "454110" "Telephonic"         "MARKET"            271.071 16
2004 "1-800-FLOWERS.COM"  "122519" "68243Q106" "5961" "454110" "Online"             "MARKET"             307.47 17
2004 "1-800-FLOWERS.COM"  "122519" "68243Q106" "5961" "454110" "Telephonic"         "MARKET"            263.039 18
2004 "1-800-FLOWERS.COM"  "122519" "68243Q106" "5961" "454110" "Retail/Fulfillment" "MARKET"             33.469 19
2005 "1-800-FLOWERS.COM"  "122519" "68243Q106" "5961" "454110" "Retail/Fulfillment" "MARKET"             49.848 20
end
Example 2 (using file) has over 37,000 obs.

Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str50 cnms str6 gvkey_cus str9 cusip_cus float idusing
"1DRI"                         "#N/A"   ""           1
"1INSURER LTD"                 "#N/A"   ""           2
"1MC-AGRICO"                   "#N/A"   ""           3
"1ST FL NA"                    "#N/A"   ""           4
"1ST SOLUTIONS"                "#N/A"   ""           5
"1ST UNION FL"                 "#N/A"   ""           6
"2 WIRELESS CARRIERS"          "#N/A"   ""           7
"20/20 SPORT"                  "#N/A"   ""           8
"20TH CENTRY"                  "012886" "90130A101"  9
"20TH CENTURY FOX"             "012886" "90130A101" 10
"20TH CENTY"                   "012886" "90130A101" 11
"TWENTY-FIRST CENTURY FOX INC" "012886" "90130A101" 12
"2122UNITED NATURAL FOODS INC" "#N/A"   ""          13
"21ST CENTY TELECOM GROUP INC" "#N/A"   ""          14
"238 TELECOM LIMITED"          "#N/A"   ""          15
"24 HOUR FITNESS"              "#N/A"   ""          16
"24 HOUR FITNESS USA, INC."    "#N/A"   ""          17
"24 HOUR FITNESS WORLD, INC."  "#N/A"   ""          18
"24/7"                         "#N/A"   ""          19
"24/7 MEDIA INC"               "#N/A"   ""          20
end
I tried both -matchit- and -reclink- commands to match.

Code:
matchit idmaster cnms using final, idusing(idusing) txtusing(cnms)
Code:
reclink cnms using final, idmaster(idmaster) idusing(idusing) gen(matchscore) _merge(_merge) minscore(.9)
However, both commands took more than 5 hours processing in Stata and still did not finish. I had to break the processing. I admitted these two fuzzy match commands took much time in processing but did not expect such a long time. Anyone has a better solution so shorten processing time when fuzzy match with two large datasets/

Thanks in advance.