Hi Statalisters,

I try to use fuzzy match commands matchit and reclink to merge two datasets.

Here is an example of master file. I am focusing on using the third column cnms (company name) to match data.

* Example generated by -dataex-. To install: ssc install dataex
input float fyear str58 conm str50 cnms
2004 "180 CONNECT INC"    "DIRECTV GROUP INC"                                 
2005 "180 CONNECT INC"    "DIRECTV GROUP INC"                                 
2006 "180 CONNECT INC"    "DIRECTV GROUP INC"                                 
2007 "180 CONNECT INC"    "DIRECTV GROUP INC"                                 
2000 "1MAGE SOFTWARE INC" "Reynolds & Reynolds  -CL A"                        
2001 "1MAGE SOFTWARE INC" "Reynolds & Reynolds  -CL A"                        
2002 "1MAGE SOFTWARE INC" "Reynolds & Reynolds  -CL A"                        
2003 "1MAGE SOFTWARE INC" "Reynolds & Reynolds  -CL A"                        
2012 "2U INC"             "Georgetown University School of Nursing and Health"
2012 "2U INC"             "University of Southern California"                 
Here is an example of using file. I will use cnms as the variable to match.

* Example generated by -dataex-. To install: ssc install dataex
input str50 cnms str6 gvkey_cus str9 cusip_cus
"20TH CENTRY"                  "012886" "90130A101"
"20TH CENTURY FOX"             "012886" "90130A101"
"20TH CENTY"                   "012886" "90130A101"
"TWENTY-FIRST CENTURY FOX INC" "012886" "90130A101"
"2122UNITED NATURAL FOODS INC" "#N/A"   ""         
"21ST CENTY TELECOM GROUP INC" "#N/A"   ""         
"238 TELECOM LIMITED"          "#N/A"   ""         
"24 HOUR FITNESS"              "#N/A"   ""         
"24 HOUR FITNESS USA, INC."    "#N/A"   ""         
"24 HOUR FITNESS WORLD, INC."  "#N/A"   ""         
"24/7"                         "#N/A"   ""         
Here are my reclink and matching codes.

reclink cnms using final1000, idmaster(idmaster) idusing(idusing) gen(matchscore) _merge(_merge) minscore(.9)
matchit idmaster cnms using final1000, idusing(idusing) txtusing(cnms)
The problem is after matching, both commands encounter similar problems, that is, (see the following example) commands seem to be confused by some common names among firms, such as CORP, INC, and LTD. For example, between observations "ARROW INTERNATIONAL" and "ADS INTERNATIONAL", the commands think they can be matched with a high score, however, the commands are confused by "INTERNATIONAL" and actually they are two distinct firms. Does anyone how to overcome such problems in fuzzy match? Can we allocate different weights within an observation to different words?

* Example generated by -dataex-. To install: ssc install dataex
input float fyear str58 conm str6 gvkey str10 cusip str4 sic str6 naics str50(cnms Ucnms) str8 ctype double salecs float(idmaster matchscore idusing) str6 gvkey_cus str9 cusip_cus byte _merge
2001 "ACURA PHARMACEUTICALS INC"   "011929" "00509L802" "2834" "325412" "WATSON PHARMACEUTICALS INC" "AGIOS PHARMACEUTICALS INC" "COMPANY"  14.559 3359 .9310636 827 "#N/A"   ""          3
2002 "ACURA PHARMACEUTICALS INC"   "011929" "00509L802" "2834" "325412" "WATSON PHARMACEUTICALS INC" "AGIOS PHARMACEUTICALS INC" "COMPANY"   6.974 3361 .9310636 827 "#N/A"   ""          3
2003 "ACURA PHARMACEUTICALS INC"   "011929" "00509L802" "2834" "325412" "WATSON PHARMACEUTICALS INC" "AGIOS PHARMACEUTICALS INC" "COMPANY"   3.335 3362 .9310636 827 "#N/A"   ""          3
2009 "ADTRAN INC"                  "030576" "00738A106" "3661" "334210" "AT&T INC"                   "AT&T INC"                  "COMPANY" 106.521 4057        1 116 "009899" "00206R102" 3
2010 "ADTRAN INC"                  "030576" "00738A106" "3661" "334210" "AT&T INC"                   "AT&T INC"                  "COMPANY" 109.021 4064        1 116 "009899" "00206R102" 3
2001 "ADV NEUROMODULATION SYS INC" "008872" "00757T101" "3845" "334510" "ARROW INTERNATIONAL"        "ADS INTERNATIONAL"         "COMPANY"     1.8 4100 .9397588 530 "#N/A"   ""          3
2002 "ADV NEUROMODULATION SYS INC" "008872" "00757T101" "3845" "334510" "ARROW INTERNATIONAL"        "ADS INTERNATIONAL"         "COMPANY"    2.78 4102 .9397588 530 "#N/A"   ""          3
2003 "ADV NEUROMODULATION SYS INC" "008872" "00757T101" "3845" "334510" "ARROW INTERNATIONAL"        "ADS INTERNATIONAL"         "COMPANY"    1.44 4106 .9397588 530 "#N/A"   ""          3
Thanks in advance.