My idea is to first get the exact 'cod' matches and then perform a fuzzy matching with names within the same value for 'cod'.
I copy below my example datasets. What I want is that both observation with cod == "530461" and name "WAGNER OLIVEIRA" and observation with the same cod but name "VAGNER OLIVEIRA" in the master dataset is matched with observation with the same cod and name "WAGNER OLIVEIRA" in the using dataset, since it is just a tiny variation of the name.
I have tried with different options in reclink, using orblock, wmatch and wnomatch, but I did not manage to do that. It only performs exact cod-name matches, but I cannot manage to perform the fuzzy matching for those tiny variations of names.
Here's the reproducible example:
Code:
/* ssc install reclink ssc install dataex */ clear input byte id_using str6 cod str24 name byte var_using 1 "530461" "WAGNER OLIVEIRA" 0 2 "675232" "MARIANA COUTINHO" 1 3 "675232" "JOANA DA SILVA" 0 4 "513372" "ROMEU DE SOUZA" 0 5 "808747" "JULIETA CORREA DOS ANJOS" 1 6 "650334" "JULIETA CORREA APARECIDA" 1 7 "351475" "ROSANGELA DIRCKSCHNEIDER" 0 8 "970505" "TOMIKI SHIOKI" 0 9 "351475" "ANA MARIA MELO FRANCO" 0 10 "773263" "PROTOGENES HERMENEGILDO" 0 11 "530461" "ABADIO DOS SANTOS" 1 end sort cod name tempfile using save `using' clear input byte id_master str6 cod str24 name float var_master 1 "530461" "WAGNER OLIVEIRA" 0.900256205 2 "" "" 0.244029951 3 "675232" "MARIANA COUTINHO" 0.797757411 4 "" "" 0.20090559 5 "" "" 0.23264436 6 "530461" "VAGNER OLIVEIRA" 0.534601937 7 "675232" "JOANA DA SILVA" 0.138305611 8 "513372" "ROMEU DE SOUZA" 0.605197148 9 "808747" "JULIETA CORREA DOS ANJOS" 0.769864143 10 "650334" "JULIETA CORREA APARECIDA" 0.115634447 11 "351475" "ROSANGELA DIRCKSCHNEIDER" 0.983849107 12 "" "" 0.794636935 13 "970505" "TOMIKI SHIOKI" 0.129721648 14 "351475" "ANA MARIA MELO FRANCO" 0.253776459 15 "351475" "ANA MARIA MELO FRANCO" 0.612023696 16 "" "" 0.649227503 17 "773263" "PROTOGENES HERMENEGILDO" 0.154941878 18 "675232" "MARIANA S COUTINHO" 0.642434356 19 "" "" 0.566767856 20 "530461" "ABADIO DOS SANTOS" 0.279123444 end sort cod name reclink cod name using `using', /// idmaster(id_master) idusing(id_using) /// gen(match_score)
0 Response to Help with reclink: perform fuzzy matches of a variable within exact matches of another variable
Post a Comment