Hi all,

I am trying to match administrative names from two datasets that have multiple lengths/spellings. The master dataset has 400 unique names and the using has 379. I merge using the following:
Code:
matchit id municipality_name using `audit_nodup', idusing(id) txtusing(Municipio) override
where id is the numeric code for the municipality in each data set.

A portion of the results include:
input int id str49 municipality_name int id1 str49 Municipio double similscore
1 "Acambaro" 292 "Tacámbaro" .5345224838248488
1 "Acambaro" 10 "Acámbaro" .7142857142857143
1 "Acambaro" 2 "Acambay" .7715167498104595
2 "Acaponeta" 3 "Acaponeta" 1
3 "Acapulco De Juarez" 4 "Acapulco de Juárez" .7647058823529411
4 "Acatlan" 264 "San Luis Acatlán" .5270462766947299
4 "Acatlan" 180 "Matlapa" .5
4 "Acatlan" 8 "Acayucan" .5443310539518174
5 "Acayucan" 298 "Tantoyuca" .5892556509887896
5 "Acayucan" 2 "Acambay" .5443310539518174
5 "Acayucan" 8 "Acayucan" 1
5 "Acayucan" 337 "Tzucacab" .5555555555555556
6 "Actopan" 370 "Zapopan" .5
7 "Acuna" 9 "Acuña" .5
9 "Aguascalientes" 11 "Aguascalientes" 1
10 "Ahome" 12 "Ahome" 1
11 "Alamo Temapache" 377 "Álamo Temapache" .9285714285714286
12 "Alamos" 378 "Álamos" .8
13 "Allende" 15 "Allende" 1
13 "Allende" 162 "La Independencia" .5345224838248488
13 "Allende" 353 "Villa de Allende" .7492686492653552
13 "Allende" 270 "San Miguel de Allende" .6092717958449424
14 "Altamira" 16 "Altamira" 1
14 "Altamira" 17 "Altamirano" .8819171036881969
15 "Ameca" 21 "Ameca" 1
16 "Amecameca" 21 "Ameca" .9354143466934853
17 "Anahuac" 150 "Ixtlahuaca" .5443310539518174
17 "Anahuac" 69 "Chihuahua" .5892556509887896
18 "Apatzingan" 22 "Apatzingán" .7777777777777778
20 "Apodaca" 23 "Apodaca" 1
21 "Arandas" 27 "Arandas" 1
22 "Arcelia" 190 "Morelia" .5
23 "Arizpe" 234 "Ramos Arizpe" .674199862463242
24 "Arriaga" 28 "Arteaga" .5
25 "Arteaga" 28 "Arteaga" 1
27 "Atlixco" 306 "Temixco" .5
27 "Atlixco" 30 "Atlixco" 1
27 "Atlixco" 31 "Atlixtac" .6172133998483676
28 "Atotonilco El Alto" 18 "Altotonga" .659380473395787
29 "Atoyac De Alvarez" 32 "Atoyac de Álvarez" .75
33 "Banderilla" 52 "Candela" .5443310539518174
33 "Banderilla" 353 "Villa de Allende" .5353033790313108
33 "Banderilla" 53 "Candelaria" .5555555555555556
34 "Benito Juarez" 323 "Tlacotepec de Benito Juárez" .5661385170722978
34 "Benito Juarez" 41 "Benito Juárez" .8333333333333334
36 "Boca Del Rio" 42 "Boca del Río" .6363636363636364
38 "Cajeme" 46 "Cajeme" 1
39 "Calkini" 48 "Calkiní" .8333333333333334
40 "Calpulalpan" 14 "Ajalpan" .6546536707079772
40 "Calpulalpan" 47 "Calakmul" .5050762722761054
41 "Calvillo" 49 "Calvillo" 1
41 "Calvillo" 245 "Saltillo" .5714285714285714
43 "Campeche" 50 "Campeche" 1
44 "Cananea" 115 "Galeana" .5773502691896258
44 "Cananea" 259 "San Juan Cancuc" .5276448530110863
44 "Cananea" 338 "Técpan de Galeana" .5
44 "Cananea" 51 "Canatlán" .5345224838248488
45 "Candela" 52 "Candela" 1
45 "Candela" 53 "Candelaria" .816496580927726
47 "Cardenas" 96 "Cárdenas" .7142857142857143
47 "Cardenas" 54 "Carmen" .50709255283711
48 "Cardonal" 330 "Tonalá" .50709255283711
49 "Carmen" 54 "Carmen" 1
50 "Castanos" 55 "Castaños" .7142857142857143
51 "Celaya" 56 "Celaya" 1
52 "Centro" 57 "Centla" .6
52 "Centro" 58 "Centro" 1
54 "Chalcatongo de Hidalgo" 132 "Hidalgo" .6531972647421809
56 "Champoton" 61 "Champotón" .75
58 "Chignahuapan" 68 "Chignahuapan" 1
58 "Chignahuapan" 69 "Chihuahua" .6092717958449424
59 "Chignautla" 68 "Chignahuapan" .502518907629606
59 "Chignautla" 91 "Cuautla" .5443310539518174
59 "Chignautla" 138 "Huautla" .5443310539518174
60 "Chihuahua" 68 "Chignahuapan" .6092717958449424
60 "Chihuahua" 73 "Chimalhuacán" .5222329678670935
60 "Chihuahua" 364 "Yahualica" .5103103630798288
60 "Chihuahua" 317 "Tihuatlán" .5103103630798288
60 "Chihuahua" 69 "Chihuahua" 1
62 "Chilpancingo De Los Bravo" 71 "Chilpancingo de los Bravo" .8333333333333334
63 "Cihuatlan" 317 "Tihuatlán" .625
63 "Cihuatlan" 69 "Chihuahua" .5103103630798288
63 "Cihuatlan" 135 "Huamantla" .5
64 "Cintalapa" 358 "Xalapa" .6324555320336759
64 "Cintalapa" 212 "Papantla" .5345224838248488
64 "Cintalapa" 74 "Cintalapa" 1
65 "Ciudad Valles" 76 "Ciudad Valles" 1
65 "Ciudad Valles" 75 "Ciudad Madero" .5400617248673216
66 "Coalcoman De Vazquez Pallares" 79 "Comalcalco" .5063696835418333
67 "Coatzacoalcos" 77 "Coatzacoalcos" 1
70 "Comala" 79 "Comalcalco" .6201736729460423
70 "Comala" 151 "Jala" .5163977794943222
70 "Comala" 114 "Frontera Comalapa" .5590169943749475
71 "Comalcalco" 79 "Comalcalco" 1
71 "Comalcalco" 60 "Chalco" .6201736729460423
72 "Comitan De Dominguez" 80 "Comitán de Domínguez" .7419408268023742
73 "Comondu" 81 "Comondú" .8333333333333334
73 "Comondu" 82 "Comonfort" .5773502691896258
74 "Compostela" 83 "Compostela" 1
75 "Cordoba" 97 "Córdoba" .6666666666666666
end
[/CODE]


I want to first isolate those observations that have a perfect match, but
Code:
by id: keep if similscore==1
also deletes all ids for which there are multiple observations but where none of them has a similscore value equal to 1. Second, I'm wondering if there is a good way to proceed (besides manual inspections) for those ids that have multiple observations, none of which is similscore==1, and the observations with the highest score is not actually the best match (ex. id#1, the 2nd match is correct but it has a lower score than the third observation).