I've been matching entire electoral rolls in Stata 16. My sixth joinby is on first forename, surname, flat number (flat number has been dropped in an earlier match) and the first eight letters of street name. I then run a number of levenshtein word scores, including for standardised names and for converting all the names into strings. For this sixth joinby the word scores need to take into account people adding or dropping forenames, while probably being the same person.

Code:
forvalues j = 1/4 {
 forvalues i=1/4 {  
levenshtein fore_17_std_f`j' fore_20_std_f`i', gen(fore_sim_1_20_`j'`i')
replace fore_sim_1_20_`j'`i'=. if fore_17_std_f`j'=="" | fore_20_std_f`i'==""
 }
}

egen zero_count = anycount(fore_sim_1_20_11-fore_sim_1_20_44), value(0)
tab zero_count
I've been using the number of zeros to indicate cases where there is strong evidence that I have the right person, taking into account their address has stayed the same, even though they may have added or dropped names. Curation can then focus on those who have switched to, or away from, English versions of their names. However, I wondered if there was an alternative way of measuring the similarities. I've been experimenting with ngram_distance (see below), but here the actual distance is largely irrelevant. What matters is how many of the names (or parts of names if the spelling is inconsistent) are the same in both periods. Does anyone have any suggestions? The first case is a particular challenge.


Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str18 fore_17_std_f1 str16 fore_17_std_f2 str15 fore_17_std_f3 str12 fore_17_std_f4 str18 fore_20_std_f1 str15(fore_20_std_f2 fore_20_std_f3) str1 fore_20_std_f4
"Sara"     "Raewyn"   "Abdul"      "Magdeline" "Sara"     "Abdul"    "Magdeline" ""
"Isabelle" "Lucy"     ""           ""          "Isabelle" ""         ""          ""
"Margaret" "Mary"     "Lily"       ""          "Margaret" "Mary"     "Lily"      ""
"Allan"    "Laurance" "Richard"    ""          "Allan"    "Laurance" "Richard"   ""
"James"    "Jan"      "Kui"        "Kanara"    "James"    "Jan"      "Kui"       ""
"Reece"    "Jermaine" ""           ""          "Reece"    "Jermaine" "Clarke"    ""
"Raymond"  "Simon"    ""           ""          "Raymond"  "Simon"    "Haika"     ""
"Marlene"  "Te"       "Tirakohine" ""          "Marlene"  "Joanne"   "Te"        ""
"Russell"  "Anderson" ""           ""          "Russell"  "Anderson" "Ngataki"   ""
"Jane"     "Aroha"    "Edith"      ""          "Jane"     "Aroha"    "Edith"     ""
end
Code:
strdist(forename17f forename20f),ngramdist(ngram_distance)
Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str30 forename17f str24 forename20f
"Sara Raewyn Adbul Magdeline" "Sara Abdul Magdeline"    
"Isabelle Lucy"               "Isabelle"                
"Margaret Mary Lily"          "Margaret Mary Lily"      
"Allan Laurance Richard"      "Allan Laurance Richard"  
"James Jan Kui Kanara"        "James Jan Kui"           
"Reece Jermaine"              "Reece Jermaine Clarke"   
"Raymond Simon"               "Raymond Simon Haika"     
"Marlene Te Tirakohine"       "Marlene Joanne Te"       
"Russell Anderson"            "Russell Anderson Ngataki"
"Jane Aroha Edith"            "Jane Aroha Edith"        
end