Hi folks,

I have a database with +70k observations. I am looking at school principals turnover of +5000 schools from 2008 until 2019. The only information I have is year, name of the school, and name of the principal. However, as the name of the principal was inserted every year, there are a lot of typos, and I need to clean this database before checking, for each school, if principals changed or not each year.

This is an example of what I am talking about:

Year School Name
2008 A Jeff Ready
2009 A Jeffrey Ready
2010 A Maria Santos
2011 A Maria Santos
2012 A MarĂ­a Santos
2008 B John Luther Schneider
2009 B John Luther Schneider
2010 B John Luter Schneider
2011 B John Luther Schneider
2012 B Johnn Luther Schneider
2008 C Robert D. King
2009 C Robert King
2010 C Robert King
2011 C Robert Douglas King
2012 C Robert King

Initially, I thought about using the user-written command “strdist”, but my understanding is that it does not compare values within the same variable.

Ideally I’d have a way of comparing the string in a line with the string in the line right above it. If they were similar enough (measured by a level of Levenshtein distance), I’d want the value to be a copy of the line above; otherwise, it would stay what it is.

Does anyone know if this function exists in Stata (or Excel, or some other language)?