I have a database with +70k observations. I am looking at school principals turnover of +5000 schools from 2008 until 2019. The only information I have is year, name of the school, and name of the principal. However, as the name of the principal was inserted every year, there are a lot of typos, and I need to clean this database before checking, for each school, if principals changed or not each year.
This is an example of what I am talking about:
Year | School | Name |
2008 | A | Jeff Ready |
2009 | A | Jeffrey Ready |
2010 | A | Maria Santos |
2011 | A | Maria Santos |
2012 | A | MarĂa Santos |
2008 | B | John Luther Schneider |
2009 | B | John Luther Schneider |
2010 | B | John Luter Schneider |
2011 | B | John Luther Schneider |
2012 | B | Johnn Luther Schneider |
2008 | C | Robert D. King |
2009 | C | Robert King |
2010 | C | Robert King |
2011 | C | Robert Douglas King |
2012 | C | Robert King |
Initially, I thought about using the user-written command “strdist”, but my understanding is that it does not compare values within the same variable.
Ideally I’d have a way of comparing the string in a line with the string in the line right above it. If they were similar enough (measured by a level of Levenshtein distance), I’d want the value to be a copy of the line above; otherwise, it would stay what it is.
Does anyone know if this function exists in Stata (or Excel, or some other language)?
0 Response to Comparing string values within the same variable
Post a Comment