Hi Stata users,

I have a problem looping over string values. I have a data set that contains duplicate observations. I would like to keep one of the duplicate observations for each duplicate observations. However, other variables create concerns to keep any one of the duplicate observations for each group. I have two other variables, gender, and province. For some of the duplicate observations, one of the duplicates contains the true province value and one contains a false value. I would like to keep the one with the true province value. Here is how the data looks:

HTML Code:
stid    gender2    province2
E05528498    Male    دیپارتمنتاداریودیپلوماسیپوهنحیحقوقپوهنتونالبیرونی
E05528498    Male    کاپيسا
E05528502    Male    کاپيسا
E05528502    Male    دیپارتمنتفقهوقانونپوهنحیشرعیاتپوهنتونالبیرونی
where stid is the identifier. gender2 is gender and province2 is the province. For the observation where stid = 05528498, the second value is the true province value and the first one is a false value. For stid = E05528502, the first value is the true value and the second is false. The true value for all observations is not کاپيسا. I have 34 province names and the true value could any of those 34. My variable values are string in the Persian language.

My question is how can I write a program that for each group of duplicates, I maintain the one with the true province value?

Another level of complication with this data is that two variables are missed up. Like the above, the province value has the same issue. In addition, the variable gender2 also contains a true gender indicator and an empty cell. For each group of duplicates, I want to maintain the one which has the gender indicator. The issue here is that the true province value and the true gender value are not in the same row. Here is an example in the data:


HTML Code:
stid    gender2    province2
F01722690    Male    دیپارتمنتتعلیماتاسلامیپوهنحیشرعیاتپوهنتونکابلبرایذکورواناث
F01722690        کابل
F01722815        کابل
F01722815    Female    دیپارتمنتگرافیکپوهنحیهنرهایزیباپوهنتونکابل
where fro stid = F01722690, gender2 = Male is the true gender indicator while province2 = دیپارتمنتتعلیماتاسلامیپوه نحیشرعیاتپوهنتونکابلبرایذ کورواناث is the wrong province value. But for the same stid in the next row, the gender2 value is and empty cell while the province2 = کابل is the true province value. For the observation stid = F01722815 the problem is flipped. The first row contains an empty cell for the gener2 and the true province2.

My question, in this case, is for each group how can I fill the empty cell with the gender indicator and then keep the one with the true province value?

With numerical value using max, min and group function it is easy to generate another variable. But I am struggling to do it with string.

You will save me a lot of time if you can help me with this.

Thanks!