Dear all,

I am working with string data and I would like to retrieve the most important word from a string variable that reflects a text-input field. I have already split the original variable into five different variables.

The dataset is like
obs word1 word2 word3 word4 word5
1 2.45X90 VASSOURA
2 N9020X PIVO SUPERIOR
3 (S 1063T) LANTERNA
4 15W4020 L OLEO
5 E V A GLITER
I would like to create a new variable that retrieves the most important word out of these five different string variables.

In particular, I would like the dataset to be as the following
obs word1 word2 word3 word4 word5 final_word
1 2.45X90 VASSOURA VASSOURA
2 N9020X PIVO SUPERIOR PIVO
3 (S 1063T) LANTERNA LANTERNA
4 15W4020 L OLEO OLEO
5 E V A GLITER GLITER
where 'final_word' is the variable that retrieves the most important out of 'word1', 'word2', 'word3', 'word4' and 'word5'.

The criteria is the following: if 'word?' has

(i) no special characters ("." "(" "*" and others)
(ii) no numbers
(iii) no whitespace among letters (see obs == 6 for a case of whitespace among letters)
(iv) length > 1 (considering the length of 'word?')

then 'final_word' == 'word?'.

I would like to first check word1, then check word2, after that check word3 and so forth.

Could you help me to find a solution for that?

Thank you very much!


Below I provide the code for importing the example dataset into Stata :

clear
input byte obs str20 word1 str20 word2 str20 word3 str20 word4 str20 word5
1 "2.45X90" "" "" "VASSOURA" ""
2 "N9020X" "PIVO" "SUPERIOR" "" ""
3 "(S" "1063T)" "LANTERNA" "" ""
4 "15W4020" "L" "" "OLEO" ""
5 "E V A" "GLITER" "" "" ""
end

Obs: I tried to use 'dataex' but I found it easier, in this case, to provide the 'importing code'.