Hi all,

RE: split using Stata version 16.0.

As part of data cleaning my data sets I need to generate a new variable that gives the final result for a variable. The variable it is derived from (sero) includes preliminary results. For a given dataset I have multiple "sero" variables (sero1, sero2, sero3) so would like an automated way to do this.

sero is a string variable that is of varying length (example A), that could be solely numerical (example B), or may contain symbols (example C). I would like to generate a new variable that has the value as listed, except for when there are square brackets ([ ]). When there are square brackets, the final value is what is listed inside the bracket. For example, 13/14/12B[14] should become 14. The "split" command works well for this when the dataset is large as the variable will typically include a mix of all types of variables (e.g. Example D). However, when there are few observations I run into issues with the command as the variable does not always contain the square brackets (e.g. examples A and B).

Example A
sero
3A
3
1-like

Example B-- Note: this is still a string variable
sero
1
3
5

Example C
sero
13/14/12B[14]
15A/F[15F]
2-like


Example D
sero
13/14/12B[14]
15A/F[15F]
2-like
3A
1
2
4
4-like
32A/12F/19A[32A]

For a single variable, a simple example is listed below. This works perfectly if there are square brackets, but does not work if no brackets.
split sero1, p("[")
split sero12, p("]")
gen sero1_new = sero11 if sero121==""
replace sero1_new = sero121 if sero121!=""
drop sero11 sero12 sero121


For the data cleaning I need this automated so that it will work for several sero (i.e. sero1, sero2, sero3 etc) variables, regardless of whether they contain the square brackets. The coding is shown below:
Note: max_sero is the maximum number of sero variables in the data set (i.e. global max_sero 3, if 3 variables). As for the simplified example, this works if there are square brackets but not if there aren't.


Is there a way I can use the same coding regardless of whether there are brackets or not? I have a work around but it is a bit messy (e.g. only adjust the variables with brackets).


forval x=1/$max_sero {
split sero`x', p("[")
split sero`x'2, p("]")
gen sero`x'_clean = sero`x'1 if sero`x'21=="" //generate sero*_clean variable, use the value present if no brackets
replace sero`x'_clean = sero`x'21 if sero`x'21!="" //updating to final serotype call given inside the bracket for closely related calls
generate sero`x'_clean_test = 1 if sero`x'_clean!= sero`x'
}


I also tried to approach this issue by using ustrregexra but could only work out how to get rid of the data inside of the brackets, not keep it (e.g. replace sero1=ustrregexra(sero1, "\[.*?\]" , "" ) if strpos(sero1,"[") ).

Thank you for your help!