Hi all,
RE: split using Stata version 16.0.
As part of data cleaning my data sets I need to generate a new variable that gives the final result for a variable. The variable it is derived from (sero) includes preliminary results. For a given dataset I have multiple "sero" variables (sero1, sero2, sero3) so would like an automated way to do this.
sero is a string variable that is of varying length (example A), that could be solely numerical (example B), or may contain symbols (example C). I would like to generate a new variable that has the value as listed, except for when there are square brackets ([ ]). When there are square brackets, the final value is what is listed inside the bracket. For example, 13/14/12B[14] should become 14. The "split" command works well for this when the dataset is large as the variable will typically include a mix of all types of variables (e.g. Example D). However, when there are few observations I run into issues with the command as the variable does not always contain the square brackets (e.g. examples A and B).
Example A
sero
3A
3
1-like
Example B-- Note: this is still a string variable
sero
1
3
5
Example C
sero
13/14/12B[14]
15A/F[15F]
2-like
Example D
sero
13/14/12B[14]
15A/F[15F]
2-like
3A
1
2
4
4-like
32A/12F/19A[32A]
For a single variable, a simple example is listed below. This works perfectly if there are square brackets, but does not work if no brackets.
split sero1, p("[")
split sero12, p("]")
gen sero1_new = sero11 if sero121==""
replace sero1_new = sero121 if sero121!=""
drop sero11 sero12 sero121
For the data cleaning I need this automated so that it will work for several sero (i.e. sero1, sero2, sero3 etc) variables, regardless of whether they contain the square brackets. The coding is shown below:
Note: max_sero is the maximum number of sero variables in the data set (i.e. global max_sero 3, if 3 variables). As for the simplified example, this works if there are square brackets but not if there aren't.
Is there a way I can use the same coding regardless of whether there are brackets or not? I have a work around but it is a bit messy (e.g. only adjust the variables with brackets).
forval x=1/$max_sero {
split sero`x', p("[")
split sero`x'2, p("]")
gen sero`x'_clean = sero`x'1 if sero`x'21=="" //generate sero*_clean variable, use the value present if no brackets
replace sero`x'_clean = sero`x'21 if sero`x'21!="" //updating to final serotype call given inside the bracket for closely related calls
generate sero`x'_clean_test = 1 if sero`x'_clean!= sero`x'
}
I also tried to approach this issue by using ustrregexra but could only work out how to get rid of the data inside of the brackets, not keep it (e.g. replace sero1=ustrregexra(sero1, "\[.*?\]" , "" ) if strpos(sero1,"[") ).
Thank you for your help!
Related Posts with Help with "Split"- splitting a string variable that has varying numbers, text and symbols
Shaded graphDear all, I was wondering if anyone could point me in the right direction on how to produce a shade…
Number of participants after Cox regression with listwise deletion of missing data.Dear Statalist, I am using multivariable Cox regression. My main exposure (TS_WHO_gold) is a catego…
Match cross-linked observationsDear Statalist, I have a large dataset that consolidates identical observations from 3 sources that…
Preliminary analysesHi all, quite a general question again. Which are the best tools in STATA to provide preliminary ana…
Sort out the observations from a large sampleHello everyone. I am a new user of Stata and new member of this forum. May I ask you a question. I a…
Subscribe to:
Post Comments (Atom)
0 Response to Help with "Split"- splitting a string variable that has varying numbers, text and symbols
Post a Comment