Hi all,
RE: split using Stata version 16.0.
As part of data cleaning my data sets I need to generate a new variable that gives the final result for a variable. The variable it is derived from (sero) includes preliminary results. For a given dataset I have multiple "sero" variables (sero1, sero2, sero3) so would like an automated way to do this.
sero is a string variable that is of varying length (example A), that could be solely numerical (example B), or may contain symbols (example C). I would like to generate a new variable that has the value as listed, except for when there are square brackets ([ ]). When there are square brackets, the final value is what is listed inside the bracket. For example, 13/14/12B[14] should become 14. The "split" command works well for this when the dataset is large as the variable will typically include a mix of all types of variables (e.g. Example D). However, when there are few observations I run into issues with the command as the variable does not always contain the square brackets (e.g. examples A and B).
Example A
sero
3A
3
1-like
Example B-- Note: this is still a string variable
sero
1
3
5
Example C
sero
13/14/12B[14]
15A/F[15F]
2-like
Example D
sero
13/14/12B[14]
15A/F[15F]
2-like
3A
1
2
4
4-like
32A/12F/19A[32A]
For a single variable, a simple example is listed below. This works perfectly if there are square brackets, but does not work if no brackets.
split sero1, p("[")
split sero12, p("]")
gen sero1_new = sero11 if sero121==""
replace sero1_new = sero121 if sero121!=""
drop sero11 sero12 sero121
For the data cleaning I need this automated so that it will work for several sero (i.e. sero1, sero2, sero3 etc) variables, regardless of whether they contain the square brackets. The coding is shown below:
Note: max_sero is the maximum number of sero variables in the data set (i.e. global max_sero 3, if 3 variables). As for the simplified example, this works if there are square brackets but not if there aren't.
Is there a way I can use the same coding regardless of whether there are brackets or not? I have a work around but it is a bit messy (e.g. only adjust the variables with brackets).
forval x=1/$max_sero {
split sero`x', p("[")
split sero`x'2, p("]")
gen sero`x'_clean = sero`x'1 if sero`x'21=="" //generate sero*_clean variable, use the value present if no brackets
replace sero`x'_clean = sero`x'21 if sero`x'21!="" //updating to final serotype call given inside the bracket for closely related calls
generate sero`x'_clean_test = 1 if sero`x'_clean!= sero`x'
}
I also tried to approach this issue by using ustrregexra but could only work out how to get rid of the data inside of the brackets, not keep it (e.g. replace sero1=ustrregexra(sero1, "\[.*?\]" , "" ) if strpos(sero1,"[") ).
Thank you for your help!
Related Posts with Help with "Split"- splitting a string variable that has varying numbers, text and symbols
Event study modellingDear forum members, I am currently working on the capital market consequences of a specific regulat…
Grouping datesHello I have a dataset with lots of repeated measures and dates. I currently have this in the long …
Data cleaningHi members, i am new around here and a novice with research. i am trying to conduct an Interrupted …
-vecrank- errorHey, I have a monthly time series data from 2005m1 to 2018m12. On visual inspection there are no gap…
Need help in probit model: probability that a developing country which is a WTO member faces a dispute.Hello! Hope you are all having a great week! I need help to calculate the probability that a devel…
Subscribe to:
Post Comments (Atom)
0 Response to Help with "Split"- splitting a string variable that has varying numbers, text and symbols
Post a Comment