How to create binary variables from a composite string variable

Dear Community,
I am relatively new to Stata and am still learns how to use macros and loops which I suspect may be a part of the required solution.

I need to transform a string variable in a large dataset (~40,000 records) into a form that I can analyse. The string variable (e.g. procedure) is currently a composite string variable where each response is separated by a line return [(char(10)]. Each response is a phrase (e.g. cholecystectomy). For each patient record the variable may be missing, have one response (cholecystectomy) or multiple responses (cholecystectomy (char(10)) appendectomy etc.). I would like to create new binary variables for each of the procedures present. There are approximately 100 different procedure values. I don't have access to a reference list of all possible values this variable can take. Rather I would like the new variable names to be generated from the values already within this variable.

This dataset is coded in this way for about 20 other string variables so the method I use naturally needs to be easily reproducible.

I can use the split command to parse by char(10) but this creates many new variables (e.g. procedure1, procedure2 etc...) but these are not binary (e.g. cholecystectomy, appendectomy).

Code:

split procedure, parse(`=char(10)')

A solution to a similar problem "How to create binary variables from words in phrases?" was posted here but is designed around the new variables of interest being 'words' within phrases. I haven't been able to modify the code to make it suit phrases delimited by line breaks within a long string. Here is the solution posted by Robert Pickard that seems to be on the right track:

Here's another approach

Code:

clear
input byte id str17 phrase
1 "new video post"
2 "newer tweet"
3 "add 12 new photos"
4 "removed 1 photo"
end
format %-17s phrase

split phrase
local nwords = r(nvars)
forvalues i = 1/`nwords' {

levelsof phrase`i', clean
foreach word in `r(levels)' {
// in case a word is not a valid Stata name
local vname = strtoname("`word'")
// ignore error if var already exists
cap gen byte `vname' = 0
// look for space delimited words
qui replace `vname' = 1 if strpos(" " + phrase`i' + " "," `word' ")
}

}

Thank you for sharing your expertise.

Shamil

BJ Data Tech Solution

Home / Data Cleaning / Data management / Data Processing / How to create binary variables from a composite string variable
How to create binary variables from a composite string variable

0 Response to How to create binary variables from a composite string variable

Post a Comment

Home / Data Cleaning / Data management / Data Processing / How to create binary variables from a composite string variable How to create binary variables from a composite string variable

Related Posts with How to create binary variables from a composite string variable

0 Response to How to create binary variables from a composite string variable

Post a Comment

Home / Data Cleaning / Data management / Data Processing / How to create binary variables from a composite string variable
How to create binary variables from a composite string variable