Hello,
I want generate a new varibale that display the words found in the string variable that match my local list of words.
The context hereby is that I want to create a profanity filter. Nevertheless not all words are considered to be profane in every context.
Therefore I want to see which words my profanity filter is classifying as profane.
The Approach sofar:
gen profanitydummy = 0
gen profanitycount = 0
local badwords "badword1 badword2 badword3 badword4"
foreach b in `badwords' {
replace profanitydummy = 1 if strpos(varstring, " `b' ") != 0
replace profanitycount = profanitycount + 1 if strpos(varstring, " `b' ") != 0
}
This results in a dummy if a word in the varstring matches a word in the local badwords.
In addition it counts the number of unique badwords used in the string.
The local badwords list is approx. 1100 words, I used from a reseacher gathering "offensive" words.
I now want to know, for which words the profanity dummy is indicating that there is a bad word in the varstring.
My approach:
gen badwordinstring = ""
foreach b in `badwords'{
replace badwordinstring = " `b' " if strpos(varstring), " `b' ")
}
Nevertheless, get the error message "invalid Syntax" and cant figure out where the problem is.
My desired goal would be: badwordsinstring: "badword5 badword7"
In addition as of right now my profanitycounter only counts the unique badwords used in a the string.
Do you guys have a hint how to change it to the absolute number of badwords in the string.
For example if badword1 is used 2 times and badword2 is used 5 times the varibale should indicate 7, as of right now I am only able to get the unique amount of badwords.
Thank you in advance.
Related Posts with Extracting words from a string variable using a local list
Multiple bars in a single chartHi all, I have very basic question that I cannot find an answer to. I have tried all sorts of option…
Test statistics and p-values different in SEM linear regression vs. OLSGreetings, I'm running Stata 15.1 on a Mac OS. I'm currently working with aggregate time series dat…
Correct date variableHi all, I hope someone can help here. I want to correct the first two digits of the year from a dat…
Creating observation pairs in Stata - balanced sample, no replacementHello, I am trying to create a balanced sample of paired observations in Stata based on a treated a…
Confidence interval for crude rateI would like to calculate the 95% confidence interval for the crude rate. How I could do it in STATA…
Subscribe to:
Post Comments (Atom)
0 Response to Extracting words from a string variable using a local list
Post a Comment