I am currently working with a dataset consisting of research articles, where a given article text is saved in one cell. I would like to deconstruct this string variable into a column vector, consisting of all individual words within that string. I have worked out a solution to it, but it feels inefficient as it involves a large number of loops. The process that I am currently working with is described below.
For this description, I will be working with three different types of variables:
- Stringvar - The variable containing the text. For the sake of this example, it will contain the text "A small cat"
- Word_var - A column vector in which to compile all individual words of "Stringvar"
- Var`i' - A set of placeholders that contain one word from "Stringvar"
The way that I have been going forward with this up until now is described in code below:
----------------------------------------------------------------------------------------------------------------------------------
clear all
set obs 1
gen Stringvar="A small cat"
// First I need to generate the same number of rows as number of words, which is done using
// gen(wordcount) function and macros
gen wordcount=wordcount(Stringvar)
egen s=max(wordcount)
replace wordcount=s
drop s
local wordcount=wordcount
global wordcount=wordcount
set obs $wordcount
// Next, I create the placeholder, "Word_var"
gen Word_var=.
tostring Word_var, replace
// Next, I loop over all words of Stringvar and create Var`i' that contains each word:
// In this example, this means that I will create variables Var1, Var2 and Var3.
forvalues i=1/`wordcount' {
gen var`i'=word(Stringvar,`i')
// Next, I extend them so that each row of Var`i' contains the same word:
replace Var`i'=Var`i'[_n-1] if Var`i'==""
// Lastly, I compile them into "Word_var":
replace Word_var=Var`i' if `i'==_n
}
// (In reality, I also delete Var1, Var2 and Var3 after each loop to reduce number of variables)
----------------------------------------------------------------------------------------------------------------------------------
- I have then, in the end, created three variables (one for each word) using loops and compiled their values into "Word_var".
Stringvar | Word_var | Var1 | Var2 | Var3 |
A small cat | A | A | small | cat |
small | A | small | cat | |
cat | A | small | cat |
The problem with this is that a normal document with 10,000 words means that I must do 10,000 loops to go through the text.
My question is now:
Is there a smarter way of doing this?
Sincerely
Johan Karlsson
0 Response to Splitting string into column vector
Post a Comment