Dear Statalist,

I am currently working with a dataset consisting of research articles, where a given article text is saved in one cell. I would like to deconstruct this string variable into a column vector, consisting of all individual words within that string. I have worked out a solution to it, but it feels inefficient as it involves a large number of loops. The process that I am currently working with is described below.

For this description, I will be working with three different types of variables:
  • Stringvar - The variable containing the text. For the sake of this example, it will contain the text "A small cat"
  • Word_var - A column vector in which to compile all individual words of "Stringvar"
  • Var`i' - A set of placeholders that contain one word from "Stringvar"
Starting out, each dataset contains only "Stringvar" and 1 row.

The way that I have been going forward with this up until now is described in code below:

----------------------------------------------------------------------------------------------------------------------------------

clear all
set obs 1
gen Stringvar="A small cat"

// First I need to generate the same number of rows as number of words, which is done using
// gen(wordcount) function and macros

gen wordcount=wordcount(Stringvar)
egen s=max(wordcount)
replace wordcount=s
drop s
local wordcount=wordcount
global wordcount=wordcount
set obs $wordcount

// Next, I create the placeholder, "Word_var"

gen Word_var=.
tostring Word_var, replace

// Next, I loop over all words of Stringvar and create Var`i' that contains each word:
// In this example, this means that I will create variables Var1, Var2 and Var3.

forvalues i=1/`wordcount' {
gen var`i'=word(Stringvar,`i')

// Next, I extend them so that each row of Var`i' contains the same word:

replace Var`i'=Var`i'[_n-1] if Var`i'==""

// Lastly, I compile them into "Word_var":

replace Word_var=Var`i' if `i'==_n
}
// (In reality, I also delete Var1, Var2 and Var3 after each loop to reduce number of variables)

----------------------------------------------------------------------------------------------------------------------------------
  • I have then, in the end, created three variables (one for each word) using loops and compiled their values into "Word_var".
Stringvar Word_var Var1 Var2 Var3
A small cat A A small cat
small A small cat
cat A small cat





The problem with this is that a normal document with 10,000 words means that I must do 10,000 loops to go through the text.

My question is now:

Is there a smarter way of doing this?

Sincerely
Johan Karlsson