Hello all,

I am trying to import and clean a number of documents (imported into a dataset as a single variable) for later analysis.

Each document consists of long dialogs between speakers, where each speaker is identified with parentheses. Some of the documents are very long, and include thousands of statements (which exceed the total number of variables I can add in my flavor of Stata).

Code:
* Example generated by -dataex-. For more info, type help dataex
clear
input str13 filename str235 text
"document1.txt" "(speaker a): Lorem ipsum dolor sit amet. (speaker b): Ut enim ad minim veniam. (speaker c): quis nostrud exercitation ullamco laboris nisi. (speaker x): Tincidunt vitae."
"document2.txt" "(speaker f): Tortor consequat id porta nibh venenatis. (speaker g:) Enim sed."                                                                                            
""              "(speaker h:) Tincidunt vitae semper. (speaker i): quis lectus nulla at volutpat diam. (speaker j): Quis varius quam quisque."                                             
end


In order to process the documents, I use split text, p("):"). One I have split the document, I then reshape it to long (so that each individual statements is a separate observation).

However, as noted, some of the documents are so long that the split will generate too many variables. (The longest document has around 3,000 statements).

I have several options I am thinking about conceptually, but I'm not sure how to properly execute them.

-The easiest would be to simply cut the string in half, and then run split in two different datasets. The problem, however, is that a) the string length varies dramatically (300,000 to 900,000 characters), and it is not well correlated with the number of statements (some of the statements are very short interjections).

-use moss text, match("):") to identify all the instances of "):" in the string. Using the string position identified by moss, I can then split each string based on roughly the 1,500th instance of "):", then run split on these separately (and avoid the variable addition limit).

However, I'm not sure how to use the value of _pos1500 to cut each string individually into two parts:

It would look something like this (not using code that works):

Code:
gen text1 = text

replace text= strpos(text, 1, [value of _pos1500])
replace text1= strpos(text1, [value of _pos1500], [end of the string])
I'd appreciate any advice anyone had on this problem, or another way to think about it entirely!

Thanks.