Hello! I am a new user of Stata and I have a problem with a task I have to solve.
As a part of my thesis, I have to do a text analysis. Afterwards, I will perform the statistical analysis with Stata and I am not sure whether I can use Stata for the text analysis as well.

I have tried it and researched a lot. But I am not sure whether my approach is correct.
There are three aspects I want to adress in this post:

As the most relevant aspect, I would like to generate a new variable that shows the frequency of a key word (substring) in a text (string / variable), for instance "machine learning". What is the best way? Would you recommend to use Stata or the integration of Wordstat or Python? There are three options but so far neither of them has been succesful.
  1. I have installed Wordstat in Stata and used it to import PDF files of annual reports. These reports are stored as text / string in the variable DOCUMENT in Stata. I would like to use Wordstat for the text analysis as well. So, I used User>Wordstat>content analysis and I generated a dictionary with key words. However, when using frequencies I get the error "No valid cases".
  2. So far, I have found options in Stata that show if a particular substring is inside a string or not (strpos, regexm, substr, subinst) , but I need to know the frequency. Noccur is a command that offers this, the only one in my opinion. However, I have used this command and the calculated frequency for some key words in Stata is lower than the actual frequency I have found in the PDF file or the text file of the particular annual report, using command F.
  3. Python is possible as an integration in Stata but I have not figured out how to interact, how to use the variable in the Stata table in the Python command and how to export the results of Python in the Stata table as a new variable. In Python I have found the regex command re.findall(pattern, string, flags=0). Is it recommended to use Python instead of Stata for the text analysis and to do statistics in Stata afterwards? Should I install Python and load the files there? Then, I would need to save the Python table as an Excel file and to create a Stata file from it. With one variable that is the same in both stata tables the merging is possible afterwards. Is that correct?
In a second step, I would like to use stop words removal in order to have the whole amount of relevant words of a document. The variable DOCUMENT stores text files as strings that contain relevant words and stop words. I would like to generate a new variable that contains the strings without stop words. I think it is possible to use the coomand txttool. So far it has not been succesful.
For the stop words removal I would use lists of Wordstat. WordStat provides stop words lists in the 4 languages English, French, German, Spanish that I need. How can I use Stata and or Wordstat? Is it recommended to use Stata and or WordStat or Python?

Moreover, I am not sure if a sentiment analysis is possible in Stata itself or with the integration of Python. The sentiment analysis could show which documents use rather positive or negative words in the same sentences that contain a particular key word.

Thank you and best regards