Dear Stata list,

I thought over this a lot of times as I did not expect to find anything unexpected with such a basic string function as trim(), but hear me out on this one.

While not necessary to understand what I find to be unexpected behavior, here is some background to what I am working on. I am working on a command that is used to test ODK based questionnaire forms. Read more about it here as well if you want. Command is still in development so not all documentation is not yet at the level it will be at the time we publish this, but I always want to give some context to when I am asking a question.

Since the questionnaire are written in Excel files I use import excel to import several columns from an excel sheet, many of them consist of string values. Some of the test my command run are sensitive to leading and trailing spaces so I must remove spaces so that "ABC " becomes "ABC". I use the function trim() for that. However, in the data set that can be accessed here I have one string variable with 10 observation for which 3 has values for which I do not think trim() works as I expected it to do.

When I used
Code:
replace name = trim(name)
I get that no real changed were made but when browsing or tabbing the variable it is obvious that three values have multiple trailing spaces. I used Nick Cox's command charlist to investigate and saw that some of the spaces were regular spaces with char code 32, but some of the spaces hare char code 160. I have found on the internet that those are so called non-breaking spaces. See for example here. Since UTF-8 characters have been introduced to Stata in later versions I would have expected that trim() would work on those alternative spaces too.

If you load the data set linked to above and run this code you can see what I see:

Code:
*Open data and show string values
use trim_example.dta
tab

*Replacing leading and trailing spaces
replace name = trim(name)
tab //check that spaces where not removed

*Keep only one of these values for less cluttered charlist result
keep in 7

*Install charlist and return all char codes used in the string value.
ssc install charlist
charlist name
return list //See that regular space 32 and non-breaking space 160 are both used.
I am using Stata 13.1 but my colleague who first reported this bug during testing was using Stata 15, and I am pretty sure it is the same bug that I reproduced.