Hi Statalist.

I am trying to use regexs and regexm to extra information from strings. A sample is below. The strings contain location on town names, which always comes first, and firm names, which always comes second. The town and firm names are usually separated by a comma, but could be separated by any punctuation character. Infrequently punctuation appears before the town name. Only part of the town name is extracted by the command that I wrote:

gen city = regexs(1) if regexm(LocationFirm, "([a-zA-Z]+)([:punct:])")

Please let me know if you have an idea about how I could extract the entire town name.

Sample data, code, and output appear below.

Thanks

Gary


input str60 LocationFirm
"Albertville,First*................"
"Albertville,Albertville-"
"Anniston.Anniston—--"
"Anniston;Commercial-."
"Anniston^Blender-."
"Decatur,MorganCounty"
"..Decatur,Jupiter"
end
gen city = regexs(1) if regexm(LocationFirm, "([a-zA-Z]+)([:punct:])")
list city


. clear

. do "C:\Users\garyr\AppData\Local\Temp\STD462c_000000. tmp"

. input str60 LocationFirm

LocationFirm
1. "Albertville,First*................"
2. "Albertville,Albertville-"
3. "Anniston.Anniston—--"
4. "Anniston;Commercial-."
5. "Anniston^Blender-."
6. "Decatur,MorganCounty"
7. "..Decatur,Jupiter"
8. end

. gen city = regexs(1) if regexm(LocationFirm, "([a-zA-Z]+)([:punct:])")

. list city

+---------+
| city |
|---------|
1. | Alber |
2. | Alber |
3. | Annisto |
4. | Annisto |
5. | Annisto |
|---------|
6. | Decat |
7. | Decat |
+---------+

.
end of do-file