Hi Statalist.
I am trying to use regexs and regexm to extra information from strings. A sample is below. The strings contain location on town names, which always comes first, and firm names, which always comes second. The town and firm names are usually separated by a comma, but could be separated by any punctuation character. Infrequently punctuation appears before the town name. Only part of the town name is extracted by the command that I wrote:
gen city = regexs(1) if regexm(LocationFirm, "([a-zA-Z]+)([:punct:])")
Please let me know if you have an idea about how I could extract the entire town name.
Sample data, code, and output appear below.
Thanks
Gary
input str60 LocationFirm
"Albertville,First*................"
"Albertville,Albertville-"
"Anniston.Anniston—--"
"Anniston;Commercial-."
"Anniston^Blender-."
"Decatur,MorganCounty"
"..Decatur,Jupiter"
end
gen city = regexs(1) if regexm(LocationFirm, "([a-zA-Z]+)([:punct:])")
list city
. clear
. do "C:\Users\garyr\AppData\Local\Temp\STD462c_000000. tmp"
. input str60 LocationFirm
LocationFirm
1. "Albertville,First*................"
2. "Albertville,Albertville-"
3. "Anniston.Anniston—--"
4. "Anniston;Commercial-."
5. "Anniston^Blender-."
6. "Decatur,MorganCounty"
7. "..Decatur,Jupiter"
8. end
. gen city = regexs(1) if regexm(LocationFirm, "([a-zA-Z]+)([:punct:])")
. list city
+---------+
| city |
|---------|
1. | Alber |
2. | Alber |
3. | Annisto |
4. | Annisto |
5. | Annisto |
|---------|
6. | Decat |
7. | Decat |
+---------+
.
end of do-file
Related Posts with Help with extracting strings using regular expressions
Intercept-Only regression with an interaction term - What does it mean?If I estimate an intercept only model in Stata reg c.y I get an estimate of a constant that shows …
Fractional multinomial logit with fixed effectsHi, I have a panel data for 24 years. I want to see how state governments allocate total budget amon…
Correcting for selectivity and omitted variable problems in discrete time event history analysisApologies for cross-listing. Dear all, I am trying to examine the effect of family ownership on th…
Tobit (if the values of dependent variable ranges between -6.96296 + 18.89634Tobit (if the values of dependent variable ranges between -6.96296 + 18.89634 what should be command…
Converting standardized coefficients to percentage termsI am running a difference-in-difference regression. The treatment variable is assigned a continuum (…
Subscribe to:
Post Comments (Atom)
0 Response to Help with extracting strings using regular expressions
Post a Comment