I am trying to obtain the distance between two geographic locations using public transportation. I used OpenTripPlanner to do so, and it allowed me to generate a string variable that contains this information. However, this is a string variable that not only contains the distance but other information on the trip taken between those two locations — for example, the route taken, transportation modes, etc.
The issue is that the string variable does not contain information on the total distance traveled. It contains information on each one of the "legs," traveled between the two points. In other words, for each one of the transportation modes taken between the two points, I have a distance. Therefore, I need to add up all those "sub-distances" to obtain the total traveled distance.
The string variable contains around 40,000 characters in each cell. However, it contains repeated characters that allow identifying a "sub-distance." Whenever the string variable shows the text ""realTime": false, distance":"" the numbers after this text refer to the distance traveled using a transportation mode.
The first thing that I tried to do was to use the command "regexs," in the following way:
Code:
gen dist=regexs(0) if regexm(store_5am, `""realTime":false,"distance":[0-9]*"')
So, what I tried to do next was to split the string variable into several variables, assuming the split would happen every time Stata runs into the text ""realTime": false, distance":"" Then, I would have a new set of string variables, and I would extract the numbers after the text in each one of these variables.
The following are the instructions I define to split the string variable and then extract the "sub-distance" from each new variable generated after the split:
Code:
split store_5am, parse(""realTime":false,") gen(d_) forvalues n=1(1)15{ capture gen dist_`n'=regexs(0) if regexm(d_`n', `""distance":[0-9]*"') }
The issue with the instructions above is that I thought Stata would generate 15 or fewer variables with the split command since it can run into the text ""realTime": false, distance":"" up to 15 times. However, when I execute the split command, I run into the error:
no room to add more variables
Up to 32,767 variables are currently allowed, which is the maximum you can set; see help memory.
The following is an example of my data. As I mentioned before, the string variable contains around 40,000 characters, but I cannot generate an example the way the data is using dataex. So, this is a version of my data for one observation and information on two "sub-distances."
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input byte id str243 store_5am 1 `"{"requestParameters":{"date":"04-18-2018","legs":[{"startTime":1524045600000,"endTime":1524046887000," "realTime":false,"distance":1596.257,"}][{startTime":1524047233000,"endTime":1524048303000,"realTime":false,"distance":11730.58983892094,"}]"' end
Thank you!
0 Response to Extracting the numbers after repeated characters in a string variable
Post a Comment