Extracting the numbers after repeated characters in a string variable

Hello,

I am trying to obtain the distance between two geographic locations using public transportation. I used OpenTripPlanner to do so, and it allowed me to generate a string variable that contains this information. However, this is a string variable that not only contains the distance but other information on the trip taken between those two locations — for example, the route taken, transportation modes, etc.

The issue is that the string variable does not contain information on the total distance traveled. It contains information on each one of the "legs," traveled between the two points. In other words, for each one of the transportation modes taken between the two points, I have a distance. Therefore, I need to add up all those "sub-distances" to obtain the total traveled distance.

The string variable contains around 40,000 characters in each cell. However, it contains repeated characters that allow identifying a "sub-distance." Whenever the string variable shows the text ""realTime": false, distance":"" the numbers after this text refer to the distance traveled using a transportation mode.

The first thing that I tried to do was to use the command "regexs," in the following way:

Code:

gen dist=regexs(0) if regexm(store_5am, `""realTime":false,"distance":[0-9]*"')

Where "store_5am," is the string variable. Although that instruction does work, the issue is that the information corresponds to the first "sub-distance," but it does not help me to obtain the other sub-distances.

So, what I tried to do next was to split the string variable into several variables, assuming the split would happen every time Stata runs into the text ""realTime": false, distance":"" Then, I would have a new set of string variables, and I would extract the numbers after the text in each one of these variables.

The following are the instructions I define to split the string variable and then extract the "sub-distance" from each new variable generated after the split:

Code:

split store_5am, parse(""realTime":false,") gen(d_)

forvalues n=1(1)15{
capture gen dist_`n'=regexs(0) if regexm(d_`n', `""distance":[0-9]*"')
}

The code above assumes that there are up to 15 times in which Stata could run into the text ""realTime": false, distance":"", but they can be less than that.

The issue with the instructions above is that I thought Stata would generate 15 or fewer variables with the split command since it can run into the text ""realTime": false, distance":"" up to 15 times. However, when I execute the split command, I run into the error:

no room to add more variables
Up to 32,767 variables are currently allowed, which is the maximum you can set; see help memory.

So, it seems the command “split” is not generating up to 15 variables, as I thought it would do.

The following is an example of my data. As I mentioned before, the string variable contains around 40,000 characters, but I cannot generate an example the way the data is using dataex. So, this is a version of my data for one observation and information on two "sub-distances."

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id str243 store_5am
1 `"{"requestParameters":{"date":"04-18-2018","legs":[{"startTime":1524045600000,"endTime":1524046887000," "realTime":false,"distance":1596.257,"}][{startTime":1524047233000,"endTime":1524048303000,"realTime":false,"distance":11730.58983892094,"}]"'
end

So, could you could help me to come up with a better approach to obtain all the "sub-distances," such that I can calculate the total distance traveled between the two locations?

Thank you!

BJ Data Tech Solution

Home / Data Cleaning / Data management / Data Processing / Extracting the numbers after repeated characters in a string variable
Extracting the numbers after repeated characters in a string variable

0 Response to Extracting the numbers after repeated characters in a string variable

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Extracting the numbers after repeated characters in a string variable Extracting the numbers after repeated characters in a string variable

Related Posts with Extracting the numbers after repeated characters in a string variable

0 Response to Extracting the numbers after repeated characters in a string variable

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Extracting the numbers after repeated characters in a string variable
Extracting the numbers after repeated characters in a string variable