Hello everybody,

I apologize in advance for the lengthy post but I just want to provide some context to my question.

I am trying to investigate the factors/circumstances that influence transit users' trip information-seeking behaviour using data from a transit trip planning app. Currently, I have a dataset of hourly transit trip look-ups within different zones for a 6-month time period. It is set up as a panel data, with N = 1062 zones and T = 4321 for each hour. Below is an example of what the dataset looks like but I have around 20 independent variables I plan on testing, some of which are continuous (e.g., average temperature, average income of a zone, distance to the nearest rapid transit station), and others that are dummy variables to indicate time of day or day of week (e.g., dummy variable indicating whether it is a weekend).

Zone ID Timestamp # of trip searches APC Avg. Temp. (°C) Weekend ...
101 2018-09-03 00:00:00 2 3 11.4 0 ...
101 2018-09-03 01:00:00 0 1 10.6 0 ...
101 2018-09-03 02:00:00 0 0 8.55 0 ...
101 2018-09-03 03:00:00 1 0 9.7 0 ...
... ... ... ... ... ... ...
101 2018-09-03 23:00:00 1 0 9.8 0 ...
102 2018-09-03 00:00:00 0 1 11.4 0 ...
102 2018-09-03 01:00:00 0 1 10.6 0 ...
... ... ... ... ... ... ...

During my initial investigation of the data, I thought I should use either a Poisson or negative binomial regression because of my dependent variable (number of trip searches). However, the issue is the number of trip searches is also influenced by the actual demand for public transit, meaning areas or times that have more people using public transit (e.g., morning commuting hours) are also very likely to have more trip searches. I am more interested in how the number of trip searches is impacted by these factors irrespective of what the actual demand is. My supervisor suggested to divide the number of searches by the actual demand that was observed, which is represented by the automatic passenger counts (APC) variable. That way, I can get the rate of trip searches compared to actual demand levels and understand when there is an over/underrepresentation of trip searches to actual demand. But as you can see, there are many instances where I have trip searches or APC equal to zero. The zeros do have a meaning, as it simply means no trip searches or actual trips were made in a zone within that hour.

I suppose one solution could be to aggregate the data and avoid this issue, but I'm hoping not to do that as I will be losing out on some important information in that case. I'm wondering if anyone has any other suggestions on how I can address this problem? Please let me know if anything needs to be clarified and thank you very much for taking the time to help me!

Best,
Lisa