Greetings all,

I have single line per observation survival data (4 million lines). Here is a simplified example
default zip code date_start date_end date_default
1 12345 2000q2 2016q1 2005q3
0 54321 1993q4 2016q1
1 13467 2003q1 2016q1 2010q1
One thing I'd like to with my data is to understand the default rate per quarter, by department. I'd ultimately like to construct a second panel (or, a first one, since this isn't per se a panel as is) where I have the different zip codes as the subjects to be followed through time, and in the end the rate of default per time. I have unemployment data that is already organized in this fashion, and naturally I want to combine it with a default rate (# defaults / # "alive" or "at risk" loans) per zip code:
zip code date unemployment default rate
11111 1990q1 4.2 x
11111 1990q2 4.1 x
11111 1990q3 4.6 x

One guess was to create some new variable that uniquely identifies zipcode/quarter combinations, and then to do a statsby on this. But that would imply ~12,000 groups (100 zip codes * 30 years *4 quarters), and that just doesn't seem right/efficient.

It shouldn't be hard for me to find a way to count the defaults per quarter/department (although I can't do tab default department zipcode, as this is too many variables :/), but I must confess I have no idea where to start on counting (and organizing in a new panel, without Excel) the at-risk loans per quarter.

Thank you so much for even some rough intuitions about how to go about this in STATA.

Have a great day,
John