Hello,

I am working with a data set which storing monthly labour force status during the survey reference period as a 24-digit string variable.

For simplicity, let’s say 1 is for employed; 2 is for unemployed; 0 is not covered in the survey.

For example, the data looks liked:
pid dv
1001 111111111111222221111000
1002 222111111111111221111111
1003 111111112111111111111000
1004 111122221111111122221111
...

I’d like to clean this, so that I can run a hazard analysis. So, initial state, durations of employment, durations of unemployment, date of a spell started, etc.

What would be the best approach/reference to start?