Dear Statalisters,


I want to study the impact of a migrants language proficiency and best friends origin on how long it takes to get the first job after migration. I'm using Stata 14.2, I have panel data from two waves and want to do survival analysis using Cox regression.

My variables (among others) are:
ACT – employment status
WK_RC – if respondents ever worked in recieving country (RC) after migration
IMDATE_op – date of migration, only asked in wave 1
JBSTART_RC_op – date of job start in RC, only asked in wave 1
CURRJBSTART_op – date of job start in RC, only asked in wave 2
SAMEJB – if the job reported in CURRJBSTART_op is the same as in JBSTART_RC_op, only asked in wave 2
FR1CB – background of best friend
LRCSPK – RC language proficiency

What I did so far:
use datawave1
append using datawave2
sort ID wave


Then I excluded persons who dropped out in wave 2:
egen occurences=count(_n), by(ID)
drop if occurences < 2


Because ID was a string, i created a new identifier variable:
egen id= group(ID)
list id



Now the data looks like this:

Code:
* Example generated by -dataex-. To install: ssc install dataex

clear

input float id byte(wave ACT WK_RC) str23(IMDATE_op JBSTART_RC_op CURRJBSTART_op) byte(SAMEJB FR1CB LRCSPK)

2 1 2 2 "10/2009" "-99 (filtered)" "" . 1 3

3 1 1 -99 "10/2009" "05/2010" "" . 1 2

3 2 1 -99 "" "" "04/2011" 2 1 2

4 1 2 2 "05/2010" "-99 (filtered)" "" . 1 3

8 1 1 -99 "10/2009" "10/2009" "" . 1 4

8 2 1 -99 "" "" "11/2009" 1 1 4

9 1 1 -99 "10/2009" "10/2009" "" . 1 1

9 2 1 -99 "" "" "11/2011" 2 1 1

11 1 1 -99 "07/2010" "06/2010" "" . -99 3

11 2 1 -99 "" "" "07/2010" 1 2 3

12 1 1 -99 "03/2010" "04/2010" "" . 1 4

12 2 1 -99 "" "" "04/2009" 1 3 3

13 1 1 -99 "06/2010" "06/2010" "" . -99 3

14 1 1 -99 "01/2010" "01/2010" "" . 2 2

14 2 1 -99 "" "" "10/2010" 1 1 2

16 1 1 -99 "04/2010" "04/2010" "" . 1 3

16 2 1 -99 "" "" "12/2010" 1 2 3

17 1 1 -99 "10/2009" "12/2009" "" . -99 3

17 2 1 -99 "" "" "-52/2010" 1 1 2

18 2 1 -99 "" "" "-99 (filtered)" -99 -99 -99

20 1 1 -99 "12/2006" "02/2007" "" . 2 2

20 2 1 -99 "" "" "02/2009" 1 1 2

21 1 1 -99 "08/2010" "08/2010" "" . 2 2

21 2 1 -99 "" "" "09/2010" 1 1 3

22 1 1 -99 "08/2010" "08/2010" "" . 2 1

22 2 1 -99 "" "" "08/2007" 1 2 1

23 1 1 -99 "10/2009" "-99 (filtered)" "" . -99 3

23 2 1 -99 "" "" "-99 (filtered)" -99 -99 -99

24 1 1 -99 "09/2009" "02/2010" "" . 1 2

25 1 1 -99 "10/2009" "04/2010" "" . 1 3

25 2 1 -99 "" "" "10/2012" 1 1 3

26 2 1 -99 "" "" "03/2012" 2 -99 1

28 1 2 1 "04/2010" "09/2010" "" . -99 2

28 2 1 -99 "" "" "07/2011" 2 1 2

29 1 1 -99 "01/2010" "02/2010" "" . 1 3

29 2 1 -99 "" "" "02/2010" 1 1 3

30 1 2 2 "09/2010" "-99 (filtered)" "" . -99 2

30 2 1 -99 "" "" "10/2011" 1 1 1

32 1 2 2 "08/2010" "-99 (filtered)" "" . 1 2

32 2 2 1 "" "" "-52/2010" 1 1 2

end

label values ACT Con38

label def Con38 1 "working", modify

label def Con38 2 "unemployed", modify

label values WK_RC Con3

label def Con3 -99 "filtered", modify

label def Con3 1 "yes", modify

label def Con3 2 "no", modify

label values SAMEJB Con3_7

label def Con3_7 -99 "filtered", modify

label def Con3_7 1 "yes", modify

label def Con3_7 2 "no", modify

label values FR1CB Con4

label def Con4 -99 "filtered", modify

label def Con4 1 "[in CO]", modify

label def Con4 2 "[RC]", modify

label def Con4 3 "other", modify

label values LRCSPK Con19

label def Con19 -99 "filtered", modify

label def Con19 1 "very well", modify

label def Con19 2 "well", modify

label def Con19 3 "not well", modify

label def Con19 4 "not at all", modify


Until now I didn't recode answers like „don't know“ or „refused“ as missings (.), because everytime a question was asked only in one panel wave the missing answers in the other panel wave are coded as missing (.), so I was afraid I'd mash things up if I also recoded the true missings.


Now my questions are:

1. How to create a time variable for survival analysis, that is, the time from date of migration to start of the first job in RC?
I know I somehow have to combine JBSTART_RC_op and CURRJBSTART_op (and maybe even SAMEJB) before substracting IMDATE_op from it, but I don't know how to do it (especially since I got so many false „missings“ in these variables because they were only asked in one wave).

2. How to create the failure indicator (employed: yes/no) while correctly taking into account respondents who are on maternity/paternity leave?


Kind regards,
Anna