Hi all,

I am having some issues merging two datasets that are both in long format. It's a bit complicated so I will try my best to explain things clearly, I have summarised things at the end which will hopefully help to clarify what I need help with. I am using Stata 16.1 in case it is relevant.


Datasets

Dataset 1 is organised by person (ID) and the year in which a given interview took place (year):
input str7 ID int year
"ID1" 1996
"ID1" 2000
"ID1" 2004
"ID1" 2008
"ID1" 2012
"ID1" 2016
"ID2" 1996
"ID2" 2000
"ID2" 2004
"ID2" 2008
"ID2" 2012
"ID2" 2016
"ID3" 1996
"ID3" 2000
"ID3" 2004
"ID3" 2008
"ID3" 2012
"ID3" 2016
"ID4" 1996
"ID4" 2000
"ID4" 2004
"ID4" 2008
"ID4" 2012
"ID4" 2016

Dataset 2 is organised by person (ID) and economic activity (ACTNO). It also contains information on the start and end month (JSTMTH JENDMTH) and year (JSTYR JENDYR) of each economic activity. These variables are also available in century month format (JSTCMC JENDCMC). Additionally, I have information on the century month in which a given interview took place: INTCMC05 for the 1996 sweep, INTCMC06 for 2000, INTCMC07 for 2004, INTCMC08 for 2008, INTCMC09 for 2012 and INTCMC10 for 2016.

To make it easier to understand this dataset, I have first shown only the person ID, economic activity number, start and end month and year below, and the century month variables further down in this post:
input str7 ID float ACTNO byte JSTMTH int JSTYR byte JENDMTH int JENDYR
"ID1" 1 4 1986 10 1988
"ID1" 2 10 1988 3 1994
"ID1" 3 3 1994 6 1998
"ID1" 4 6 1998 9 2003
"ID1" 5 9 2003 6 2005
"ID1" 6 6 2005 . .
"ID1" . . . . .
"ID2" . -1 -1 -1 -1
"ID3" 1 4 1986 7 1986
"ID3" 2 7 1986 7 1988
"ID3" 3 7 1988 10 1993
"ID3" 4 10 1993 . .
"ID3" . . . . .
"ID4" 1 4 1986 7 1986
"ID4" 2 7 1986 7 1988
"ID4" 3 7 1988 8 1990
"ID4" 4 8 1990 1 1996
"ID4" 5 1 1996 7 2000
"ID4" 6 7 2000 4 2002
"ID4" 7 4 2002 3 2003
"ID4" 8 3 2003 9 2003
"ID4" 9 9 2003 7 2004
"ID4" 10 7 2004 12 2004
"ID4" 11 12 2004 10 2008
"ID4" 12 10 2008 -6 -6
"ID4" . . . . .

The Problem

The issue I have is that one economic activity (ACTNO) in dataset 2 can span several years, but I only have one row per year in dataset 1. At the moment, I'm thinking the best way to merge them would be to create a person-year file, with one row per year, and merge the datasets that way. I don't know how to do so in Stata, so any advice would be really welcome!

To further complicate things, some people had several economic activities in a given year (e.g. ID4 did 3 economic activities in 2004), but I would only want one row per year per person to be able to merge dataset 2 with dataset 1.

To get around this, I want to implement two rules:



Rule 1

If the year-row in question concerns one of the sweep years (1996/ 2000/ 2004/ 2008/ 2012/ 2016), I want to assign observations to these years if the end of the activity (JENDCMC in century months) is after the interview date (INTCMC05-10) (so if JENDCMC>= INTCMC05-10). If there are multiple observations of this sort, I want to keep the one closest to the interview date.

This can be determined using the INTCMC variables which show when an interview took place. To recap, it's INTCMC05 for the 1996 sweep, INTCMC06 for 2000, INTCMC07 for 2004, INTCMC08 for 2008, INTCMC09 for 2012 and INTCMC10 for 2016. Below I am also showing person ID (ID), start and end year in "normal" format (JSTYR JENDYR), economic activity (ACTNO) and start and end century months (JSTCMC JENDCMC).

input str7 ID int(JSTYR JENDYR) float ACTNO int(JSTCMC JENDCMC INTCMC05) float(INTCMC06 INTCMC07) int INTCMC08 float(INTCMC09 INTCMC10)
"ID1" 1986 1988 1 1036 1066 1161 1201 1257 1311 1355 .
"ID1" 1988 1994 2 1066 1131 1161 1201 1257 1311 1355 .
"ID1" 1994 1998 3 1131 1182 1161 1201 1257 1311 1355 .
"ID1" 1998 2003 4 1182 1245 1161 1201 1257 1311 1355 .
"ID1" 2003 2005 5 1245 1266 1161 1201 1257 1311 1355 .
"ID1" 2005 . 6 1266 -6 1161 1201 1257 1311 1355 .
"ID1" . . . . . . 1201 1257 . 1355 1400
"ID2" -1 -1 . -1 -1 -1 -1 -1 -1 -1 .
"ID3" 1986 1986 1 1036 1039 1161 1204 1255 1309 1350 .
"ID3" 1986 1988 2 1039 1063 1161 1204 1255 1309 1350 .
"ID3" 1988 1993 3 1063 1126 1161 1204 1255 1309 1350 .
"ID3" 1993 . 4 1126 -6 1161 1204 1255 1309 1350 .
"ID3" . . . . . . 1204 1255 . 1350 1411
"ID4" 1986 1986 1 1036 1039 1161 1207 1263 1306 -1 .
"ID4" 1986 1988 2 1039 1063 1161 1207 1263 1306 -1 .
"ID4" 1988 1990 3 1063 1088 1161 1207 1263 1306 -1 .
"ID4" 1990 1996 4 1088 1153 1161 1207 1263 1306 -1 .
"ID4" 1996 2000 5 1153 1207 1161 1207 1263 1306 -1 .
"ID4" 2000 2002 6 1207 1228 1161 1207 1263 1306 -1 .
"ID4" 2002 2003 7 1228 1239 1161 1207 1263 1306 -1 .
"ID4" 2003 2003 8 1239 1245 1161 1207 1263 1306 -1 .
"ID4" 2003 2004 9 1245 1255 1161 1207 1263 1306 -1 .
"ID4" 2004 2004 10 1255 1260 1161 1207 1263 1306 -1 .
"ID4" 2004 2008 11 1260 1306 1161 1207 1263 1306 -1 .
"ID4" 2008 -6 12 1306 -6 1161 1207 1263 1306 -1 .
"ID4" . . . . . . 1207 1263 . . .

If we use ID4' 2004 jobs as an example, INTCMC07 would be the variable to look at (because 2004 is the year of interest), and I would want the third row (ACTNO==11) to be coded as 2004 in the "year" variable because JENDCMC(==1306) > INTCMC07(==1263).
ID JSTYR JENDYR ACTNO JSTCMC JENDCMC INTCMC07
"ID4" 2003 2004 9 1245 1255 1263
"ID4" 2004 2004 10 1255 1260 1263
"ID4" 2004 2008 11 1260 1306 1263 <---- 2004
ID4's 2000 jobs are a good example of a case in which JENDCMC>= INTCMC06 (INTCMC06 because 2000 is the year of interest) applies to several economic activities, so the case in which JENDCMC is closest to INTCMC06 would be chosen.

ID JSTYR JENDYR ACTNO JSTCMC JENDCMC INTCMC07
"ID4" 1996 2000 5 1153 1207 1207 <---- 2000
"ID4" 2000 2002 6 1207 1228 1207

Rule 2

In cases where the year of concern is not a sweep year, OR in cases where the relevant INTCMC variable is missing, I want to assign the year to jobs that started before (JSTMTH < 7) and ended in or after July (JENDMTH>=7) of a given year. Should this apply to several activities, I want to pick the activity that ended closest to July of that year. I am showing this in "normal" date format here, but it might be easier to work it out in century month format once I get to it.

If we take ID4 as an example again, I would want ACTNO==4 to be assigned as year==1986 because it started before and ended in July 1986 so took place in the first half of the year.
input str7 ID float ACTNO byte JSTMTH int JSTYR byte JENDMTH int JENDYR
"ID4" 1 4 1986 7 1986 <---- 1986
"ID4" 2 7 1986 7 1988
Another example:
input str7 ID float ACTNO byte JSTMTH int JSTYR byte JENDMTH int JENDYR
"ID5" 1 1 1996 7 1996<--- 1996 (JENDMTH>=7 applies to some rows below, but this is the only row for which JSTMTH < 7 applies)
"ID5" 2 7 1996 9 1996
"ID5" 3 9 1996 11 1996
"ID5" 4 11 1996 8 1997

Summary

I need help with:

(1) Creating a person-year dataset (which shows which economic activity a given person was doing in a given year). This dataset can have multiple rows per person ID (ID) and economic activity (ACTNO), but only one row per year (a variable I will have to create). The year would be determined based on whether it is contained within the start and end year of each activity (JSTYR JENDYR).

(2) Telling Stata which activity to assign to (or drop from) a given year should more than one activity have taken place in that year. I want to do so using two rules:
  1. If the year in question is 1996/ 2000/ 2004/ 2008/ 2012/ 2016, then the activity for which JENDCMC >= INTCMC05-10 is true will be assigned. Should there be several activities for which this is true, the one closest to the interview date (INTCMC05-10) will be picked.
  2. For all other cases, I will assign the year to the activity that started before (JSTMTH < 7) and ended in or after July of a given year (JENDMTH>=7). Should this apply to several activities, I want to pick the activity that ended closest to July of that year.
To illustrate the result I would like to end up with, I manually assigned the "year" variable to all observations for person ID4 (with the reason for my choices shown in italics and underlined in the data):

input str7 ID float ACTNO byte JSTMTH int JSTYR byte JENDMTH int(JENDYR JENDCMC INTCMC05) float(INTCMC06 INTCMC07) int INTCMC08 float(INTCMC09 INTCMC10) year
"ID4" 1 4 1986 7 1986 1039 1161 1207 1263 1306 -1 . 1986 <--- this activity rather than the next one was coded as 1986 because of rul
e 2 (started before and ended in/after July 1986)
"ID4" 2 7 1986 7 1988 1063 1161 1207 1263 1306 -1 . 1987
"ID4" 2 7 1986 7 1988 1063 1161 1207 1263 1306 -1 . 1988 <--- this activity rather than the next one was coded as 1988 because of rule 2
(started before and ended in/after July 1988)
"ID4" 3 7 1988 8 1990 1088 1161 1207 1263 1306 -1 . 1989
"ID4" 3 7 1988 8 1990 1088 1161 1207 1263 1306 -1 . 1990 <--- this activity rather than the next one was coded as 1990 because of rule 2
(started before and ended in/after July 1990)
"ID4" 4 8 1990 1 1996 1153 1161 1207 1263 1306 -1 . 1991
"ID4" 4 8 1990 1 1996 1153 1161 1207 1263 1306 -1 . 1992
"ID4" 4 8 1990 1 1996 1153 1161 1207 1263 1306 -1 . 1993
"ID4" 4 8 1990 1 1996 1153 1161 1207 1263 1306 -1 . 1994
"ID4" 4 8 1990 1 1996 1153 1161 1207 1263 1306 -1 . 1995
"ID4" 5 1 1996 7 2000 1207 1161 1207 1263 1306 -1 . 1996 <--- this activity rather than the previous one was coded as 1996 because of rule 1 (JENDCMC >= INTCMC05)
"ID4" 5 1 1996 7 2000 1207 1161 1207 1263 1306 -1 . 1997
"ID4" 5 1 1996 7 2000 1207 1161 1207 1263 1306 -1 . 1998
"ID4" 5 1 1996 7 2000 1207 1161 1207 1263 1306 -1 . 1999
"ID4" 5 1 1996 7 2000 1207 1161 1207 1263 1306 -1 . 2000 <--- this activity rather than the next one was coded as 2000 because of rule 1 (JENDCMC >= INTCMC06, and its JENDCMC is closest to INTCMC06)
"ID4" 6 7 2000 4 2002 1228 1161 1207 1263 1306 -1 . 2001
"ID4" 7 4 2002 3 2003 1239 1161 1207 1263 1306 -1 . 2002 <--- this activity rather than the previous one was coded as 2002 because of rule 2
((started before and ended in/after July 2002)
"ID4" 8 3 2003 9 2003 1245 1161 1207 1263 1306 -1 . 2003 <--- this activity rather than the next one was coded as 2003 because of rule 2 (started before and ended in/after July 2003)
"ID4" 9 9 2003 7 2004 1255 1161 1207 1263 1306 -1 . .
"ID4" 10 7 2004 12 2004 1260 1161 1207 1263 1306 -1 . .
"ID4" 11 12 2004 10 2008 1306 1161 1207 1263 1306 -1 . 2004 <--- this activity rather than the previous two was coded as 2004 because of rule 1 (JENDCMC >= INTCMC06)
"ID4" 11 12 2004 10 2008 1306 1161 1207 1263 1306 -1 . 2005
"ID4" 11 12 2004 10 2008 1306 1161 1207 1263 1306 -1 . 2006
"ID4" 11 12 2004 10 2008 1306 1161 1207 1263 1306 -1 . 2007
"ID4" 11 12 2004 10 2008 1306 1161 1207 1263 1306 -1 . 2008 <--- this activity rather than the next one was coded as 2012 because of rule 2
(started before and ended in/after July 2008)
"ID4" 12 10 2008 -6 -6 -6 1161 1207 1263 1306 -1 . .
"ID4" . . . . . . . 1207 1263 . . . .

Absolutely any advice regarding any of the steps involved in doing this in Stata would be highly appreciated! Thank you!