I have a panel from Compustat like the below in Table 1 (table 1 is a subset of the data to show 3 different example issues), where gvkey is the firm-specific identifier, fyear is the reporting year, emp is employment, and dlrsn is the reason the firm dropped out of the dataset.
Table 1
gvkey | fyear | emp | dlrsn | |||
001 | 1996 | 2 | 02 | |||
001 | 1997 | 3 | 02 | |||
001 | 1998 | 2 | 02 | |||
001 | 1999 | 1 | 02 | |||
002 | 1996 | 4 | 06 | |||
002 | 1997 | 5 | 06 | |||
002 | 1998 | 4 | 06 | |||
002 | 1999 | 3 | 06 | |||
002 | 2000 | 3 | 06 | |||
002 | 2001 | 3 | 06 | |||
002 | 2002 | 3 | 06 | |||
002 | 2003 | 3 | 06 | |||
003 | 1996 | 7 | . | |||
003 | 1997 | 8 | . | |||
. | . | . | . | |||
. | . | . | . | |||
. | . | . | . | |||
003 | 2016 | 14 | . |
I need employment data for each firm all the way up to and including 2016 (as shown in Table 2). However, many firms drop out of the dataset (e.g., because of bankruptcy). For such firms, I want to generate employment numbers for all years from the last date they reported, going up to and including 2016 using the following methodoloy:
- If dlrsn is 02 or 03, set employment number to zero from the first year after the last reporting year going up to and including 2016. For example, in Table 2, firm 001 reports up to 1999, I would like to generate data that has fyears 2000-2016 and employment set at 0 because dlrsn is 02.
- If dlrsn is 01,04,05,06,07,09,10,20, use the last reported employment number for all years after the last reported year. For example, in Table 2, firm 002 reports up 2003, I would like to generate data that has fyear 2004-2016 and employment is set equal to the last available employment number (i.e.,4) because dlrsn is 06
- If the firm does not drop out of the dataset, nothing should change.
gvkey | fyear | emp | dlrsn | |||
001 | 1996 | 2 | 02 | |||
001 | 1997 | 3 | 02 | |||
001 | 1998 | 2 | 02 | |||
001 | 1999 | 1 | 02 | |||
001* | 2000* | 0* | 02* | |||
001* | 2001* | 0* | 02* | |||
001* | 2002* | 0* | 02* | |||
. | . | . | . | |||
. | . | . | . | |||
. | . | . | . | |||
001 | 2016 | 0* | 02 | |||
002 | 1996 | 4 | 06 | |||
002 | 1997 | 5 | 06 | |||
002 | 1998 | 4 | 06 | |||
002 | 1999 | 3 | 06 | |||
002 | 2000 | 3 | 06 | |||
002 | 2001 | 2 | 06 | |||
002 | 2002 | 3 | 06 | |||
002 | 2003 | 4 | 06 | |||
002* | 2004* | 4* | 06* | |||
002* | 2005* | 4* | 06* | |||
. | . | . | . | |||
. | . | . | . | |||
. | . | . | . | |||
002* | 2016* | 4* | 06* | |||
003 | 1996 | 7 | . | |||
003 | 1997 | 8 | . | |||
. | . | . | . | |||
. | . | . | . | |||
. | . | . | . | |||
003 | 2016 | 14 | . |
gen dldteyear=year(dldte)
bysort gvkey: egen lastdate=max(fyear)
expand yeardiff if fyear==lastdate
The code was able to duplicate the last reported date in Table 1 the correct number of times but then I was a bit stuck with how to do the next step, I was thinking replace years because they would have to be made consecutive and then replace employment but this started to get a bit messy. I am sure there is probably a better approach then the one I am taking which seems rather mechanical.
Thanks in advance for all support.
Best,
Ali
0 Response to Generating data to balance a panel dataset
Post a Comment