Dear all,

I am currently writing a code to capture the differences in earnings management between US firms and cross-listed firms (foreign firms on an American stock exchange). Because the cross-listed firms are self-selected, the data might be biased. Therefore, I have to match the cross-listed firms with US firms based on:

- fyear (fiscal year)
- sic_2 (industry)
- at (total assets)

Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input long gvkey double fyear byte sic_2 double at
1004 1996 50  529.584
1004 1997 50  670.559
1004 1998 50   726.63
1004 1999 50  740.998
1004 2000 50  701.854
1004 2001 50  710.199
1004 2002 50  686.621
1004 2003 50  709.292
1004 2004 50   732.23
1004 2005 50  978.819
1004 2006 50 1067.633
1004 2007 50  1362.01
1004 2008 50 1377.511
1004 2009 50 1501.042
1004 2010 50 1703.727
1004 2011 50 2195.653
1004 2012 50   2136.9
1004 2013 50   2199.5
1004 2014 50     1515
1004 2015 50   1442.1
1010 1997 37   3181.3
1010 1998 37   3257.3
1010 1999 37   3563.4
1010 2000 37   3794.5
1010 2001 37   3723.1
1010 2002 37   3702.5
1010 2003 37   4832.1
1013 1997 36  936.303
1013 1998 36 1300.587
1013 1999 36 1672.529
1013 2000 36   3970.5
1013 2001 36   2499.7
1013 2002 36   1144.2
1013 2003 36   1296.9
1013 2004 36   1428.1
1013 2005 36     1535
1013 2006 36   1611.4
1013 2007 36   1764.8
1013 2008 36     1921
1013 2009 36   1343.6
1013 2010 36   1474.5
1019 1997 73    26.71
1019 1998 73   29.283
1019 1999 73   29.341
1019 2000 73   28.638
1019 2001 73   30.836
1021 1997 38   20.516
1021 1998 38   18.661
1021 1999 38   13.986
1021 2000 38   11.608
1021 2001 38    8.635
1021 2002 38     7.85
1021 2003 38    6.044
1021 2004 38    6.245
1021 2005 38    8.153
1021 2006 38   14.341
1021 2007 38   27.171
1021 2008 38   21.401
1034 1997 28  631.866
1034 1998 28  908.936
1034 1999 28 1160.266
1034 2000 28 1610.435
1034 2001 28 2390.008
1034 2002 28 2296.924
1034 2003 28 2329.268
1034 2004 28 2003.842
1034 2005 28 1623.383
1034 2006 28  927.239
1034 2007 28 1288.165
1036 1997 34 1778.547
1036 1998 34  2113.32
1036 1999 34 2241.575
1036 2000 34 2325.377
1037 1996 36    4.969
1037 1997 36     5.45
1037 1998 36    3.228
1037 1999 36    4.575
1037 2000 36    6.373
1037 2001 36   17.867
1038 1996 78  718.213
1038 1997 78   795.78
1038 1998 78   975.73
1038 1999 78 1188.805
1038 2000 78 1047.264
1038 2001 78  1279.17
1038 2002 78 1491.698
1038 2003 78 1506.534
1043 1997 50     44.9
1043 1998 50   45.639
1043 1999 50    42.21
1045 1997 45    20915
1045 1998 45    22303
1045 1999 45    24374
1045 2000 45    26213
1045 2001 45    32841
1045 2002 45    30267
1045 2003 45    29330
1045 2004 45    28773
1045 2005 45    29495
1045 2006 45    29145
end
I currently have 5,208 cross-listed firms, and I want to reduce the amount of American firms (16,663) to the same amount (total amount of observations is 186,587).

Perhaps what I mean is not exactly called 'matching'. Anyway, I am looking to keep one American firm for each cross-listed firm, that is in the same year and industry and with the closest asset size.

If I use -joinby- I get exactly the American firms that I need, however in that case they are in the same observation line as the cross-listed firms. If you can help me in 'splitting up' these observations, that would be equally helpful. The code I used for that is:

Code:
preserve
keep if dummy_foreign == 0
ds sic_2 fyear, not
rename (`r(varlist)') for0_=
tempfile for0
save `for0'

restore
keep if dummy_foreign == 1
ds sic_2 fyear, not
rename (`r(varlist)') for1_=
tempfile for1
save `for1'

//    FIND ALL ADMISSIBLE PAIRINGS
joinby sic_2 fyear using `for0'

//    KEEP ONE WITH CLOSEST SIZE
gen size_diff = abs(for0_at - for1_at)
by for1_gvkey fyear sic_2 (size_diff), sort: keep if _n == 1
Thank you in advance.