Dear all,

I am currently writing a code to capture the differences in earnings management between US firms and cross-listed firms (foreign firms on an American stock exchange). Because the cross-listed firms are self-selected, the data might be biased. Therefore, I have to match the cross-listed firms with US firms based on:

- mtb (market-to-book ratio)
- roa (return on assets)
- at (total assets)


Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input long gvkey double fyear float(dummy_usa dummy_foreign) double    at    float(roa    mtb)
1004 1996 1 0  529.584    .04347752          .
1004 1997 1 0  670.559    .05317504          .
1004 1998 1 0   726.63    .05734831  1.6586404
1004 1999 1 0  740.998    .04745357  1.0978953
1004 2000 1 0  701.854    .02640293  1.1084794
1004 2001 1 0  710.199   -.08298942  1.1752149
1004 2002 1 0  686.621  -.018074017   .4858825
1004 2003 1 0  709.292   .004940137  1.0239426
1004 2004 1 0   732.23   .021104025  1.6606493
1004 2005 1 0  978.819   .035923906  2.0879886
1004 2006 1 0 1067.633    .05494398  2.4809506
1004 2007 1 0  1362.01     .0551714  1.2772952
1004 2008 1 0 1377.511    .05709646   .8701464
1004 2009 1 0 1501.042   .029731346  1.0421851
1004 2010 1 0 1703.727    .04098427  1.2568352
1004 2011 1 0 2195.653    .03084413  .56036645
1004 2012 1 0   2136.9    .02573822   .8591657
1004 2013 1 0   2199.5   .033143897   .9606355
1004 2014 1 0     1515   .006732673  1.2381912
1004 2015 1 0   1442.1   .033076763   .9731014
1010 1997 1 0   3181.3   .068022504          .
1010 1998 1 0   3257.3   .020630583          .
1010 1999 1 0   3563.4   .021692766          .
1010 2000 1 0   3794.5   .021056794          .
1010 2001 1 0   3723.1    .03913406          .
1010 2002 1 0   3702.5   .021525996          .
1010 2003 1 0   4832.1    .07294965          .
1013 1997 1 0  936.303    .11624122          .
1013 1998 1 0 1300.587    .11281598   3.393175
1013 1999 1 0 1672.529     .0523967   5.735749
1013 2000 1 0   3970.5    .21863745   5.652886
1013 2001 1 0   2499.7   -.51514184   1.903243
1013 2002 1 0   1144.2   -1.0006992   1.725441
1013 2003 1 0   1296.9   -.05914103  3.3024726
1013 2004 1 0   1428.1    .01148379   2.715488
1013 2005 1 0     1535    .07211726  2.6268575
1013 2006 1 0   1611.4      .040772  1.9200138
1013 2007 1 0   1764.8    .06023346  2.1825328
1013 2008 1 0     1921  -.021811556   .7718683
1013 2009 1 0   1343.6    -.3530068  2.2617743
1013 2010 1 0   1474.5    .04204815      2.835
1019 1997 1 0    26.71     .0426432          .
1019 1998 1 0   29.283    .05624424   2.919797
1019 1999 1 0   29.341    .03489997   2.787749
1019 2000 1 0   28.638    .06152664  2.3041565
1019 2001 1 0   30.836    .04183422   3.302927
1021 1997 1 0   20.516    .07550205          .
1021 1998 1 0   18.661   -.17833985  1.0966128
1021 1999 1 0   13.986   -.15780066   .6230607
1021 2000 1 0   11.608   -.06960717  1.0197082
1021 2001 1 0    8.635    -.2012739  1.0919029
1021 2002 1 0     7.85   .010700637  .55830675
1021 2003 1 0    6.044   -.25066182   1.194015
1021 2004 1 0    6.245     .2153723   5.218199
1021 2005 1 0    8.153    .23304304   4.236929
1021 2006 1 0   14.341     .0700788   2.719383
1021 2007 1 0   27.171   -.17198484  2.0490286
1021 2008 1 0   21.401    -.5162843  1.2018434
1034 1997 1 0  631.866   .027550146          .
1034 1998 1 0  908.936    .02663664   3.564293
1034 1999 1 0 1160.266   .031865105   2.587424
1034 2000 1 0 1610.435   .034467705   2.080925
1034 2001 1 0 2390.008  -.015863545   1.314704
1034 2002 1 0 2296.924    -.0433889   .6115829
1034 2003 1 0 2329.268   .005938776   .9239342
1034 2004 1 0 2003.842   -.15706678  1.0131919
1034 2005 1 0 1623.383    .08240138  1.6793386
1034 2006 1 0  927.239    .08902128   1.434651
1034 2007 1 0 1288.165  -.010542904   1.206971
1036 1997 0 1 1778.547    .07757906          .
1036 1998 0 1  2113.32    .04717128   .9201303
1036 1999 0 1 2241.575    .03966407   .8469118
1036 2000 0 1 2325.377    .02431864  .51730186
1037 1996 1 0    4.969     -.555645          .
1037 1997 1 0     5.45     .1719266          .
1037 1998 1 0    3.228    -1.078067  15.407714
1037 1999 1 0    4.575    -.2450273  72.605804
1037 2000 1 0    6.373    .18264553   6.106415
1037 2001 1 0   17.867     .0374993  3.4009595
1038 1996 1 0  718.213   .026447587          .
1038 1997 1 0   795.78   -.03078615          .
1038 1998 1 0   975.73  -.016414378   3.127864
1038 1999 1 0 1188.805   -.04642225  2.0251205
1038 2000 1 0 1047.264   -.10110727 -2.8141334
1038 2001 1 0  1279.17  -.008965189  1.7854867
1038 2002 1 0 1491.698  -.013609993  1.0782552
1038 2003 1 0 1506.534  -.007111688  2.0165327
1043 1997 1 0     44.9  .0022939867          .
1043 1998 1 0   45.639    .02346677  -1.848265
1043 1999 1 0    42.21 -.0032693674  -.6050181
1045 1997 1 0    20915    .04709538          .
1045 1998 1 0    22303    .05891584    1.43031
1045 1999 1 0    24374    .04041192  1.4482962
1045 2000 1 0    26213   .031015145   .8304026
1045 2001 1 0    32841   -.05365245   .6411717
1045 2002 1 0    30267   -.11600093  1.0764759
1045 2003 1 0    29330   -.04186839    44.9258
1045 2004 1 0    28773  -.026448406 -3.0372775
1045 2005 1 0    29495   -.02919139  -2.748398
1045 2006 1 0    29145   .007925888  -11.08553
end

I currently have 5,208 cross-listed firms, and I want to reduce the amount of American firms (16,663) to the same amount (total amount of observations is 186,587).

My question looks similar to the problem discussed here: https://www.statalist.org/forums/for...with-firm-size
However, if I follow this, I end up with only 326 observations. Moreover, if I follow the commands in the post above, I end up with the American firms in the same row as their matched cross-listed firms. However, what I want is to get rid of all the unmatched American observations, and keep a dataset where the matched US and cross-listed firms are not in the same observation, so that I can still run regressions on them.

Perhaps what I mean is not exactly called 'matching'. Anyway, I am looking to keep one American firm for each cross-listed firm, that is most similar in terms of mtb, roa and at.

Also, I might have to sort firms on years first (that is: for observations/firms to be matched, the main criteria is that the observations are in the same year). If that is necessary, how can I adapt the code?

Thank you in advance.