I am new to Stata and just encountered this issue in data management.
Suppose I have two datasets with different numbers of observations.
First, conditioning on these two datasets contain some common variables but different observations, I want combine these two datasets, remaining all the observations in these two datasets and leaving the unmatched observations as missing values. For example, both these two datasets have variables srcdate, gvkey, and cusip. These two datasets also contain their own but potentially different salecs (namely, salecs1 and salecs2). If these three variables (srcdate, gvkey, and cusip) can match, then we compare their salecs (salecs1 and salecs2) by listing them next to each other. This procedure can be done by generating a dummy variable, if matched, presented by 1, otherwise, presented by 0.
Second, if initially these three variables
(srcdate, gvkey, and cusip)
cannot match, just create a new row of observations and leave whoever's cell as missing values.The result should contain both datasets entire information as well as comparing their observations respectively.
Here is the example for these two datasets.
Dataset 1
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input long srcdate str6 gvkey str10 cusip str58 conm str4 sic str50 cnms str8 ctype double salecs 19904 "122519" "68243Q106" "1-800-FLOWERS.COM" "5961" "International" "GEOREG" 15.127 20269 "122519" "68243Q106" "1-800-FLOWERS.COM" "5961" "International" "GEOREG" 11.215 20635 "122519" "68243Q106" "1-800-FLOWERS.COM" "5961" "International" "GEOREG" 11.73 21000 "122519" "68243Q106" "1-800-FLOWERS.COM" "5961" "International" "GEOREG" 11.936 21365 "122519" "68243Q106" "1-800-FLOWERS.COM" "5961" "International" "GEOREG" 11.519 20819 "034066" "68247Q102" "111 INC -ADR" "5960" "Not Reported" "COMPANY" 6.997 21184 "034066" "68247Q102" "111 INC -ADR" "5960" "3 Customers" "COMPANY" 50.479 18627 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "8090" "Medicare" "GOVDOM" 239.017 18627 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "8090" "Commercial" "MARKET" 272.78 18627 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "8090" "Medicaid" "GOVDOM" 16.077 18627 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "8090" "Self Pay" "MARKET" 8.039 18992 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "8090" "Commercial" "MARKET" 325.093 18992 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "8090" "Self Pay" "MARKET" 8.942 18992 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "8090" "Argentina" "GEOREG" 43.5 18992 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "8090" "Medicaid" "GOVDOM" 17.883 18992 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "8090" "Medicare" "GOVDOM" 286.772 19358 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "8090" "Commercial" "MARKET" 367.812 19358 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "8090" "Medicare" "GOVDOM" 292.328 19358 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "8090" "Self Pay" "MARKET" 7.548 19358 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "8090" "Argentina" "GEOREG" 62.7 19358 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "8090" "Medicaid" "GOVDOM" 18.528 19723 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "8090" "Self Pay" "MARKET" 7.876 19723 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "8090" "Medicaid" "GOVDOM" 19.332 19723 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "8090" "Argentina" "GEOREG" 72.5 19723 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "8090" "Commercial" "MARKET" 388.787 19723 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "8090" "Medicare" "GOVDOM" 300.004 20088 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "8090" "Medicare" "GOVDOM" 376.865 20088 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "8090" "Self Pay" "MARKET" 8.522 20088 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "8090" "Commercial" "MARKET" 540.678 20088 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "8090" "Medicaid" "GOVDOM" 20.832 20088 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "8090" "Argentina" "GEOREG" 76.8 20453 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "8090" "Medicare" "GOVDOM" 388.049 20453 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "8090" "Commercial" "MARKET" 587.074 20453 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "8090" "Self Pay" "MARKET" 10.001 20453 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "8090" "Argentina" "GEOREG" 90.3 20453 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "8090" "Medicaid" "GOVDOM" 15.002 end format %d srcdate
Dataset 2
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input long srcdate str6 gvkey str10 cusip str58 conm str100 conml str4 sic double salecs 18627 "179586" "33740N105" "1ST UNITED BANCORP INC" "1st United Bancorp Inc" "6020" 50.174 18992 "179586" "33740N105" "1ST UNITED BANCORP INC" "1st United Bancorp Inc" "6020" 62.148 19358 "179586" "33740N105" "1ST UNITED BANCORP INC" "1st United Bancorp Inc" "6020" 70.183 19723 "179586" "33740N105" "1ST UNITED BANCORP INC" "1st United Bancorp Inc" "6020" 71.5 18566 "166222" "90137E106" "20-20 TECHNOLOGIES INC" "20-20 Technologies Inc" "7372" 66.531 18931 "166222" "90137E106" "20-20 TECHNOLOGIES INC" "20-20 Technologies Inc" "7372" 68.25 20088 "020547" "90214L106" "2050 MOTORS INC" "2050 Motors Inc" "5500" 0 20453 "020547" "90214L106" "2050 MOTORS INC" "2050 Motors Inc" "5500" 0 20819 "020547" "90214L106" "2050 MOTORS INC" "2050 Motors Inc" "5500" 0 21184 "020547" "90214L106" "2050 MOTORS INC" "2050 Motors Inc" "5500" 0 21549 "020547" "90214L106" "2050 MOTORS INC" "2050 Motors Inc" "5500" 0 18627 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "21st Century Oncology Holdings Inc" "8090" 543.963 18992 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "21st Century Oncology Holdings Inc" "8090" 644.717 19358 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "21st Century Oncology Holdings Inc" "8090" 693.951 19723 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "21st Century Oncology Holdings Inc" "8090" 736.516 20088 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "21st Century Oncology Holdings Inc" "8090" 1018.182 20453 "265008" "90131G107" "21ST CENTURY ONCOLOGY HLDGS" "21st Century Oncology Holdings Inc" "8090" 1079.227 18627 "186876" "90138A103" "21VIANET GROUP INC" "21Vianet Group Inc" "7370" 79.576 18992 "186876" "90138A103" "21VIANET GROUP INC" "21Vianet Group Inc" "7370" 162.209 19358 "186876" "90138A103" "21VIANET GROUP INC" "21Vianet Group Inc" "7370" 244.644 19723 "186876" "90138A103" "21VIANET GROUP INC" "21Vianet Group Inc" "7370" 324.879 20088 "186876" "90138A103" "21VIANET GROUP INC" "21Vianet Group Inc" "7370" 463.599 20453 "186876" "90138A103" "21VIANET GROUP INC" "21Vianet Group Inc" "7370" 561.05 20819 "186876" "90138A103" "21VIANET GROUP INC" "21Vianet Group Inc" "7370" 524.525 21184 "186876" "90138A103" "21VIANET GROUP INC" "21Vianet Group Inc" "7370" 521.449 21549 "186876" "90138A103" "21VIANET GROUP INC" "21Vianet Group Inc" "7370" 494.492 18627 "160938" "90214M104" "2242749 ONT LTD" "2242749 Ont Ltd" "3080" 462.797 18992 "160938" "90214M104" "2242749 ONT LTD" "2242749 Ont Ltd" "3080" 453.605 19358 "160938" "90214M104" "2242749 ONT LTD" "2242749 Ont Ltd" "3080" 457.415 19723 "160938" "90214M104" "2242749 ONT LTD" "2242749 Ont Ltd" "3080" 455.522 20088 "160938" "90214M104" "2242749 ONT LTD" "2242749 Ont Ltd" "3080" 473.866 end format %d srcdate
0 Response to How to combine two observation-varying datasets based on several variables
Post a Comment