Code:
clear set obs 20000000 gen indid=floor(uniform()*10000) gen firmid=floor(uniform()*10000) sort indid firmid egen matchid=group(indid firmid) drawnorm x y preserve
timer on 1 sort matchid collapse x y indid firmid, by(matchid) summarize timer off 1
restore preserve
timer on 2 sort matchid indid firmid collapse x y, by(matchid indid firmid) summarize timer off 2
restore timer list 1 timer list 2
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
matchid | 17,451,332 8738687 5060250 1 1.81e+07
x | 17,451,332 .0003232 .9687208 -5.402949 5.76085
y | 17,451,332 .000109 .9687222 -5.663941 5.742223
indid | 17,451,332 4820.521 2791.404 0 9999
firmid | 17,451,332 4998.954 2886.053 0 9999
and the second gives
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
indid | 18,125,449 4999.293 2886.385 0 9999
firmid | 18,125,449 4999.083 2886.161 0 9999
matchid | 18,125,449 9062725 5232367 1 1.81e+07
x | 18,125,449 .0002974 .9751242 -5.402949 5.76085
y | 18,125,449 .0001436 .9751012 -5.663941 5.742223
Note the different number of observations; since matchid was defined by combinations of indid and firmid, there should be no variation in these variables within matchid and adding them to the by() option should not change the number of observations or anything else in the descriptive statistics (this holds, by the way, when I choose 1000 different values for indid and firmid, instead of 10 000).
To follow up on this, I checked to see that matchid was indeed not getting unique combinations with this code:
Code:
format matchid %9.0f tempfile hold sort matchid save `hold' collapse (sd) sdi=indid sdf=firmid, by (matchid) merge matchid using `hold' gsort -sdi -sdf matchid order matchid indid firmid sdi sdf list in 1/10
+----------------------------------------------------------------------------------+
| matchid indid firmid sdi sdf x y _merge |
|----------------------------------------------------------------------------------|
1. | 17283644 9534 9999 .5773503 5772.637 .1314288 .5337735 3 |
2. | 17283644 9535 0 .5773503 5772.637 -1.074206 1.676847 3 |
3. | 17283644 9535 1 .5773503 5772.637 -.2784413 -2.624709 3 |
4. | 17998432 9929 9999 .5773503 5772.637 .654556 .3150313 3 |
5. | 17998432 9930 1 .5773503 5772.637 -1.504835 .3309812 3 |
|----------------------------------------------------------------------------------|
6. | 17998432 9930 0 .5773503 5772.637 -.0689836 -1.354481 3 |
7. | 16986156 9370 9998 .5773503 5772.06 -.1597866 .414003 3 |
8. | 16986156 9371 0 .5773503 5772.06 .0601182 -.0195616 3 |
9. | 16986156 9370 9998 .5773503 5772.06 -.1587963 1.086275 3 |
10. | 16986156 9371 1 .5773503 5772.06 .5464685 .4928921 3 |
+----------------------------------------------------------------------------------+
where you can see that the same matchid has multiple indid-firmid combinations associated with it.
Clearly, I seem to have misunderstood what egen group does. Is this a commonly known issue? If so, is there a better way to generate a variable that refers to unique combinations of variables in the varlist?
Thank you all for your time.
P.S. Here are the timers, if you are curious:
. timer list 1
1: 128.52 / 4 = 32.1290
. timer list 2
2: 217.01 / 3 = 72.3383
0 Response to egen newvar=group(varlist) does not generate a variable based on all unique combinations in the varlist
Post a Comment