I have been using egen newvar=group(varname1 varname2) for a while, with the understanding that newvar would be a new variable containing all unique combinations of varname1 and varname2. Since I want to collapse my data to 1 observation per unique combination of varname1 and varname2, but I also need to have these variables (as well as newvar) in the final dataset, I decided to test the following code (a somewhat pathological example, I agree, but I regularly work with datasets with tens of millions of observations) to see if it was faster to put varname1 and varname2 in the by() option or in the variable list. Remember, there should be no variation in varname1 and varname2 within values of newvar, so the means of varname1 and varname2 that come out of collapse are just their values for each specific value of newvar, and both approaches should give the same thing. However, I got an unfortunate surprise: egen group does not appear to generate its new variable based on all unique combinations in the varlist. Here is the code I tested:

Code:
clear
set obs 20000000
gen indid=floor(uniform()*10000)
gen firmid=floor(uniform()*10000)
sort indid firmid
egen matchid=group(indid firmid)
drawnorm x y
preserve
timer on 1 sort matchid collapse x y indid firmid, by(matchid) summarize timer off 1
 restore
preserve
timer on 2 sort matchid indid firmid collapse x y, by(matchid indid firmid) summarize timer off 2
 restore
timer list 1
timer list 2
The first summarize gives
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
matchid | 17,451,332 8738687 5060250 1 1.81e+07
x | 17,451,332 .0003232 .9687208 -5.402949 5.76085
y | 17,451,332 .000109 .9687222 -5.663941 5.742223
indid | 17,451,332 4820.521 2791.404 0 9999
firmid | 17,451,332 4998.954 2886.053 0 9999


and the second gives
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
indid | 18,125,449 4999.293 2886.385 0 9999
firmid | 18,125,449 4999.083 2886.161 0 9999
matchid | 18,125,449 9062725 5232367 1 1.81e+07
x | 18,125,449 .0002974 .9751242 -5.402949 5.76085
y | 18,125,449 .0001436 .9751012 -5.663941 5.742223


Note the different number of observations; since matchid was defined by combinations of indid and firmid, there should be no variation in these variables within matchid and adding them to the by() option should not change the number of observations or anything else in the descriptive statistics (this holds, by the way, when I choose 1000 different values for indid and firmid, instead of 10 000).

To follow up on this, I checked to see that matchid was indeed not getting unique combinations with this code:

Code:
format matchid %9.0f
tempfile hold
sort matchid
save `hold'
collapse (sd) sdi=indid sdf=firmid, by (matchid)
merge matchid using `hold'
gsort -sdi -sdf matchid
order matchid indid firmid sdi sdf
list in 1/10
and Stata returns:
+----------------------------------------------------------------------------------+
| matchid indid firmid sdi sdf x y _merge |
|----------------------------------------------------------------------------------|
1. | 17283644 9534 9999 .5773503 5772.637 .1314288 .5337735 3 |
2. | 17283644 9535 0 .5773503 5772.637 -1.074206 1.676847 3 |
3. | 17283644 9535 1 .5773503 5772.637 -.2784413 -2.624709 3 |
4. | 17998432 9929 9999 .5773503 5772.637 .654556 .3150313 3 |
5. | 17998432 9930 1 .5773503 5772.637 -1.504835 .3309812 3 |
|----------------------------------------------------------------------------------|
6. | 17998432 9930 0 .5773503 5772.637 -.0689836 -1.354481 3 |
7. | 16986156 9370 9998 .5773503 5772.06 -.1597866 .414003 3 |
8. | 16986156 9371 0 .5773503 5772.06 .0601182 -.0195616 3 |
9. | 16986156 9370 9998 .5773503 5772.06 -.1587963 1.086275 3 |
10. | 16986156 9371 1 .5773503 5772.06 .5464685 .4928921 3 |
+----------------------------------------------------------------------------------+


where you can see that the same matchid has multiple indid-firmid combinations associated with it.

Clearly, I seem to have misunderstood what egen group does. Is this a commonly known issue? If so, is there a better way to generate a variable that refers to unique combinations of variables in the varlist?

Thank you all for your time.

P.S. Here are the timers, if you are curious:
. timer list 1
1: 128.52 / 4 = 32.1290

. timer list 2
2: 217.01 / 3 = 72.3383