Have anyone noticed that the results from the Hosmer Lemeshow test (for evaluating calibration in prediction rules) using the command "hl" and "estat gof" differ, despite specifying the number of groups? Any ideas on why and which is the best to use?

As example:

. estat gof, group(10) table

Logistic model for Outcome_num, goodness-of-fit test

(Table collapsed on quantiles of estimated probabilities)
+--------------------------------------------------------+
| Group | Prob | Obs_1 | Exp_1 | Obs_0 | Exp_0 | Total |
|-------+--------+-------+-------+-------+-------+-------|
| 1 | 0.2447 | 16 | 18.9 | 68 | 65.1 | 84 |
| 2 | 0.2759 | 21 | 17.9 | 47 | 50.1 | 68 |
| 3 | 0.2970 | 20 | 24.5 | 65 | 60.5 | 85 |
| 4 | 0.3247 | 18 | 20.7 | 49 | 46.3 | 67 |
| 5 | 0.3543 | 26 | 25.7 | 50 | 50.3 | 76 |
|-------+--------+-------+-------+-------+-------+-------|
| 6 | 0.3976 | 28 | 29.9 | 51 | 49.1 | 79 |
| 7 | 0.4351 | 36 | 30.1 | 37 | 42.9 | 73 |
| 8 | 0.4949 | 46 | 35.3 | 30 | 40.7 | 76 |
| 9 | 0.5746 | 43 | 40.3 | 33 | 35.7 | 76 |
| 10 | 0.6970 | 36 | 46.7 | 40 | 29.3 | 76 |
+--------------------------------------------------------+

number of observations = 760
number of groups = 10
Hosmer-Lemeshow chi2(8) = 17.99
Prob > chi2 = 0.0213

. hl Outcome_num PredProbRESP_MORT, group(10)

Group N Obs (%) Exp (%) Min % Max % HL
1 84 16 (19.0) 14.4 (17.1) 3.2 21.1 0.23
2 68 21 (30.9) 16.8 (24.7) 21.9 27.0 1.42
3 85 20 (23.5) 24.8 (29.2) 27.1 30.8 1.32
4 67 18 (26.9) 22.0 (32.8) 30.9 35.5 1.07
5 76 26 (34.2) 28.6 (37.6) 35.5 40.2 0.37
6 79 28 (35.4) 34.8 (44.1) 40.4 47.0 2.41
7 73 36 (49.3) 36.0 (49.3) 47.3 52.3 0.00
8 76 46 (60.5) 43.3 (56.9) 53.0 61.2 0.40
9 76 43 (56.6) 50.6 (66.5) 61.6 73.0 3.37
10 76 36 (47.4) 60.2 (79.2) 73.2 92.5 46.81
Total 760 290 (38.2) 331.3 (43.6) 3.2 92.5 57.40

number of observations = 760
number of groups = 10
Hosmer-Lemeshow chi2(10) = 57.40
Prob > chi2 = 0.0000


In some cases, the results even give a different conclusion (p>0.05 and p<0.05).

Any comments/experiences would be appreciated!