Hello everyone,

I found something about weighted medians in Stata/SE. To my knowledge and what I have found at the moment, this is not reported or explained in the forum.

Consider a dataset with 176 observations, and two variables: a monetary variable (y) and a weight (w), normalized to the number of observations. The data is sorted by "y". Example:

Code:
. list in 1/5

     +-----------------------+
     |         y           w |
     |-----------------------|
  1. |     153.3   1.2242037 |
  2. |   75753.3   1.2242037 |
  3. | 92089.306   1.2255392 |
  4. |  113553.3   .80866169 |
  5. | 119325.52   1.2849769 |
     +-----------------------+

. 
. list in 171/176

     +-----------------------+
     |         y           w |
     |-----------------------|
171. | 1008153.3   1.9178511 |
172. | 1050153.3    .8489436 |
173. | 1191875.5   1.2725638 |
174. | 1428153.3    1.039806 |
175. |   1671717   1.3656695 |
     |-----------------------|
176. | 1932153.3   .74033083 |
     +-----------------------+
Obtaining the weighted median with -summarize, detail- give us

Code:
. sum y [aw=w], d

                              y
-------------------------------------------------------------
      Percentiles      Smallest
 1%      75753.3          153.3
 5%     133878.4        75753.3
10%     188204.2       92089.31       Obs                 176
25%     221473.1       113553.3       Sum of Wgt.         176

50%     338399.3                      Mean           405967.1
                        Largest       Std. Dev.      271224.9
75%     504507.2        1191876
90%     714153.3        1428153       Variance       7.36e+10
95%     840153.3        1671717       Skewness       2.291167
99%      1671717        1932153       Kurtosis       10.72572

. di r(p50)
338399.33
Now, if I manually calculate the weighted median following the methodology for percentiles described in Stata Base Reference Manual: Release 15, pages 2673-2674, I found this:

Code:
.* Following reference manual

. preserve

.         gen P = (0.5*_N)    // defining the cutting point for the 50th percentile

.         gen W = w if _n == 1  // Defining the cumulative sum of weights
(175 missing values generated)

.         replace W = w[_n] + W[_n-1] if _n > 1
(175 real changes made)

.         gen index = ( W > P )  // Index for finding "center" of weighted distribution

.         replace index = index[_n] + index[_n-1] if _n > 1
(88 real changes made)

. * Calculating median 

.         gen aux_median = ( y[_n-1] + y[_n] )/2 if index == 1 & W[_n-1] == P
(176 missing values generated)

.         replace aux_median = y if index == 1 & W[_n-1] != P
(1 real change made)

.         replace aux_median = 0 if aux_median == .
(175 real changes made)

.         egen median = max(aux_median)

.         di median
336153.3

. restore
This is a different result. What could be happening here?

One hypothesis (but I can't confirm it, as -summarize- is a built-in command) is that this is related with the number of decimals that -summarize- considers when using weights. in fact, cutting arbitrarily in three decimals allow us to achieve the same result that -summarize-.


Code:
. * Following reference manual
. preserve

.         gen P = (0.5*_N)    // defining the cutting point for the 50th percentile

.         gen W = w if _n == 1  // Defining the cumulative sum of weights
(175 missing values generated)

.         replace W = w[_n] + W[_n-1] if _n > 1
(175 real changes made)

.         replace W = round(W,0.001)  // Cutting decimals to 3
(176 real changes made)

.         gen index = ( W > P )  // Index for finding "center" of weighted distribution

.         replace index = index[_n] + index[_n-1] if _n > 1
(87 real changes made)

. * Calculating median 

.         gen aux_median = ( y[_n-1] + y[_n] )/2 if index == 1 & W[_n-1] == P
(175 missing values generated)

.         replace aux_median = y if index == 1 & W[_n-1] != P
(0 real changes made)

.         replace aux_median = 0 if aux_median == .
(175 real changes made)

.         egen median = max(aux_median)

.         di median
338399.33

. restore
Thanks in advance for any help.

Kind regards,
David