I found something about weighted medians in Stata/SE. To my knowledge and what I have found at the moment, this is not reported or explained in the forum.
Consider a dataset with 176 observations, and two variables: a monetary variable (y) and a weight (w), normalized to the number of observations. The data is sorted by "y". Example:
Code:
. list in 1/5
+-----------------------+
| y w |
|-----------------------|
1. | 153.3 1.2242037 |
2. | 75753.3 1.2242037 |
3. | 92089.306 1.2255392 |
4. | 113553.3 .80866169 |
5. | 119325.52 1.2849769 |
+-----------------------+
.
. list in 171/176
+-----------------------+
| y w |
|-----------------------|
171. | 1008153.3 1.9178511 |
172. | 1050153.3 .8489436 |
173. | 1191875.5 1.2725638 |
174. | 1428153.3 1.039806 |
175. | 1671717 1.3656695 |
|-----------------------|
176. | 1932153.3 .74033083 |
+-----------------------+Code:
. sum y [aw=w], d
y
-------------------------------------------------------------
Percentiles Smallest
1% 75753.3 153.3
5% 133878.4 75753.3
10% 188204.2 92089.31 Obs 176
25% 221473.1 113553.3 Sum of Wgt. 176
50% 338399.3 Mean 405967.1
Largest Std. Dev. 271224.9
75% 504507.2 1191876
90% 714153.3 1428153 Variance 7.36e+10
95% 840153.3 1671717 Skewness 2.291167
99% 1671717 1932153 Kurtosis 10.72572
. di r(p50)
338399.33Code:
.* Following reference manual . preserve . gen P = (0.5*_N) // defining the cutting point for the 50th percentile . gen W = w if _n == 1 // Defining the cumulative sum of weights (175 missing values generated) . replace W = w[_n] + W[_n-1] if _n > 1 (175 real changes made) . gen index = ( W > P ) // Index for finding "center" of weighted distribution . replace index = index[_n] + index[_n-1] if _n > 1 (88 real changes made) . * Calculating median . gen aux_median = ( y[_n-1] + y[_n] )/2 if index == 1 & W[_n-1] == P (176 missing values generated) . replace aux_median = y if index == 1 & W[_n-1] != P (1 real change made) . replace aux_median = 0 if aux_median == . (175 real changes made) . egen median = max(aux_median) . di median 336153.3 . restore
One hypothesis (but I can't confirm it, as -summarize- is a built-in command) is that this is related with the number of decimals that -summarize- considers when using weights. in fact, cutting arbitrarily in three decimals allow us to achieve the same result that -summarize-.
Code:
. * Following reference manual . preserve . gen P = (0.5*_N) // defining the cutting point for the 50th percentile . gen W = w if _n == 1 // Defining the cumulative sum of weights (175 missing values generated) . replace W = w[_n] + W[_n-1] if _n > 1 (175 real changes made) . replace W = round(W,0.001) // Cutting decimals to 3 (176 real changes made) . gen index = ( W > P ) // Index for finding "center" of weighted distribution . replace index = index[_n] + index[_n-1] if _n > 1 (87 real changes made) . * Calculating median . gen aux_median = ( y[_n-1] + y[_n] )/2 if index == 1 & W[_n-1] == P (175 missing values generated) . replace aux_median = y if index == 1 & W[_n-1] != P (0 real changes made) . replace aux_median = 0 if aux_median == . (175 real changes made) . egen median = max(aux_median) . di median 338399.33 . restore
Kind regards,
David
No comments:
Post a Comment