Hi,

I'm struggling with the geometric mean computation in the following case.
I need to create composite indexes based on the geometric (row) mean of multiple variables. The indexes are composed of a different number of variables, and the variables have different distribution.
I created a syntax following these steps:
1) standardization of the variables by generating a "modified z-scores" based on median absolute deviation (to minimize the impact of extreme values);
2) log transformation: store the sign of the values before the logarithmic transformation and log transform abs(`var'), adding 1 so it returns zeros when `var' == 0
3) exponentiate the arithmetic rowmean of the log transformed variables: store its sign, exponentiate it, substract 1, and restore its sign.

This syntax is:

//Step1 - standardization: compute "modified z-scores" (based on median absolute deviation to minimize the impact of extreme values)
Code:
foreach var of varlist v* {
 qui su `var', det
 gen double `var'_zsco = ((`var'-`r(p50)')/`r(p50)')* 0.6745
}
//Step 2 - logarithmic transformation
Code:
foreach var of varlist *zsco {
//store the sign of the values before the logarithmic transformation
  gen s_`var' = .
  replace s_`var' =  -1 if `var' < 0 & `var' != .
  replace s_`var' =   1 if `var' > 0 & `var' != .
      replace s_`var' =   1 if `var' == 0 & `var' != .  /*to avoir missing values for (zsco==0)*/

//logarithmic transformation of `var', adding 1 so it returns zeros when `var' == 0
  gen double i_`var' = ln(1+(abs(`var')))*s_`var'
}

//Step 3 - compute the arithmetic rowmean of the ln transformed variables and
Code:
egen double i_Mean = rmean(i_*)

foreach var of varlist i_Mean {
//store the sign of the values of var
  gen s_`var' = .
  replace s_`var' =  -1 if `var' < 0 & `var' != .
  replace s_`var' =   1 if `var' > 0 & `var' != .
  replace s_`var' =   1 if `var' == 0 & `var' != .
// exponentiate the arithmetic mean
  gen double exp_`var' = (exp(abs(`var')))-1
//restore the sign of var values
  replace exp_`var' = s_`var'*exp_`var'
}

I created an independent check for rows with positive z scores only (as the gmean() function for egen in egenmore (SSC) ignores zeros and negatives).
Taking for granted that step 1 is irrelevant for the actual problem, I simulated steps 2 and 3 on a previous exmaple provided by Nick (https://www.statalist.org/forums/for...62#post1360962)

I get very close values to what my syntax generate, but it is not an exact match (I get a .9948 correlation), and I just can't find why and where is my mistake.

All the values I get from my own Steps 2 and 3 slightly higher then the expected values.


//Generating example data
Code:
clear
set obs 10
set seed 2803
forval j = 1/5 {
      gen y`j' = ceil(100 * (runiform()^2))
}

list
+-------------------------+ | y1 y2 y3 y4 y5 | |-------------------------| 1. | 86 63 45 8 1 | 2. | 12 40 73 100 4 | 3. | 60 1 74 61 4 | 4. | 2 1 4 2 54 | 5. | 12 1 22 22 4 | |-------------------------| 6. | 1 7 15 84 14 | 7. | 4 1 12 94 7 | 8. | 40 2 15 2 89 | 9. | 16 34 25 7 6 | 10. | 15 6 3 44 6 | +-------------------------+
//Generating expected gmean values
Code:
gen double M1 = y1

quietly forval j = 2/5 {
    replace M1 = M1 * y`j'
}

replace M1 = exp(log(M1)/5)

list


//independent check 2 proposed by Nick
Code:
matrix test = (86, 63, 45, 8, 1)
gen test = test[1, _n]
means test

egen gmean = mean(ln(test))
replace gmean = exp(gmean)


means test

Variable | Type Obs Mean [95% Conf. Interval]
-------------+---------------------------------------------------------------
test | Arithmetic 5 40.6 -4.225618 85.42562
| Geometric 5 18.11458 1.794746 182.8326
| Harmonic 5 4.256322 . .
-----------------------------------------------------------------------------
Missing values in confidence intervals for harmonic mean indicate
that confidence interval is undefined for corresponding variables.
Consult Reference Manual for details.

//Applying my syntax
//Step 2 - log transformation
Code:
foreach var of varlist y* {
//store the sign of the values before the log transformation
  gen s_`var' = .
  replace s_`var' =  -1 if `var' < 0 & `var' != .
  replace s_`var' =   1 if `var' > 0 & `var' != .
  replace s_`var' =   1 if `var' == 0 & `var' != .  /*to avoid missing values when var ==0)*/

//log transformation of `var', adding 1 so it returns zeros when `var' == 0
gen double i_`var' = ln(1+(abs(`var')))*s_`var'
}

//Step 3 - compute the arithmetic rowmean of the ln transformed variables and
Code:
egen double i_Mean = rmean(i_*)

foreach var of varlist i_Mean {
//store the sign of the values of var
  gen s_`var' = .
  replace s_`var' =  -1 if `var' < 0  & `var' != .
  replace s_`var' =   1 if `var' > 0  & `var' != .
  replace s_`var' =   1 if `var' == 0 & `var' != .   /*to avoid missing values when var == 0*/
// exponentiate the arithmetic mean
  gen double exp_`var' = exp(abs(`var'))-1
//restore the sign of var values
  replace exp_`var' = s_`var'*exp_`var'
}


list y1 y2 y3 y4 y5 M1 exp_i_Mean

     +-------------------------------------------------+
     | y1   y2   y3    y4   y5          M1   exp_i_M~n |
     |-------------------------------------------------|
  1. | 86   63   45     8    1   18.114581   20.515226 |
  2. | 12   40   73   100    4   26.873536    27.83036 |
  3. | 60    1   74    61    4   16.104771    18.52345 |
  4. |  2    1    4     2   54   3.8663641   4.4817729 |
  5. | 12    1   22    22    4   7.4682237   8.2785434 |
     |-------------------------------------------------|
  6. |  1    7   15    84   14   10.430841   11.669224 |
  7. |  4    1   12    94    7   7.9413333    8.975884 |
  8. | 40    2   15     2   89   11.639123   12.966184 |
  9. | 16   34   25     7    6   14.169602    14.40053 |
 10. | 15    6    3    44    6   9.3453063    9.713163 |
     +-------------------------------------------------+
Any help figuring out where is my mistake would be very appreciated!
Best,
Martin