Hello everyone,


Suppose I plot a histogram:
Code:
clear
set obs 10

g z = _n
replace z = 5 if _n > 5

hist z
Given the plotted histogram, I would like to generate two new variables:
  1. bin, giving the bin which a given observation belongs to.
  2. density, giving the density of the bin which the observation belongs to.
If I generated these two variables manually for the histogram plotted:
Code:
g correct_bin = 1 if inrange(_n, 1, 2)
replace correct_bin = 2 if _n == 3
replace correct_bin = 3 if _n >= 4

g correct_density = 0.15 if inrange(_n, 1, 2)
replace correct_density = 0.075 if _n == 3
replace correct_density = 0.525 if _n >= 4
I have tried using the command twoway__histogram_gen to create a solution. However, while my solution works in the above case it does not seem to work for example when:
  • Bins aren't just beside each other, that is for example bin 1 = [1,2) and bin 2 = [5,6)
  • Or even just when the sample size grows and the values of the z variable is continuous, then numerical issues quickly arise
I suspect a combination of twoway__histogram_gen and egen cut could be used to generate a correct solution, below follows my attempt which works for my toy example. I first outline the ideas and then provide the code:
  1. Use twoway__histogram_gen to find the midpoint of each bin.
  2. Adjust the midpoints to be the start points of the bins.
  3. Create new variables x_v which are constant to the start point of bin v.
  4. Check which interval [x_v, x_{v+1}) each observation belongs to.a
  5. Find the corresponding density of that bin.
Here is the code with each step labelled as in the description above:
Code:
* 1, finding midpoints
twoway__histogram_gen z, gen(y x)

* 2, adjusting midpoints to start points
local adjust = (x[2] - x[1]) / 2
replace x = x - `adjust'

* 3, generating variables constant to startpoints
count if x != .
local N = r(N)
forvalues v = 1/`=`N'+1' {
    g x_`v' = x[`v']
}

* 4, finding bin of each observation
g new_bin = .
forvalues v = 1/`N' {
    replace new_bin = `v' if x_`v' <= z & z < x_`=`v'+1'
}

* 5, finding density of the bin
g new_density = .
forvalues v = 1/`N' {
    replace new_density = y[`v'] if new_bin == `v'
}
Checking so this has given the correct solution:
Code:
assert correct_bin == new_bin
assert correct_density == new_density
Finally, note that by browsing the data we see rounding already becoming a slight problem since we have x_1 == .99999994 instead of x_1 == 1.


Thanks in advance for help!
Simon


aI'm not sure that the histogram bins of Stata are indeed of the form [x_v, x_{v+1}) (i.e., closed-open) but my investigations indicate this is true (the final bin having endpoint infinity).