I am replicating an imputation procedure for a household survey, and I'm struggling to find a elegant coding solution for bracketed responses. Some respondents do not provide an exact dollar amount for their income, but they do provide a bracketed response.

In the stylized example below, the wages_bkt and retirement_bkt are the bracketed versions of wages and retirement, respectively. A bracket code of 1 corresponds to a $0 to $1,000 bracket range and 2 to over $1,000. I've already calculated bracket codes for the nonmissing responses.

For the observations missing values, I need to take a random draw from the valid values that fall within the bracket range specified for that observation and income type, e.g. for the household with $1,500 in wages and missing retirement income, I need to randomly draw a retirment income value from the valid retirement income reports that fall in bracket 2.

In my scratch code below, I:
  1. Rank the nonmissing values within each bracket range (I eventually plan to generate a random value and calculate within bracket rank by that number)
  2. Generate variables with the total number of valid responses from which we can draw for wages and retirement
  3. Draw a random number between 1 and the (bracket-level) number of valid reports, for observations with missing reports
I get the results at the bottom of the code. For the observation missing both wages and retirement, I want to pull the 2nd ranked wage value from the 2nd wage bracket, or $1,500. Ideally, I would do this without any file I/O (i.e. I'm hoping to avoid building a reference of valid values I merge back onto the observations missing data). I need to take 5 separate draws for each variable across ~20 variables (and store the data in a wide format). Further, I need to use a rolling 2-year sample across many quarters. I'm concerned that a solution relying on a lot of file I/O might prove quite inefficient.

Does anyone have any ideas how I might pull the valid values from the observations identified by the *_pull_from and *_bkt variables into the observations with invalid values in the example below? Or does anyone have an altogether different solution? I'm replicating an existing imputation procedure, so I'm constrained on the methodology. Thank you!

Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input int(wages retirement wages_bkt retirement_bkt)
 500 1100 1 2
 900  300 1 1
3000  600 2 1
2600  800 2 1
4600 1100 2 2
 700  400 1 1
 400 1100 1 2
1200 1700 2 2
1500    . 2 2
   .    . 2 1
   . 1200 1 2
end

bysort wages_bkt: egen w_rank = rank(wages) if !missing(wages), unique
bysort retirement_bkt: egen r_rank = rank(retirement) if !missing(retirement), unique

bysort wages_bkt: egen w_num_valid = max(w_rank)
bysort retirement_bkt: egen r_num_valid = max(r_rank)

set seed 123
gen w_pull_from = floor(runiform()*w_num_valid) + 1 if missing(wages)
gen r_pull_from = floor(runiform()*r_num_valid) + 1 if missing(retirement)

*Results in:
clear
input int(wages retirement wages_bkt retirement_bkt w_rank r_rank w_num_valid r_num_valid w_pull_from r_pull_from)
3000  600 2 1 4 3 5 4 . .
2600  800 2 1 3 4 5 4 . .
   .    . 2 1 . . 5 4 2 4
 900  300 1 1 4 1 4 4 . .
 700  400 1 1 3 2 4 4 . .
1200 1700 2 2 1 5 5 5 . .
1500    . 2 2 2 . 5 5 . 4
 400 1100 1 2 1 1 4 5 . .
 500 1100 1 2 2 3 4 5 . .
   . 1200 1 2 . 4 4 5 3 .
4600 1100 2 2 5 2 5 5 . .
end