This started as a question, but I came up with a solution that I couldn't find documented elsewhere. The method uses a matrix. I'm very new to matrices and had no confidence that the method would work - but it did! I'm sure this method is far from novel, but it was fun to try it out and discover that it works! I share it here for discussion - is this a good solution? Are there better solutions?

Here is the problem:
I have a large dataset of blood values for children. I want to create a binary variable anaemia that is 0 if the child is not anaemic and 1 if the child is anaemic. The normal values for haemoglobin vary by age and sex. In the past I have used published mean (SD) values to generate Z-scores, and identified anaemia as children with a haemoglobin Z-score less than -2. There are some issues with the parametric assumptions underlying that approach. Here I will be using lower bound cut-offs as published in: Staffa et al, Pediatric hematology normal ranges derived from pediatric primary care patients. Am J Hematol [Internet]. 2020 Oct [cited 2022 Dec 17];95(10).

The relevant variables in my dataset for achieving this are:
  • haemoglobin (path_hb)
  • sex (sex)
  • date of birth (dob)
  • date of blood test (path_date)
From these I generated a categorical variable for age at blood test (path_age), with each category corresponding to the age categories used in the reference.

Code:
* Example generated by -dataex-. For more info, type help dataex
clear
input int path_hb byte sex int(dob path_date) float path_age
128 1 20853 21570  7
112 1 19185 21625  9
117 2 21397 22484  8
102 2 21642 22486  8
 92 1 22337 22465  6
123 1 20591 22451  8
108 2 22363 22468  6
126 1 19719 22486  9
 81 1 21210 22540  8
 99 2 21917 22563  7
111 2 20887 22564  8
133 1 20385 21624  8
 99 1 22084 22575  7
 76 1 22257 22584  7
 99 2 21312 22580  8
106 1 22279 22635  7
105 1 22325 22636  7
 91 2 22481 22644  6
120 1 19505 21627  8
103 2 20331 21629  8
130 2 19890 21631  8
114 1 20886 21633  8
119 1 20965 21644  7
 89 1 21510 21646  6
105 2 20648 21650  8
101 2 19456 21556  8
108 1 20751 21678  8
117 1 21525 21680  6
100 2 19458 21693  9
 89 1 21452 21715  7
139 1 18245 21720  9
113 1 21175 21726  7
121 1 19665 21735  8
119 1 19253 21755  9
 80 1 21279 21749  7
 93 1 20515 21756  8
113 2 21448 21557  6
126 1 20900 21760  8
132 1 20744 21764  8
 91 1 20860 21788  8
end
format %dM_d,_CY dob
format %dM_d,_CY path_date
label values sex sex_lbl
label def sex_lbl 1 "Male", modify
label def sex_lbl 2 "Female", modify
label values path_age path_age_lbl
label def path_age_lbl 5 "31d to 60d", modify
label def path_age_lbl 6 "61d to 180d", modify
label def path_age_lbl 7 "6m to <2y", modify
label def path_age_lbl 8 "2y to <6y", modify
label def path_age_lbl 9 "6y to <12y", modify
label def path_age_lbl 10 "12y to <18y", modify
The brief is to create a binary variable anaemia that is 0 if the haemoglobin is ≥ cut-off for age and sex, and 1 if the haemoglobin is < cut-off for age and sex. My first solution was simply many iterations of:

Code:
replace anaemia = 1 if sex == X & path_age = Y & path_hb < Z
I thought that looked messy, so I tried another solution. I created a matrix for the reference lower cut-offs as follows:

Code:
matrix input hbRef = ( 128, 128 \ /// 1: 1-3 days
133, 130 \ /// 2: 4-7 days
110, 120 \ /// 3: 8-14 days
98, 102 \ /// 4: 15-30 days
90, 89 \ /// 5: 31-60 days
94, 96 \ /// 6: 61-180 days
102, 103 \ /// 7: 181 days to <2 years
107, 107 \ /// 8: 2 years to <6 years
113, 112 \ /// 9: 6 years to <12 years
124, 114 ) // 10: 12 years to <18 years)
matrix colnames hbRef = "Males" "Females"
matrix rownames hbRef = "1d to 3d" "4d to 7d" "8d to 14d" "15d to 30d" "31d to 60d" "61d to 180d" "6m to <2y" "2y to <6y" "6y to <12y" "12y to <18y"
Which gives the following:

Code:
. mat list hbRef

hbRef[10,2]
               Males  Females
   1d to 3d      128      128
   4d to 7d      133      130
  8d to 14d      110      120
 15d to 30d       98      102
 31d to 60d       90       89
61d to 180d       94       96
  6m to <2y      102      103
  2y to <6y      107      107
 6y to <12y      113      112
12y to <18y      124      114
Note that the columns correspond to my sex codes and the rows correspond to my path_age codes. Now I can create my anaemia variable as follows:

Code:
gen anaemia = .
replace anaemia = 0 if !missing(path_hb) & !missing(path_age)
replace anaemia = 1 if path_hb < hbRef[path_age, sex]
label variable anaemia "Anaemic for age"
label values anaemia yesno_lbl
How do these methods compare? Six in one, half a dozen in the other?