Hi all,

I have a dataset where every row represents a study. Each row contains a column describing the region and country where the study was conducted, and several columns containing data on 5-year age-groups (from 0-4 to 99+). For each study, age-group variables are coded as 0 (the study sample does not contain that age group), 1 (the study sample contains that age group), or . (missing). The number of studies per country varies in the dataset (eg There are 8 studies done in China, 22 in the US, etc.) I am interested in determining what is the count of valid data (ie coded=1) for each age-group by country and region. This is an example of my dataset:

Code:
clear
input str71 sid_final str28 regionsimplified str22 country byte(age_20_24 age_25_29 age_30_34)
"HBSC 2009-2010" "Asia" "Armenia"    0 0 0
"World Health Survey" "Asia" "Bangladesh" 1 1 1
"Haq et al., 2005" "Asia" "Bangladesh" 1 1 1
"Cambodia Elderly survey 2004" "Asia" "Cambodia" 0 0 0
"Koyanagi et al., 2018" "Asia" "China" 0 0 0
"Liao et al., 2009" "Asia" "China" 1 1 1
"World Health Survey" "Asia" "China" 1 1 1
end
In the example given, I would expect to end up with
Bangladesh: 20-24: 2 studies; 25-29: 2 studies; 30-34: 2 studies
China: 20-24: 2 studies; 25-29: 2 studies; 30-34: 2 studies

I have tried to use the following command:
Code:
 preserve
collapse (first) regionsimplified (count) age_20_24 age_25_29 age_30_34, by(country)
but this gives me the total number of studies per country instead and the same number across all age groups, not necessarily the number of studies that have data for a particular age group per country. Any suggestions?
Thanks!