Hello,

I apologise for what I'm sure is a trivial question but I've been searching through the forums for an answer to no avail.

I have data from a cluster randomised trial in 3 arms, with 10 clusters in each arm. Each cluster contains 20 households and there is data on how many windows a house has and whether that window has a blind (a binary outcome, coded 0 or 1). I have missing data for 5 households out of the 600. Each entry in the dataset is for a window, identified by its household number (1-20), cluster number (1-30), arm number (1-3) and outcome (0 or 1), giving a total of 3162 entries, but there is no unique household identifier.

I'm trying to work out the proportion of households in each arm in which at least one of the windows has a blind.

As far as I can tell, I have to do two things:

(1) Generate a unique identifier for each household, which will usually have multiple entries because most houses have multiple windows.

The code I've come up with (which seems to work) is:
Code:
egen household_id = group(cluster_number household_number)
(2) Write some code which effectively says "if any of the entries under a particular household_id has a value of 1 for the outcome, count that household. If not, don't." And separate them by the three arms.

But I'm really stuck on how to do part (2).


Any assistance would be greatly appreciated.

Jacob