Hi,

I have a dataset that is a hierarchy set over a maximum of 4 levels where each row is a level in the hierarchy (eg, level 1=fruit, level 2=citrus, level 3=lemon). Below is an example of how the data is structured.

Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str9 l1 str8(l2 l3) str14 l4 float keep
"fruit"     ""         ""         ""               0
"fruit"     "citrus"   ""         ""               0
"fruit"     "citrus"   "lemon"    ""               1
"fruit"     "apple"    ""         ""               1
"fruit"     "bannana"  ""         ""               1
"vegetable" "root veg" ""         ""               0
"vegetable" "root veg" "carrot"   ""               1
"vegetable" "root veg" "parsnip"  ""               1
"vegetable" "root veg" "turnip"   ""               1
"vegetable" "root veg" "beetroot" ""               0
"vegetable" "root veg" "beetroot" "white beetroot" 1
"vegetable" "root veg" "beetroot" "red beetroot"   1
"vegetable" "tomato"   ""         ""               1
end
I want to keep only the fullest row for each unit within a level. For example, I would not keep row 2 as "citrus" appears also in row 3. Similarly, I would drop row 10 as "beetroot" appears in rows 11 & 12 - but both rows 11 & 12 would be retained as they have different l4 entries. In this example I have manually added the "keep" variable to show which observations should be kept.

Is there a way to either automate the creation of the keep variable or collapse the data so that only the fullest rows are retained?

Thank you for any help,
Bryony