Dear Stata list,

I am agonising over a problem and would very much appreciate your support. I have a dataset with some 27,000 observations and 260 variables. The data is structured in group-years, i.e. I observe more than 2,000 actors over varying periods of time. I run Stata version 14.2.
I drew the data from some 20 datasets. This means for example that information for the conflict behaviour of an actor is based on data from dataset A (“conflict_A”), dataset BC (“conflict_BC”), dataset DEF (“conflict_DEF”) and dataset G (“conflict_G”). For other variables I use partly the same, but also other datasets, here e.g. the number of fatalities caused per year (datasets A, BC, DEF, Z).

My plan is the following:
  • For each category of variables (conflict, fatalities, etc.) create a master variable (master_CONF, master_FATA, etc.) that combines the individual observations according to some rules:
    • Rule 1: Datasets A and BC are superior to datasets DEF, G and Z.
    • Rule 2: Within each group of datasets (A and BC vs. DEF, G and Z), find the variable with the least missing observations by id. Use all of the observations of this variable for this id and fill the master variable. If there are still missings in the master variable, find the variable with the second least missings per id in this dataset group. If all variables by dataset group are exhausted and there are still missings, move on to the next group of datasets and do the same. For example, even if variable A has more missings than variable G, it is preferred as long as it has a value by id. Observations from DEF, G and Z are only included when A and BC still show a missing for that particular observation.
    • Rule 3: For the case when variables have the same number of missings per id within the same group of datasets: I would like to choose the variable with the highest sum total (here highest overall sum for conflict behaviour or highest overall sum of fatalities).
  • For each master variable, create a source variable (string) that records the name of the contributing variable.

Rules in the order of appliance:
Ranking of datasets (superior > inferior)
Number of missing obs by id within group of datasets (lesser missings by id > more missings by id)
If same number of missings by id in same group of datasets: Overall score by id (Highest sum total > less sum total)

I wrote a code that achieves almost all rules – except for the repeated loops within the same group of datasets. Is there a way to tell Stata that some variables (identified by their name ending) are to be treated preferential for the entire operation? Also, I wonder whether there is a code that is more parsimonious, since I have so many variables in my real dataset.

Code:
clear
input id year conf_A conf_BC conf_DEF conf_G fata_A fata_BC fata_DEF fata_Z
1 1990 . 0 0 . . 0 0 .
1 1991 1 2 0 . 23 111 0 .
1 1992 2 2 . . 114    . .    .
1 1993 1 2 1 . 19 . 27    .
2 2003 0 . 0 0 0 . 0 0
2 2004 . . 0 1 . . 0 44
13 1992 2 1    1 0    143    . 55 0
13 1993 2 1    1 .    197    17 . .
13 1994 . 1    1 .    . 10 .    .
89 2001 1 .    . 2    . .    . 101
89 2002 2 .    2 2    . .    198    165
89 2003 . .    2 2    . .    153    153
89 2004 1 2    . 1    12 200 . 22
89 2005 1 2    . .    32 176 . .
89 2006 . .    1 .    . .    62 .
89 2007 . .    1 0    . .    29 0
end

local conf conf_A conf_BC conf_DEF conf_G
local fata fata_A fata_BC fata_DEF fata_Z


* Create variables that count the missing obs by id:

foreach x of varlist `conf' `fata' {
    gen lack_`x' = missing(`x')
    bysort id: egen mis_`x' = total(lack_`x')
    drop lack_*
    }
           

* Create source variables that record the origin of the obs:

// rule 2 - least missings:

local mis_CONF mis_conf_A mis_conf_BC mis_conf_DEF mis_conf_G
local mis_FATA mis_fata_A mis_fata_BC mis_fata_DEF mis_fata_Z

egen min_mis_CONF = rowmin(`mis_CONF')
egen min_mis_FATA = rowmin(`mis_FATA')

foreach x of varlist `mis_CONF' {
    gen name_`x' = ""
    replace name_`x' = "`x'" if `x' == min_mis_CONF
    }
    
foreach x of varlist `mis_FATA' {
    gen name_`x' = ""
    replace name_`x' = "`x'" if `x' == min_mis_FATA
    }

// is there a shorter way that combines both loops above?


// rule 1 - ranking by datasets (A and BC are superior to DEF, G and Z)
// the only ranking I can come up with is ex post:

replace name_mis_conf_DEF = "" if name_mis_conf_A != "" | ///
name_mis_conf_BC != ""
replace name_mis_conf_G = "" if name_mis_conf_A != "" | ///
name_mis_conf_BC != ""

replace name_mis_fata_DEF = "" if name_mis_fata_A != "" | ///
name_mis_fata_BC != ""
replace name_mis_fata_Z = "" if name_mis_fata_A != "" | ///
name_mis_fata_BC != ""

// this procedure selects the variable with the least missings. but it does not
// consider that A is better to DEF, G and Z even if A has more missings.
// I would like to run the loop by dataset-group (ranking) as long as there are
// missings in the source variable, but values in the other variables.
// Is there an alternative to rowmin, that selects the minimum value across obs
// in each loop?


// rule 3: if equal number of missings within groups, the highest sum by id prevails

gen overlap_CONF = 0
replace overlap_CONF = 1 if name_mis_conf_A != "" & name_mis_conf_BC ///
!= ""
replace overlap_CONF = 1 if name_mis_conf_DEF != "" & name_mis_conf_G ///
!= ""

gen overlap_FATA = 0
replace overlap_FATA = 1 if name_mis_fata_A != "" & name_mis_fata_BC ///
!= ""
replace overlap_FATA = 1 if name_mis_fata_DEF != "" & name_mis_fata_Z ///
!= ""

foreach x of varlist `conf' `fata' {
    bysort id: egen sum_`x' = total(`x')
    }


// rule 1 - ranking by datasets (A and B are superior to C, D and Z)
// the only ranking I can come up with is ex post:

replace name_mis_conf_A = "" if overlap_CONF == 1 & (sum_conf_A < ///
sum_conf_BC)
replace name_mis_conf_BC = "" if overlap_CONF == 1 & (sum_conf_BC < ///
sum_conf_A)
replace name_mis_conf_DEF = "" if overlap_CONF == 1 & (sum_conf_DEF < ///
sum_conf_G)
replace name_mis_conf_G = "" if overlap_CONF == 1 & (sum_conf_G < ///
sum_conf_DEF)

replace name_mis_fata_A = "" if overlap_FATA == 1 & (sum_fata_A < ///
sum_fata_BC)
replace name_mis_fata_BC = "" if overlap_FATA == 1 & (sum_fata_BC < ///
sum_fata_A)
replace name_mis_fata_DEF = "" if overlap_FATA == 1 & (sum_fata_DEF < ///
sum_fata_Z)
replace name_mis_fata_Z = "" if overlap_FATA == 1 & (sum_fata_Z < ///
sum_fata_DEF)


// one source variable only

gen name_combi_conf = ""
foreach x in name_mis_conf_A name_mis_conf_BC name_mis_conf_DEF name_mis_conf_G {
    replace name_combi_conf = `x' if `x' != ""
}


gen name_combi_fata = ""
foreach x in name_mis_fata_A name_mis_fata_BC name_mis_fata_DEF name_mis_fata_Z {
    replace name_combi_fata = `x' if `x' != ""
}


// keep only the dataset indicator (which has varying length in dataset)

gen split_name_combi_conf = strrpos(name_combi_conf, "_")
gen source_CONF = ""
replace source_CONF = substr(name_combi_conf, split_name_combi_conf +1, .)

gen split_name_combi_fata = strrpos(name_combi_fata, "_")
gen source_FATA = ""
replace source_FATA = substr(name_combi_fata, split_name_combi_fata +1, .)


    
* Create master variables based on the source variable:     

gen master_CONF = .
foreach x in A BC DEF G {
    replace master_CONF = conf_`x' if source_CONF == "`x'"
}


gen master_FATA = .
foreach x in A BC DEF Z {
    replace master_FATA = fata_`x' if source_FATA == "`x'"
    }

drop mis_* min_* name_* sum_* overlap_* split_*
sort id year
br
My code is close, but not quite how it should look like, which is like this:

id year conflict_A conflict_BC conflict_DEF conflict_G master_CONF source_CONF fatalities_A fatalities_BC fatalities_DEF fatalities_Z master_FATA source_FATA
1 1990 . 0 0 . 0 conflict_BC . 0 0 . 0 fatalities_BC
1 1991 1 2 0 . 2 conflict_BC 23 111 0 . 23 fatalities_A
1 1992 2 2 . . 2 conflict_BC 114 . . . 114 fatalities_A
1 1993 1 2 1 . 2 conflict_BC 19 . 27 . 19 fatalities_A
2 2003 0 . 0 0 0 conflict_A 0 . 0 0 0 fatalities_A
2 2004 . . 0 1 1 conflict_G . . 0 44 44 fatalities_Z
13 1992 2 1 1 0 1 conflict_BC 143 . 55 0 143 fatalities_A
13 1993 2 1 1 . 1 conflict_BC 197 17 . . 12 fatalities_BC
13 1994 . 1 1 . 1 conflict_BC . 10 . . 10 fatalities_BC
89 2001 1 . . 2 1 conflict_A . . . 101 101 fatalities_Z
89 2002 2 . 2 2 2 conflict_A . . 198 165 165 fatalities_Z
89 2003 . . 2 2 2 conflict_G . . 153 153 153 fatalities_Z
89 2004 1 2 . 1 1 conflict_A 12 200 . 22 200 fatalities_BC
89 2005 1 2 . . 1 conflict_A 32 176 . . 176 fatalities_BC
89 2006 . . 1 . 1 conflict_DEF . . 62 . 62 fatalities_DEF
89 2007 . . 1 0 0 conflict_G . . 29 0 0 fatalities_Z

Any hint is very much welcome.

Thank you so much. Best, Tom.