Hello everyone,

This is my first post on the forum, so I hope that I will be able to provide all the information necessary to answer my question. Please bear with me if this is not the case.

I am currently trying to merge a cross-section of a panel data set called the Indonesian Family Life Survey (https://www.rand.org/well-being/soci.../FLS/IFLS.html). The original data set contains different dta.files on different questions of the survey. The data may have different levels of observations, e.g. the individual, the household, physical activity of an individual, individual interview time, etc. Accordingly, there are different identifiers for each level of observation. The number of identifiers may also change between different levels of observations. For example, the household only has one identifier (hhid14), whereas the individual has two identifiers for one unique observation (hhid14, pid14). Most other variables have 3 different identifiers for one unique observation, e.g. for individual interview time, it's hhid14, pid14, and time_occ. However, the third identifier may not the be same for different levels of observations. For an individual's physical activity, we would they would be hhid14, pid14, and kktype. All observations have hhid14 as ONE identifier.

My goal is to have a fully merged data set that contains all questions of the cross-section. I have tried to use the merge command but I am not quite which dta.file I should use as a master file or which option (1:1, 1:m, m:1, m:m) is correct for the different levels of observation. I am wary of the m:m option, but I think it may actually be the appropriate choice here.

If I use a dataset as a master file where the household is the level of observation (and hence hhid14 the unique ID), I am not quite sure what my code should look like.

Here an example of my code:

merge 1:1 hhid14 using b3a_cov, nogen (household-level data, only one unique identifiers)

merge 1:m hhid14 using b3a_dl1, nogen (individual-level data, not quite sure how to tell STATA that pid14 is the other ID)

merge 1:m pid14 hhid14 using b3a_dl2, nogen (three unique identifiers)

merge m:m pid14 hhid14 using b3a_dl3, nogen (three unique identifiers, but a different one from the line before. Since I already have three different IDs in the data set, I use m:m)

I am not quite sure if this is the correct approach as I am pretty new to merging data sets. Would anyone be willing to help me out here?

Thank you for your time.

Best,

Jérôme