Hi all,

I am working on a demographic analysis on who is applying for jobs and who is selected, and using Stata to analyze the data. I have missing data on gender and ethnicity (as some people preferred not to give this information), which ranges from 10-20%. The conclusions from the analysis depend on who is in the unknown category, so there is interest in knowing more about the unknown.

I’ve been asked which groups (e.g. white females, etc) are more likely to be in the unknown category. I have a hard time answering this question as by its very nature the unknown category is UNKNOWN, so how can we know this? I’m wondering what the best way to resolve this is. Here are some possibilities:

1. Impute the missing data. This seems like the best approach though I am having a hard time convincing others of this.
  • For gender: Impute gender based on first name (I have seen some studies that do this)
  • For ethnicity: more of a problem. not sure the reliability of imputing ethnicity based on last name. Perhaps do multiple imputation, although our data doesn’t have many additional fields to build a good model
2. review research on which groups are more or less likely to withhold their information in job applications. For example, maybe African American respondents are more likely to withhold their race on job applications. However, I’m not sure what to do with this information even if I could find a study on this. If it’s found in a study in a different context, why would this necessarily be true in my data set?

3. apply the same proportion of male/female to the missing that is in the non-missing. For example, among those who did answer the question, 80% are men and 20% are women. Then we say that the missing consists of 80% men and 20% women. This seems flawed, as we seem to be assuming that men and women withhold their information at the same rate.

4. apply the same proportion of male/female to the missing that is in the general population. Again, this seems flawed to me.

Any comments appreciated.