I’m doing an analysis of applicants for grants over several years. In a given year, duplicate people have been dropped (those who submit more than one application in a given year). However, it is quite common for the same person to be found in multiple years, and the number of years in the data can vary across people.

The question I am trying to answer: is there a "statistically significant" linear trend over time in the percentage of females, % of males, and % unknown? Also, I’d like to show the regression trends in a graph with the confidence intervals. I realize that with having an unknown category, increases in the % females and % males over time need to be interpreted with caution.

Proposed set up: 3 separate logistic regression models. Outcome is 1) female (vs not female), 2) male (vs not male), 3) unknown (vs not unknown).

The explanatory variable is year, coded as: 1, 2, 3, etc. (use to determine the linear trend)

Question:

**1) does one need to account for the fact that the same person can be found in different years? For my purposes, I just want to know if the overall percentage increased over the years, regardless of whether some were the same people or not. Also, the outcome (gender) does not change over time within a given person. Therefore, it seems like my goal is maybe to treat them as independent but the data has some of the people in the same years. Can one do a regular logit does one need to do a GEE for example accounting for the panel data?

Note, question cross posted here (no replies as of now): https://stats.stackexchange.com/ques...-for-clusterin