Hi everyone,

Unfortunately I'm not able to download the dataex package, since I'm working on an external server. I hope you can still understand my query and are willing to help me out! I know some questions have been asked about simulation before, but none of the posts really matches what I'm looking for.

I am investigating the relationship between employment and crime on an individual and monthly level. I have data from around 1 million individuals for 96 time periods (8 years, 12 months per year), where I know whether they were employed or not, whether they committed an offence or not, monthly income and some other control variables.

My original dataset looks approximately like this:
id time emp crime income age crimehist
1 1 1 0 2000 19 0
1 2 1 0 1800 19 0
1 3 0 1 0 19 0
1 4 0 0 0 20 1
1 5 1 0 1400 20 1
2 1 1 0 1500 24 3
2 2 1 1 1100 24 3
2 3 1 1 1400 24 4
2 4 0 0 0 24 5
2 5 0 0 0 25 5
Crime = 1 if someone committed a crime in that period, emp = 1 if someone is employed in that period. Crimehist is number of crimes committed in the past year (not including current period)

I want to carry out a logistic regression to see whether there is a relationship in the following way:

Code:
xtlogit crime emp age age2 crimehist, fe
To verify that a fixed effects logistic model is an appropriate model to apply to this data, I have been asked to do a simulation study, mainly to verify that the model provides consistent estimates of the parameters. The values of the independent variables and error terms should be simulated, and parameters should be given a fixed value. The dependent variable can then be calculated for every observation. By simulating the model, I can check whether the estimated parameters are close to the true (chosen) parameter values. By trying different values of T, this simulation can verify the consistency of the parameter estimates as long as T is large enough.

Even though there is some documentation on simulation studies online, I have not been able to find a proper code for this simulation study.

I think it's important for me to first of all know what kind of distribution my variables have. How do I find out? For example, income does not seem to have a perfectly normal distribution (see picture) - should my simulated independent variable then have a similar distribution to the real data, or can I assume normal distribution?

Array

In case I assume normal distribution for all my independent variables, what would be the next steps?

For generating the income variable I first used this code:
Code:
gen sim_inc = 0
replace sim_inc = 2416 + 1226 * invnorm(uniform()) if sim_emp != 0
because income had a mean of 2416 and S.D. of 1226 in the original dataset. However, this leads to negative values for income as well and a very different distribution over all (also because I set all values to zero if emp = 0).

Once I have generated all independent variables, how do I create the dependent variable?

And how do I then run the regression, and check whether the logistic model leads to consistent parameter estimates?

Thanks a lot in advance for your help!