Hello everyone,

I have multilevel data (individuals nested within municipalities). In my database I have information on individuals and on the municipality where they live. In the table there is an exemple of the structure of my data:
individual age sex fear municipality homicide rate gini security spending
1 20 male 0.4 1 8 0.2 1000
2 25 male 0.2 1 8 0.2 1000
3 50 female 0.8 2 12 0.5 500
4 89 male 0.8 3 21 0.4 1200
5 75 male 0.4 3 21 0.4 1200
6 12 female 0.2 3 21 0.4 1200
7 54 female 0.1 4 17 0.3 3000
8 33 female 0.5 4 17 0.3 3000
9 60 female 0.7 4 17 0.3 3000
Overall, there are 740 different municipalities in my dataset, with a minimum of 19 and a maximum of 1700 individuals per municipality.

My main goal is to estimate the impact of income inequality of a municipality on the fear of its residents. However, I suspect the gini coefficient to be endogenous. As my data are hierarchical, I estimate my model using the mixed command (Stata 14) and 2SLS procedure.

I'm first regressing the gini coefficient on the instrumental variables and the municipality level controls. I then stored the estimated gini:
Code:
reg gini IV1 IV2 homicide_rate security_spending, cluster(municipality)
Code:
predict gini_est
One of my first question is: as my observations are at the individual level, when I'm estimating my first stage equation, results are potentially biased, as some municipalities will have 19 observations and others 1700. Is it enough to correct this problem by adding the cluster(municipality) option ?

Then I'm using my predicted gini to estimate the mixed model as follow:
Code:
mixed fear gini_est homicide_rate security_spending age sex || municipality: , vce(cluster municipality)
I added the vce(cluster municipality) option to obtain robust clustered standard errors.

My second question is: Does it seem correct to estimate my model this way, or could I improve something ?

Thank you very much,
Lucie