Hi Stata Forum,

I need theoretical and coding help. (I am a beginner in data analysis, so if some things don't make sense is because I am still learning and correction of the mistake is good)

I have a survey panel data set with 3 waves, due to attrition and refreshments samples that were only interviewed once, I am first doing a simple evaluation with Pooled OLS on a binary dependent variable. Then my goals is to do a logit, a fixed effects and a logit fixed effects. But, at least, the first Pooled OLS also helps me to check if I did the data cleaning correctly.

It looks like I did not ! Because when I run the regression a key variable is omitted due to collinearity. When i use "vif" at least the mean VIF is 1.24 (below ten), but all the variables have a VIF from 1.06 to 1.55.

Due to literature I planned to do country groups and put them in the regression. But country group 2 was omitted. The main idea for producing those country groups was because certain European countries have similar characteristics, and its seen from previous causal inferences on the topic that it follows a northern-southern European distribution. Therefore I tried to do the same. The code I did looks like this. But it looks like doing such data cleaning gave me problems. As you can see group 1 is much smaller than group 2 and 3.

Code:
gen groups = 1 * inlist(country, 15, 16) + ///
             2 * inlist(country, 13, 18, 17, 23) + ///
             3 * inlist(country, 12, 11, 34, 28, 35, 20)
label def gnames ///
       1 "South"  ///
       2 "North"  ///
       3 "Middle"
label val groups gnames
label var groups "Country groups"
ta groups, generate (groupy_)
Then I tried running the "reg" command with the variables of interest : reg $dep $indep if groups == 1 ; if groups== 2 and so forth. It didn't omit any variable, but the R-squared drops from 0.0882 to 0.0408. And the new VIF is 1.26 (each VIF ranging from 1.03 to 1.78)



Question 1. Do you have more ideas into how to correctly tackle this? The idea was to do country groups, or country clusters, so the estimates for all the variables would be related to each country group. Because the outcome variable depends on the region (ex. Northern Europe has more active elderly in comparison to Southern Europe). Each country group have different amounts of observations.

Question 2. What information exactly do you need in order to clarify this? I don't know how to copy the whole data set nor how to copy the output of the regression. I could not do it with dataex correctly.

If anyone would like to help me in this problem would be amazing, I lack the knowledge for this.

Thanks