Difference-in-Difference - collinearity

Dear Statalisters,

I am relatively new to Stata, so please bear with me. Also, I am aware that this issue has already been addressed on this forum, but I don't seem to be able to find the solution to my problem.

I am using logit in Stata 15.1 to understand whether migration changes employment outcomes. I am using an unbalanced dataset. For the purpose of this explanation, I will use the most basic specification (i.e. without any socio-economic control variables and without margins which I use at a later stage).

I am typing:

Code:

logit employed i.migrant##i.migration i.year, cluster(ident)

where:
- employed is a binary variable equal to 1 for years when respondents were economically active, 0 otherwise.
- migrant is a 'treatment': a time-invariant binary variable for control and treatment groups, which equals 1 for migrants (those who migrated); 0 for non-migrants (those who stayed behind).
- migration is 'time' or 'post': a binary variable equal to 1 for years after migration, 0 for years before migration. As such, migration == 0 for both groups in the years before migration, but migration == 1 only for 1 group who underwent the treatment, i.e. migrants.

The problem I encounter is as follows: the interaction term is omitted due to collinearity (while both migrant & migration are estimated without problems). More specifically, I obtain the following output:

Code:

 logit employed i.migrant##i.l_mig2 i.year, cluster(ident)

note: 1950.year != 0 predicts success perfectly
      1950.year dropped and 1 obs not used

note: 0.migrant#1.l_mig2 identifies no observations in the sample
note: 1.migrant#1.l_mig2 omitted because of collinearity
note: 2009.year omitted because of collinearity
Iteration 0:   log pseudolikelihood = -67798.349  
Iteration 1:   log pseudolikelihood = -67286.391  
Iteration 2:   log pseudolikelihood = -67285.374  
Iteration 3:   log pseudolikelihood = -67285.374  

Logistic regression                             Number of obs     =    104,797
                                                Wald chi2(60)     =     224.75
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -67285.374               Pseudo R2         =     0.0076

                                (Std. Err. adjusted for 4,502 clusters in ident)
--------------------------------------------------------------------------------
               |               Robust
      employed |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
---------------+----------------------------------------------------------------
     1.migrant |   -.259598   .0554714    -4.68   0.000      -.36832    -.150876
      1.l_mig2 |   .4030797   .0643309     6.27   0.000     .2769934     .529166
               |
migrant#l_mig2 |
          0 1  |          0  (empty)
          1 1  |          0  (omitted)
               |
          year |
         1950  |          0  (empty)
         1951  |  -.8324219   1.000956    -0.83   0.406     -2.79426    1.129416
         1952  |  -.9865726   .5585038    -1.77   0.077     -2.08122    .1080748
         1953  |  -.8025823   .3977641    -2.02   0.044    -1.582186   -.0229789

Please note that I tripple-checked the data to make sure they are coded in the correct way. An example of data for a migrant in my data would be:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str8 ident double year float(migrant migration)
"B0000001" 1991 1 0
"B0000001" 1992 1 0
"B0000001" 1993 1 0
"B0000001" 1994 1 0
"B0000001" 1995 1 0
"B0000001" 1996 1 0
"B0000001" 1997 1 0
"B0000001" 1998 1 0
"B0000001" 1999 1 0
"B0000001" 2000 1 0
"B0000001" 2001 1 0
"B0000001" 2002 1 0
"B0000001" 2003 1 1
"B0000001" 2004 1 1
"B0000001" 2005 1 1
"B0000001" 2006 1 1
"B0000001" 2007 1 1
"B0000001" 2008 1 1
"B0000001" 2009 1 1
end
format %ty year

A corresponding example for a non-migrant:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str8 ident double year float(migrant migration)
"C0001002" 1995 0 0
"C0001002" 1996 0 0
"C0001002" 1997 0 0
"C0001002" 1998 0 0
"C0001002" 1999 0 0
"C0001002" 2000 0 0
"C0001002" 2001 0 0
"C0001002" 2002 0 0
"C0001002" 2003 0 0
"C0001002" 2004 0 0
"C0001002" 2005 0 0
"C0001002" 2006 0 0
"C0001002" 2007 0 0
"C0001002" 2008 0 0
"C0001002" 2009 0 0
end
format %ty year

Is the problem driven by the fact that my time/post variable (here: migration) varies only for the control group (i.e. migrants)? Or is there any other issue I am not aware of? I will be most grateful for your help.

Best wishes,

Justyna

BJ Data Tech Solution

Home / Data Cleaning / Data management / Data Processing / Difference-in-Difference - collinearity
Difference-in-Difference - collinearity

0 Response to Difference-in-Difference - collinearity

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Difference-in-Difference - collinearity Difference-in-Difference - collinearity

Related Posts with Difference-in-Difference - collinearity

0 Response to Difference-in-Difference - collinearity

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Difference-in-Difference - collinearity
Difference-in-Difference - collinearity