Hi everyone,

I am trying to identify the causal effect that X has on y.
I am running a pooled OLS for my unbalanced panel data with years 2011, 2012, 2013, 2016, 2017, 2018.

y_{it} = a + BX_{it} + Controls + d_{t} + u_{it}

I am aware that there are many reasons that the Pooled OLS might be biased but I want to focus on one in particular - reverse causality. I hypothesise that y_{it} does not contemporaneously cause X_{it} (i.e. current y will not have a causal effect on current X). But I think lagged values of y will have a causal effect on current values of X. Specifically, I think that distantly lagged values of y_{i, t-20} will affect X_{it} and lagged values of y will be highly correlated with current y causing bias.

In summary: I am trying to test if X has a causal effect on y. Current y will not have a causal effect on current X, but lagged y (by 15-20 years) will, and this is highly correlated with current y.

My first question: is this reverse causality? Or is it in fact omitted variable bias (I have omitted lagged y)

My proposed solution is to use a FE/FD model. This might sort out other problems with the Pooled OLS but will it remove the problem described above?

My logic is as follows: FE/FD will measure changes in variables within a person. If I find a positive correlation between a change in X and a change in y (using a fixed effects/first differenced model), I can assume that the change in X has caused the change in y; because y can only affect X in the long-term (say 20 years) whereas X can contemporaneously affect y.

In theory if I could run the following regression:

y_{it} = a + BX_{it} + Controls + u_{it} + y_{i,t-20} for t=2018.

I think this would remove the problem, but in practice I do not have any lagged values beyond 7 years, which is unlikely to be long enough.
Also, the assumption that y does not contemporaneously affect X, but that lagged y does is based on intuitive/theoretical arguments.

Any thoughts on this question would be much appreciated.