Hello everyone,

I have an (unbalacend) panel data set of German companies with investments in multiple countries over ten years and want to analyze the effect of certain country-level variables on the investment amount of German firms. The following gives you an example of the structure of the data.

Please note that it is just example data. Investor_id refers to the ID of the German investor, invest_amount is the invested amount held by the German company in that country, GDP refers to the gross domestic product of the country in which this investment is seated, GDP_capita is the GDP per capita and EU_dummy is a dummy for the country being part of the EU (=1) or not (=0). My actual data set contains 8 more country-level variables.

Code:
.list, sepby(year) noobs abbrev(16)


  +---------------------------------------------------------------------------------+
  | year   country   investor_id   invest_amount        GDP   GDP_capita   EU_dummy |
  |---------------------------------------------------------------------------------|
  | 2017     China            45            1300   1.00e+07         4320          0 |
  | 2017    France            45             100     400000         5675          1 |
  | 2017    France            86             670     400000         5675          1 |
  |---------------------------------------------------------------------------------|
  | 2018     China            45            1500   1.10e+07         4520          0 |
  | 2018    France            45             105     390000         5575          1 |
  | 2018    France            86             660     390000         5575          1 |
  +---------------------------------------------------------------------------------+
I want to use a regression with fixed effects as is done in the related literature. However, I am struggling with the data due to its "three dimensions": the observations are not only on a year-level and on a company-level but also on a country-level, since a German company can have investments in more than one country at a time. Because of this I cannot simply use xtset and xtreg as in

Code:
xtset investor_id year
because those two variables are no unique identifiers, since a German investor (investor_id) can have multiple investments in one year (that is because I look at multiple countries). However, it works to use xtset only on the investors and use fixed effects for the years.

Code:
xtset investor_id
xtreg log_invest_amount GDP GDP_capita EU_dummy i.year, fe
But my question now is if this would really be the correct approach for this kind of data? In the literature to related research projects I have seen many different approaches of how to use fixed effects on this kind of data set. Unfortunately, most of the time there is no explanation at all and the papers simply state what they did without explaining why. I could also not find an explanation that I understood in statistic books (mostly looking in chapters on fixed effects and clustering).

Here are the alternatives to the above fixed effects for investor_id and year that I could also use and that I encountered in the literature:

1) Create a new country-investor-level variabel
Code:
egen ic_id = group(investor_id country)
xtset ic_id year
xtreg log_invest_amount GDP GDP_capita EU_dummy, fe
2) Create a new country-time-level variable
Code:
egen it_id = group(investor_id year)
xtset it_id
xtreg log_invest_amount GDP GDP_capita EU_dummy, fe
3) Create a new country-time-level variable
Code:
egen ct_id = group(country year)
xtset ct_id
xtreg log_invest_amount GDP GDP_capita EU_dummy, fe
Does anyone know and can also reason which of the above approaches would be most appropriate from a theoretical perspective for my data?

Furthermore, I believe that the standard errors have to be made cluster robust. Here I encounter the exact same problem as with the fixed effects and don't know what to define as the group that should be clustered after (is it the investors? or the countries? or the investors in a given country?). And in the literature it seems again very ambigious what to use and most often it comes without an explanation.

So my second question is what is your thought on which cluster variable to use? Country, investor_id, ic_id, it_id or ct_id from above?

I am looking forward to your helpful and interesting input

For the sake of completeness I want to mention that I posted a somewhat related question in 2017 on Statalist when I was for the first time working on a related project, but it was a different approach/model and question and my Stata and statistics knowledge was on a different level back then: https://www.statalist.org/forums/for...led-panel-data

Best regards,
Anton