Hi,

I have a dataset of individual daily investor trading data. In total, there are about 1 million observations containing about 40,000 distinct investors with on average 25 trades each.
The data is 3-dimensional in the sense that there is a time variable date (in days), an investorID variable and a stockID variable.

Let's say I would like to investigate the effect of some exogenous day- and stock-specific signal (like an analyst forecast or a news annoucement on that particular stock) on the volume traded by each investor per day per stock.

Example of the data for 2 investorIDs:


Code:
clear
input float date int stockID double(investorID volume) float signal
17591 128 1   13 0
17591 449 1   80 0
17885  61 1   80 0
17885 686 1   60 1
17896 449 1  350 0
17896 752 1   80 0
18155 743 1  250 0
18851 760 1 1000 1
16502 775 2   50 0
16628 698 2   50 0
17021 625 2   13 0
17021 625 2   37 0
17554 775 2  100 0
17793 585 2   50 0
17793 752 2   50 0
17805 752 2   50 0
17815  61 2   50 0
17815 585 2  100 1
17815 585 2  100 1
17821  75 2   50 0
17821 591 2   50 0
17821 752 2  100 0
18522  61 2   50 0
18913  61 2   50 0
18913 760 2  200 0
end
format %td date


I tried the following two approaches:

1.) I collapsed the data by summing the trading volume per day per stock ID. This eliminates my investor ID-dimension (as all the volume on one day in one stock is aggregated) and I receive panel data which I can group on stock ID over time.
This yields me about 500 groups for the stock IDs. If I run
Code:
xtset stockID date
xtset volume signal CONTROLS, cluster(stockID) fe
the aimed effect of variable signal on volume is not there:

Code:
Fixed-effects (within) regression               Number of obs      =    753191
Group variable: stockID                                        Number of groups   =       451

R-sq:  within  = 0.0615                         Obs per group: min =        10
       between = 0.3930                                        avg =    1479.0
       overall = 0.1741                                        max =      1878

                                                F(120,489)         =      7.86
corr(u_i, Xb)  = -0.0281                        Prob > F           =    0.0000

                                (Std. Err. adjusted for 451 clusters in stockID)
--------------------------------------------------------------------------------
               |               Robust
volume |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
---------------+----------------------------------------------------------------
  signal |   -2.11476    2.80901    -0.75   0.452    -7.620337    3.390816


2.) Actually, I do not want to sum volume across investors. Therefore, I tried to cope with the 3 dimensions by collapsing the the data by date, investorID and stockID such that the resulting dataset contains summed volume on the individual investor level per day (some investors trade a specific stock multiple times per day, that's why I had to do this).
Then I run
Code:
 egen grouping = group(investorID stockID)
xtset grouping date
xtset volume signal CONTROLS, cluster(grouping) fe
In this case I get the following but with lower R-squared and an incredibly high number of groups as compared to observations, of course.

Code:
Fixed-effects (within) regression               Number of obs      =    854643
Group variable: grouping                        Number of groups   =    367014

R-sq:  within  = 0.0222                         Obs per group: min =         1
       between = 0.0368                                        avg =       2.3
       overall = 0.0344                                        max =       257

                                                F(117,368003)      =     33.31
corr(u_i, Xb)  = -0.2855                        Prob > F           =    0.0000

                            (Std. Err. adjusted for 367014 clusters in grouping)
--------------------------------------------------------------------------------
               |               Robust
  volume |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
---------------+----------------------------------------------------------------
   signal |   15.64934   3.922797     3.99   0.000     7.960774    23.33791

I am no expert on panel data regressions. Is it "common"/acceptable to have such a high number of groups in panel data? Is there a better approach that copes with my issue?

Any comments are very welcome. Thank you!