I have a dataset of individual daily investor trading data. In total, there are about 1 million observations containing about 40,000 distinct investors with on average 25 trades each.
The data is 3-dimensional in the sense that there is a time variable date (in days), an investorID variable and a stockID variable.
Let's say I would like to investigate the effect of some exogenous day- and stock-specific signal (like an analyst forecast or a news annoucement on that particular stock) on the volume traded by each investor per day per stock.
Example of the data for 2 investorIDs:
Code:
clear input float date int stockID double(investorID volume) float signal 17591 128 1 13 0 17591 449 1 80 0 17885 61 1 80 0 17885 686 1 60 1 17896 449 1 350 0 17896 752 1 80 0 18155 743 1 250 0 18851 760 1 1000 1 16502 775 2 50 0 16628 698 2 50 0 17021 625 2 13 0 17021 625 2 37 0 17554 775 2 100 0 17793 585 2 50 0 17793 752 2 50 0 17805 752 2 50 0 17815 61 2 50 0 17815 585 2 100 1 17815 585 2 100 1 17821 75 2 50 0 17821 591 2 50 0 17821 752 2 100 0 18522 61 2 50 0 18913 61 2 50 0 18913 760 2 200 0 end format %td date
I tried the following two approaches:
1.) I collapsed the data by summing the trading volume per day per stock ID. This eliminates my investor ID-dimension (as all the volume on one day in one stock is aggregated) and I receive panel data which I can group on stock ID over time.
This yields me about 500 groups for the stock IDs. If I run
Code:
xtset stockID date xtset volume signal CONTROLS, cluster(stockID) fe
Code:
Fixed-effects (within) regression Number of obs = 753191 Group variable: stockID Number of groups = 451 R-sq: within = 0.0615 Obs per group: min = 10 between = 0.3930 avg = 1479.0 overall = 0.1741 max = 1878 F(120,489) = 7.86 corr(u_i, Xb) = -0.0281 Prob > F = 0.0000 (Std. Err. adjusted for 451 clusters in stockID) -------------------------------------------------------------------------------- | Robust volume | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------------+---------------------------------------------------------------- signal | -2.11476 2.80901 -0.75 0.452 -7.620337 3.390816
2.) Actually, I do not want to sum volume across investors. Therefore, I tried to cope with the 3 dimensions by collapsing the the data by date, investorID and stockID such that the resulting dataset contains summed volume on the individual investor level per day (some investors trade a specific stock multiple times per day, that's why I had to do this).
Then I run
Code:
egen grouping = group(investorID stockID) xtset grouping date xtset volume signal CONTROLS, cluster(grouping) fe
Code:
Fixed-effects (within) regression Number of obs = 854643 Group variable: grouping Number of groups = 367014 R-sq: within = 0.0222 Obs per group: min = 1 between = 0.0368 avg = 2.3 overall = 0.0344 max = 257 F(117,368003) = 33.31 corr(u_i, Xb) = -0.2855 Prob > F = 0.0000 (Std. Err. adjusted for 367014 clusters in grouping) -------------------------------------------------------------------------------- | Robust volume | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------------+---------------------------------------------------------------- signal | 15.64934 3.922797 3.99 0.000 7.960774 23.33791
I am no expert on panel data regressions. Is it "common"/acceptable to have such a high number of groups in panel data? Is there a better approach that copes with my issue?
Any comments are very welcome. Thank you!
0 Response to How to handle 3-dimensional ("panel") data?
Post a Comment