I have a methodological question concerning lasso regressions and the lasso linear command in Stata.
I have a dataset on daily investment flows of firms and a huge collection of dummy variables which constitute daily signals upon which the firms potentially invest.
There are more than one million observations and more than 2000 dummy variables (D_*) and a set of a few further controls (C_*).
I want to find out which of the dummy variables are most relevant to explain the dependent variable (FLOW).
To do so, I estimate a lasso linear regression command of FLOW on D_* with C_* being variables which are always included. Due to a long computation time over the whole sample, I first ran this command on a subsample of a random draw of 10,000 observations.
Code:
lasso linear FLOW (C_*) D_* if random_sample == 1
Code:
Lasso linear model No. of obs = 10,000
No. of covariates = 2,179
Selection: Cross-validation No. of CV folds = 10
--------------------------------------------------------------------------
| No. of Out-of- CV mean
| nonzero sample prediction
ID | Description lambda coef. R-squared error
---------+----------------------------------------------------------------
1 | first lambda 612.519 32 0.1412 1.13e+08
6 | lambda before 384.6798 34 0.1427 1.13e+08
* 7 | selected lambda 350.5059 35 0.1427 1.13e+08
8 | lambda after 319.3679 35 0.1426 1.13e+08
12 | last lambda 220.1279 57 0.1407 1.14e+08
--------------------------------------------------------------------------Therefore, my question: Is it possible to run lasso such that it sets coefficients to zero which are close to zero or less than zero?
Thanks
0 Response to Linear Lasso Regressions and Stata's "lasso linear" Command
Post a Comment