Hi,

I have a methodological question concerning lasso regressions and the lasso linear command in Stata.

I have a dataset on daily investment flows of firms and a huge collection of dummy variables which constitute daily signals upon which the firms potentially invest.
There are more than one million observations and more than 2000 dummy variables (D_*) and a set of a few further controls (C_*).

I want to find out which of the dummy variables are most relevant to explain the dependent variable (FLOW).

To do so, I estimate a lasso linear regression command of FLOW on D_* with C_* being variables which are always included. Due to a long computation time over the whole sample, I first ran this command on a subsample of a random draw of 10,000 observations.


Code:
lasso linear FLOW (C_*) D_* if random_sample == 1
I obtained:

Code:
Lasso linear model                          No. of obs        =     10,000
                                            No. of covariates =      2,179
Selection: Cross-validation                 No. of CV folds   =         10

--------------------------------------------------------------------------
         |                                No. of      Out-of-      CV mean
         |                               nonzero       sample   prediction
      ID |     Description      lambda     coef.    R-squared        error
---------+----------------------------------------------------------------
       1 |    first lambda     612.519        32       0.1412     1.13e+08
       6 |   lambda before    384.6798        34       0.1427     1.13e+08
     * 7 | selected lambda    350.5059        35       0.1427     1.13e+08
       8 |    lambda after    319.3679        35       0.1426     1.13e+08
      12 |     last lambda    220.1279        57       0.1407     1.14e+08
--------------------------------------------------------------------------
From a conceptual point of view, only explanatory variables with a positive effect on the dependent variable are of interest (i.e., those can be thought of positive stimulus to invest). Both explanatory variables with a significantly negative as well as negligible effect on the dependent variable are out of interest. However, of course, in my lasso specification, also variables with large negative coefficients are selected if those exist (and they do, as I found out after checking the variables from the selected lambda-model.)

Therefore, my question: Is it possible to run lasso such that it sets coefficients to zero which are close to zero or less than zero?

Thanks