Summary
Consider factor variablegroupthat takes on the values 1, 2, and 3. If you type
. lasso linear y i.group. . .
lassowill know that separate covariates forgroup1, 2, and 3 are to be included among the variables
to be potentially included in the model.
If you create your own indicator variables, you need to create and specify indicators for all the
values of the factor variable:
. generate g1 = (group==1)
. generate g2 = (group==2)
. generate g3 = (group==3)
. lasso linear y g1 g2 g3. . .
It is important that you do not omit one of them, say,g1, and instead type
. lasso linear y g2 g3. . .
Consider factor variablegroupthat takes on the values 1, 2, and 3. If you type
. lasso linear y i.group. . .
lassowill know that separate covariates forgroup1, 2, and 3 are to be included among the variables
to be potentially included in the model.
If you create your own indicator variables, you need to create and specify indicators for all the
values of the factor variable:
. generate g1 = (group==1)
. generate g2 = (group==2)
. generate g3 = (group==3)
. lasso linear y g1 g2 g3. . .
It is important that you do not omit one of them, say,g1, and instead type
. lasso linear y g2 g3. . .
While tinkering around, I discovered that that one must not use ib#.group in place of i.group. Doing so causes the specified base level to be omitted, and will therefore give different results. I think a warning about this should be added to the documentation. E.g., something like this could be added to the Summary section.
Note as well that you must not use the ib# prefix, because that will cause the selected base level to be omitted. For example, using ib1.group is equivalent to including g2 and g3 but not g1.
For anyone who is interested, the code for my "tinkering" is pasted below.
Cheers,
Bruce
Code:
// File: LASSO_collinear_covariates.do // Date: 25-Oct-2022 // Name: Bruce Weaver, bweaver@lakeheadu.ca // Suggestion: Caution users of LASSO that factor variables will not // be handled as described in the documentation if one uses ib#.variable. // Only the i.variable form of factor variable notation is handled properly. // The relevant documentation can be seen here: // https://www.stata.com/manuals/lassocollinearcovariates.pdf#lassoCollinearcovariates // Use auto.dta to create an example like the one described. clear * sysuse auto // Create 5 indicator variables for rep78 forvalues i = 1(1)5 { generate byte rep`i' = rep78 == `i' if !missing(rep78) } summarize rep1-rep5 // NOTE that you must reset the seed before estimating each model. * [1] Use factor variable notation for rep78 set seed 1234 quietly lasso linear mpg i.rep78 /// foreign headroom weight turn gear_ratio price trunk length displacement * Show which variables have been retained lassocoef, display(coef) * [2] Use the 5 indicator variables for rep78 set seed 1234 quietly lasso linear mpg rep1 rep2 rep3 rep4 rep5 /// foreign headroom weight turn gear_ratio price trunk length displacement * Show which variables have been retained lassocoef, display(coef) // Q. What happens if one uses ib#.rep78 rather than i.rep78? forvalues i = 1(1)5 { set seed 1234 display "Base level for rep78 = "`i' quietly lasso linear mpg ib`i'.rep78 /// foreign headroom weight turn gear_ratio price trunk length displacement * Show which variables have been retained lassocoef, display(coef) } // A. Stata omits the base level when I do that. // Let's check a couple of them to verify. * Factor variable notiation with ib3.rep78 set seed 1234 quietly lasso linear mpg ib3.rep78 /// foreign headroom weight turn gear_ratio price trunk length displacement * Show which variables have been retained lassocoef, display(coef) * Indicator variables with rep3 omitted set seed 1234 quietly lasso linear mpg rep1 rep2 rep4 rep5 /// foreign headroom weight turn gear_ratio price trunk length displacement * Show which variables have been retained lassocoef, display(coef) * Factor variable notiation with ib5.rep78 set seed 1234 quietly lasso linear mpg ib5.rep78 /// foreign headroom weight turn gear_ratio price trunk length displacement * Show which variables have been retained lassocoef, display(coef) * Indicator variables with rep5 omitted set seed 1234 quietly lasso linear mpg rep1 rep2 rep3 rep4 /// foreign headroom weight turn gear_ratio price trunk length displacement * Show which variables have been retained lassocoef, display(coef) // Confirmed.
0 Response to [LASSO] Collinear covariates: Suggested addition to the documentation
Post a Comment