Hi All, no specific code question here but rather a statistical one. I am trying to create a linear model that predicts the cost of a particular hip surgery. To set the scene, my n = ~19,000 and I am starting with ~460 variables (440 of which are dummies). I have so many variables because during a given surgery, many different medications or procedures can be given and across 19,000 patients, this results in many dummy variables for each medication or procedure.

Having said that, I will first use Lasso model selection with 5 fold cross validation as a guide to weed out variables that don't contribute much to the cost of the procedure. Since Lasso is not specifying a model based on p values, it does not present p values in the output. My concern is that submitting this model for publication will not go over well given reviewer's heavy reliance on p values.

I am deciding to then take the model that Lasso specified and use those independent variables in an OLS model. Using this method, I can now present p values and will be able to evaluate each independent variable for significance using the p values to determine the final model.

1. Is this sequence of model specification something that is reasonable to do/correct statistical methodology? Or will my OLS results be biased in some way?
2. Alternatively, I've seen Elastic Net used and read a paper that showed its results can be better than Lasso. Therefore, I was considering switching from Lasso to Elastic Net but am not sure how or if that would effect my interpretation of the results after I run the OLS in the second step?

Thanks in advance for the input!