Hi, I’m trying to develop a linear predictive model with pooled cross-sectional data of eight-year household surveys in Cambodia, but got stuck... I would highly appreciate if anyone can help me.

As I used the eight-year survey data, I redefined the sampling frame into a single unit first. In the original sampling frame, 24 provinces were grouped into 19 and each primary sampling unit (PSU), primarily village, was divided into urban and rural. The sample was allocated proportionally among the 38 provincial groups among the 38 provincial groups (provG). In the first stage, PSUs were defined independently in each provincial group. Then one enumeration area (EA) was selected from PSU by simple random sampling as the second stage, and finally 10 households were selected by systematic sampling.

I redefined the variable “clusters” with "year" and "PSU" variables, and “strata” with the "year" and the "provG (provincial group)" variables, as shown below:

egen clusters=group(year PSU), label
egen strata=group(year provG), label
svyset clusters [pweight=hhweight], strata(strata) vce(linearized) singleunit(centered) || EA || hhid

As you probably know, you must use the prefix "svy:" before a command if you would like to maintain the redefined survey frame. However, I am trying to develop a predictive model by using Lasso regression, and "lasso" is not supported by the prefix “svy:.” Here are three questions:
  1. Would it be a problem if I develop the model without “svy:” prefix, as shown below?
lasso linear Y X1 X2 X2 ……
  1. Does the use of importance weight, as shown below, solve the problem?
lasso linear Y X1 X2 X2 …… add [iweight=hhweight]


3. Is there any other solution?


Thank you very much in advance.

Best regards,
Haruyo