Hi:

This is not a stata related question, so please forgive me if this is not allowed.

I am dealing with the NHIS database which has a complex survey design with clustering, stratification and oversampling of certain sub-population. The psu's are mainly counties or contiguous counties which are later stratified based on MSA status. The NHIS however only provide pseudo strata and pseudo psu codes for confidentiality reasons. For the survey period that I am interested in, there were 304 strata and 482 psu's. However, there are 300 pseudo-strata, each containing 2 pseudo psu's-so 600 pseudo psu's in total. My confusion stems from the fact that in the manual, they said that the pseudo psu's were constructed by collapsing the original psu's to create bigger clusters so that it would be more difficult to identify any given clusters. If that is the case then how come there are more pseudo psu's then the original ones?

I am trying to include some measure of area specific fixed effects in my panel regression and I was thinking of using the pseudo-psu's as a proxy for geographic area. It says in the above paper that, "a given geographic area within a given NHIS sample PSU should have the same set of Pseudo-Stratum and Pseudo-PSU codes assignments if it is present in more than one NHIS annual microdata file." Doesn't that imply that the original psu's are broken down into psudo-psu's which explains why there are more pseudo psu's than original ones? Then why does it say in the manual that the psu's are merged or collapsed?

I have attached a link to their manual.http://www.asasrms.org/Proceedings/y2007/Files/JSM2007-000353.pdf.

I would be really grateful if any kind soul could help me out!