I want to pool two datasets to obtain a nationally representative data, each study had study-specific weights which accounted for sampling and response. How should I recalibrate weighting of the pooled dataset?
Information about the sampling and weighting methods of each study are as follows:
Longitudinal study A
The baseline survey and the follow-up surveys with replacement for deceased elders were conducted in a randomly selected half of the counties and cities in 22 of China’s 31 provinces in 1998, 2000, 2002, and 2005. In 1998 baseline survey, we tried to interview all centenarians who voluntarily agreed to participate in the study in the sampled counties and cities; for each centenarian interviewee, one nearby octogenarian and one nearby nonagenarian of predefined age and sex were interviewed. In the 2002 and 2005 waves, three nearby elders aged 65–79 of predefined age and sex were interviewed in conjunction with every two centenarians. “Nearby” is loosely defined – it could be in the same village or on the same street, if available, or in the same town or in the same sampled county or city. The predefined age and sex are randomly determined, based on the randomly 2 Introduction to the Longitudinal study A 27 assigned code numbers of the centenarians, to have more or less randomly selected comparable numbers of males and females at each age from 65 to 99.
Those interviewees who were still surviving in the follow-up waves were re-interviewed. In our 1998 baseline survey and 2000, 2002, and 2005 follow-up surveys, we tried to interview all centenarians who voluntarily agreed to participate in the study, in order to keep a large sub-sample of centenarians in each of the waves. New interviewees of the same sex and age (or within the same 5-year age group) replaced those elderly who were interviewed but subsequently died before the next wave.
In sum, the Longitudinal study A interviewed 8,959 and 11,161 oldest-old aged 80–112 in 1998 and 2000, and 16,057. and 15,638 elderly aged 65–112 in 2002 and 2005, respectively. In the four waves, in total, 10,964, 14,384, 16,526, and 9,941 face-to-face interviews were conducted with centenarians, nonagenarians, octogenarians, and younger elders aged 65–79, respectively (see Table 2.2 for more detailed information). At each wave, the longitudinal survivors were re-interviewed, and additional participants replaced the deceased interviewees.
In the first four nationwide follow-up surveys conducted in 2000, 2002, 2005 and 2008, Longitudinal study A also included new recruits to replace all deceased and lost-to-follow-up elders of the same sex and age, but no such new recruits as replacement were included in the LONGITUDINAL A 2011 and 2014 nationwide follow-up surveys (due to budget constraints), except in the eight selected longevity areas where the density of centenarians was exceptionally high
Consequently, the Longitudinal study A participants of the oldest-old interviewed in 1998 and 2008 were nationally representative samples, and adequate for comparative analysis of cohorts born ten years apart.
However, because the Longitudinal study A participants aged 65-79 interviewed in 2011 or 2014 were only survivors from those interviewed in the 2008 wave, but they were not a nationally representative sample, and not compatible with the young-old participants recruited in 2002 for the first time.
Longitudinal study B
The Longitudinal study B national baseline survey was conducted in 28 provinces, 150 countries/districts, 450 villages/urban communities, across the country. The sample is representative of people aged 45 and over, living in households; institutionalized elderly are not sampled, but Wave 1 respondents who later enter into an institution will be followed. All samples were drawn in four stages. County-level、neighborhood level、household level、respondent level.
County-level sampling
At the first stage, all county-level units with the exception of Tibet were sorted (stratified) by region, within region by urban district or rural county, and by GDP per capita. Region was a categorical variable based on the NBS division of province area. After this sorting (stratification), the population of each county was listed, along with the cumulative population (populations of each county plus all the counties higher on the list). If N is the total population of all the county-level units and 150 is the number of counties to be sampled, then define an interval n=N/150. The first county was selected by choosing a random number r from 0 to 1, and selecting the first neighborhood with cumulative population greater than rn. Then the interval n was added to this starting point, and the second county was the first county on the list with cumulative population greater than rn+n. The third county was chosen by once again adding the interval n, and picking the first county on the list with cumulative population greater than r*n+n+n.
Neighborhood-level sampling
Our sample used administrative villages (cun) in rural areas and neighborhoods (shequ) in urban areas, which comprise one or more former resident committees (juweihui), as primary sampling units (PSUs). We selected 3 PSUs within each county-level unit, using PPS (probabilities proportional to size) sampling. Note that rural counties contain both rural villages and urban neighborhoods and it is also possible for urban districts to contain rural administrative villages. For each county-level unit, the list of all PSUs was randomly sorted. Then, the population of each PSU was listed, along with the cumulative population (populations of each PSU plus all the PSUs higher on the list). If N is the total population of the county-level unit and 3 is the number of PSUs to be sampled, then define an interval n=N/3. The first PSU is selected by choosing a random number r from 0 to 1, and selecting the first neighborhood with cumulative population greater than rn. Then the interval n is added to this starting point, and the second PSU is the first PSU on the list with cumulative population greater than rn+n. The third PSU is chosen by once again adding the interval n, and picking the first PSU on the list with cumulative population greater than r*n+n+n. This procedure was implemented using the Stata command samplepps. In neighborhoods with very large populations (over 2000 households), given the high costs of preparing map-based sampling frames, supervisors were permitted to select a geographic subset of the neighborhood as the PSU, for example one or more former neighborhood committees (juweihui) in the community (shequ). Enough sub-neighborhoods were to be sampled to ensure that there were a sufficient number of eligible sample respondents. Sub-neighborhoods would then be selected based on the estimated population of each sub-neighborhood There were 30 communities that had to be split this way. Due to mistakes in the original sampling frame, of the 450 communities originally chosen, we had to replace 6 for the following reasons: two villages disappeared due to resettlement, one urban community was expanded to becoming a county-level urban district, two communities were nearly entirely collective dwelling residents, one being university dormitories and the other being prison, which are not supposed to be part of our samples. The choice of replacement communities followed the exact procedure outlined above. In 6 counties, the administrative boundaries changed so that the chosen communities fell within two counties. We did not replace these communities. As a result, the final number of counties becomes 156.
Household-level sampling
In each PSU, we selected a sample of dwellings from our frame, which was constructed based on maps prepared by mappers/listers with the support of local informants. In order to get accurate sample frame of household in each village or community, a mapping/listing software named CHARLS-GIS was developed. For each PSU, a mapper was first sent to the community with a GPS unit to collect the boundary, then the CHARLS office used the boundary information to capture Google Earth map images, which were used as the basis for the mapping and listing. Then, all buildings in each PSU were enumerated with photos and GPS readings, and dwellings within each building were listed. Collective living dwellings such as military bases, schools, dormitories or nursing homes, were excluded.
Then each PSU sampling frame was checked by the CHARLS headquarters to ensure that all buildings within the community boundary were enumerated. After verification, the supervisors used CHARLS-GIS software to randomly sample 80 households, which were marked on the map and sent back to mappers/listers in the field to collect information for these households including age of the oldest person, name of household head, telephone number, and whether the dwelling unit was empty or not. The number of households sampled was greater than the targeted sample size of 24 households per PSU in anticipation of sampled households’ not having any members aged 45 or older, the possibility of an empty house and household non-response. . Based on this information, the supervisor randomly sampled a specific number of households for each community/village using the CHARLS-GIS software. The initial sampling was a random sample from the 80 households. From these households we computed the fraction of households that were age-eligible and the number of empty dwellings. From this we derived neighborhood/village-specific sampling proportions and then chose our sample from the entire sampling frame.
After final sampling work in the PSU was completed, the information on the sampled households was sent back to the mappers/listers, who loaded this information in the CHARLS-GIS software on their computer. The mappers/listers then sent ‘A letter to the respondent’. Simultaneously, the IT in CHARLS project office transferred the sampled household lists and addresses for a given PSU to the interviewer’s CAPI system. We interviewed all age-eligible sample households in each PSU who were found and willing to participate in the survey. Some dwellings had multiple households living in them. In these cases we randomly chose one household that had an age-eligible member. Thus, variation in the share of sampled households that could be found, had an age-eligible member, or were willing to participate in the survey led to different numbers of completed household surveys in each PSU. This is corrected for in the sampling weights.
Respondent-level sampling
In each sampled household, a short screening form was used to identify whether the household had a member meeting our age eligibility requirements. If a household had persons older than 40 and meeting our residence criterion, we randomly selected one of them. If the chosen person is 45 or older, then he/she becomes a main respondent and also interviewed his or her spouse. If the chosen person is between ages 40 and 44 he/she is reserved as a refresher sample for future rounds of survey. If an age-eligible person was too frail to answer questions, we identified a proxy respondent to help him/her to answer questions, usually a spouse or knowledgeable adult child, if there was one in the house. Households without members 45 years or older were not interviewed.
Questions concerning household roster in section A, household organization and financial transfers in section C were answered by the “Family Respondent”, who could be either the main respondent or the spouse of the main respondent; whenever possible the person chosen was the individual most able to answer the questions in these sections accurately. Similarly, a “Financial Respondent” was chosen to answer questions on family income, expenditure, and assets. In this case, any household member aged 18 or above could be selected as the financial respondent (including the main respondent and spouse), with the main criteria again being which person is most knowledgeable about these matters.
My question is
If we combine the LONGITUDINAL A participants interviewed in 2011 and the LONGITUDINAL B national baseline survey, is the pooled individuals a nationally representative sample??
Is it compatible with the young-old participants recruited in LONGITUDINAL A 2002 for the first time?
For example, if we aim to estimate years lived in different Incidence of cognitive impairment at age 65 years in 1991 and 2011.
As each study had study-specific weights which accounted for sampling and response. And both of the study had given the weighting value, but the values vary widely. How should we recalibrate weighting of the pooled dataset?
Related Posts with How should we recalibrate weighting of the pooled dataset?
Count how many units have the dummy variable=1Hi I have data for 353 local authorities over 20 quarters and one of my variables is a dummy variabl…
Reshaping Dataset to long formatDear STATA forum, I am currently working with time series data, which looks as follows: Code: * …
The problem with creating dummy variablesHi Statalists, I am trying to find the relationship between the change in labour productivity (depen…
Installing saeCan someone help with the instalation of sae. When I use the usual .SSC install...' I get the follow…
Standard errors with spxtregressHello, I have looked everywhere in the manual, but there does not seem to be an option to compute h…
Subscribe to:
Post Comments (Atom)
0 Response to How should we recalibrate weighting of the pooled dataset?
Post a Comment