I am analysing a survey data and I am relatively new to this commonly used study design. I am seeking help on topics of survey weighting, selection bias and paradata.
Survey data setup
The survey was on doctors (registered under a particular program in the country) on the impact of the pandemic on health services. The sampling frame consists of 23,900 doctors covered by 3 agencies (Agency A, B and C). Under A, there are 13,400 doctors, Under B there are 6000 doctors and under C, 4500 doctors. Among these, 700, 1000 and 1100 doctors were randomly sampled from Agency A, B and C respectively (total sample= 2800). From the survey conducted on these 2800 doctors across the 3 agencies, response was received from 400 doctors from Agency A, 800 from agency B and 700 from agency C.
As per the above, I have assumed that this survey used a stratified random sampling at the agency level. The dataset I have (Data respondents) is on these 1900 doctors. Data is available on about 200 variables from the 1900 responders.
Data available on non-respondents and paradata
The central concern is non-responders and how to account for the ensuing bias as described below. The challenge I am facing with non-response analysis is that the data I have on the 900 non-responders are minimal. In the data set with the full 2800 doctors (Data full) the data I have common across responders and non-responders are only on their (1) agency (A, B or C), (2) qualification (3 category variable bachelors, specialization, super specialization) , and (3) province (5 category variable). Additionally, I also have paradata on the ‘number of attempts’ to contact the doctors (Attempt 1, 2 and 3) – Var 4. Reason for nonresponse among the 900 doctors is also recorded (reasons fall under 10 categories)
Analysis will involve estimating frequencies and proportions, and few regression models giving crude Odds ratio estimates. What is the practical way to analyse this survey data accounting for selection bias?
I give below a sample data set with 30 observations and few variables produced by -dataex-..The data structure below is that of respondents only.
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input byte id int dateofsurvey str1 agency str2 province str19 qualification byte numberofattemptstocontact str23 age byte opdload_ct str14 opdload_hilo str64 servicesb4c19 byte(services_tests_b4c19 services_meds_b4c20 services_ehealth_b4c21) 1 22494 "B" "P2" "Superspecialization" 1 "30 - 45 years old" 10 "Same as before" "Testing, Providing medication" 1 1 0 2 22494 "B" "P2" "Superspecialization" 1 "30 - 45 years old" 10 "Higher" "Testing, Providing medication" 1 1 0 3 22494 "C" "P1" "Superspecialization" 1 "Older than 61 years old" 20 "Lower" "Testing, Providing medication" 1 1 0 4 22494 "C" "P1" "Superspecialization" 1 "30 - 45 years old" 20 "Lower" "Testing, Providing medication" 1 1 0 5 22494 "C" "P1" "Superspecialization" 1 "30 - 45 years old" 40 "Same as before" "Testing, Providing medication" 1 1 0 6 22494 "B" "P2" "Superspecialization" 2 "46 - 60 years old" 30 "Same as before" "Testing, Providing medication, Home consultation" 1 1 0 7 22494 "B" "P3" "Superspecialization" 1 "46 - 60 years old" 15 "Same as before" "Testing " 1 0 0 8 22494 "B" "P3" "Specialization" 1 "30 - 45 years old" 20 "Higher" "Testing " 1 0 0 9 22494 "B" "P3" "Superspecialization" 2 "46 - 60 years old" 25 "Lower" "Other, please specify" 0 0 0 10 22494 "B" "P3" "Superspecialization" 1 "30 - 45 years old" 60 "Lower" "Testing, Other, please specify" 1 0 0 11 22494 "B" "P3" "Superspecialization" 2 "30 - 45 years old" 25 "Lower" "Testing, Providing medication" 1 1 0 12 22525 "C" "P1" "Superspecialization" 1 "30 - 45 years old" 25 "Same as before" "Providing medication, testing" 1 1 0 13 22525 "C" "P1" "Superspecialization" 1 "46 - 60 years old" 30 "Lower" "Testing, Providing medication" 1 1 0 14 22525 "C" "P1" "Superspecialization" 2 "30 - 45 years old" 3 "Lower" "Providing medication, testing, Other, please specify" 1 1 0 15 22525 "C" "P1" "Superspecialization" 1 "30 - 45 years old" 10 "Lower" "Providing medication, testing" 1 1 0 16 22525 "C" "P1" "Superspecialization" 1 "46 - 60 years old" 40 "Lower" "Providing medication, testing" 1 1 0 17 22525 "C" "P1" "Superspecialization" 2 "46 - 60 years old" 10 "Lower" "Providing medication " 0 1 0 18 22525 "C" "P1" "Superspecialization" 1 "30 - 45 years old" 10 "Lower" "Testing, Providing medication" 1 1 0 19 22525 "C" "P1" "Superspecialization" 1 "46 - 60 years old" 50 "Lower" "Testing, Providing medication" 1 1 0 20 22555 "B" "P2" "Superspecialization" 3 "30 - 45 years old" 30 "Same as before" "Testing, Providing medication, Home consultation" 1 1 0 21 22555 "B" "P2" "Superspecialization" 1 "30 - 45 years old" 15 "Lower" "Testing, Providing medication" 1 1 0 22 22555 "B" "P2" "Superspecialization" 3 "30 - 45 years old" 20 "Lower" "Testing, Other, please specify Providing medication, E-health " 1 1 1 23 22555 "B" "P2" "Superspecialization" 1 "Less than 30 years old" 20 "Lower" "Testing, Providing medication" 1 1 0 24 22555 "A" "P5" "Superspecialization" 1 "Less than 30 years old" 20 "Same as before" "Providing medication, testing, Other, please specify" 1 1 0 25 22555 "A" "P4" "Bachelors" 1 "30 - 45 years old" 20 "Higher" "Testing " 1 0 0 26 22555 "A" "P4" "Superspecialization" 3 "Less than 30 years old" 20 "Lower" "Testing " 1 0 0 27 22555 "A" "P4" "Specialization" 1 "Less than 30 years old" 100 "Lower" "E-health " 0 0 1 28 22555 "A" "P4" "Superspecialization" 3 "30 - 45 years old" 5 "Lower" "Providing medication " 0 1 0 29 22494 "B" "P2" "Superspecialization" 1 "30 - 45 years old" 60 "Lower" "Testing, Other, please specify" 1 0 0 30 22555 "A" "P5" "Superspecialization" 1 "30 - 45 years old" 30 "Same as before" "Testing, Providing medication, Home consultation" 1 1 0 end format %tdnn/dd/CCYY dateofsurvey
The following are the codes I have started with (I am using StataMP 13 on Windows 10):
Code:
gen wt_strat=13400/400 replace wt_strat=6000/800 if agency=="B" replace wt_strat=4500/700 if agency=="C" gen FPC_Strata=1/wt_strat
Code:
svyset id [pweight=wt_strat], strata(agency) fpc(fpc_strat)
Code:
pweight: wt_strat VCE: linearized Single unit: missing Strata 1: agency SU 1: id FPC 1: fpc_strat
Please correct me if I have gone wrong in the above steps assuming stratified random sampling. Or should strategies to account for non-response be incorporated in the above command lines?
Accounting for non-response
To account for non-response, I read about post stratification (in previous threads in Statalist, and literature) but I have data on only 3 variables across non-responders and responders. I also read that paradata can be used to account for non-response analysis (Kreuter F, Olson K. Paradata for Nonresponse Error Investigation. 2013). I have 1 paradata variable, "number of attempts to contact" specifying the number of times (maximum 3 attempts) a particular doctor was contacted to get a successful interview. But I do not know how to use this variable and in Stata or whether this variable is enough to account for bias.
Requesting your insights on the aforementioned.
Thank you!
0 Response to Problem with Survey data analysis, non-response, selection bias, use of paradata
Post a Comment