My first post on this site so please bear with me for any 'mild' transgressions I make. I've been using Statalist for quite some time, great resource that has solved most of my encountered problems and questions. I could not find any current thread that targets my current dilemma however.
Long story short, I want to create a (random) replica/miniature version of my population for testing some model fitting. I have year-firm rows as my uniquely identifying observation (panel data) between 1998-2017. When doing so, I want to maintain the proportion of two stratas, the first being the occurence of defaults (binary variable "Def_1y", =0 for non-default and =1 for default), the second the proportion of yearly data (variable "ser_year"). E.g. if 1% of my firm-year observations are defaults (the remaining 99% being non-defaults) and my year= 2017 data (for example) is 20% of the firm-year observations, then when I draw a random sample I want to maintain these characteristics. The firms are observed on several occassions and so when I randomly draw a, say, 60% random sample of the population I want to draw without replacement and ensure that I am drawing all of the firm observations for each selected firm (i.e. clustering the firm). E.g. If a firm (variable "orgnr", an organizational number) has not defaulted years 1998 and 1999, but defaults the next year 2000, I want to make sure that if this particular firm is selected all of its observations are included. Allthewhile keeping the proportions of default and yearly observations.
I hope this makes sense, extract of my relevant data (in order: "orgnr", "ser_year", "Def_1y") below:
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input double orgnr float(ser_year Def_1y) 5560001538 1998 0 5560001538 1999 0 5560001538 2000 0 5560001991 2002 0 5560001991 2003 0 5560001991 2004 0 5560001991 2005 0 5560001991 2006 0 5560001991 2007 0 5560001991 2008 0 5560001991 2009 0 5560001991 2010 0 5560001991 2011 0 5560001991 2012 0 5560001991 2013 0 5560001991 2014 0 5560002296 1998 0 5560002296 1999 0 5560002296 2000 0 5560002296 2001 0 5560002296 2002 0 5560002296 2003 0 5560002296 2004 0 5560002296 2005 0 5560002296 2006 0 5560002296 2007 0 5560002296 2008 0 5560002296 2009 0 5560003682 1999 0 5560003682 2000 0 5560003682 2001 0 5560003682 2002 0 5560003682 2003 0 5560003682 2004 0 5560003682 2005 0 5560008293 2007 0 5560008293 2008 0 5560008855 1998 0 5560008855 1999 0 5560010554 2004 0 5560010554 2005 0 5560010554 2006 0 5560010554 2007 1 5560010554 2008 0 5560010554 2009 0 5560010554 2010 0 5560010554 2011 0 5560010554 2012 0 5560010554 2013 0 5560010554 2014 0 end
I have been trying to reach this effect using -gsample- in Stata 16 (see below), but it has not produced the results I am looking for.
(i) When specifying 2 stratas it almost haphazardly fails to account for clustering (as far as I can tell). I.e. a firm may form part of the sample in one year but not the next, which is not desirable. (ii) When specifying just 1 strata instead ("Def_1y"), the below code almost perfectly accounts for clustering (some firm-years are however not included when the firm is sampled, but I guess it will never be perfect given its constraint to simultaneously keep default proportions). However, the distribution of yearly observations does not mimic the population distribution, which is not desirable.
Code:
gsample 60, percent wor strata(Def_1y ser_year) cluster(orgnr) keep generate(sample60) gsample 60, percent wor strata(Def_1y) cluster(orgnr) keep generate(sample60)
Do I have to break-up the sampling into multiple stages? Might be a complex solution to keep clustered firm sampling while maintaining the (rare) default and year proportions.
Again, I am just looking to create a ca 60% random sample mimicing the population proportions of defaults ("Def_1y") and yearly observations ("ser_year"), clustered on firm so that if a firm-year is selected all other firm-year observations for that firm are also selected. Without replacement.
Any advice would be extremely helpful and appreciated,
Best,
John-Edward
0 Response to Random (clustered) sampling without replacement keeping two strata population proportions
Post a Comment