Balance an unbalanced dataset

Hi all,

I have a strongly unbalanced dataset of countries observed by year.
I would like to balance it by retaining as many observations as possible. For instance, say the maximum time span is 46 periods. Still, only 10 out of 100 countries have 46 time periods, whereas if I only drop 2 time periods (going from 46 to 44) I can reach 20 more countries ending up with 30 countries observed in 44 time periods. Then, I would prefer the second choice. Say the minimum acceptable time periods are 28. Is there a way to make the process automatic ideally with a "persist" and "restore" dropping all observations having less than k time periods and counting the number of countries with k time periods then repeating the process... Of course, the time periods should be the same (so for instance if I have 2 countries with 3 time periods, if country A has 1980-1981-1982 and country B has 2020-2021-2022, then this is not a matching of time periods. Instead, time periods should ideally coincide).

This is a snapshot of my code:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str4 person_ctrycode int earliest_publn_year
"AD" 1983
"AD" 1990
"AD" 1998
"AD" 2005
"AD" 2006
"AD" 2009
"AD" 2013
"AD" 2014
"AD" 2015
"AD" 2017
"AD" 2018
"AD" 2019
"AD" 2020
"AD" 2021
"AE" 1978
"AE" 1984
"AE" 1989
"AE" 1990
"AE" 1991
"AE" 1992
"AE" 1993
"AE" 1994
"AE" 1996
"AE" 1997
"AE" 1998
"AE" 2001
"AE" 2003
"AE" 2004
"AE" 2005
"AE" 2006
"AE" 2007
"AE" 2008
"AE" 2009
"AE" 2010
"AE" 2011
"AE" 2012
"AE" 2013
"AE" 2014
"AE" 2015
"AE" 2016
"AE" 2017
"AE" 2018
"AE" 2019
"AE" 2020
"AE" 2021
"AF" 2015
"AF" 2018
"AF" 2020
"AG" 2010
"AG" 2013
"AI" 1982
"AI" 1984
"AI" 1986
"AI" 1988
"AI" 1990
"AI" 1995
"AI" 1996
"AI" 2002
"AI" 2008
"AI" 2009
"AI" 2010
"AI" 2019
"AI" 2020
"AL" 1988
"AL" 1993
"AL" 2003
"AL" 2009
"AL" 2010
"AL" 2011
"AL" 2015
"AL" 2017
"AL" 2018
"AM" 1995
"AM" 2000
"AM" 2004
"AM" 2005
"AM" 2006
"AM" 2007
"AM" 2008
"AM" 2010
"AM" 2011
"AM" 2012
"AM" 2013
"AM" 2014
"AM" 2015
"AM" 2016
"AM" 2017
"AM" 2018
"AM" 2019
"AM" 2020
"AN" 1979
"AN" 1980
"AN" 1981
"AN" 1982
"AN" 1983
"AN" 1984
"AN" 1985
"AN" 1986
"AN" 1987
"AN" 1988
end

however, I think that state provides a default unbalanced dataset which might be more useful for an MWE.

Thank you

BJ Data Tech Solution

Home / Data Cleaning / Data management / Data Processing / Balance an unbalanced dataset
Balance an unbalanced dataset

0 Response to Balance an unbalanced dataset

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Balance an unbalanced dataset Balance an unbalanced dataset

Related Posts with Balance an unbalanced dataset

0 Response to Balance an unbalanced dataset

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Balance an unbalanced dataset
Balance an unbalanced dataset