Hi all,

I have a strongly unbalanced dataset of countries observed by year.
I would like to balance it by retaining as many observations as possible. For instance, say the maximum time span is 46 periods. Still, only 10 out of 100 countries have 46 time periods, whereas if I only drop 2 time periods (going from 46 to 44) I can reach 20 more countries ending up with 30 countries observed in 44 time periods. Then, I would prefer the second choice. Say the minimum acceptable time periods are 28. Is there a way to make the process automatic ideally with a "persist" and "restore" dropping all observations having less than k time periods and counting the number of countries with k time periods then repeating the process... Of course, the time periods should be the same (so for instance if I have 2 countries with 3 time periods, if country A has 1980-1981-1982 and country B has 2020-2021-2022, then this is not a matching of time periods. Instead, time periods should ideally coincide).

This is a snapshot of my code:
Code:
* Example generated by -dataex-. For more info, type help dataex
clear
input str4 person_ctrycode int earliest_publn_year
"AD" 1983
"AD" 1990
"AD" 1998
"AD" 2005
"AD" 2006
"AD" 2009
"AD" 2013
"AD" 2014
"AD" 2015
"AD" 2017
"AD" 2018
"AD" 2019
"AD" 2020
"AD" 2021
"AE" 1978
"AE" 1984
"AE" 1989
"AE" 1990
"AE" 1991
"AE" 1992
"AE" 1993
"AE" 1994
"AE" 1996
"AE" 1997
"AE" 1998
"AE" 2001
"AE" 2003
"AE" 2004
"AE" 2005
"AE" 2006
"AE" 2007
"AE" 2008
"AE" 2009
"AE" 2010
"AE" 2011
"AE" 2012
"AE" 2013
"AE" 2014
"AE" 2015
"AE" 2016
"AE" 2017
"AE" 2018
"AE" 2019
"AE" 2020
"AE" 2021
"AF" 2015
"AF" 2018
"AF" 2020
"AG" 2010
"AG" 2013
"AI" 1982
"AI" 1984
"AI" 1986
"AI" 1988
"AI" 1990
"AI" 1995
"AI" 1996
"AI" 2002
"AI" 2008
"AI" 2009
"AI" 2010
"AI" 2019
"AI" 2020
"AL" 1988
"AL" 1993
"AL" 2003
"AL" 2009
"AL" 2010
"AL" 2011
"AL" 2015
"AL" 2017
"AL" 2018
"AM" 1995
"AM" 2000
"AM" 2004
"AM" 2005
"AM" 2006
"AM" 2007
"AM" 2008
"AM" 2010
"AM" 2011
"AM" 2012
"AM" 2013
"AM" 2014
"AM" 2015
"AM" 2016
"AM" 2017
"AM" 2018
"AM" 2019
"AM" 2020
"AN" 1979
"AN" 1980
"AN" 1981
"AN" 1982
"AN" 1983
"AN" 1984
"AN" 1985
"AN" 1986
"AN" 1987
"AN" 1988
end
however, I think that state provides a default unbalanced dataset which might be more useful for an MWE.

Thank you