I have struggled with a problem in STATA 16.0 for a couple of days now, and hope that someone will be able to help.
In a study I am looking at changes in microorganisms at a danish hospital before and during the ongoing COVID-19 pandemic.
I am working with a data set of 18 variables and around 250.000 observations form a period of 01. jan 2016 until 31.10.2020.
I am trying to find out how many new events occur on the various hospital departments - a new event being defined as a new species found for a given ID per calender month.
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input byte v1 long id int born str1 sex str9 pnr str10 modtaget str1 kundetype str7 afdeling str12 afsnit str17 materialegrp str11 materiale str13 lokalisation str3 gruppe int bakterie str22 species str7(vre mrsa) byte isolatnr 1 73887 1948 "F" "D19002307" "06-01-2019" "H" "O Q" "O Q" "Vćv og pus" "Vćv" "" "Eco" 2 "E. coli" "" "" 1 2 218978 1913 "F" "D15014168" "02-02-2015" "H" "O X" "O X 3" "Urin" "Urin" "" "Eco" 2 "E. coli" "" "" 1 3 218978 1913 "F" "D15030215" "02-02-2015" "H" "O N" "O N AMB HO" "Urin" "Urin" "" "Eco" 2 "Staphylococcus aureus" "" "" 2 4 218978 1913 "F" "D15040981" "06-02-2015" "H" "O Q" "O Q" "Podning og sekret" "Podning" "Knć" "Eco" 2 "Staphylococcus aureus" "" "" 3 5 218978 1913 "F" "D15063931" "06-02-2015" "H" "O R" "O R AMB A" "Urin" "Urin" "" "Eco" 2 "E. coli" "" "" 4 6 218978 1913 "F" "D17140288" "12-02-2015" "H" "O Q" "O Q 1" "Urin" "Urin" "" "Eco" 2 "E. coli" "" "" 5 7 218978 1913 "F" "D18006462" "15-01-2018" "H" "O Q" "O Q AMB INDV" "Luftvejsprřve" "Ekspektorat" "" "Eco" 2 "E. coli" "" "" 6 8 218978 1913 "F" "D18041761" "04-04-2018" "H" "O Q" "O Q" "Podning og sekret" "Podning" "Rectum" "Eco" 2 "E. coli" "" "" 7 9 218978 1913 "F" "D15030216" "09-03-2015" "H" "O Q" "O Q AMB 1" "Urin" "Urin" "" "Eb" 29 "Enterobacter aerogenes" "" "" 1 10 218978 1913 "F" "D18041761" "04-04-2018" "H" "O Q" "O Q" "Podning og sekret" "Podning" "Rectum" "Kp" 22 "Klebsiella pneumoniae" "" "" 1 11 218978 1913 "F" "D17124169" "17-10-2017" "H" "O Q" "O Q AMB" "Podning og sekret" "Podning" "Svćlg/tonsil" "Sa" 340 "Staphylococcus aureus" "" "Positiv" 1 12 218978 1913 "F" "D18006462" "25-11-2017" "H" "O Q" "O Q AMB INDV" "Luftvejsprřve" "Ekspektorat" "" "Sa" 340 "Staphylococcus aureus" "" "Negativ" 2 13 681838 2004 "M" "D18122519" "10-10-2018" "H" "O Z" "O Z BKC" "Urin" "Urin" "" "Eco" 2 "E. coli" "" "" 1 14 682201 2072 "M" "D19129122" "09-10-2018" "H" "O Q" "O Q AMB INDV" "Urin" "Urin" "" "Eco" 2 "E. coli" "" "" 1 15 682201 2072 "M" "T18536251" "22-10-2018" "D" "Diverse" "KMAEL" "Fćces" "Fćces" "" "Ef" 362 "Enterococcus faecium" "Positiv" "" 1 16 682201 2072 "M" "T18542523" "23-10-2018" "D" "Diverse" "KMAEL" "Fćces" "Fćces" "" "Ef" 362 "Enterococcus faecium" "Positiv" "" 2 17 682201 2072 "M" "D18150117" "10-12-2018" "D" "Diverse" "KMAEL" "Urin" "Urin" "" "Kp" 22 "Klebsiella pneumoniae" "" "" 1 18 682201 2072 "M" "D19010402" "10-12-2018" "H" "O X" "O X 1" "Podning og sekret" "Podning" "Svćlg/tonsil" "Kp" 22 "Klebsiella pneumoniae" "" "" 2 19 682201 2072 "M" "D18127433" "22-10-2018" "D" "Diverse" "KMAEL" "Ursteril vćske" "Ledvćske" "Knć" "Sa" 340 "Staphylococcus aureus" "" "Positiv" 1 20 682201 2072 "M" "D19042194" "01-04-2019" "H" "O Q" "O Q AMB IND5" "Podning og sekret" "Podning" "Nćse" "Sa" 340 "Staphylococcus aureus" "" "Positiv" 2 21 682201 2072 "M" "D19129122" "01-04-2019" "H" "O Q" "O Q AMB INDV" "Urin" "Urin" "" "Sa" 340 "Staphylococcus aureus" "" "Positiv" 3 end
Therefore I have been trying to come up with a way of dropping all observations which share ID, species and Month/Year, except for the earliest observation in each month/year.
I have tried the following:
Code:
gen date = date(modtaget, "DMY") format date %tdMon_dd,_CCYY gen month = month(date) gen year = year(date) quietly by id year month species: gen duplicate = cond(_N==1,0,_n) drop if duplicate>1
I have then tried with the following commands:
Code:
bysort id species year month: egen flag=min(date) duplicates tag id species year month, gen(dup) drop if dup>0 & date!=flag
Code:
duplicates tag id species date department, gen(dup1) duplicates drop id species date department, force
Code:
duplicates tag id species date, gen(dup2) drop if dup2>0
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input byte v1 long id int born str1 sex str9 pnr str10 modtaget str1 kundetype str7 afdeling str12 afsnit str17 materialegrp str11 materiale str13 lokalisation str3 gruppe int bakterie str22 species str7(vre mrsa) byte isolatnr 1 73887 1948 "F" "D19002307" "06-01-2019" "H" "O Q" "O Q" "Vćv og pus" "Vćv" "" "Eco" 2 "E. coli" "" "" 1 2 218978 1913 "F" "D15014168" "02-02-2015" "H" "O X" "O X 3" "Urin" "Urin" "" "Eco" 2 "E. coli" "" "" 1 7 218978 1913 "F" "D18006462" "15-01-2018" "H" "O Q" "O Q AMB INDV" "Luftvejsprřve" "Ekspektorat" "" "Eco" 2 "E. coli" "" "" 6 8 218978 1913 "F" "D18041761" "04-04-2018" "H" "O Q" "O Q" "Podning og sekret" "Podning" "Rectum" "Eco" 2 "E. coli" "" "" 7 9 218978 1913 "F" "D15030216" "09-03-2015" "H" "O Q" "O Q AMB 1" "Urin" "Urin" "" "Eb" 29 "Enterobacter aerogenes" "" "" 1 10 218978 1913 "F" "D18041761" "04-04-2018" "H" "O Q" "O Q" "Podning og sekret" "Podning" "Rectum" "Kp" 22 "Klebsiella pneumoniae" "" "" 1 3 218978 1913 "F" "D15030215" "02-02-2015" "H" "O N" "O N AMB HO" "Urin" "Urin" "" "Eco" 2 "Staphylococcus aureus" "" "" 2 11 218978 1913 "F" "D17124169" "17-10-2017" "H" "O Q" "O Q AMB" "Podning og sekret" "Podning" "Svćlg/tonsil" "Sa" 340 "Staphylococcus aureus" "" "Positiv" 1 12 218978 1913 "F" "D18006462" "25-11-2017" "H" "O Q" "O Q AMB INDV" "Luftvejsprřve" "Ekspektorat" "" "Sa" 340 "Staphylococcus aureus" "" "Negativ" 2 13 681838 2004 "M" "D18122519" "10-10-2018" "H" "O Z" "O Z BKC" "Urin" "Urin" "" "Eco" 2 "E. coli" "" "" 1 14 682201 2072 "M" "D19129122" "09-10-2018" "H" "O Q" "O Q AMB INDV" "Urin" "Urin" "" "Eco" 2 "E. coli" "" "" 1 15 682201 2072 "M" "T18536251" "22-10-2018" "D" "Diverse" "KMAEL" "Fćces" "Fćces" "" "Ef" 362 "Enterococcus faecium" "Positiv" "" 1 19 682201 2072 "M" "D18127433" "22-10-2018" "D" "Diverse" "KMAEL" "Ursteril vćske" "Ledvćske" "Knć" "Sa" 340 "Staphylococcus aureus" "" "Positiv" 1 20 682201 2072 "M" "D19042194" "01-04-2019" "H" "O Q" "O Q AMB IND5" "Podning og sekret" "Podning" "Nćse" "Sa" 340 "Staphylococcus aureus" "" "Positiv" 2 end
I am aware that this is a long post, but I hope that it makes sort of sense and that someone can see where I go wrong, and perhaps have a good sugestion as to how I end up with only new event - defined as a new species found for a given ID per calender month.
0 Response to Drop all observations with similar variables, except for the earliest?
Post a Comment