Drop all observations with similar variables, except for the earliest?

Dear statalists

I have struggled with a problem in STATA 16.0 for a couple of days now, and hope that someone will be able to help.

In a study I am looking at changes in microorganisms at a danish hospital before and during the ongoing COVID-19 pandemic.
I am working with a data set of 18 variables and around 250.000 observations form a period of 01. jan 2016 until 31.10.2020.
I am trying to find out how many new events occur on the various hospital departments - a new event being defined as a new species found for a given ID per calender month.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte v1 long id int born str1 sex str9 pnr str10 modtaget str1 kundetype str7 afdeling str12 afsnit str17 materialegrp str11 materiale str13 lokalisation str3 gruppe int bakterie str22 species str7(vre mrsa) byte isolatnr
 1  73887 1948 "F" "D19002307" "06-01-2019" "H" "O Q"     "O Q"          "Vćv og pus"       "Vćv"        ""              "Eco"   2 "E. coli"                ""        ""        1
 2 218978 1913 "F" "D15014168" "02-02-2015" "H" "O X"     "O X 3"        "Urin"              "Urin"        ""              "Eco"   2 "E. coli"                ""        ""        1
 3 218978 1913 "F" "D15030215" "02-02-2015" "H" "O N"     "O N AMB HO"   "Urin"              "Urin"        ""              "Eco"   2 "Staphylococcus aureus"  ""        ""        2
 4 218978 1913 "F" "D15040981" "06-02-2015" "H" "O Q"     "O Q"          "Podning og sekret" "Podning"     "Knć"          "Eco"   2 "Staphylococcus aureus"  ""        ""        3
 5 218978 1913 "F" "D15063931" "06-02-2015" "H" "O R"     "O R AMB A"    "Urin"              "Urin"        ""              "Eco"   2 "E. coli"                ""        ""        4
 6 218978 1913 "F" "D17140288" "12-02-2015" "H" "O Q"     "O Q 1"        "Urin"              "Urin"        ""              "Eco"   2 "E. coli"                ""        ""        5
 7 218978 1913 "F" "D18006462" "15-01-2018" "H" "O Q"     "O Q AMB INDV" "Luftvejsprřve"    "Ekspektorat" ""              "Eco"   2 "E. coli"                ""        ""        6
 8 218978 1913 "F" "D18041761" "04-04-2018" "H" "O Q"     "O Q"          "Podning og sekret" "Podning"     "Rectum"        "Eco"   2 "E. coli"                ""        ""        7
 9 218978 1913 "F" "D15030216" "09-03-2015" "H" "O Q"     "O Q AMB 1"    "Urin"              "Urin"        ""              "Eb"   29 "Enterobacter aerogenes" ""        ""        1
10 218978 1913 "F" "D18041761" "04-04-2018" "H" "O Q"     "O Q"          "Podning og sekret" "Podning"     "Rectum"        "Kp"   22 "Klebsiella pneumoniae"  ""        ""        1
11 218978 1913 "F" "D17124169" "17-10-2017" "H" "O Q"     "O Q AMB"      "Podning og sekret" "Podning"     "Svćlg/tonsil" "Sa"  340 "Staphylococcus aureus"  ""        "Positiv" 1
12 218978 1913 "F" "D18006462" "25-11-2017" "H" "O Q"     "O Q AMB INDV" "Luftvejsprřve"    "Ekspektorat" ""              "Sa"  340 "Staphylococcus aureus"  ""        "Negativ" 2
13 681838 2004 "M" "D18122519" "10-10-2018" "H" "O Z"     "O Z BKC"      "Urin"              "Urin"        ""              "Eco"   2 "E. coli"                ""        ""        1
14 682201 2072 "M" "D19129122" "09-10-2018" "H" "O Q"     "O Q AMB INDV" "Urin"              "Urin"        ""              "Eco"   2 "E. coli"                ""        ""        1
15 682201 2072 "M" "T18536251" "22-10-2018" "D" "Diverse" "KMAEL"        "Fćces"            "Fćces"      ""              "Ef"  362 "Enterococcus faecium"   "Positiv" ""        1
16 682201 2072 "M" "T18542523" "23-10-2018" "D" "Diverse" "KMAEL"        "Fćces"            "Fćces"      ""              "Ef"  362 "Enterococcus faecium"   "Positiv" ""        2
17 682201 2072 "M" "D18150117" "10-12-2018" "D" "Diverse" "KMAEL"        "Urin"              "Urin"        ""              "Kp"   22 "Klebsiella pneumoniae"  ""        ""        1
18 682201 2072 "M" "D19010402" "10-12-2018" "H" "O X"     "O X 1"        "Podning og sekret" "Podning"     "Svćlg/tonsil" "Kp"   22 "Klebsiella pneumoniae"  ""        ""        2
19 682201 2072 "M" "D18127433" "22-10-2018" "D" "Diverse" "KMAEL"        "Ursteril vćske"   "Ledvćske"   "Knć"          "Sa"  340 "Staphylococcus aureus"  ""        "Positiv" 1
20 682201 2072 "M" "D19042194" "01-04-2019" "H" "O Q"     "O Q AMB IND5" "Podning og sekret" "Podning"     "Nćse"         "Sa"  340 "Staphylococcus aureus"  ""        "Positiv" 2
21 682201 2072 "M" "D19129122" "01-04-2019" "H" "O Q"     "O Q AMB INDV" "Urin"              "Urin"        ""              "Sa"  340 "Staphylococcus aureus"  ""        "Positiv" 3
end

In the example above ID "218978" has 3 positive "E. Coli" samples and 2 positive Staph aureus samples from february 2015, but with my approach, only the first positive species sample counts as a new event.

Therefore I have been trying to come up with a way of dropping all observations which share ID, species and Month/Year, except for the earliest observation in each month/year.

I have tried the following:

Code:

gen date = date(modtaget, "DMY")
format date %tdMon_dd,_CCYY
gen month = month(date)
gen year = year(date)
quietly by id year month species:  gen duplicate = cond(_N==1,0,_n)
drop if duplicate>1

However it seems to be random which "duplicate" observation gets which duplicate number, and therefore it isnt necessary the earliest observation which is kept.

I have then tried with the following commands:

Code:

bysort id species year month: egen flag=min(date)
 duplicates tag id species year month, gen(dup)
 drop if dup>0 & date!=flag

This seems to work in the dataex above and I drop the correct observations, except when there are more than one observations which share the earliest date - all of these observations are kept. This doesnt seem to be a problem when these samples come from the same department, then I simply just drop all of these remaining duplicates except 1 by:

Code:

duplicates tag id species date department, gen(dup1)
duplicates drop id species date department, force

The only "duplicates" left after doing this are observations which share id, species and date, but which are found on different departments. In the dataex, this is the case with observation 17 and 18. With the data I have available I have no way of knowing from which department the sample was first taken and therefore I drop all of the observations which share id, species and the earliest date, but are sampled on different departments.

Code:

duplicates tag id species date, gen(dup2)
 drop if dup2>0

That all leaves me with the following in this dataex, which I am satisfied with.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte v1 long id int born str1 sex str9 pnr str10 modtaget str1 kundetype str7 afdeling str12 afsnit str17 materialegrp str11 materiale str13 lokalisation str3 gruppe int bakterie str22 species str7(vre mrsa) byte isolatnr
 1  73887 1948 "F" "D19002307" "06-01-2019" "H" "O Q"     "O Q"          "Vćv og pus"       "Vćv"        ""              "Eco"   2 "E. coli"                ""        ""        1
 2 218978 1913 "F" "D15014168" "02-02-2015" "H" "O X"     "O X 3"        "Urin"              "Urin"        ""              "Eco"   2 "E. coli"                ""        ""        1
 7 218978 1913 "F" "D18006462" "15-01-2018" "H" "O Q"     "O Q AMB INDV" "Luftvejsprřve"    "Ekspektorat" ""              "Eco"   2 "E. coli"                ""        ""        6
 8 218978 1913 "F" "D18041761" "04-04-2018" "H" "O Q"     "O Q"          "Podning og sekret" "Podning"     "Rectum"        "Eco"   2 "E. coli"                ""        ""        7
 9 218978 1913 "F" "D15030216" "09-03-2015" "H" "O Q"     "O Q AMB 1"    "Urin"              "Urin"        ""              "Eb"   29 "Enterobacter aerogenes" ""        ""        1
10 218978 1913 "F" "D18041761" "04-04-2018" "H" "O Q"     "O Q"          "Podning og sekret" "Podning"     "Rectum"        "Kp"   22 "Klebsiella pneumoniae"  ""        ""        1
 3 218978 1913 "F" "D15030215" "02-02-2015" "H" "O N"     "O N AMB HO"   "Urin"              "Urin"        ""              "Eco"   2 "Staphylococcus aureus"  ""        ""        2
11 218978 1913 "F" "D17124169" "17-10-2017" "H" "O Q"     "O Q AMB"      "Podning og sekret" "Podning"     "Svćlg/tonsil" "Sa"  340 "Staphylococcus aureus"  ""        "Positiv" 1
12 218978 1913 "F" "D18006462" "25-11-2017" "H" "O Q"     "O Q AMB INDV" "Luftvejsprřve"    "Ekspektorat" ""              "Sa"  340 "Staphylococcus aureus"  ""        "Negativ" 2
13 681838 2004 "M" "D18122519" "10-10-2018" "H" "O Z"     "O Z BKC"      "Urin"              "Urin"        ""              "Eco"   2 "E. coli"                ""        ""        1
14 682201 2072 "M" "D19129122" "09-10-2018" "H" "O Q"     "O Q AMB INDV" "Urin"              "Urin"        ""              "Eco"   2 "E. coli"                ""        ""        1
15 682201 2072 "M" "T18536251" "22-10-2018" "D" "Diverse" "KMAEL"        "Fćces"            "Fćces"      ""              "Ef"  362 "Enterococcus faecium"   "Positiv" ""        1
19 682201 2072 "M" "D18127433" "22-10-2018" "D" "Diverse" "KMAEL"        "Ursteril vćske"   "Ledvćske"   "Knć"          "Sa"  340 "Staphylococcus aureus"  ""        "Positiv" 1
20 682201 2072 "M" "D19042194" "01-04-2019" "H" "O Q"     "O Q AMB IND5" "Podning og sekret" "Podning"     "Nćse"         "Sa"  340 "Staphylococcus aureus"  ""        "Positiv" 2
end

However, when I run my do-file through my entire datasat the outcome varies slightly each time (how many observations are left from the different departments, and how many of the different species are fount where) and that tells me the unfortunet tale that I am doing something, possibly several things, very wrong.

I am aware that this is a long post, but I hope that it makes sort of sense and that someone can see where I go wrong, and perhaps have a good sugestion as to how I end up with only new event - defined as a new species found for a given ID per calender month.

BJ Data Tech Solution

Home / Data Cleaning / Data management / Data Processing / Drop all observations with similar variables, except for the earliest?
Drop all observations with similar variables, except for the earliest?

0 Response to Drop all observations with similar variables, except for the earliest?

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Drop all observations with similar variables, except for the earliest? Drop all observations with similar variables, except for the earliest?

Related Posts with Drop all observations with similar variables, except for the earliest?

0 Response to Drop all observations with similar variables, except for the earliest?

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Drop all observations with similar variables, except for the earliest?
Drop all observations with similar variables, except for the earliest?