I have a dataset that I need to clean but I'm having a hard time coming up with a way to do what I need.
This is employees absences (leave) by days or fractions of days.
Sometimes an employee put in a leave request and then it didn't take it, or took it but for a different reason, so there are corrections (the negative numbers, such as in teacher id 86 date 20781).
I need to drop the observations that have the leave that wasn't taken, the correction, and keep only the leave that was actually taken. The corrections don't always have the same leave code.
Sometimes an employee put in two different leave requests for the same day, and they do add up to a full day, so those need to stay (example: teacher id 152 date 20895)
Sometimes HR messed up and generated duplicates of leave. One of those duplicates needs to go. (example the last two obs in this example, teacher id 694 date 20877)
My data looks like this:
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input long teacherid int date double timeabsent str5 LEAVECODE 86 20781 1 "70" 86 20781 -1 "70" 86 20781 1 "01" 152 20895 .5 "01" 152 20895 .5 "01B" 158 20961 1 "04" 158 20961 1 "61" 158 20961 -1 "61" 158 20962 1 "04" 158 20962 -1 "61" 158 20962 1 "61" 158 20963 1 "04" 158 20963 -1 "61" 158 20963 1 "61" 158 20964 1 "04" 158 20964 1 "61" 158 20964 -1 "61" 158 20965 1 "04" 158 20965 -1 "61" 158 20965 1 "61" 206 20912 .55 "04" 206 20912 .45 "57" 442 20846 .5 "01" 442 20846 .5 "01B" 446 20972 .5 "01" 446 20972 .5 "01B" 474 20709 -1 "04-UN" 474 20709 1 "04" 474 20709 1 "61-UN" 474 20710 -1 "04-UN" 474 20710 1 "04" 474 20710 1 "61-UN" 474 20711 -1 "04-UN" 474 20711 1 "04" 474 20711 1 "61-UN" 474 20712 -1 "04-UN" 474 20712 1 "04" 474 20712 1 "61-UN" 474 20713 -1 "04-UN" 474 20713 1 "04" 474 20713 1 "61-UN" 474 20716 -1 "04-UN" 474 20716 1 "04" 474 20716 1 "61-UN" 474 20717 -1 "04-UN" 474 20717 1 "04" 474 20717 1 "61-UN" 474 20718 -1 "04-UN" 474 20718 1 "04" 474 20718 1 "61-UN" 474 20719 -1 "04-UN" 474 20719 1 "04" 474 20719 1 "61-UN" 474 20720 -1 "04-UN" 474 20720 1 "04" 474 20720 1 "61-UN" 532 20940 1 "04" 532 20940 -1 "01" 532 20940 1 "01" 532 20941 1 "04" 532 20941 1 "01" 532 20941 -1 "01" 532 20942 1 "04" 532 20942 1 "01" 532 20942 -1 "01" 532 20943 -1 "01E" 532 20943 1 "01E" 532 20943 1 "04" 532 20944 1 "01E" 532 20944 -1 "01E" 532 20944 1 "04" 532 20947 1 "04" 532 20947 1 "01B" 532 20947 -1 "01B" 532 20948 .1 "59" 532 20948 -.1 "59-UY" 532 20948 1 "04" 532 20948 .9 "01B" 532 20948 -.9 "01BUY" 532 20949 1 "59" 532 20949 -1 "59-UY" 532 20949 1 "04" 574 20894 1 "04" 574 20894 -1 "01" 574 20894 1 "01" 574 20895 1 "04" 574 20895 -1 "01" 574 20895 1 "01" 672 20989 .5 "01" 672 20989 .5 "01B" 682 20919 1 "04H" 682 20920 1 "04H" 682 20921 1 "04H" 682 20922 1 "04H" 682 20923 1 "04H" 682 20928 1 "01" 682 20965 1 "04H" 682 20968 1 "04H" 694 20877 1 "31" 694 20877 1 "04" end format %tdnn/dd/CCYY date
I tried collapsing and tagging duplicates that add up to more than one day, but when there are negative numbers, that doesn't help. I tried tagging duplicates and keeping only the last instance, but the absence I need to keep isn't always where _n==_N
How can I get rid of the negative numbers and what they are correcting (their matching positives) and leave only the positives that have no correction, while also leaving only the positives that have no duplicates, or one instance of the positives with duplicates that add up to more than 1, or the positives that have duplicates but that add up to 1 or less?
(I have a quarter million observations for one district)
Thanks in advance.
0 Response to Help with dropping selected duplicates
Post a Comment