Hello all,

I'm a new Stata user and new to Statalist; I'm currently using StataMP 14.

Purpose: I'm doing some internal reporting and analysis for my jurisdiction's contact tracing efforts, and attempting to clump together duplicate observations that were auto-generated from different sources and then give these dates a sub-group id. I feel like I'm almost there but having trouble applying some of the literature on the forum (e.g., Stata Tip 51: Events in intervals).

Goal: My ultimate goal is to create a new unique id based off of four criteria (first 3 letters of first name + first three of last + DOB + date of last exposure only if dates fall within 7 days of each other. I'm stuck at the last criteria and don't know how to apply a loop to my given problem (if that's what's even needed) of parsing out only certain dates (those that fall within one week of each other), and excluding the rest.

What I've done so far: I was able to group duplicates and assign them an ID in order to calculate the elapsed time within the group; however, I'm inadvertently excluding some of the dates that I wish to keep.

I then totaled the elapsed time within groups with the variable "between" and then created a binary variable to flag if groups were within one week of each other (varname: withinoneweek), at which point I flagged groups that met the 'within one week' criteria. Once I group dates that fall within one week I can assign them another group number and then concatenate that to the first part of my unique id. As you can see, my method is incomplete and the first four observations that have the same last exposure date were excluded.

I've generated some dummy data that exemplifies the issue at hand. Below, I've included syntax that I've used to get me up to the point where I'm currently stuck. My apologies ahead of time for the confusion as I'm new to Stata; any tips or advice would be most helpful and much appreciated.

Code:
clear
input str12 testid byte dup_testid float(groupid lastexposure newgroupid between totaldays withinoneweek)
"NatGar17866" 5 134 22512   .  0 14 0
"NatGar17866" 5 134 22512   .  0 14 0
"NatGar17866" 5 134 22512   .  . 14 0
"NatGar17866" 5 134 22512   .  0 14 0
"NatGar17866" 5 134 22522   . 10 14 0
"NatGar17866" 5 134 22526   .  4 14 0
"JohSmi17836" 4 133 22512   .  . 14 0
"JohSmi17836" 4 133 22512   .  0 14 0
"JohSmi17836" 4 133 22512   .  0 14 0
"JohSmi17836" 4 133 22525   . 13 14 0
"JohSmi17836" 4 133 22526   .  1 14 0
"JanDoe17502" 4 173 22531 116  0  0 1
"JanDoe17502" 4 173 22531 116  .  0 1
"JanDoe17502" 4 173 22531 116  0  0 1
"JanDoe17502" 4 173 22531 116  0  0 1
"JanDoe17502" 4 173 22531 116  0  0 1
end
format %tdnn/dd/CCYY lastexposure
Code:
egen testid = concat(partfirst partlast dob)
duplicates report testid
duplicates tag testid, generate(dup_testid)
 gsort -dup_testid testid lastexposure

egen groupid = group(testid) if dup_testid >=1

bysort groupid (lastexposure): gen between = lastexposure - lastexposure[_n-1] if dup_testid >=1    

bysort groupid (lastexposure): egen totaldays = total(between) if dup_testid >=1
    
gen withinoneweek = cond(totaldays >= 0 & totaldays <=7, 1, 0)

egen newgroupid = group(groupid) if withinoneweek == 1