I cannot include my source data due to sensitivity, but if needed I will post a template with made up values. I don't know if reshaping is the proper term here...

My data is indexed by COVID-19 case IDs in my country. The variables I am interested in are when the case's earliest test (eRT) was sampled versus when it was logged to a database of interest (the difference being 'delay'). Each case has info on the patient's name and demographics, as well as test type, which lab ordered and/or processed it, what city the lab is in, etc.

I am interested in verifying if there are significant differences between cities and labs in terms of delays, as defined. This is what I am running after cleaning up my data a bit.


Code:
cap gen eRT_createdAt_cont=date(eRT_createdAt,"YMD")
cap gen eRT_sampleCollectedDate_cont=date(eRT_sampleCollectedDate,"MDY")

cap gen delay = eRT_createdAt_cont - eRT_sampleCollectedDate_cont
replace delay = . if delay < 0 | delay > 5

encode eRT_processedByEntity_city, gen(city)
encode eRT_processedByEntity_name, gen(lab)
keep if eRT_testType == "Molecular"
by lab, sort: drop if _N < 500
fvset base 365 lab
asdoc reg delay i.lab, label
I believe it would be appropriate to transform/reshape my data such that it is panel data, with id being lab and time being ert_sampleCollectedDate, and the value of interest being delays_mean during that day, for that lab. So delays would be calculated by summing all delays for lab X on day Y, and dividing the total by the number of samplesCollected that day.

I am having a bit of trouble making sure this is the way to go. The above regression give very significant results, as does the one with cities without the need to drop observations. If this is indeed the way to go, I am having trouble envisioning how to generate this variable in Stata, create two columns of unique sampleCollectedDate, Lab pairs, and merge the variable with that table.

I am also open to exploring delays being, instead, the mean for a week; daily data may be noisy, I figure.