Dear Experts,

I have a very large population database of patients who have been treated for localized prostate cancer with various therapies (surgery, radiation, systemic drugs, etc). It contains oncologic outcomes (e.g., time to recurrence, metastasis, death, death from prostate cancer, next therapy, or last followup if none of the above). This database has about 20,000 individuals in it longitudinally tracked over time, and is mostly used for comparative effectiveness research centered around survival analyses.

I would like to perform analyses whereby I select a subset of patients from this database having a similar distribution of baseline characteristics to patients from published, prospectively performed clinical trials.

For example, the published RTOG 9601 trial was a 2 arm randomized trial that investigated the addition (or omission) of a drug therapy to patients receiving radiation therapy after surgery for localized prostate cancer

Here is a table from the publication showing the baseline characteristics of the population:
Array

How would I go about selecting a population from my database of 20,000 patients that would match the distribution of these published patients?

The end goal is to see how different therapies applied to a similar, matched patient population compare to published studies. without the individual patient level data from the published studies, I am not sure how to proceed.

Here is my data structure that matches the categories above:

Code:
* Example generated by -dataex-. For more info, type help dataex
clear
input float(race Age Karnofsky personid Gleason_Score) byte pick3
5  65.41547 3    50 1 1
7  66.90486 9   155 1 1
7  78.17933 3   200 1 1
7   60.4846 3   278 0 1
5  62.89117 9   339 1 1
7         . 9   500 0 1
7  69.65366 9   527 1 1
7  66.49418 9   901 2 1
7  73.58248 9  1103 1 1
7 66.428474 9  2193 2 1
7         . 9  5267 0 1
7  63.96714 3  5343 0 1
6         . 9  5388 0 1
4         . 9  5465 0 1
7  77.16632 3  7921 0 1
6  77.37167 9  8124 0 1
6   72.1013 9  8556 2 1
7         . 9  8992 0 1
7  69.89733 9  8994 2 1
7  81.27584 8  9017 2 1
7  65.98768 9  9126 0 1
6  70.57358 3  9855 2 1
7  58.82272 9 10155 1 1
5   70.8063 9 10244 2 1
7  71.66872 8 10395 1 1
7  60.94456 9 10734 0 1
7 65.420944 9 10863 1 1
7   69.9165 9 11436 1 1
7  66.49145 9 11453 1 1
7  70.87474 3 11537 0 1
7  64.19439 3 11622 1 1
7  73.52772 9 11716 2 1
7  51.40862 3 11844 0 1
7  63.01985 9 12173 1 1
7  56.53388 9 12293 1 1
7  74.06434 9 12427 2 1
7  65.94661 9 12914 1 1
7  47.20329 9 13012 2 1
7  68.98015 9 13277 2 1
7  59.82204 9 13591 2 1
end
label values race race
label def race 4 "Native Am.", modify
label def race 5 "Other", modify
label def race 6 "Unknown", modify
label def race 7 "White", modify
label values Karnofsky RT_KPS
label def RT_KPS 3 "(Karnofsky)  100 - Normal, no co", modify
label def RT_KPS 8 "(Karnofsky)  80 - Normal activit", modify
label def RT_KPS 9 "(Karnofsky)  90 - Able to carry", modify
label values Gleason_Score GS
label def GS 0 "2-6", modify
label def GS 1 "7", modify
label def GS 2 "8-10", modify




Any help appreciated.

Cheers,

JT