Short version: I’m seeking a solution to how to do 1:m matching of cases and controls without replacement, *given a file in long (“edge”) format of all possible matching pairs.* Code to create sample data occurs at the end of this post.

Longer version:
In corresponding off-list with Rima Saliba about this problem, I discovered, that the code I had posted here several years ago in response to a question about this problem is wrong.

While there have been a number of threads over the year about case control matching, (see e.g. here). I now have encountered and refined the problem in a way that seems different enough from previous work to be worth posting in a more generalized way. What follows is my refined version of the problem, to which I'm interested in solutions.

Via the use of -joinby- or perhaps -rangejoin- or -cross-, one can have a file that pairs up cases or treatment subjects with matching potential controls. In such situations, analysts may want to have 1:m matching *without replacement.* Complications include that:
1) The controls that match one case/treatment subject may match many others.
2) There are varying numbers of controls available for each case, in general more or perhaps less than m
3) If possible, one wants to avoid too “greedy” an algorithm, which can result in the extreme in one case getting assigned all m controls and some other similar case getting 0.

I have the idea that some solution involving -merge- should be possible, per some earlier threads, but I have not successfully figured how to do that. I also have the thought that one of the many built-in or community-contributed matching commands might be used, but I have not worked that out either. I *have* discovered, that some of the “greediness” problem can be avoided by having an algorithm that picks only *1* control without replacement for every case, and then applying this iteratively, so a solution that only picks one control per case would solve the problem.

In that context, here is a code snippet to create what I’d consider a representative kind of data set with which to work:

Code:
// Create an “edge” file of matched pairs.
set seed 82743
local ncases = 100
local maxmatch = 100
local maxcontrolid = 2000
clear
set obs `ncases'
gen int caseid = _n
gen navail = ceil(runiform() * `maxmatch')
label var navail "# controls matched to this case"
expand navail
gen int controlid = ceil(runiform() * `maxcontrolid')
summ navail
order caseid controlid

I realize that “without replacement“ is not necessarily analytically preferable, but that’s another issue.