Identifying duplicates within groups based on two variables in reverse order

Dear all,

I matched 2 datasets using the joinby command on the variable "project_id". The result is a table with pairs of workers who collaborate on specific projects.

As you can see, there are some duplicate pairs when worker1_id and worker2_id are exchanged. For example, the worker pairs in line 5 and 6 appear in reversed order in line 7 and 9.

How would I go about identifying and dropping such duplicates (for the same project_id)?

Note that in those cases where a pair appears more than one time per project_id, I would like to keep the observation where worker1 is the lead (variable "worker1_lead"==1). So for example, when deciding whether to keep the pair in line 5 vs. its duplicate in line 9, we would keep the observation in line 5. When neither observation of a duplicate has a lead worker, then it is irrelevant which one we keep.

I would greatly appreciate your help, thanks a lot in advance!

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int(project_id worker1_id worker2_id) byte worker1_lead
  533 2400  837 1
  747   45 2020 0
 1107  418   72 1
 1107   72  418 0
31587  708  462 1
31587  708 2782 1
31587 2782  708 0
31587 2782  462 0
31587  462  708 0
end

BJ Data Tech Solution

Home / Data Cleaning / Data management / Data Processing / Identifying duplicates within groups based on two variables in reverse order
Identifying duplicates within groups based on two variables in reverse order

0 Response to Identifying duplicates within groups based on two variables in reverse order

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Identifying duplicates within groups based on two variables in reverse order Identifying duplicates within groups based on two variables in reverse order

Related Posts with Identifying duplicates within groups based on two variables in reverse order

0 Response to Identifying duplicates within groups based on two variables in reverse order

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Identifying duplicates within groups based on two variables in reverse order
Identifying duplicates within groups based on two variables in reverse order