Dear Statalist,

I have a large dataset that contains several million emails (one row per email sent). I am trying to create a variable that indicates whether a given email sender has written to an email recipient before. To illustrate the problem, I created a short example:

message_id sender_id time recipient_1 recipient_2 recipient_3 num_recipients new_recipients
1 john_peters 31.05.2010 02:57 abc def ghi 3 3
2 john_peters 31.05.2010 03:47 klm 1 1
3 john_peters 31.05.2010 03:54 nop ghi 2 1
4 john_peters 31.05.2010 05:04 abc klm 2 0
5 john_peters 31.05.2010 06:28 def 1 0
6 john_peters 31.05.2010 06:42 abc klm 2 0
7 john_peters 31.05.2010 06:59 qrs def 2 1
8 mary_clark 01.06.2010 06:28 abc klm 2 2
9 mary_clark 01.06.2010 06:42 tuv 1 1
10 mary_clark 01.06.2010 06:59 tuv xyz 2 1

The variable message_id is a unique identifier for each email.
The variable sender_id is a unique identifier for each email sender.
The variable time is the time stamp when the email was sent.
The variables recipient_* are unique string identifiers for each email recipient.
The variable num_recipients is the number of recipients (= rowtotal of recipient_*).

The variable new_recipients is the variable of interest. It counts the number of email recipients in a given email that have not received an email from a given sender before. For instance, in message 4, john_peters sent an email to recipients abc, and klm. Given that john_peters had already sent a message to abc before (see message 1) but not to klm, the number of new recipients is one for message 4.

It would be relatively easy to create the variable new_recipients using a reshape command. However, given the size of the dataset, reshape is too time consuming and memory intense. Thus, I was wondering whether there is an alternative solution (maybe a loop) that allows me to create the variable new_recipients in a more efficient manner.

Best,

Marvin