I have a large dataset that contains several million emails (one row per email sent). I am trying to create a variable that indicates whether a given email sender has written to an email recipient before. To illustrate the problem, I created a short example:
message_id | sender_id | time | recipient_1 | recipient_2 | recipient_3 | num_recipients | new_recipients |
1 | john_peters | 31.05.2010 02:57 | abc | def | ghi | 3 | 3 |
2 | john_peters | 31.05.2010 03:47 | klm | 1 | 1 | ||
3 | john_peters | 31.05.2010 03:54 | nop | ghi | 2 | 1 | |
4 | john_peters | 31.05.2010 05:04 | abc | klm | 2 | 0 | |
5 | john_peters | 31.05.2010 06:28 | def | 1 | 0 | ||
6 | john_peters | 31.05.2010 06:42 | abc | klm | 2 | 0 | |
7 | john_peters | 31.05.2010 06:59 | qrs | def | 2 | 1 | |
8 | mary_clark | 01.06.2010 06:28 | abc | klm | 2 | 2 | |
9 | mary_clark | 01.06.2010 06:42 | tuv | 1 | 1 | ||
10 | mary_clark | 01.06.2010 06:59 | tuv | xyz | 2 | 1 |
The variable message_id is a unique identifier for each email.
The variable sender_id is a unique identifier for each email sender.
The variable time is the time stamp when the email was sent.
The variables recipient_* are unique string identifiers for each email recipient.
The variable num_recipients is the number of recipients (= rowtotal of recipient_*).
The variable new_recipients is the variable of interest. It counts the number of email recipients in a given email that have not received an email from a given sender before. For instance, in message 4, john_peters sent an email to recipients abc, and klm. Given that john_peters had already sent a message to abc before (see message 1) but not to klm, the number of new recipients is one for message 4.
It would be relatively easy to create the variable new_recipients using a reshape command. However, given the size of the dataset, reshape is too time consuming and memory intense. Thus, I was wondering whether there is an alternative solution (maybe a loop) that allows me to create the variable new_recipients in a more efficient manner.
Best,
Marvin
0 Response to Panel data: Identify recurrent strings across columns
Post a Comment