Panel data: Identify recurrent strings across columns

Dear Statalist,

I have a large dataset that contains several million emails (one row per email sent). I am trying to create a variable that indicates whether a given email sender has written to an email recipient before. To illustrate the problem, I created a short example:

message_id	sender_id	time	recipient_1	recipient_2	recipient_3	num_recipients	new_recipients
1	john_peters	31.05.2010 02:57	abc	def	ghi	3	3
2	john_peters	31.05.2010 03:47	klm			1	1
3	john_peters	31.05.2010 03:54	nop	ghi		2	1
4	john_peters	31.05.2010 05:04	abc	klm		2	0
5	john_peters	31.05.2010 06:28	def			1	0
6	john_peters	31.05.2010 06:42	abc	klm		2	0
7	john_peters	31.05.2010 06:59	qrs	def		2	1
8	mary_clark	01.06.2010 06:28	abc	klm		2	2
9	mary_clark	01.06.2010 06:42	tuv			1	1
10	mary_clark	01.06.2010 06:59	tuv	xyz		2	1

The variable message_id is a unique identifier for each email.
The variable sender_id is a unique identifier for each email sender.
The variable time is the time stamp when the email was sent.
The variables recipient_* are unique string identifiers for each email recipient.
The variable num_recipients is the number of recipients (= rowtotal of recipient_*).

The variable new_recipients is the variable of interest. It counts the number of email recipients in a given email that have not received an email from a given sender before. For instance, in message 4, john_peters sent an email to recipients abc, and klm. Given that john_peters had already sent a message to abc before (see message 1) but not to klm, the number of new recipients is one for message 4.

It would be relatively easy to create the variable new_recipients using a reshape command. However, given the size of the dataset, reshape is too time consuming and memory intense. Thus, I was wondering whether there is an alternative solution (maybe a loop) that allows me to create the variable new_recipients in a more efficient manner.

Best,

Marvin

BJ Data Tech Solution

Home / Data Cleaning / Data management / Data Processing / Panel data: Identify recurrent strings across columns
Panel data: Identify recurrent strings across columns

0 Response to Panel data: Identify recurrent strings across columns

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Panel data: Identify recurrent strings across columns Panel data: Identify recurrent strings across columns

Related Posts with Panel data: Identify recurrent strings across columns

0 Response to Panel data: Identify recurrent strings across columns

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Panel data: Identify recurrent strings across columns
Panel data: Identify recurrent strings across columns