This is my first post here and I am generally new to Stata, but I'll try to formulate my concern as precise as possible:
I am working with several datasets, each of which has different unique identifiers. In order to merge them, I need a linked dataset which causes my quite a headache.
As a start, I would like to eliminate adjacent observations if they share the same two common identifiers (for instance id1 and id2). I therefore created a new variable which should take the value 1 if that is indeed the case and zero otherwise. My code looks as follows:
gen sameid=0
replace sameid=1 if id1[_n]==id1[_n-1] & id2[_n]==id2[_n-1]
The code does what it is supposed to do only that it is not adhering to my previous sorting (I sorted the data in ascending order of "linkenddt" dates which is important for later operations). Specifically, the following is returned (as an excerpt):
id1 | id2 | linkdt | linkenddt | dup | sameid | |
1. | 1076 | 6765 | 04. Nov 82 | 31dec1992 | 5 | 0 |
2. | 1076 | 6765 | 03. Nov 92 | 31dec1992 | 5 | 1 |
3. | 1076 | 6765 | 01. Jan 93 | 30. Nov 10 | 5 | 1 |
4. | 1076 | 6765 | 01. Jan 93 | 30. Nov 10 | 5 | 1 |
5. | 1076 | 6765 | 01dec2010 | 10dec2010 | 5 | 1 |
6. | 1076 | 6765 | 01dec2010 | 31dec2019 | 5 | 1 |
7. | 1000 | 8987 | 01. Jan 50 | 30. Jan 62 | 1 | 0 |
8. | 1000 | 8987 | 31. Jan 62 | 31dec2019 | 1 | 1 |
9. | 375 | 55 | 08. Jun 83 | 09mar1998 | 0 | 0 |
10. | 375 | 55 | 05. Jan 71 | 15. Aug 03 | 0 | 0 |
Many thanks in advance and kind regards,
Jasper
0 Response to Handling duplicates
Post a Comment