Problems with duplicates in a panel dataset

Hello, I am using Stata 16.0 on Windows 10.

I am dealing with a dataset like this one

Code:

. input id year income

            id       year     income
  1. 9 1 10
  2. 9 1 5
  3. 9 1 7
  4. 9 1 14
  5. 9 1 18
  6. 9 2 11
  7. 9 2 6
  8. 9 2 8
  9. 9 2 15
 10. 9 2 19
 11. 10 1 3
 12. 10 2 4
 13. 11 1 1
 14. 11 1 2
 15. 11 2 4
 16. 11 2 4
 17. 12 1 2
 18. 12 1 3
 19. 12 2 3
 20. 12 2 4 
 21. end

. list

     +--------------------+
     | id   year   income |
     |--------------------|
  1. |  9      1       10 |
  2. |  9      1        5 |
  3. |  9      1        7 |
  4. |  9      1       14 |
  5. |  9      1       18 |
     |--------------------|
  6. |  9      2       11 |
  7. |  9      2        6 |
  8. |  9      2        8 |
  9. |  9      2       15 |
 10. |  9      2       19 |
     |--------------------|
 11. | 10      1        3 |
 12. | 10      2        4 |
 13. | 11      1        1 |
 14. | 11      1        2 |
 15. | 11      2        4 |
     |--------------------|
 16. | 11      2        4 |
 17. | 12      1        2 |
 18. | 12      1        3 |
 19. | 12      2        3 |
 20. | 12      2        4 |
     +--------------------+

. 
end of do-file

.

Just for the sake of simplicity let's say that Id is the family identifier and income is the income of a family member. So when I see these duplicates I am actually seeing different incomes from different members of the family. What I would like to do then, is to build a variable which is equal to the sum of the income of the various family member, so that I build a family income indicator.
The final dataset should look like this :

Code:

    +--------------------+
     | id   year   income |
     |--------------------|
  1. |  9      1       54 |
  2. |  9      2       19 |
  3. | 10      1        3 |
  4. | 10      2        4 |
  5. | 11      1        3 |
     |--------------------|
  6. | 11      2        8 |
  7. | 12      1        5 |
  8. | 12      2        7 |
     +--------------------+

I used this simplified dataset to sketch my problem. However my dataset has around 90k observations, and around 4k duplicates :

Code:

duplicates tag id year, gen(isdup)

Duplicates in terms of id year

. tab isdup

      isdup |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |     93,230       95.93       95.93
          1 |      2,128        2.19       98.12
          2 |        930        0.96       99.08
          3 |        492        0.51       99.58
          4 |        205        0.21       99.79
          5 |        126        0.13       99.92
          7 |         40        0.04       99.97
          9 |         10        0.01       99.98
         10 |         11        0.01       99.99
         12 |         13        0.01      100.00
------------+-----------------------------------
      Total |     97,185      100.00

Thanks in advance

BJ Data Tech Solution

Home / Data Cleaning / Data management / Data Processing / Problems with duplicates in a panel dataset
Problems with duplicates in a panel dataset

0 Response to Problems with duplicates in a panel dataset

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Problems with duplicates in a panel dataset Problems with duplicates in a panel dataset

Related Posts with Problems with duplicates in a panel dataset

0 Response to Problems with duplicates in a panel dataset

Post a Comment

Home / Data Cleaning / Data management / Data Processing / Problems with duplicates in a panel dataset
Problems with duplicates in a panel dataset