Hello everyone!
I have been working on removing duplicated observations for a long time, using different ways I learnt from googling.
However, I realized that I was not effective enough because it normally took me a very long time to detect and remove the duplications.
Sometimes, I have to analyze the duplication obs and choose which one I should keep and which ones I should delete. This is time consuming when there are thousands of duplicated observations.
So, I was wondering if there is a solution that, for each id-year, it keeps only the observation with the most data available, and removes others.

For example, in the attached picture, I want to make a panel dataset based on "cusipid" and "fyear".
For each cusipid-fyear, there are more than one observations.


Array

The code I used is as followed:

Code:
duplicates tag cusipid fyear, gen(isdup)
edit if isdup
order isdup

Then, I have to hand-select which one has more data available and should be kept.

I can not just simple keep the first observations and remove the rest because in the case cusipid=267 and fyear =2007, I actually need the second one with other data equals 0.

So, are there any codes that can help to select obs with more data available and keep them as the only one observation for each cusipid-fyear?

Thank you very much!

Kind regards
Shengze