Hello Statlists,

I'm currently struggling with my dataset. I have a large dataset of around 7 000 000 observations of companies from 2004 - 2017 in Sweden.

These companies will be divided into 2 periods. Period=1 is 2010-2017 and Period=0 is 2004-2009
They will also be divided if they can opt-out of an audit by Audit=1 they can't opt out, and Audit=0 they can opt-out.

The dataset contains yearly observations of a company's annual report, so one company is usually included in multiple observations if they have submitted an annual report for more than 1 year. (please see: http://prntscr.com/rh3vcc for an extract from my data)

So, what I want to do is: to take a random sample of 1500 unique companies in, period, and audit. So I do not want to have 1 company appear in my sample several times. I still want to keep the "duplicate" observations. I need to do this because I will have to manually check each company and all their annual reports in another database. I need to know which year the annual report is from in my Stata dataset.

Is this possible? Or do I need to rethink the whole thing...

Kind regards,

Thomas