Hello fellow Stata users,

I have a very particular question that I have been trying to resolve last couple of days, using FAQ and search, but eventually I need some guidance.

So I have a dataset of coded sentences that looks like the following (I have >2000 observations). DOC_id represents the newspaper ad, CS_id represents the sentence within the ad, so there could be several CS_id per one value of DOC_id. Name is the code for the candidate.
Code:
case_id   DOC_id  CS_id  Name
 1        120700 200831 4350
 2        120701 200833 4350
 3        120703 200275 4350
 4        120704 200276 4350
 5        120705 200277 4350
 6        120727 200882 4233
 7        120728 200889 4233
 8        120738 200980 5034
 9        120738 200979 5034
10        120739 200981 5034
11        120739 200982 5034
12        120740 200984 4210
13        120740 200983 4210
14        120741 200985 4210
15        120741 200986 4210
To replicate:
Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float case_id double(DOC_id CS_id SACT_name)
 1 120700 200831 4350
 2 120701 200833 4350
 3 120703 200275 4350
 4 120704 200276 4350
 5 120705 200277 4350
 6 120727 200882 4233
 7 120728 200889 4233
 8 120738 200980 5034
 9 120738 200979 5034
10 120739 200981 5034
11 120739 200982 5034
12 120740 200984 4210
13 120740 200983 4210
14 120741 200985 4210
15 120741 200986 4210
end


Ideally, I need to expand the observations based on how much DOC_id repeats, e.g. 120700 would repeat twice, 120701 would not repeat, 120738 that already repeats three times would have another 5 repetitions per cluster, so 3x5 = 15. This information is stored in a separate dataset. I know that a simple expandcl can help me, but I face the following problems:

1) As I said, this information is external, so for now I have to specify the code for each DOC_id manually, and I would like to avoid doing that for this number of observations...
2) Second, I need to assign new DOC_id for each new (set of) observations. The newvar created by expandcl creates the numeration for clusters and does not contain information about old and new observations. My aim is to see how many ads a candidate has published, including repeating ones, but having the same DOC_id for duplicate ads is not informative.

I know that there is a possible easy solution to this problem, but at this moment searching for the answer did not help me much. I would be grateful for your suggestions.