I have a very particular question that I have been trying to resolve last couple of days, using FAQ and search, but eventually I need some guidance.
So I have a dataset of coded sentences that looks like the following (I have >2000 observations). DOC_id represents the newspaper ad, CS_id represents the sentence within the ad, so there could be several CS_id per one value of DOC_id. Name is the code for the candidate.
Code:
case_id DOC_id CS_id Name 1 120700 200831 4350 2 120701 200833 4350 3 120703 200275 4350 4 120704 200276 4350 5 120705 200277 4350 6 120727 200882 4233 7 120728 200889 4233 8 120738 200980 5034 9 120738 200979 5034 10 120739 200981 5034 11 120739 200982 5034 12 120740 200984 4210 13 120740 200983 4210 14 120741 200985 4210 15 120741 200986 4210
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input float case_id double(DOC_id CS_id SACT_name) 1 120700 200831 4350 2 120701 200833 4350 3 120703 200275 4350 4 120704 200276 4350 5 120705 200277 4350 6 120727 200882 4233 7 120728 200889 4233 8 120738 200980 5034 9 120738 200979 5034 10 120739 200981 5034 11 120739 200982 5034 12 120740 200984 4210 13 120740 200983 4210 14 120741 200985 4210 15 120741 200986 4210 end
Ideally, I need to expand the observations based on how much DOC_id repeats, e.g. 120700 would repeat twice, 120701 would not repeat, 120738 that already repeats three times would have another 5 repetitions per cluster, so 3x5 = 15. This information is stored in a separate dataset. I know that a simple expandcl can help me, but I face the following problems:
1) As I said, this information is external, so for now I have to specify the code for each DOC_id manually, and I would like to avoid doing that for this number of observations...
2) Second, I need to assign new DOC_id for each new (set of) observations. The newvar created by expandcl creates the numeration for clusters and does not contain information about old and new observations. My aim is to see how many ads a candidate has published, including repeating ones, but having the same DOC_id for duplicate ads is not informative.
I know that there is a possible easy solution to this problem, but at this moment searching for the answer did not help me much. I would be grateful for your suggestions.
0 Response to Expand the observations based on a value
Post a Comment