I am using Stata 17/SE on Mac and am trying to generate subsets of my panel data based on dates.
For context, I have 50 years of tennis data... around 6 million observations where each observation is one match. Variables include things like tournament number, tournament start/end dates, player name, score, etc..
I am attempting to construct my own rank within the dataset. Still, I need a little help and can't find similar examples.
My objective is to:
* Generate points won (based on tournament type, what round, whether the player won or lost, etc.).. this is the points awarded for that particular match/observation
* Calculate the player's start and end date (e.g., career period)
* Generate subsets based on mm-yy to only include players who were active within this given month (this will give approx 540 smaller datasets...daily would be better but would be way too many files)
* Take the monthly datasets back into one dataset
* Then generate a 12-month cumulative sum of points
* Collapse into a player-level dataset, taking the max cumulative sum
* Then merge onto the larger panel dataset using m:1 name_merge and month variable
Noting that the first tournament is 06jan1976 and the final tournament date is 27dec2022.
So far, I have the player points and career dates.
However, what I am struggling with is how to generate a loop to capture each month in its own dataset and then generate the cumulative sum.
E.g., Should I be starting with something like the following:
Setting up:
Code:
gen month = month(tourn_date_start) gen year = year(tourn_date_start) gen date_monthly = ym(year,month) format %tm date_monthly
Code:
forval m=1/12 { forval y=1976/2022 { keep if month == `m' & year == `y' gen active = inrange(tourn_date_start, career_start, career_end) ....??? save `m'-`y', replace } }
Code:
cd "...." clear append using `: dir . files "*.dta"'
...
Noting that I likely will need to save these files to my drive (rather than temporary files) as I am using my little Macbook and it struggles at the best of times, haha.
I am a little lost in some areas... Any help is appreciated!
Data is as per the below (noting this is only for the first month/year given the size of the data hard to find a subset to show the context..):
Code:
* Example generated by -dataex-. For more info, type help dataex clear input int tourn_date_start float month int year float date_monthly str64 name_merge float(career_start career_end est_points) 5868 1 1976 192 "jiri hrebec" 5857 7969 2 5868 1 1976 192 "jan kodes" 5857 13639 2 5868 1 1976 192 "marty riessen" 5857 9181 2 5868 1 1976 192 "dick stockton" 5849 10734 32 5868 1 1976 192 "bob giltinan" 5857 7299 . 5868 1 1976 192 "allan stone" 5857 7299 . 5868 1 1976 192 "bob lutz" 5857 10363 32 5868 1 1976 192 "erik van dillen" 5863 10363 1 5868 1 1976 192 "dick stockton" 5849 10734 . 5868 1 1976 192 "ilie nastase" 5857 14668 2 5868 1 1976 192 "sherwood stewart" 5849 11032 1 5868 1 1976 192 "jan kodes" 5857 13639 2 5868 1 1976 192 "billy martin" 5849 8354 . 5868 1 1976 192 "arthur ashe" 5849 10363 . 5868 1 1976 192 "sandy mayer" 5849 10734 2 5868 1 1976 192 "bjorn borg" 5857 12365 64 5868 1 1976 192 "bob lutz" 5857 10363 . 5868 1 1976 192 "tom okker" 5849 10363 4 5868 1 1976 192 "jaime fillol" 5868 15627 . 5868 1 1976 192 "mark cox" 5849 10363 . 5868 1 1976 192 "dick crealy" 5857 10363 . 5868 1 1976 192 "frew mcmillan" 5849 10363 . 5868 1 1976 192 "tom gorman" 5849 14668 . 5868 1 1976 192 "rod laver" 5868 10734 . 5868 1 1976 192 "dennis ralston" 5849 8278 2 5868 1 1976 192 "aleksandr metreveli" 5849 6905 . 5868 1 1976 192 "brian gottfried" 5849 10804 4 5868 1 1976 192 "karl meiler" 5857 9138 1 5868 1 1976 192 "tom okker" 5849 10363 1 5868 1 1976 192 "rod laver" 5868 10734 32 5868 1 1976 192 "vitas gerulaitis" 5849 10741 2 5868 1 1976 192 "ismail el shafei" 5868 10363 . 5868 1 1976 192 "vitas gerulaitis" 5849 10741 2 5868 1 1976 192 "jan kodes" 5857 13639 1 5868 1 1976 192 "jan kodes" 5857 13639 1 5868 1 1976 192 "allan stone" 5857 7299 . 5868 1 1976 192 "ray moore" 5868 9166 . 5868 1 1976 192 "fred stolle" 5868 8984 . 5868 1 1976 192 "wojtek fibak" 5857 11525 . 5868 1 1976 192 "ilie nastase" 5857 14668 1 5868 1 1976 192 "frew mcmillan" 5849 10363 1 5868 1 1976 192 "dennis ralston" 5849 8278 64 5868 1 1976 192 "jiri hrebec" 5857 7969 . 5868 1 1976 192 "charlie pasarell" 5857 8452 2 5868 1 1976 192 "dennis ralston" 5849 8278 2 5868 1 1976 192 "jiri hrebec" 5857 7969 1 5868 1 1976 192 "jan kodes" 5857 13639 . 5868 1 1976 192 "geoff masters" 5849 8030 1 5868 1 1976 192 "tom okker" 5849 10363 . 5868 1 1976 192 "harold solomon" 5849 11403 . 5868 1 1976 192 "andrew pattison" 5849 8655 . 5868 1 1976 192 "billy martin" 5849 8354 . 5868 1 1976 192 "bjorn borg" 5857 12365 2 5868 1 1976 192 "kim warwick" 5849 13658 . 5868 1 1976 192 "stan smith" 5857 14668 . 5868 1 1976 192 "jimmy connors" 5863 14668 . 5868 1 1976 192 "jeff borowiak" 5857 10734 . 5868 1 1976 192 "dennis ralston" 5849 8278 1 5868 1 1976 192 "sandy mayer" 5849 10734 1 5868 1 1976 192 "andrew pattison" 5849 8655 . 5868 1 1976 192 "allan stone" 5857 7299 . 5868 1 1976 192 "vijay amritraj" 5849 12057 . 5868 1 1976 192 "allan stone" 5857 7299 2 5868 1 1976 192 "tom okker" 5849 10363 . 5868 1 1976 192 "jimmy connors" 5863 14668 . 5868 1 1976 192 "jiri hrebec" 5857 7969 . 5868 1 1976 192 "frew mcmillan" 5849 10363 2 5868 1 1976 192 "jan kodes" 5857 13639 . 5868 1 1976 192 "harold solomon" 5849 11403 . 5868 1 1976 192 "bob hewitt" 5849 7634 . 5868 1 1976 192 "dick stockton" 5849 10734 1 5868 1 1976 192 "tom okker" 5849 10363 2 5868 1 1976 192 "erik van dillen" 5863 10363 1 5868 1 1976 192 "brian gottfried" 5849 10804 . 5868 1 1976 192 "bob giltinan" 5857 7299 1 5868 1 1976 192 "dick crealy" 5857 10363 . 5868 1 1976 192 "tom gorman" 5849 14668 4 5868 1 1976 192 "arthur ashe" 5849 10363 . 5868 1 1976 192 "rod laver" 5868 10734 32 5868 1 1976 192 "karl meiler" 5857 9138 . 5868 1 1976 192 "stan smith" 5857 14668 . 5868 1 1976 192 "dick stockton" 5849 10734 . 5868 1 1976 192 "wojtek fibak" 5857 11525 1 5868 1 1976 192 "charlie pasarell" 5857 8452 . 5868 1 1976 192 "roscoe tanner" 5857 15187 . 5868 1 1976 192 "tom okker" 5849 10363 . 5868 1 1976 192 "bob lutz" 5857 10363 4 5868 1 1976 192 "bob hewitt" 5849 7634 2 5868 1 1976 192 "tom gorman" 5849 14668 2 5868 1 1976 192 "jimmy connors" 5863 14668 1 5868 1 1976 192 "ismail el shafei" 5868 10363 . 5868 1 1976 192 "bjorn borg" 5857 12365 . 5868 1 1976 192 "roscoe tanner" 5857 15187 2 5868 1 1976 192 "dick stockton" 5849 10734 64 5868 1 1976 192 "brian gottfried" 5849 10804 2 5868 1 1976 192 "erik van dillen" 5863 10363 . 5868 1 1976 192 "marty riessen" 5857 9181 . 5868 1 1976 192 "tony roche" 5868 10734 . 5868 1 1976 192 "cliff richey" 5849 14668 . 5868 1 1976 192 "alex metreveli" 5849 6905 1 end format %td tourn_date_start format %tm date_monthly format %td career_start format %td career_end
0 Response to Splitting a dataset into subsets based on active dates
Post a Comment