Hello,
here is a small part of my dataset. The full dataset is about 50million observations:
You can see that some observations are missing due to holidays, weekends etc..
LPERMNO datadate ajexdi prccd
10001 03jan1989 3 6.313
10001 04jan1989 3 6.25
10001 05jan1989 3 6.5
10001 06jan1989 3 6.125
10001 09jan1989 3 6.5
10001 10jan1989 3 6.5
10001 11jan1989 3 6.25
10001 12jan1989 3 6.25
10001 13jan1989 3 6.5
10001 16jan1989 3 6.5
10001 17jan1989 3 6.25
10001 18jan1989 3 6.5
10001 19jan1989 3 6.375
10001 20jan1989 3 6.25
10001 23jan1989 3 6.5
10001 24jan1989 3 6.25
Now i want this dataset to merge with another dataset by the unique identifier (LPERMNO) and by the date (datadate).
Now in my second dataset, there is always one date per year per LPERMNO.
The issue is that this date is sometimes on a holiday or weekend so i cant merge it fully with this dataset.
In this dataset i would like to generate new observations for the missing ones and just take the values from the next available date per LPERMNO.
So for example i would want the missing values 21jan1989 and 22jan1989 to be generated and to take on the values for ajexdi and prccd from 23jan1989.
So the newly, fully complete dataset would look like this:
LPERMNO datadate ajexdi prccd
10001 03jan1989 3 6.313
10001 04jan1989 3 6.25
10001 05jan1989 3 6.5
10001 06jan1989 3 6.125
10001 07jan1989 3 6.5
10001 08jan1989 3 6.5
10001 09jan1989 3 6.5
10001 10jan1989 3 6.5
10001 11jan1989 3 6.25
10001 12jan1989 3 6.25
10001 13jan1989 3 6.5
10001 14jan1989 3 6.5
10001 15jan1989 3 6.5
10001 16jan1989 3 6.5
10001 17jan1989 3 6.25
10001 18jan1989 3 6.5
10001 19jan1989 3 6.375
10001 20jan1989 3 6.25
10001 21jan1989 3 6.5
10001 22jan1989 3 6.5
10001 23jan1989 3 6.5
10001 24jan1989 3 6.25
To ensure that the newly generated observations are actually quite close to the next one and not years away (As this is stock data and they sometimes start trading again after stopping for years) i would also like to limit the "next available date" to be maximimum 30 days away from the missing date
How do i accomplish this? Ideally with a command that doesnt take hours as the dataset is quite huge but it doesn't matter if it isn't possible otherwise. Feel free to ask any clarifying questions or you might have a better solution to my problem.
Thanks
Nick
0 Response to Generating missing observations with the next available observation
Post a Comment