I am new to this forum, so sorry if this question has been asked before (tried searching for it but could not find anything similar to my issue). I am currently writing my master thesis using insurance data and I'm having some difficulties using the relevant variables in the dataset because of how it is organized. What i would like to do is merge all the "insurance_policy_ID" observations that are duplicates into just 1 observation while at the same time summing the "insurance_premium_amount". I don't need the "coverage_ID" variable to be organized so that the final merged observations will display the "highest" coverage within the policy ID. If I just had a couple thousand observations, it would take time to merge them manually but it would be doable, but i have over 6 million observations (The dataset is from one of the biggest insurance companies in my country and its over a year of insurance data).
Simple illustration of how the data is displayed:
insurance_policy_ID | product_ID | coverage_ID | insurance_premium_amount |
12345 | Car | Sub coverage 1 (f.ex required) | 100 |
12345 | Car | Sub coverage 2 (f.ex partial coverage) | 400 |
12345 | Car | Sub coverage 3 (f.ex comprehensive coverage) | 700 |
12345 | Car | Sub coverage 4 (F.ex full comprehensive coverage ("top" insurance)) | 1000 |
How i would want it to look:
insurance_policy_ID | product_ID | coverage_ID | insurace_premium_amount |
12345 | Car | Sub coverage x (not important) | 2200 |
Thanks in advance for any help!
Tor
0 Response to Question: Merging observations in one variable while simultaneously summing the numbers of the related observations in another variable
Post a Comment