Please help me with cleaning this data:


below is my data from excel file and I have 1000s of rows more like this.
  • patientid and patientid2 are same patient id number. (a patient has many measurements of iron & calcium)
  • startdate is the date we started treatment-baseline date
  • testdate is the laboratory test date for iron or calcium
  • value is the measured value of calcium or iron concentration
Now I need concentration values at baseline for each patient for calcium and iron which is the closest date after treatment or the same date as treatment started.
& also calcium_final iron_final as the final measurements which is the latest tested measurement
calcium_final_date and iron_final_date are basically the latest test dates.

what could be the best possible approach to solve this mess?


patientid startdate patientid2 testdate parameter value calcium_base calcium_final calcium_final_date iron_base iron_final iron_final_date
1122 10/07/2016 1122 9/26/2018 iron 227
1122 11/08/2017 iron 194
1122 9/29/2017 iron 4520
1122 12/29/2016 iron 416
1122 9/30/2016 iron 3238
1122 11/21/2017 calcium 4
1122 9/28/2017 calcium 11
1122 01/09/2017 calcium 22
1122 9/29/2016 calcium 33
1127 10/20/2016 1127 10/10/2018 iron 25
1127 10/23/2017 iron 121
1127 01/09/2017 iron 94
1127 10/07/2016 iron 152
1127 10/09/2017 calcium 24
1127 01/09/2017 calcium 22
1127 10/11/2016 calcium 222
1199 4/17/2018 1199 3/28/2018 calcium 5000
1199 1/22/2018 iron 669
1289 08/11/2017 1289 6/27/2018 iron 219
1289 01/08/2018 iron 221
1289 9/21/2017 calcium 6000
1289 6/27/2018 calcium 5
1450 11/13/2015 1450 11/08/2021 iron 509
1450 5/17/2021 iron 30
1450 5/29/2020 iron 117
1450 01/08/2020 calcium 3457
1450 07/02/2019 calcium 502