Hi all,

Currently I am working with a mortgage portfolio on which I would like to perform a logistic regression.

The portfolio consists of 14 datasets (per quarter) and each dataset contains all the loans at that given time with the related characteristics/variables. Since the data in the dataset is captured per loan type and not per loan, loans with multiple loan types (f.e. 3or 5) are overrepresented in the data in relation to loans that only consist of 1 loan type. (a loan with 3 loan types has 3 observations in the dataset and a loan 1 with 1 loan type has 1 observation). Secondly, there is an overrepresentation in the date due to the age of a specific loan. Older loans could be observed in all 14 datasets, but newer loans occur in less datasets.

I merged all the datasets into 1 and reshaped the dataset from long to wide to overcome this form of over/under representation. Ideally, I would like to have 1 observation per loan but currently each observation(row) contains the quarterly information of a loan and has all the loan types in that observation (wide).

As reshaping of the data led to many variables, my dependent variables (behind on payment) is divided into many variables and therefore I am not able to perform the logit regression.

Anny suggestion on how I could tackle these issues and correct the dataset in order to perform a logistic regression?


Thank you in advance,

Django