Predicting Loan Defaults with a Supervised Machine Learning Model

Nabil Salehiyan – Algorithm Engineer

Reyhaneh Hosseinpour – Software Engineer

Shuang Qi – Test Engineer

The University of Texas at Dallas

Neural Net Mathematics

Dr. Richard Golden

Abstract

The peer-to-peer (P2P) lending platform has grown rapidly for over a decade around the

world due to easier access than the traditional loan from banks. However, P2P has its downsides

of unsecured loans, such as loan defaults (e.g., overdue loans). Therefore, it’s important for the

loan providers to have great strategies to decide whether they should provide a client with money

or not to avoid loan defaults. This process can be automated through a supervised machine

learning method in which we take a large dataset given by previous borrowers and predict whose

loan will go into default, saving time and money for loan providers.

We are using publicly available data from LendingClub.com from 2007-2010 which

contains four continuous independent variables and one binary dependent variable with 9579

samples. Preprocessing of the data included normalizing continuous variables such as FICO

score and DTI. Recoding was also necessary for our output variable, where we had to decide

whether our algorithm worked better with binary code or characters.

After normalization, extracting the input patterns, and deploying empirical risk

minimization, we initialized our theta and ran the gradient descent algorithm. Our predicted theta

values corresponded to our input variables' correlation to our output. Although our results

suggest we need higher accuracy which is likely due to model architecture, there is an obvious

implication that it has on the loan industry. An optimized model can be used for loan providers to

efficiently predict a borrower's relationship with their contract.

Introduction

The peer-to-peer (P2P) lending platform has become popular all around the world in

recent years, as an alternative financial service mode to traditional bank loans (Bachmann et al.,

2011). Through a P2P platform, borrowers can easily access cheap loans with fewer

requirements (e.g., high credit) through online services instead of the rigorous loan process

through the banks. However, with the development of P2P, loan defaults are constantly emerging

due to the lack of a proper credit rating system, laws, and regulations, suggesting the necessity of

building a reliable credit risk rating system (Sun, 2020). “Lending Club” is the biggest P2P

lending platform in the United States (“LendingClub,” 2022). This is where the data for our

model is collected from.

Recently, evaluation approaches such as Machine Learning (ML) are used to predict the

probability of a borrower’s loan performance and default risk (Carmichael, 2014; Croux et al.,

2020; Emekter et al., 2015; Serrano-Cinca et al., 2015; Singh et al., 2021). For example,

Carmichael (2014) reported that defaulting was determined by the FICO score levels, recent

inquiries, income, and loan purposes using a discrete hazard time model. Croux et al. (2020)

used logistic regression to show the correlation between default risk and loan purposes, loan

maturity, homeownership, occupation, etc. Emekter et al. (2015) found that a lower default

possibility was positively related to a higher FICO score and a low debt-to-income ratio. A study

from Serrano-Cinca et al. (2015) indicated that significant factors determining loan default

included loan purpose, applicant income, current housing situation, and level of indebtedness.

Singh et al., (2021) highlighted the importance of loan duration, loan amount, age, and income in

loan default using different classification algorithms in ML.

Based on the work mentioned above, we built a logistic regression model to analyze

factors with potential best prediction as well as greatest frequency of evaluation including the

expected monthly installments owed by the borrower, self-reported annual income of the

borrower, credit score (FICO), the number of days the borrower has had a credit line, etc. It was

expected to see the contributions of all these factors that the borrower paid back the loan on time

and the prediction accuracy.

Methods

A total of 9579 samples are included in this study, as shown in Figure 1, available from

LendingClub.com during 2007 and 2010. Our simplified data set consisted of four variables:

Installment (the monthly installments owed by the borrower if the loan is funded), log.annual.inc

(the natural log of the self-reported annual income of the borrower, fico (the FICO credit score of

the borrower), days.with.cr.line (the number of days the borrower has had a credit line).

These are the four variables that seemed to make the best predictions given a simplified

data set. These are also four of the most used variables in banking. All four are almost always

necessary for a loan provider to give out a loan. Generally, the better your FICO score is, the

more likely you are to pay off a loan. This credit score reflects your history of repayments and

your reputation with credit companies. The annual income also usually reflects a higher chance

to pay off the loan- as someone with more money is more likely to pay back the money they

owe. The days with a credit line show how many days the borrower has access to a credit line,

which usually reflects the tendency for paying back money that is owed. Lastly, a higher

installment amount can be expected to correlate with less loan repayment, as someone who has

more money to pay back can be assumed to be less motivated to repay.

Figure 1. Summary of the count of participants paid loans in full or not

To begin creating our model, data preprocessing was essential in ensuring proper

predictions. This is one of the most important steps, in which we transform the data into a format

that is more easily and effectively processed in machine learning to ensure accurate results and

other data science tasks. To preprocess the data, a profiling process has been done to get statistics

about data quality and its characteristics. The dimensions of data have been reduced to avoid

phenomena or data that are not relevant to a particular ML task. As part of the preprocessing of

our data, all the variables have been standardized. Data normalization is the systematic process

of grouping similar values into one common value, bringing greater context and accuracy to a

database. In other words, the goal of data normalization is to ensure that data is similar across all

records. Data validation was the last step of preprocessing. At this stage, the data is split into two

sets. The first set is used to train the machine-learning model, and the second set is the testing

data that is used to gauge the accuracy and robustness of the resulting model.

To make our prediction model, we will be using a logistic regression network

architecture with 0 hidden unit layers. Logistic regression is a statistical method for analyzing a

dataset in which the dependent variable is dichotomous. Logistic regression is used to find the

relationship between one dependent binary variable and several independent variables which are

multiplied with weights and summed up. This result will add to the sigmoid function to find the

result between 0 and 1. The values above 0.5 are 1’s, and the values below 0.5 are considered as

0. In order to estimate regression coefficients and weights, optimization techniques are used.

The data generating process generates an event which is detected as a feature vector by

the supervised learning machine. Input pattern  and a desired response  make up the feature

vector. The input pattern and its internal knowledge state are used by machine learning to

produce a response󰇘. A discrepancy between 󰇘. and desired response y results in an update to

the learning machine’s internal knowledge state.

This logistic regression that follows is the best when your output variable is expected to

be a binary value. The model estimates the probability of y = 1 for s and  as:

󰇘

󰇛



󰇜

=󰇟  

󰇛

󰇘󰇛

󰇜

󰇜󰇠



where

󰇘

󰇛



󰇜





󰇟



󰇠



The discrepancy function is chosen such that

󰇛󰇜

󰇘

󰇟󰇘

󰇛



󰇜

 

󰇛

  

󰇜

  󰇘

󰇛



󰇜

󰇠

With this choice of the discrepancy function , the empirical risk function 

󰆹



󰇛



󰇜

is defined for

the training data set 



 󰇝

󰇛









󰇜



󰇛









󰇜

󰇞 such that:



󰆹



󰇛



󰇜











󰇘

󰇛



󰇜

 

󰇛

  

󰇜

  󰇘

󰇛



󰇜







The parameters of the logistic regression model corresponding to a minimizer of 

󰆹



󰇛



󰇜

in the

discrepancy function with the loss function



󰇛󰇟



󰇠



󰇜

󰇘

󰇛



󰇜

 

󰇛

  

󰇜

  󰇘

󰇛



󰇜

󰇠

One of the optimistic functions used to find the best regression weights is logistic regression,

namely gradient descent. This algorithm updates a set of parameters in an iterative manner to

minimize an error function. (Golden, 2020). Gradient descent algorithms train the entire sample

in the training set to do a single update for a parameter in a particular iteration.

The proper way to create a gradient descent algorithm is as follows. The first step is to construct

a smooth objective function that is twice continuously differentiable. Then you derive the

gradient descent algorithm with 



being the initial guess.

Then,



󰇛

 

󰇜

 

󰇛



󰇜

 



󰇛

󰇛



󰇜

this algorithm generates parameter estimations. The step sizes 







are chosen such that



󰇛

 

󰇜

  

󰇛



󰇜



For all t  .

The next step is to derive the stopping criteria. Let  be a small positive number. If



󰇛

󰇛



󰇜





 , then stop iterating.

The best way to evaluate the performance of an algorithm would be to make predictions for new

data (test data) to which you already know the answers. For this, we split up our dataset and

created useful estimates of the performance of our algorithms. Step 1: train the algorithm on the

first part (use 67% of the data for training). Step 2: make predictions on the second part (use 33%

of the data for testing). Step 3: Evaluate the predictions against the expected results.

Results

Table 1 Coefficients of four main variables

As shown in Table 1, the output of our test predictions using the train model are as

follows: 0: days.with.cr. (0.78), 1: fico score (0.005). 2: log.annual.inc (0.07). 3: Installment (-

1.10). There’s a positive correlation (0.78) between the days of credit and default status,

suggesting a larger amount of days with a credit line result in a higher chance for the loan to be

sent to collections. There is also a very small theta value between fico score and default status,

suggesting a small positive, but little to no relationship between these two. We can assume

variables

coefficients

days.with.cr

0.779578

fico score

0.004776

log.annual.inc

0.071965

Installment

-1.098573

intercept

-0.732803

though, that a higher credit score results in more paid-off loans. For the annual income variable,

we also see a positive correlation of 0.07- meaning a higher annual income tends to be associated

with paid-off loans. Lastly, the installment amount was the only variable with a negative

correlation. This suggests that the larger the loan a borrower received, the higher their chance of

going into default is. The overall accuracy for our model is 0.85.

As shown in Figure 2, the output of gradient descent of gradient Norm shows the

minimization of all the theta values over iterations.

Figure 2. The output of the normalized gradient descent

As shown in Figure 3, the graph of our gradient descent shows the difference between our

training and test set, it can be seen that our test set had a higher degree of minimization over

iterations compared to the training set.

Figure 3. The gradient descent of loss, training and testing data

Considering estimated coefficients logistic model is as following:

󰇘

󰇛



󰇜







  



Discussion

This study explores the relationship of four determinant variables and default of loans in

P2P lending settings utilizing 9579 samples of individually approved loans from the

LendingClub consumer platform from 2007 to 2010. The four variables are the expected monthly

installments owed by the borrower, self-reported annual income, credit score (FICO), and the

number of days with a credit line. The findings are mainly as follows: with an accuracy of

around 85% for our test set, this model shows that installment amount, annual income, and fico

score have a positive correlation with default status. Days with a credit line remained the only

negative correlation. The implications of these results show which variables loan providers could

use to predict if a borrower's loan will be sent to a collection agency.

As mentioned earlier, credit risk rating is of priority in the P2P lending platform to avoid

potential loan default (Sun, 2020). Our study is thus of great value, the findings of which

optimizes and makes time-efficient decisions for P2P lenders to save millions of dollars in the

long run. As suggested by our findings, the Lending Club lenders should pay more attention to

the borrower’s days of credits and the installment amount, taking actions such as trying to know

more about the borrower’s credit history (especially in countries without a perfect record of

credit history and interactional consumers) or to find more creative ways to lower the default rate

of the borrower. In addition, borrowers with relatively higher income / FICO scores might be not

a participant in the P2P market. Providing greater incentives to attract borrowers with lower

probability of loan default would possibly decrease the default risk (Emekter et al., 2015).

Future directions for this study include having a larger dataset with more complicated

variables. The simplified data set in our model made for some limitations as to how much this

could be used in a real-world scenario. More work in this area is important for not only the loan

providers but for the economy.

References

Bachmann, A., Becker, A., Buerckner, D., Hilker, M., Kock, F., Lehmann, M., Tiburtius, P., &

Funk, B. (2011). Online Peer-to-Peer Lending—A Literature Review. Journal of Internet

Banking and Commerce, 16(2), 1–18.

Carmichael, D. (2014). Modeling Default for Peer-to-Peer Loans (SSRN Scholarly Paper No.

2529240). https://doi.org/10.2139/ssrn.2529240

Croux, C., Jagtiani, J., Korivi, T., & Vulanovic, M. (2020). Important factors determining

Fintech loan default: Evidence from a lendingclub consumer platform. Journal of

Economic Behavior & Organization, 173, 270–296.

https://doi.org/10.1016/j.jebo.2020.03.016

Emekter, R., Tu, Y., Jirasakuldech, B., & Lu, M. (2015). Evaluating credit risk and loan

performance in online Peer-to-Peer (P2P) lending. Applied Economics, 47(1), 54–70.

https://doi.org/10.1080/00036846.2014.962222

LendingClub. (2022). In Wikipedia.

https://en.wikipedia.org/w/index.php?title=LendingClub&oldid=1121898345

Golden, R. (2020). : A Unified Framework. Chapman and Hall/CRC.

https://doi.org/10.1201/9781351051507

Serrano-Cinca, C., Gutiérrez-Nieto, B., & López-Palacios, L. (2015). Determinants of Default in

P2P Lending. PLOS ONE, 10(10), e0139427.

https://doi.org/10.1371/journal.pone.0139427

Singh, V., Yadav, A., Awasthi, R., & Partheeban, G. N. (2021). Prediction of Modernized Loan

Approval System Based on Machine Learning Approach. 2021 International Conference

on Intelligent Technologies (CONIT), 1–4.

https://doi.org/10.1109/CONIT51480.2021.9498475

Sun, X. (2020). Prediction of the Borrowers’ Payback to the Loan with Lending Club Data. 2020

International Conference on Modern Education and Information Management

(ICMEIM), 375–379. https://doi.org/10.1109/ICMEIM51375.2020.00092

Appendix

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline

from sklearn.decomposition import PCA

from sklearn.metrics import accuracy_score

from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import MinMaxScaler

import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import seaborn as sns

from sklearn import preprocessing

import time

loan=pd.read_csv('/home/yat-lok/Downloads/loan_data.csv')

loan.head()

loan.info()

loan.describe()

sns.pairplot(loan,hue='not_fully_paid')

loan.columns

X1=loan.drop(['not_fully_paid'],axis=1) # keeping 'purpose' column aside for now in features

y=loan['not_fully_paid']

# normalize

x = preprocessing.normalize(X1)

x = pd.DataFrame(x)

from sklearn.preprocessing import StandardScaler

scaler=StandardScaler()

X_scaled=scaler.fit_transform(x)

# define a class for logistic regression

class Modeling:

def __init__(self, theta, gamma = 0.0001, max_iters = 1000):

self.gamma = gamma

self.max_iters = max_iters

self.theta = theta

self.grad = None

def _S_one(self, x):

X_yhat = x

X_yhat["y_hat"] = np.ones(len(x.index))

return X_yhat

def _y_hat(self, S):

return self._logistic(np.dot(S, self.theta))

def _cost_function(self, S, y):

y_hat = self._y_hat(S)

c = (np.log(y_hat)*-y - (1-y)* np.log(1-y_hat))

return c

def _loss_func(self, S, y):

c = self._cost_function(S, y)

return np.mean(c)

def _gradient_descent_func(self, S, y):

y_hat = self._y_hat(S)

return np.dot(S.transpose(), (y_hat - y))

def _gradient_iteration(self, S, y):

d_loss = 1/len(y) * self._gradient_descent_func(S, y)

return d_loss

def _gradient_descent(self, S, y, S_test = None, y_test = None):

# gradient descent

grad = []

loss = []

loss_test = []

thetas = []

t = 0

gradnorm = np.inf

while gradnorm >= 0.001 and t <= self.max_iters:

stime = time.time()

gt = self._gradient_iteration(S, y)

self.theta = self.theta - self.gamma*gt

gradnorm = np.max(np.abs(gt))

lss = self._loss_func(S, y)

grad.append(gradnorm)

loss.append(lss)

thetas.append(self.theta)

if S_test is not None:

lss_test = self._loss_func(S_test, y_test)

loss_test.append(lss_test)

t += 1

etime = time.time()

#print(f"it takes {etime - stime} for each loop")

#print(f"the iteration {t}, grad is {gradnorm}, loss is {lss}")

return grad, loss, thetas, loss_test

def _logistic(self, x):

return (1/(1+np.exp(-x)))

def training(self, X, y, X_test = None, y_test = None):

S = self._S_one(X)

if X_test is not None:

S_test = self._S_one(X_test)

else:

S_test = None

grad, loss, thetas, loss_test = self._gradient_descent(S, y, S_test, y_test)

return grad, loss, thetas, loss_test

def fitting(self, X):

S = self._S_one(X)

y_hat_prob = self._y_hat(S)

return y_hat_prob

# train and test split

x_train, x_test, y_train, y_test = train_test_split(X,y, test_size = .33)

def probs_to_binary_label(probs, threshold):

return [1 if i > threshold else 0 for i in probs]

# model training process with training data

start = time.time()

# initiate the theta to all zeros

init_theta = np.zeros(5)

# find the values

model_train = Modeling(theta = init_theta, gamma = 0.2, max_iters=10000)

grad_train, loss_train, thetas_train, loss_test = model_train.training(x_train, y_train, x_test,

y_test)

y_hat_train_list = model_train.fitting(x_train)

y_hat_test_list = model_train.fitting(x_test)

y_hat_train_binary = probs_to_binary_label(y_hat_train_list, 0.5)

y_hat_test_binary = probs_to_binary_label(y_hat_test_list, 0.5)

end = time.time()

print(f"Time to train model and make train and test predictions: {round(end-start, 1)}s")

def accuracy(y, y_hat):

accuracy = np.sum(np.equal(y, y_hat))/len(y)

return accuracy

fig = plt.figure()

axes = fig.add_subplot(111)

axes.plot(loss_train)

axes.plot(loss_test)

axes.set_xlabel("Iteration")

axes.set_ylabel("Loss")

axes.legend(["Train", "Test"])

plt.title("Gradient Descent of Loss, Train and Test")

plt.show()

def grad_desc_viz(grad, ylabel, title):

# visualization the gradient descent

fig = plt.figure()

axes = fig.add_subplot(111)

axes.plot(grad)

axes.set_xlabel("Iteration")

axes.set_ylabel(ylabel)

plt.title(title)

plt.show()

grad_desc_viz(grad_train, ylabel = "Gradient Norm", title="Gradient descent of Gradient

Norm")

# Final thetas after gradient descent

pd.Series(model_train.theta.flatten(), index = X.columns.tolist() + ["intercept"])