Predicting Loan Defaults with a Supervised Machine Learning Model
Nabil Salehiyan Algorithm Engineer
Reyhaneh Hosseinpour Software Engineer
Shuang Qi Test Engineer
The University of Texas at Dallas
Neural Net Mathematics
Dr. Richard Golden
Abstract
The peer-to-peer (P2P) lending platform has grown rapidly for over a decade around the
world due to easier access than the traditional loan from banks. However, P2P has its downsides
of unsecured loans, such as loan defaults (e.g., overdue loans). Therefore, it’s important for the
loan providers to have great strategies to decide whether they should provide a client with money
or not to avoid loan defaults. This process can be automated through a supervised machine
learning method in which we take a large dataset given by previous borrowers and predict whose
loan will go into default, saving time and money for loan providers.
We are using publicly available data from LendingClub.com from 2007-2010 which
contains four continuous independent variables and one binary dependent variable with 9579
samples. Preprocessing of the data included normalizing continuous variables such as FICO
score and DTI. Recoding was also necessary for our output variable, where we had to decide
whether our algorithm worked better with binary code or characters.
After normalization, extracting the input patterns, and deploying empirical risk
minimization, we initialized our theta and ran the gradient descent algorithm. Our predicted theta
values corresponded to our input variables' correlation to our output. Although our results
suggest we need higher accuracy which is likely due to model architecture, there is an obvious
implication that it has on the loan industry. An optimized model can be used for loan providers to
efficiently predict a borrower's relationship with their contract.
Introduction
The peer-to-peer (P2P) lending platform has become popular all around the world in
recent years, as an alternative financial service mode to traditional bank loans (Bachmann et al.,
2011). Through a P2P platform, borrowers can easily access cheap loans with fewer
requirements (e.g., high credit) through online services instead of the rigorous loan process
through the banks. However, with the development of P2P, loan defaults are constantly emerging
due to the lack of a proper credit rating system, laws, and regulations, suggesting the necessity of
building a reliable credit risk rating system (Sun, 2020). “Lending Club” is the biggest P2P
lending platform in the United States (“LendingClub,” 2022). This is where the data for our
model is collected from.
Recently, evaluation approaches such as Machine Learning (ML) are used to predict the
probability of a borrower’s loan performance and default risk (Carmichael, 2014; Croux et al.,
2020; Emekter et al., 2015; Serrano-Cinca et al., 2015; Singh et al., 2021). For example,
Carmichael (2014) reported that defaulting was determined by the FICO score levels, recent
inquiries, income, and loan purposes using a discrete hazard time model. Croux et al. (2020)
used logistic regression to show the correlation between default risk and loan purposes, loan
maturity, homeownership, occupation, etc. Emekter et al. (2015) found that a lower default
possibility was positively related to a higher FICO score and a low debt-to-income ratio. A study
from Serrano-Cinca et al. (2015) indicated that significant factors determining loan default
included loan purpose, applicant income, current housing situation, and level of indebtedness.
Singh et al., (2021) highlighted the importance of loan duration, loan amount, age, and income in
loan default using different classification algorithms in ML.
Based on the work mentioned above, we built a logistic regression model to analyze
factors with potential best prediction as well as greatest frequency of evaluation including the
expected monthly installments owed by the borrower, self-reported annual income of the
borrower, credit score (FICO), the number of days the borrower has had a credit line, etc. It was
expected to see the contributions of all these factors that the borrower paid back the loan on time
and the prediction accuracy.
Methods
A total of 9579 samples are included in this study, as shown in Figure 1, available from
LendingClub.com during 2007 and 2010. Our simplified data set consisted of four variables:
Installment (the monthly installments owed by the borrower if the loan is funded), log.annual.inc
(the natural log of the self-reported annual income of the borrower, fico (the FICO credit score of
the borrower), days.with.cr.line (the number of days the borrower has had a credit line).
These are the four variables that seemed to make the best predictions given a simplified
data set. These are also four of the most used variables in banking. All four are almost always
necessary for a loan provider to give out a loan. Generally, the better your FICO score is, the
more likely you are to pay off a loan. This credit score reflects your history of repayments and
your reputation with credit companies. The annual income also usually reflects a higher chance
to pay off the loan- as someone with more money is more likely to pay back the money they
owe. The days with a credit line show how many days the borrower has access to a credit line,
which usually reflects the tendency for paying back money that is owed. Lastly, a higher
installment amount can be expected to correlate with less loan repayment, as someone who has
more money to pay back can be assumed to be less motivated to repay.
Figure 1. Summary of the count of participants paid loans in full or not
To begin creating our model, data preprocessing was essential in ensuring proper
predictions. This is one of the most important steps, in which we transform the data into a format
that is more easily and effectively processed in machine learning to ensure accurate results and
other data science tasks. To preprocess the data, a profiling process has been done to get statistics
about data quality and its characteristics. The dimensions of data have been reduced to avoid
phenomena or data that are not relevant to a particular ML task. As part of the preprocessing of
our data, all the variables have been standardized. Data normalization is the systematic process
of grouping similar values into one common value, bringing greater context and accuracy to a
database. In other words, the goal of data normalization is to ensure that data is similar across all
records. Data validation was the last step of preprocessing. At this stage, the data is split into two
sets. The first set is used to train the machine-learning model, and the second set is the testing
data that is used to gauge the accuracy and robustness of the resulting model.
To make our prediction model, we will be using a logistic regression network
architecture with 0 hidden unit layers. Logistic regression is a statistical method for analyzing a
dataset in which the dependent variable is dichotomous. Logistic regression is used to find the
relationship between one dependent binary variable and several independent variables which are
multiplied with weights and summed up. This result will add to the sigmoid function to find the
result between 0 and 1. The values above 0.5 are 1’s, and the values below 0.5 are considered as
0. In order to estimate regression coefficients and weights, optimization techniques are used.
The data generating process generates an event which is detected as a feature vector by
the supervised learning machine. Input pattern and a desired response make up the feature
vector. The input pattern and its internal knowledge state are used by machine learning to
produce a response󰇘. A discrepancy between 󰇘. and desired response y results in an update to
the learning machine’s internal knowledge state.
This logistic regression that follows is the best when your output variable is expected to
be a binary value. The model estimates the probability of y = 1 for s and as:
󰇘
󰇛

󰇜
=󰇟 
󰇛
󰇘󰇛
󰇜
󰇜󰇠

where
󰇘
󰇛

󰇜
󰇟
󰇠
The discrepancy function is chosen such that
󰇛󰇜
󰇘
󰇟󰇘
󰇛

󰇜
󰇛
󰇜
 󰇘
󰇛

󰇜
󰇠
With this choice of the discrepancy function , the empirical risk function
󰆹
󰇛
󰇜
is defined for
the training data set
󰇝
󰇛
󰇜
󰇛
󰇜
󰇞 such that:
󰆹
󰇛
󰇜
󰇘
󰇛

󰇜
󰇛
󰇜
 󰇘
󰇛

󰇜


The parameters of the logistic regression model corresponding to a minimizer of
󰆹
󰇛
󰇜
in the
discrepancy function with the loss function

󰇛󰇟

󰇠
󰇜
󰇘
󰇛

󰇜
󰇛
󰇜
 󰇘
󰇛

󰇜
󰇠
One of the optimistic functions used to find the best regression weights is logistic regression,
namely gradient descent. This algorithm updates a set of parameters in an iterative manner to
minimize an error function. (Golden, 2020). Gradient descent algorithms train the entire sample
in the training set to do a single update for a parameter in a particular iteration.
The proper way to create a gradient descent algorithm is as follows. The first step is to construct
a smooth objective function that is twice continuously differentiable. Then you derive the
gradient descent algorithm with
being the initial guess.
Then,
󰇛
󰇜

󰇛
󰇜
󰇛
󰇛
󰇜
󰇜
this algorithm generates parameter estimations. The step sizes


are chosen such that

󰇛
󰇜

󰇛
󰇜
For all t .
The next step is to derive the stopping criteria. Let be a small positive number. If
󰇛
󰇛
󰇜
󰇜
, then stop iterating.
The best way to evaluate the performance of an algorithm would be to make predictions for new
data (test data) to which you already know the answers. For this, we split up our dataset and
created useful estimates of the performance of our algorithms. Step 1: train the algorithm on the
first part (use 67% of the data for training). Step 2: make predictions on the second part (use 33%
of the data for testing). Step 3: Evaluate the predictions against the expected results.
Results
Table 1 Coefficients of four main variables
As shown in Table 1, the output of our test predictions using the train model are as
follows: 0: days.with.cr. (0.78), 1: fico score (0.005). 2: log.annual.inc (0.07). 3: Installment (-
1.10). There’s a positive correlation (0.78) between the days of credit and default status,
suggesting a larger amount of days with a credit line result in a higher chance for the loan to be
sent to collections. There is also a very small theta value between fico score and default status,
suggesting a small positive, but little to no relationship between these two. We can assume
variables
coefficients
days.with.cr
0.779578
fico score
0.004776
log.annual.inc
0.071965
Installment
-1.098573
intercept
-0.732803
though, that a higher credit score results in more paid-off loans. For the annual income variable,
we also see a positive correlation of 0.07- meaning a higher annual income tends to be associated
with paid-off loans. Lastly, the installment amount was the only variable with a negative
correlation. This suggests that the larger the loan a borrower received, the higher their chance of
going into default is. The overall accuracy for our model is 0.85.
As shown in Figure 2, the output of gradient descent of gradient Norm shows the
minimization of all the theta values over iterations.
Figure 2. The output of the normalized gradient descent
As shown in Figure 3, the graph of our gradient descent shows the difference between our
training and test set, it can be seen that our test set had a higher degree of minimization over
iterations compared to the training set.
Figure 3. The gradient descent of loss, training and testing data
Considering estimated coefficients logistic model is as following:
󰇘
󰇛

󰇜


Discussion
This study explores the relationship of four determinant variables and default of loans in
P2P lending settings utilizing 9579 samples of individually approved loans from the
LendingClub consumer platform from 2007 to 2010. The four variables are the expected monthly
installments owed by the borrower, self-reported annual income, credit score (FICO), and the
number of days with a credit line. The findings are mainly as follows: with an accuracy of
around 85% for our test set, this model shows that installment amount, annual income, and fico
score have a positive correlation with default status. Days with a credit line remained the only
negative correlation. The implications of these results show which variables loan providers could
use to predict if a borrower's loan will be sent to a collection agency.
As mentioned earlier, credit risk rating is of priority in the P2P lending platform to avoid
potential loan default (Sun, 2020). Our study is thus of great value, the findings of which
optimizes and makes time-efficient decisions for P2P lenders to save millions of dollars in the
long run. As suggested by our findings, the Lending Club lenders should pay more attention to
the borrower’s days of credits and the installment amount, taking actions such as trying to know
more about the borrower’s credit history (especially in countries without a perfect record of
credit history and interactional consumers) or to find more creative ways to lower the default rate
of the borrower. In addition, borrowers with relatively higher income / FICO scores might be not
a participant in the P2P market. Providing greater incentives to attract borrowers with lower
probability of loan default would possibly decrease the default risk (Emekter et al., 2015).
Future directions for this study include having a larger dataset with more complicated
variables. The simplified data set in our model made for some limitations as to how much this
could be used in a real-world scenario. More work in this area is important for not only the loan
providers but for the economy.
References
Bachmann, A., Becker, A., Buerckner, D., Hilker, M., Kock, F., Lehmann, M., Tiburtius, P., &
Funk, B. (2011). Online Peer-to-Peer LendingA Literature Review. Journal of Internet
Banking and Commerce, 16(2), 118.
Carmichael, D. (2014). Modeling Default for Peer-to-Peer Loans (SSRN Scholarly Paper No.
2529240). https://doi.org/10.2139/ssrn.2529240
Croux, C., Jagtiani, J., Korivi, T., & Vulanovic, M. (2020). Important factors determining
Fintech loan default: Evidence from a lendingclub consumer platform. Journal of
Economic Behavior & Organization, 173, 270296.
https://doi.org/10.1016/j.jebo.2020.03.016
Emekter, R., Tu, Y., Jirasakuldech, B., & Lu, M. (2015). Evaluating credit risk and loan
performance in online Peer-to-Peer (P2P) lending. Applied Economics, 47(1), 5470.
https://doi.org/10.1080/00036846.2014.962222
LendingClub. (2022). In Wikipedia.
https://en.wikipedia.org/w/index.php?title=LendingClub&oldid=1121898345
Golden, R. (2020). : A Unified Framework. Chapman and Hall/CRC.
https://doi.org/10.1201/9781351051507
Serrano-Cinca, C., Gutiérrez-Nieto, B., & López-Palacios, L. (2015). Determinants of Default in
P2P Lending. PLOS ONE, 10(10), e0139427.
https://doi.org/10.1371/journal.pone.0139427
Singh, V., Yadav, A., Awasthi, R., & Partheeban, G. N. (2021). Prediction of Modernized Loan
Approval System Based on Machine Learning Approach. 2021 International Conference
on Intelligent Technologies (CONIT), 14.
https://doi.org/10.1109/CONIT51480.2021.9498475
Sun, X. (2020). Prediction of the Borrowers’ Payback to the Loan with Lending Club Data. 2020
International Conference on Modern Education and Information Management
(ICMEIM), 375379. https://doi.org/10.1109/ICMEIM51375.2020.00092
Appendix
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
from sklearn import preprocessing
import time
loan=pd.read_csv('/home/yat-lok/Downloads/loan_data.csv')
loan.head()
loan.info()
loan.describe()
sns.pairplot(loan,hue='not_fully_paid')
loan.columns
X1=loan.drop(['not_fully_paid'],axis=1) # keeping 'purpose' column aside for now in features
y=loan['not_fully_paid']
# normalize
x = preprocessing.normalize(X1)
x = pd.DataFrame(x)
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
X_scaled=scaler.fit_transform(x)
# define a class for logistic regression
class Modeling:
def __init__(self, theta, gamma = 0.0001, max_iters = 1000):
self.gamma = gamma
self.max_iters = max_iters
self.theta = theta
self.grad = None
def _S_one(self, x):
X_yhat = x
X_yhat["y_hat"] = np.ones(len(x.index))
return X_yhat
def _y_hat(self, S):
return self._logistic(np.dot(S, self.theta))
def _cost_function(self, S, y):
y_hat = self._y_hat(S)
c = (np.log(y_hat)*-y - (1-y)* np.log(1-y_hat))
return c
def _loss_func(self, S, y):
c = self._cost_function(S, y)
return np.mean(c)
def _gradient_descent_func(self, S, y):
y_hat = self._y_hat(S)
return np.dot(S.transpose(), (y_hat - y))
def _gradient_iteration(self, S, y):
d_loss = 1/len(y) * self._gradient_descent_func(S, y)
return d_loss
def _gradient_descent(self, S, y, S_test = None, y_test = None):
# gradient descent
grad = []
loss = []
loss_test = []
thetas = []
t = 0
gradnorm = np.inf
while gradnorm >= 0.001 and t <= self.max_iters:
stime = time.time()
gt = self._gradient_iteration(S, y)
self.theta = self.theta - self.gamma*gt
gradnorm = np.max(np.abs(gt))
lss = self._loss_func(S, y)
grad.append(gradnorm)
loss.append(lss)
thetas.append(self.theta)
if S_test is not None:
lss_test = self._loss_func(S_test, y_test)
loss_test.append(lss_test)
t += 1
etime = time.time()
#print(f"it takes {etime - stime} for each loop")
#print(f"the iteration {t}, grad is {gradnorm}, loss is {lss}")
return grad, loss, thetas, loss_test
def _logistic(self, x):
return (1/(1+np.exp(-x)))
def training(self, X, y, X_test = None, y_test = None):
S = self._S_one(X)
if X_test is not None:
S_test = self._S_one(X_test)
else:
S_test = None
grad, loss, thetas, loss_test = self._gradient_descent(S, y, S_test, y_test)
return grad, loss, thetas, loss_test
def fitting(self, X):
S = self._S_one(X)
y_hat_prob = self._y_hat(S)
return y_hat_prob
# train and test split
x_train, x_test, y_train, y_test = train_test_split(X,y, test_size = .33)
def probs_to_binary_label(probs, threshold):
return [1 if i > threshold else 0 for i in probs]
# model training process with training data
start = time.time()
# initiate the theta to all zeros
init_theta = np.zeros(5)
# find the values
model_train = Modeling(theta = init_theta, gamma = 0.2, max_iters=10000)
grad_train, loss_train, thetas_train, loss_test = model_train.training(x_train, y_train, x_test,
y_test)
y_hat_train_list = model_train.fitting(x_train)
y_hat_test_list = model_train.fitting(x_test)
y_hat_train_binary = probs_to_binary_label(y_hat_train_list, 0.5)
y_hat_test_binary = probs_to_binary_label(y_hat_test_list, 0.5)
end = time.time()
print(f"Time to train model and make train and test predictions: {round(end-start, 1)}s")
def accuracy(y, y_hat):
accuracy = np.sum(np.equal(y, y_hat))/len(y)
return accuracy
fig = plt.figure()
axes = fig.add_subplot(111)
axes.plot(loss_train)
axes.plot(loss_test)
axes.set_xlabel("Iteration")
axes.set_ylabel("Loss")
axes.legend(["Train", "Test"])
plt.title("Gradient Descent of Loss, Train and Test")
plt.show()
def grad_desc_viz(grad, ylabel, title):
# visualization the gradient descent
fig = plt.figure()
axes = fig.add_subplot(111)
axes.plot(grad)
axes.set_xlabel("Iteration")
axes.set_ylabel(ylabel)
plt.title(title)
plt.show()
grad_desc_viz(grad_train, ylabel = "Gradient Norm", title="Gradient descent of Gradient
Norm")
# Final thetas after gradient descent
pd.Series(model_train.theta.flatten(), index = X.columns.tolist() + ["intercept"])