Loan Defaulters Prediction of Lending Club Dataset¶

Introduction¶

We will use classifiers to predict a loan default by users. For this, we will use a real-world dataset provided by Lending Club. Lending Club is a fintech firm that has publicly available data on its website. If you are interested, you can collect dataset from Kaggle from this link. Lending Club Dataset. Lending Club Dataset. The data is helpful for analytical studies and it contains hundreds of features. Looking into all the features is out of the scope of our study. Therefore, we will only use a subset of features for our predictions. The this study will give us an idea about how real business problems are solved using EDA and Machine Learning.

We will use Jupyter-Notebook on Linux Ubuntu 22.04.

Import Libraries¶

In [1]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
import pandas as pd
import numpy as np
import seaborn as sns
import category_encoders as ce
from scipy import stats 
import matplotlib.pyplot as plt
import hvplot.pandas
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler

from sklearn.metrics import (
    accuracy_score, confusion_matrix, classification_report, 
    roc_auc_score, roc_curve, auc,
    ConfusionMatrixDisplay, RocCurveDisplay
)
from sklearn.metrics import ConfusionMatrixDisplay, RocCurveDisplay

from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier

import tensorflow as tf
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization 
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import AUC

pd.set_option('display.float', '{:.2f}'.format)
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 50)

import warnings
warnings.filterwarnings("ignore")

Load Dataset¶

In [2]:
file_location = "/home/hduser/backup/data/accepted_2007_to_2018Q4.csv.gz"

selected_columns = [
    "id",
    "purpose",
    "term",
    "verification_status",
    "acc_now_delinq",
    "addr_state",
    "annual_inc",
    "application_type",
    "dti",
    "grade",
    "home_ownership",
    "initial_list_status",
    "installment",
    "int_rate",
    "loan_amnt",
    "loan_status",
    "tax_liens",
    "delinq_amnt",
    "pub_rec",
    "last_fico_range_high",
    "last_fico_range_low",
    "recoveries",
    "collection_recovery_fee"
]

# Load only the selected columns (Pandas can read gzip directly)
df = pd.read_csv(file_location, usecols=selected_columns, compression='gzip', low_memory=False)

# Show the first few rows
df.head()
Out[2]:
id loan_amnt term int_rate installment grade home_ownership annual_inc verification_status loan_status purpose addr_state dti pub_rec initial_list_status recoveries collection_recovery_fee last_fico_range_high last_fico_range_low application_type acc_now_delinq delinq_amnt tax_liens
0 68407277 3600.00 36 months 13.99 123.03 C MORTGAGE 55000.00 Not Verified Fully Paid debt_consolidation PA 5.91 0.00 w 0.00 0.00 564.00 560.00 Individual 0.00 0.00 0.00
1 68355089 24700.00 36 months 11.99 820.28 C MORTGAGE 65000.00 Not Verified Fully Paid small_business SD 16.06 0.00 w 0.00 0.00 699.00 695.00 Individual 0.00 0.00 0.00
2 68341763 20000.00 60 months 10.78 432.66 B MORTGAGE 63000.00 Not Verified Fully Paid home_improvement IL 10.78 0.00 w 0.00 0.00 704.00 700.00 Joint App 0.00 0.00 0.00
3 66310712 35000.00 60 months 14.85 829.90 C MORTGAGE 110000.00 Source Verified Current debt_consolidation NJ 17.06 0.00 w 0.00 0.00 679.00 675.00 Individual 0.00 0.00 0.00
4 68476807 10400.00 60 months 22.45 289.91 F MORTGAGE 104433.00 Source Verified Fully Paid major_purchase PA 25.37 0.00 w 0.00 0.00 704.00 700.00 Individual 0.00 0.00 0.00
In [3]:
print("Shape of the data frame :",df.shape)
Shape of the data frame : (2260701, 23)

Droping all missing values

In [4]:
df = df.dropna()
In [5]:
df.dtypes
Out[5]:
id                          object
loan_amnt                  float64
term                        object
int_rate                   float64
installment                float64
grade                       object
home_ownership              object
annual_inc                 float64
verification_status         object
loan_status                 object
purpose                     object
addr_state                  object
dti                        float64
pub_rec                    float64
initial_list_status         object
recoveries                 float64
collection_recovery_fee    float64
last_fico_range_high       float64
last_fico_range_low        float64
application_type            object
acc_now_delinq             float64
delinq_amnt                float64
tax_liens                  float64
dtype: object
In [6]:
df.describe()
Out[6]:
loan_amnt int_rate installment annual_inc dti pub_rec recoveries collection_recovery_fee last_fico_range_high last_fico_range_low acc_now_delinq delinq_amnt tax_liens
count 2258852.00 2258852.00 2258852.00 2258852.00 2258852.00 2258852.00 2258852.00 2258852.00 2258852.00 2258852.00 2258852.00 2258852.00 2258852.00
mean 15044.31 13.09 445.74 78051.79 18.82 0.20 143.96 24.00 687.65 675.53 0.00 12.38 0.05
std 9188.00 4.83 267.11 112720.16 14.18 0.57 748.38 131.26 72.97 111.11 0.07 726.74 0.38
min 500.00 5.31 4.93 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 8000.00 9.49 251.62 46000.00 11.90 0.00 0.00 0.00 654.00 650.00 0.00 0.00 0.00
50% 12900.00 12.62 377.94 65000.00 17.84 0.00 0.00 0.00 699.00 695.00 0.00 0.00 0.00
75% 20000.00 15.99 593.06 93000.00 24.49 0.00 0.00 0.00 734.00 730.00 0.00 0.00 0.00
max 40000.00 30.99 1719.83 110000000.00 999.00 86.00 39859.55 7174.72 850.00 845.00 14.00 249925.00 85.00

Exploratory Data Analysis¶

Target feature

The loan_status feature, which is our target variable, contains other values than Fully Paid and Charged Off. Therefore, we consider to encode all. Inspecting "loan_status" column unique values

In [7]:
df["loan_status"].unique()
Out[7]:
array(['Fully Paid', 'Current', 'Charged Off', 'In Grace Period',
       'Late (31-120 days)', 'Late (16-30 days)', 'Default',
       'Does not meet the credit policy. Status:Fully Paid',
       'Does not meet the credit policy. Status:Charged Off'],
      dtype=object)

Current: Applicant is in the process of paying the instalments, i.e. the tenure of the loan is not yet completed. These candidates are not labelled as 'defaulted'. Dropping the current customers as they are not required for driving factors consideration. Also id column is not required.

In [8]:
df = df[df.loan_status != "Current"]
In [9]:
df["loan_status"].unique()
Out[9]:
array(['Fully Paid', 'Charged Off', 'In Grace Period',
       'Late (31-120 days)', 'Late (16-30 days)', 'Default',
       'Does not meet the credit policy. Status:Fully Paid',
       'Does not meet the credit policy. Status:Charged Off'],
      dtype=object)
In [10]:
df.drop('id', axis=1, inplace=True)

Heatmaps of our dataset.

In [11]:
# all columns
data_encoded = df.copy()
for col in data_encoded.select_dtypes(include=['object']).columns:
    data_encoded[col] = data_encoded[col].astype('category').cat.codes

plt.figure(figsize=(21, 14))
sns.heatmap(data_encoded.corr(), annot=True, cmap='viridis', fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap (Numeric + Encoded Categorical Features)", fontsize=18, pad=12)
plt.tight_layout()
plt.show()
No description has been provided for this image

Define the mapping loan_status for "Good Loan" and "Bad Loan"¶

In [12]:
Good_Loan_statuses = [
    'Fully Paid',
    'In Grace Period',
    'Does not meet the credit policy. Status:Fully Paid'
]
Bad_Loan_statuses = [
    'Charged Off',
    'Does not meet the credit policy. Status:Charged Off',
    'Late (16-30 days)',
    'Late (31-120 days)',
    'Default'
]

# Filter the DataFrame for the relevant statuses
df_filtered = df[df['loan_status'].isin(Good_Loan_statuses + Bad_Loan_statuses)].copy()

# Create a new column 'target' to group the statuses
df_filtered.loc[:, 'target_status'] = df_filtered['loan_status'].apply(
    lambda x: 'Good Loan' if x in Good_Loan_statuses else 'Bad Loan'
)

# Plot
fig, ax = plt.subplots(figsize=(10, 6))
sns.despine()
sns.countplot(data=df_filtered, x='target_status', palette=['lightgreen', 'salmon'])

# Add legend
handles = ax.patches
labels = ['Good Loan', 'Bad Loan']
ax.legend(handles=handles, labels=labels, loc='upper right')

ax.set(xlabel='Loan Status', ylabel='Count')
ax.set_title('Loan Status Count', size=20)
plt.tight_layout()
plt.show()
No description has been provided for this image

Reduce the data size to fasten following steps, otherwise the memory will soon run out

In [13]:
filtered_df = df.sample(n=100000, random_state=42)
In [14]:
# Define the mapping for "Good Loan" and "Bad Loan"
Good_Loan_statuses = [
    'Fully Paid',
    'In Grace Period',
    'Does not meet the credit policy. Status:Fully Paid'
]

# Update the loan_status column
filtered_df['loan_status'] = filtered_df['loan_status'].apply(
    lambda x: 'Good Loan' if x in Good_Loan_statuses else 'Bad Loan'
)

# Verify the updated values
print(filtered_df["loan_status"].unique())
['Good Loan' 'Bad Loan']
In [15]:
filtered_df.head()
Out[15]:
loan_amnt term int_rate installment grade home_ownership annual_inc verification_status loan_status purpose addr_state dti pub_rec initial_list_status recoveries collection_recovery_fee last_fico_range_high last_fico_range_low application_type acc_now_delinq delinq_amnt tax_liens
767727 10000.00 36 months 16.14 352.27 C RENT 60000.00 Source Verified Good Loan other CA 11.90 0.00 w 0.00 0.00 679.00 675.00 Individual 0.00 0.00 0.00
2008787 12000.00 36 months 10.49 389.98 B MORTGAGE 69000.00 Source Verified Good Loan debt_consolidation GA 17.40 1.00 w 0.00 0.00 704.00 700.00 Individual 0.00 0.00 0.00
119744 12000.00 60 months 16.99 298.17 D MORTGAGE 80000.00 Not Verified Good Loan other FL 34.80 0.00 w 0.00 0.00 684.00 680.00 Individual 0.00 0.00 0.00
2157409 9525.00 36 months 11.39 313.60 B MORTGAGE 100000.00 Not Verified Good Loan debt_consolidation FL 9.18 0.00 w 0.00 0.00 714.00 710.00 Individual 0.00 0.00 0.00
1961525 25000.00 36 months 12.79 839.83 C MORTGAGE 88000.00 Verified Good Loan debt_consolidation NH 18.30 0.00 f 0.00 0.00 719.00 715.00 Individual 0.00 0.00 0.00

Some Categorical features

addr_state

In [16]:
# Visualization for total loan count by state (In the USA)

fig, ax =plt.subplots(figsize=(20,10))
sns.despine()
order = filtered_df["addr_state"].value_counts().index
sns.countplot(data=filtered_df,x="addr_state",order=order)
ax.tick_params(axis='x', labelrotation=90)
ax.set(xlabel='State', ylabel='')
ax.set_title('Loan count by state', size=20)
Out[16]:
Text(0.5, 1.0, 'Loan count by state')
No description has been provided for this image
In [17]:
# Grade count by loan status

# Ensure 'grade' is treated as a string and drop NaN values if any
filtered_df['grade'] = filtered_df['grade'].astype(str)
order = sorted(filtered_df["grade"].unique())

# Plot
fig, ax = plt.subplots(figsize=(12, 8))
sns.despine()
sns.countplot(data=filtered_df, x="grade", hue="loan_status", order=order)
ax.tick_params(axis='x', labelrotation=0)
ax.set(xlabel='Grade', ylabel='Count')
ax.set_title('Grade assigned by LC', size=20)
plt.tight_layout()
plt.show()
No description has been provided for this image
In [18]:
# Term count by loan status

fig, ax =plt.subplots(figsize=(12,8)) 
sns.despine() 
order=sorted(filtered_df["term"].unique())
sns.countplot(data=filtered_df,x="term",hue="loan_status",order=order)
ax.tick_params(axis='x', labelrotation=0)
ax.set(xlabel='Months', ylabel='')
ax.set_title('Term of the loan', size=20)
Out[18]:
Text(0.5, 1.0, 'Term of the loan')
No description has been provided for this image
In [19]:
# Purpose of loan count by loan status

fig, ax =plt.subplots(1,2,figsize=(20,8))

sns.despine() 

ax[0].tick_params(axis='x', labelrotation=90)
ax[0].set(xlabel='Purpose', ylabel='')
ax[0].set_title('Purpose of loan - Full', size=20)
ax[1].tick_params(axis='x', labelrotation=90)
ax[1].set(xlabel='Purpose', ylabel='')
ax[1].set_title('Purpose of loan - Last values zoom-in', size=20)

sns.countplot(data=filtered_df,x="purpose",hue="loan_status",
              order=filtered_df["purpose"].value_counts().index,ax=ax[0])

sns.countplot(data=filtered_df,x="purpose",hue="loan_status",
              order=["house","wedding","renewable_energy",
                    "educational"],ax=ax[1])
Out[19]:
<Axes: title={'center': 'Purpose of loan - Last values zoom-in'}, xlabel='Purpose', ylabel='count'>
No description has been provided for this image
In [20]:
# Home ownership status count by loan status

fig, ax =plt.subplots(1,2,figsize=(20,8))

sns.despine() 

ax[0].tick_params(axis='x', labelrotation=0)
ax[0].set(xlabel='Ownership status', ylabel='')
ax[0].set_title('Ownership - Full', size=20)
ax[1].tick_params(axis='x', labelrotation=0)
ax[1].set(xlabel='Ownership status', ylabel='')
ax[1].set_title('Ownership - Last values zoom-in', size=20)

sns.countplot(data=filtered_df,x="home_ownership",hue="loan_status",ax=ax[0])
sns.countplot(data=filtered_df,x="home_ownership",hue="loan_status",order=["ANY","NONE","OTHER"],ax=ax[1])
Out[20]:
<Axes: title={'center': 'Ownership - Last values zoom-in'}, xlabel='Ownership status', ylabel='count'>
No description has been provided for this image

Some Numerical features

In [21]:
# Installment amount count by loan status

fig, ax =plt.subplots(1,2,figsize=(20,8))

sns.despine() 

ax[0].tick_params(axis='x', labelrotation=0)
ax[0].set(xlabel='Installments amount in USD', ylabel='')
ax[0].set_title('Installment amount by loan type - Distribution', size=20)
ax[1].tick_params(axis='x', labelrotation=0)
ax[1].set_title('Installment amount by loan type - Boxplot', size=20)


sns.histplot(data=filtered_df,x="installment",hue="loan_status",bins=30,
            kde=True,ax=ax[0])
sns.boxplot(data=filtered_df,x="loan_status",y="installment",ax=ax[1]).set(xlabel='Loan Status|', 
                                                                       ylabel='Amount in USD')
Out[21]:
[Text(0.5, 0, 'Loan Status|'), Text(0, 0.5, 'Amount in USD')]
No description has been provided for this image
In [22]:
# Interest rate count by loan status

fig, ax =plt.subplots(1,2,figsize=(20,8))

sns.despine() 

ax[0].tick_params(axis='x', labelrotation=0)
ax[0].set(xlabel='Interest rate in %', ylabel='')
ax[0].set_title('Interest rate by loan type - Distribution', size=20)
ax[1].tick_params(axis='x', labelrotation=0)
ax[1].set_title('Interest rate by loan type - Boxplot', size=20)


sns.histplot(data=filtered_df,x="int_rate",hue="loan_status",bins=30,
            kde=True,ax=ax[0])

sns.boxplot(data=filtered_df,x="loan_status",y="int_rate",ax=ax[1]).set(xlabel='Loan Status', 
                                                                    ylabel='Interest rate in %')
Out[22]:
[Text(0.5, 0, 'Loan Status'), Text(0, 0.5, 'Interest rate in %')]
No description has been provided for this image
In [23]:
loan_amnt_box = filtered_df.hvplot.box(
    y='loan_amnt', subplots=True, by='loan_status', width=300, height=350, 
    title="Loan Status by Loan Amount ", xlabel='Loan Status', ylabel='Loan Amount'
)

installment_box = filtered_df.hvplot.box(
    y='installment', subplots=True, by='loan_status', width=300, height=350, 
    title="Loan Status by Installment", xlabel='Loan Status', ylabel='Installment'
)

loan_amnt_box + installment_box
Out[23]:
In [24]:
def pub_rec(number):
    if number == 0.0:
        return 0
    else:
        return 1
    
def delinq_amnt(number):
    if number == 0.0:
        return 0
    elif number >= 1.0:
        return 1
    else:
        return number
    
def acc_now_delinq(number):
    if number == 0.0:
        return 0
    elif number >= 1.0:
        return 1
    else:
        return number
In [25]:
filtered_df['pub_rec'] = filtered_df.pub_rec.apply(pub_rec)
filtered_df['delinq_amnt'] = filtered_df.delinq_amnt.apply(delinq_amnt)
filtered_df['acc_now_delinq'] = filtered_df.acc_now_delinq.apply(acc_now_delinq)
In [26]:
plt.figure(figsize=(12, 30))

plt.subplot(6, 2, 1)
sns.countplot(x='pub_rec', data=filtered_df, hue='loan_status')

plt.subplot(6, 2, 2)
sns.countplot(x='initial_list_status', data=filtered_df, hue='loan_status')

plt.subplot(6, 2, 3)
sns.countplot(x='application_type', data=filtered_df, hue='loan_status')

plt.subplot(6, 2, 4)
sns.countplot(x='delinq_amnt', data=filtered_df, hue='loan_status')

plt.subplot(6, 2, 5)
sns.countplot(x='acc_now_delinq', data=filtered_df, hue='loan_status')
Out[26]:
<Axes: xlabel='acc_now_delinq', ylabel='count'>
No description has been provided for this image

For all columns both numeraical and categorical

In [27]:
encoder = ce.OrdinalEncoder()
encoded_data = encoder.fit_transform(filtered_df)

# Compute correlation on all columns
encoded_data.corr()['loan_status'] \
    .drop('loan_status') \
    .sort_values() \
    .hvplot.barh(
        width=600, height=400,
        title="Correlation between Loan Status and All Features",
        ylabel='Correlation', xlabel='All Encoded Features'
    )
Out[27]:

Data PreProcessing¶

In [28]:
# Define the mapping for "Good Loan" and "Bad Loan"
Good_Loan_statuses = [
    'Fully Paid',
    'In Grace Period',
    'Does not meet the credit policy. Status:Fully Paid'
]

# Update the loan_status column
df['loan_status'] = df['loan_status'].apply(
    lambda x: 'Good Loan' if x in Good_Loan_statuses else 'Bad Loan'
)

# Verify the updated values
print(df["loan_status"].unique())
['Good Loan' 'Bad Loan']
In [29]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1381834 entries, 0 to 2260697
Data columns (total 22 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   loan_amnt                1381834 non-null  float64
 1   term                     1381834 non-null  object 
 2   int_rate                 1381834 non-null  float64
 3   installment              1381834 non-null  float64
 4   grade                    1381834 non-null  object 
 5   home_ownership           1381834 non-null  object 
 6   annual_inc               1381834 non-null  float64
 7   verification_status      1381834 non-null  object 
 8   loan_status              1381834 non-null  object 
 9   purpose                  1381834 non-null  object 
 10  addr_state               1381834 non-null  object 
 11  dti                      1381834 non-null  float64
 12  pub_rec                  1381834 non-null  float64
 13  initial_list_status      1381834 non-null  object 
 14  recoveries               1381834 non-null  float64
 15  collection_recovery_fee  1381834 non-null  float64
 16  last_fico_range_high     1381834 non-null  float64
 17  last_fico_range_low      1381834 non-null  float64
 18  application_type         1381834 non-null  object 
 19  acc_now_delinq           1381834 non-null  float64
 20  delinq_amnt              1381834 non-null  float64
 21  tax_liens                1381834 non-null  float64
dtypes: float64(13), object(9)
memory usage: 242.5+ MB
In [30]:
print([column for column in df.columns if df[column].dtype == object])
['term', 'grade', 'home_ownership', 'verification_status', 'loan_status', 'purpose', 'addr_state', 'initial_list_status', 'application_type']
In [31]:
print([column for column in df.columns if pd.api.types.is_numeric_dtype(df[column])])
['loan_amnt', 'int_rate', 'installment', 'annual_inc', 'dti', 'pub_rec', 'recoveries', 'collection_recovery_fee', 'last_fico_range_high', 'last_fico_range_low', 'acc_now_delinq', 'delinq_amnt', 'tax_liens']

Let's encode the target loan_status to facilate our calculations.

In [32]:
# Encoding target values to dummy values 

df['loan_status'] = df['loan_status'].map({'Good Loan':0,'Bad Loan':1})
In [33]:
df.term.unique()
Out[33]:
array([' 36 months', ' 60 months'], dtype=object)
In [34]:
term_values = {' 36 months': 36, ' 60 months': 60}
df['term'] = df.term.map(term_values)
In [35]:
df.term.unique()
Out[35]:
array([36, 60])
In [36]:
dummies = ['grade', 'home_ownership', 'verification_status', 'purpose', 'addr_state', 
           'initial_list_status', 'application_type']
df = pd.get_dummies(df, columns=dummies, drop_first=True)
In [37]:
df.head()
Out[37]:
loan_amnt term int_rate installment annual_inc loan_status dti pub_rec recoveries collection_recovery_fee last_fico_range_high last_fico_range_low acc_now_delinq delinq_amnt tax_liens grade_B grade_C grade_D grade_E grade_F grade_G home_ownership_MORTGAGE home_ownership_NONE home_ownership_OTHER home_ownership_OWN ... addr_state_ND addr_state_NE addr_state_NH addr_state_NJ addr_state_NM addr_state_NV addr_state_NY addr_state_OH addr_state_OK addr_state_OR addr_state_PA addr_state_RI addr_state_SC addr_state_SD addr_state_TN addr_state_TX addr_state_UT addr_state_VA addr_state_VT addr_state_WA addr_state_WI addr_state_WV addr_state_WY initial_list_status_w application_type_Joint App
0 3600.00 36 13.99 123.03 55000.00 0 5.91 0.00 0.00 0.00 564.00 560.00 0.00 0.00 0.00 False True False False False False True False False False ... False False False False False False False False False False True False False False False False False False False False False False False True False
1 24700.00 36 11.99 820.28 65000.00 0 16.06 0.00 0.00 0.00 699.00 695.00 0.00 0.00 0.00 False True False False False False True False False False ... False False False False False False False False False False False False False True False False False False False False False False False True False
2 20000.00 60 10.78 432.66 63000.00 0 10.78 0.00 0.00 0.00 704.00 700.00 0.00 0.00 0.00 True False False False False False True False False False ... False False False False False False False False False False False False False False False False False False False False False False False True True
4 10400.00 60 22.45 289.91 104433.00 0 25.37 0.00 0.00 0.00 704.00 700.00 0.00 0.00 0.00 False False False False True False True False False False ... False False False False False False False False False False True False False False False False False False False False False False False True False
5 11950.00 36 13.44 405.18 34000.00 0 10.20 0.00 0.00 0.00 759.00 755.00 0.00 0.00 0.00 False True False False False False False False False False ... False False False False False False False False False False False False False False False False False False False False False False False True False

5 rows × 93 columns

Train Test Split¶

In [38]:
w_p = df.loan_status.value_counts()[0] / df.shape[0]
w_n = df.loan_status.value_counts()[1] / df.shape[0]

print(f"Weight of positive values {w_p}")
print(f"Weight of negative values {w_n}")
Weight of positive values 0.7864685627940838
Weight of negative values 0.2135314372059162
In [39]:
train, test = train_test_split(df, test_size=0.33, random_state=42)

print(train.shape)
print(test.shape)
(925828, 93)
(456006, 93)

Removing Outliers¶

In [40]:
print(train.shape)
train = train[train['annual_inc'] <= 250000]
train = train[train['dti'] <= 50]
print(train.shape)
(925828, 93)
(914316, 93)

Normalizing the data¶

In [41]:
X_train, y_train = train.drop('loan_status', axis=1), train.loan_status
X_test, y_test = test.drop('loan_status', axis=1), test.loan_status
In [42]:
X_train.head()
Out[42]:
loan_amnt term int_rate installment annual_inc dti pub_rec recoveries collection_recovery_fee last_fico_range_high last_fico_range_low acc_now_delinq delinq_amnt tax_liens grade_B grade_C grade_D grade_E grade_F grade_G home_ownership_MORTGAGE home_ownership_NONE home_ownership_OTHER home_ownership_OWN home_ownership_RENT ... addr_state_ND addr_state_NE addr_state_NH addr_state_NJ addr_state_NM addr_state_NV addr_state_NY addr_state_OH addr_state_OK addr_state_OR addr_state_PA addr_state_RI addr_state_SC addr_state_SD addr_state_TN addr_state_TX addr_state_UT addr_state_VA addr_state_VT addr_state_WA addr_state_WI addr_state_WV addr_state_WY initial_list_status_w application_type_Joint App
1304582 25000.00 36 15.31 870.44 150000.00 12.68 0.00 0.00 0.00 639.00 635.00 0.00 0.00 0.00 False True False False False False False False False True False ... False False False False False False False False False False False False False False False False False False False False False False False False False
1118221 4850.00 36 22.99 187.72 48000.00 20.95 0.00 231.28 41.63 559.00 555.00 0.00 0.00 0.00 False False False False True False True False False False False ... False False False False False False False False False False False False False False False False False False False False False False False False False
1862395 6625.00 36 13.11 223.58 22500.00 33.23 0.00 487.13 86.96 609.00 605.00 0.00 0.00 0.00 True False False False False False False False False False True ... False False False False False False False False False False False False False False False False False False False False False False False False False
420596 10000.00 36 12.39 334.01 31814.00 36.70 0.00 0.00 0.00 634.00 630.00 0.00 0.00 0.00 False True False False False False False False False False True ... False False False False False False False False False True False False False False False False False False False False False False False False False
2035521 12000.00 36 13.99 410.08 30000.00 7.00 0.00 0.00 0.00 519.00 515.00 0.00 0.00 0.00 False True False False False False False False False False True ... False False False False False True False False False False False False False False False False False False False False False False False False False

5 rows × 92 columns

In [43]:
y_train.head()
Out[43]:
1304582    0
1118221    1
1862395    1
420596     1
2035521    1
Name: loan_status, dtype: int64
In [44]:
X_train.dtypes
Out[44]:
loan_amnt                     float64
term                            int64
int_rate                      float64
installment                   float64
annual_inc                    float64
                               ...   
addr_state_WI                    bool
addr_state_WV                    bool
addr_state_WY                    bool
initial_list_status_w            bool
application_type_Joint App       bool
Length: 92, dtype: object
In [45]:
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Models Building¶

In [46]:
def print_score(true, pred, train=True):
    if train:
        clf_report = pd.DataFrame(classification_report(true, pred, output_dict=True))
        print("Train Result:\n________________________________________________")
        print(f"Accuracy Score: {accuracy_score(true, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(true, pred)}\n")
        
    elif train==False:
        clf_report = pd.DataFrame(classification_report(true, pred, output_dict=True))
        print("Test Result:\n________________________________________________")        
        print(f"Accuracy Score: {accuracy_score(true, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(true, pred)}\n")
In [47]:
X_train = np.array(X_train).astype(np.float32)
X_test = np.array(X_test).astype(np.float32)
y_train = np.array(y_train).astype(np.float32)
y_test = np.array(y_test).astype(np.float32)

Logistic Regression¶

In [48]:
# Initialize model
lr_clf = LogisticRegression(max_iter=1000, random_state=42)

# Train
lr_clf.fit(X_train, y_train)

# Predictions
y_train_pred = lr_clf.predict(X_train)
y_test_pred = lr_clf.predict(X_test)

# Evaluate
print_score(y_train, y_train_pred, train=True)
print_score(y_test, y_test_pred, train=False)
Train Result:
________________________________________________
Accuracy Score: 92.74%
_______________________________________________
CLASSIFICATION REPORT:
                0.0       1.0  accuracy  macro avg  weighted avg
precision      0.94      0.87      0.93       0.91          0.93
recall         0.97      0.78      0.93       0.87          0.93
f1-score       0.95      0.82      0.93       0.89          0.93
support   719181.00 195135.00      0.93  914316.00     914316.00
_______________________________________________
Confusion Matrix: 
 [[696420  22761]
 [ 43594 151541]]

Test Result:
________________________________________________
Accuracy Score: 92.68%
_______________________________________________
CLASSIFICATION REPORT:
                0.0      1.0  accuracy  macro avg  weighted avg
precision      0.94     0.87      0.93       0.90          0.93
recall         0.97     0.78      0.93       0.87          0.93
f1-score       0.95     0.82      0.93       0.89          0.93
support   358208.00 97798.00      0.93  456006.00     456006.00
_______________________________________________
Confusion Matrix: 
 [[346817  11391]
 [ 21981  75817]]

In [49]:
disp = ConfusionMatrixDisplay.from_estimator(
    lr_clf, X_test, y_test, 
    cmap='Blues', values_format='d', 
    display_labels=['Default', 'Fully-Paid']
)

disp = RocCurveDisplay.from_estimator(lr_clf, X_test, y_test)
No description has been provided for this image
No description has been provided for this image
In [50]:
scores_dict = {
    'Logistic Regression': {
        'Train': roc_auc_score(y_train, lr_clf.predict(X_train)),
        'Test': roc_auc_score(y_test, lr_clf.predict(X_test)),
    },
}

Decision Tree Classifier¶

In [51]:
# Initialize Decision Tree model
dt_clf = DecisionTreeClassifier(max_depth=10, criterion='entropy', random_state=42)

# Train the model
dt_clf.fit(X_train, y_train)

# Predictions
y_train_pred = dt_clf.predict(X_train)
y_test_pred = dt_clf.predict(X_test)

# Evaluate
print_score(y_train, y_train_pred, train=True)
print_score(y_test, y_test_pred, train=False)
Train Result:
________________________________________________
Accuracy Score: 94.34%
_______________________________________________
CLASSIFICATION REPORT:
                0.0       1.0  accuracy  macro avg  weighted avg
precision      0.95      0.91      0.94       0.93          0.94
recall         0.98      0.82      0.94       0.90          0.94
f1-score       0.96      0.86      0.94       0.91          0.94
support   719181.00 195135.00      0.94  914316.00     914316.00
_______________________________________________
Confusion Matrix: 
 [[703447  15734]
 [ 36016 159119]]

Test Result:
________________________________________________
Accuracy Score: 94.16%
_______________________________________________
CLASSIFICATION REPORT:
                0.0      1.0  accuracy  macro avg  weighted avg
precision      0.95     0.91      0.94       0.93          0.94
recall         0.98     0.81      0.94       0.89          0.94
f1-score       0.96     0.86      0.94       0.91          0.94
support   358208.00 97798.00      0.94  456006.00     456006.00
_______________________________________________
Confusion Matrix: 
 [[350067   8141]
 [ 18508  79290]]

In [52]:
disp = ConfusionMatrixDisplay.from_estimator(
    dt_clf, X_test, y_test, 
    cmap='Blues', values_format='d', 
    display_labels=['Default', 'Fully-Paid']
)

disp = RocCurveDisplay.from_estimator(dt_clf, X_test, y_test)
No description has been provided for this image
No description has been provided for this image
In [53]:
scores_dict['Decision Tree'] = {
        'Train': roc_auc_score(y_train, dt_clf.predict(X_train)),
        'Test': roc_auc_score(y_test, dt_clf.predict(X_test)),
    }

Gaussian Naive Bayes¶

In [54]:
# Initialize model
gnb_clf = GaussianNB()

# Train
gnb_clf.fit(X_train, y_train)

# Predictions
y_train_pred = gnb_clf.predict(X_train)
y_test_pred = gnb_clf.predict(X_test)

# Evaluate
print_score(y_train, y_train_pred, train=True)
print_score(y_test, y_test_pred, train=False)
Train Result:
________________________________________________
Accuracy Score: 90.66%
_______________________________________________
CLASSIFICATION REPORT:
                0.0       1.0  accuracy  macro avg  weighted avg
precision      0.92      0.84      0.91       0.88          0.90
recall         0.96      0.69      0.91       0.83          0.91
f1-score       0.94      0.76      0.91       0.85          0.90
support   719181.00 195135.00      0.91  914316.00     914316.00
_______________________________________________
Confusion Matrix: 
 [[693934  25247]
 [ 60159 134976]]

Test Result:
________________________________________________
Accuracy Score: 90.57%
_______________________________________________
CLASSIFICATION REPORT:
                0.0      1.0  accuracy  macro avg  weighted avg
precision      0.92     0.84      0.91       0.88          0.90
recall         0.96     0.69      0.91       0.83          0.91
f1-score       0.94     0.76      0.91       0.85          0.90
support   358208.00 97798.00      0.91  456006.00     456006.00
_______________________________________________
Confusion Matrix: 
 [[345411  12797]
 [ 30219  67579]]

In [55]:
disp = ConfusionMatrixDisplay.from_estimator(
    gnb_clf, X_test, y_test, 
    cmap='Blues', values_format='d', 
    display_labels=['Default', 'Fully-Paid']
)

disp = RocCurveDisplay.from_estimator(gnb_clf, X_test, y_test)
No description has been provided for this image
No description has been provided for this image
In [56]:
scores_dict['GNB'] = {
        'Train': roc_auc_score(y_train, gnb_clf.predict(X_train)),
        'Test': roc_auc_score(y_test, gnb_clf.predict(X_test)),
    }

Gradient Boosting¶

In [57]:
# Initialize model
gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)

# Train
gb_clf.fit(X_train, y_train)

# Predictions
y_train_pred = gb_clf.predict(X_train)
y_test_pred = gb_clf.predict(X_test)

# Evaluate
print_score(y_train, y_train_pred, train=True)
print_score(y_test, y_test_pred, train=False)
Train Result:
________________________________________________
Accuracy Score: 94.35%
_______________________________________________
CLASSIFICATION REPORT:
                0.0       1.0  accuracy  macro avg  weighted avg
precision      0.95      0.92      0.94       0.93          0.94
recall         0.98      0.81      0.94       0.89          0.94
f1-score       0.96      0.86      0.94       0.91          0.94
support   719181.00 195135.00      0.94  914316.00     914316.00
_______________________________________________
Confusion Matrix: 
 [[704611  14570]
 [ 37111 158024]]

Test Result:
________________________________________________
Accuracy Score: 94.32%
_______________________________________________
CLASSIFICATION REPORT:
                0.0      1.0  accuracy  macro avg  weighted avg
precision      0.95     0.92      0.94       0.93          0.94
recall         0.98     0.81      0.94       0.89          0.94
f1-score       0.96     0.86      0.94       0.91          0.94
support   358208.00 97798.00      0.94  456006.00     456006.00
_______________________________________________
Confusion Matrix: 
 [[351035   7173]
 [ 18740  79058]]

In [58]:
disp = ConfusionMatrixDisplay.from_estimator(
    gb_clf, X_test, y_test, 
    cmap='Blues', values_format='d', 
    display_labels=['Default', 'Fully-Paid']
)

disp = RocCurveDisplay.from_estimator(gb_clf, X_test, y_test)
No description has been provided for this image
No description has been provided for this image
In [59]:
scores_dict['Gradient Boosting'] = {
        'Train': roc_auc_score(y_train, gb_clf.predict(X_train)),
        'Test': roc_auc_score(y_test, gb_clf.predict(X_test)),
    }

Random Forest Classifier¶

In [60]:
rf_clf = RandomForestClassifier(n_estimators=100)
rf_clf.fit(X_train, y_train)

y_train_pred = rf_clf.predict(X_train)
y_test_pred = rf_clf.predict(X_test)

print_score(y_train, y_train_pred, train=True)
print_score(y_test, y_test_pred, train=False)
Train Result:
________________________________________________
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
                0.0       1.0  accuracy  macro avg  weighted avg
precision      1.00      1.00      1.00       1.00          1.00
recall         1.00      1.00      1.00       1.00          1.00
f1-score       1.00      1.00      1.00       1.00          1.00
support   719181.00 195135.00      1.00  914316.00     914316.00
_______________________________________________
Confusion Matrix: 
 [[719181      0]
 [    22 195113]]

Test Result:
________________________________________________
Accuracy Score: 94.53%
_______________________________________________
CLASSIFICATION REPORT:
                0.0      1.0  accuracy  macro avg  weighted avg
precision      0.95     0.92      0.95       0.94          0.94
recall         0.98     0.82      0.95       0.90          0.95
f1-score       0.97     0.86      0.95       0.92          0.94
support   358208.00 97798.00      0.95  456006.00     456006.00
_______________________________________________
Confusion Matrix: 
 [[351279   6929]
 [ 18022  79776]]

In [61]:
disp = ConfusionMatrixDisplay.from_estimator(rf_clf, X_test, y_test, 
                             cmap='Blues', values_format='d', 
                             display_labels=['Default', 'Fully-Paid'])

disp = RocCurveDisplay.from_estimator(rf_clf, X_test, y_test)
No description has been provided for this image
No description has been provided for this image
In [62]:
scores_dict['Random Forest'] = {
        'Train': roc_auc_score(y_train, rf_clf.predict(X_train)),
        'Test': roc_auc_score(y_test, rf_clf.predict(X_test)),
    }

XGBoost Classifier¶

In [63]:
xgb_clf = XGBClassifier(use_label_encoder=False)
xgb_clf.fit(X_train, y_train)

y_train_pred = xgb_clf.predict(X_train)
y_test_pred = xgb_clf.predict(X_test)

print_score(y_train, y_train_pred, train=True)
print_score(y_test, y_test_pred, train=False)
Train Result:
________________________________________________
Accuracy Score: 95.16%
_______________________________________________
CLASSIFICATION REPORT:
                0.0       1.0  accuracy  macro avg  weighted avg
precision      0.96      0.93      0.95       0.94          0.95
recall         0.98      0.84      0.95       0.91          0.95
f1-score       0.97      0.88      0.95       0.93          0.95
support   719181.00 195135.00      0.95  914316.00     914316.00
_______________________________________________
Confusion Matrix: 
 [[706188  12993]
 [ 31270 163865]]

Test Result:
________________________________________________
Accuracy Score: 94.89%
_______________________________________________
CLASSIFICATION REPORT:
                0.0      1.0  accuracy  macro avg  weighted avg
precision      0.96     0.92      0.95       0.94          0.95
recall         0.98     0.83      0.95       0.91          0.95
f1-score       0.97     0.87      0.95       0.92          0.95
support   358208.00 97798.00      0.95  456006.00     456006.00
_______________________________________________
Confusion Matrix: 
 [[351288   6920]
 [ 16388  81410]]

In [64]:
disp = ConfusionMatrixDisplay.from_estimator(
    xgb_clf, X_test, y_test, 
    cmap='Blues', values_format='d', 
    display_labels=['Default', 'Fully-Paid']
)

disp = RocCurveDisplay.from_estimator(xgb_clf, X_test, y_test)
No description has been provided for this image
No description has been provided for this image
In [65]:
scores_dict['XGBoost'] = {
        'Train': roc_auc_score(y_train, xgb_clf.predict(X_train)),
        'Test': roc_auc_score(y_test, xgb_clf.predict(X_test)),
    }

Artificial Neural Networks (ANNs)¶

In [66]:
def evaluate_nn(true, pred, train=True):
    if train:
        clf_report = pd.DataFrame(classification_report(true, pred, output_dict=True))
        print("Train Result:\n================================================")
        print(f"Accuracy Score: {accuracy_score(true, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(true, pred)}\n")
        
    elif train==False:
        clf_report = pd.DataFrame(classification_report(true, pred, output_dict=True))
        print("Test Result:\n================================================")        
        print(f"Accuracy Score: {accuracy_score(true, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(true, pred)}\n")
        
def plot_learning_evolution(r):
    plt.figure(figsize=(12, 8))
    
    plt.subplot(2, 2, 1)
    plt.plot(r.history['loss'], label='Loss')
    plt.plot(r.history['val_loss'], label='val_Loss')
    plt.title('Loss evolution during trainig')
    plt.legend()

    plt.subplot(2, 2, 2)
    plt.plot(r.history['AUC'], label='AUC')
    plt.plot(r.history['val_AUC'], label='val_AUC')
    plt.title('AUC score evolution during trainig')
    plt.legend();

def nn_model(num_columns, num_labels, hidden_units, dropout_rates, learning_rate):
    inp = tf.keras.layers.Input(shape=(num_columns, ))
    x = BatchNormalization()(inp)
    x = Dropout(dropout_rates[0])(x)
    for i in range(len(hidden_units)):
        x = Dense(hidden_units[i], activation='relu')(x)
        x = BatchNormalization()(x)
        x = Dropout(dropout_rates[i + 1])(x)
    x = Dense(num_labels, activation='sigmoid')(x)
  
    model = Model(inputs=inp, outputs=x)
    model.compile(optimizer=Adam(learning_rate), loss='binary_crossentropy', metrics=[AUC(name='AUC')])
    return model
In [67]:
def setup_gpu():
    """Setup GPU configuration with error handling"""
    gpus = tf.config.experimental.list_physical_devices('GPU')
    if gpus:
        try:
            # Enable memory growth
            for gpu in gpus:
                tf.config.experimental.set_memory_growth(gpu, True)
            print(f"Using {len(gpus)} GPU(s) with memory growth")
        except RuntimeError as e:
            print(f"GPU setup error: {e}")
            print("Falling back to CPU")
            tf.config.set_visible_devices([], 'GPU')
    else:
        print("No GPU available, using CPU")
In [68]:
# Setup GPU
setup_gpu()

# --- Your model setup ---
num_columns = X_train.shape[1]
num_labels = 1
hidden_units = [150, 150, 150]
dropout_rates = [0.1, 0, 0.1, 0]
learning_rate = 1e-3

# Assuming nn_model is your custom function that builds a tf.keras model
model = nn_model(
    num_columns=num_columns, 
    num_labels=num_labels,
    hidden_units=hidden_units,
    dropout_rates=dropout_rates,
    learning_rate=learning_rate
)

# Train with error handling
try:
    r = model.fit(
        X_train, y_train,
        validation_data=(X_test, y_test),
        epochs=20,
        batch_size=32,
        verbose=1
    )
    print("Training completed successfully!")
    
except Exception as e:
    print(f"Training error: {e}")
    print("Trying with CPU...")
    
    # Fallback to CPU
    tf.config.set_visible_devices([], 'GPU')
    model = nn_model(
        num_columns=num_columns, 
        num_labels=num_labels,
        hidden_units=hidden_units,
        dropout_rates=dropout_rates,
        learning_rate=learning_rate
    )
    
    r = model.fit(
        X_train, y_train,
        validation_data=(X_test, y_test),
        epochs=20,
        batch_size=32,
        verbose=1
    )
2025-11-18 10:14:02.963038: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
No GPU available, using CPU
Epoch 1/20
28573/28573 ━━━━━━━━━━━━━━━━━━━━ 99s 3ms/step - AUC: 0.9584 - loss: 0.1863 - val_AUC: 0.9677 - val_loss: 0.1640
Epoch 2/20
28573/28573 ━━━━━━━━━━━━━━━━━━━━ 96s 3ms/step - AUC: 0.9628 - loss: 0.1741 - val_AUC: 0.9683 - val_loss: 0.1669
Epoch 3/20
28573/28573 ━━━━━━━━━━━━━━━━━━━━ 96s 3ms/step - AUC: 0.9639 - loss: 0.1713 - val_AUC: 0.9687 - val_loss: 0.1677
Epoch 4/20
28573/28573 ━━━━━━━━━━━━━━━━━━━━ 110s 4ms/step - AUC: 0.9643 - loss: 0.1701 - val_AUC: 0.9686 - val_loss: 0.1705
Epoch 5/20
28573/28573 ━━━━━━━━━━━━━━━━━━━━ 121s 4ms/step - AUC: 0.9644 - loss: 0.1699 - val_AUC: 0.9688 - val_loss: 0.1695
Epoch 6/20
28573/28573 ━━━━━━━━━━━━━━━━━━━━ 121s 4ms/step - AUC: 0.9645 - loss: 0.1696 - val_AUC: 0.9687 - val_loss: 0.1749
Epoch 7/20
28573/28573 ━━━━━━━━━━━━━━━━━━━━ 125s 4ms/step - AUC: 0.9649 - loss: 0.1690 - val_AUC: 0.9685 - val_loss: 0.1727
Epoch 8/20
28573/28573 ━━━━━━━━━━━━━━━━━━━━ 122s 4ms/step - AUC: 0.9648 - loss: 0.1690 - val_AUC: 0.9690 - val_loss: 0.1804
Epoch 9/20
28573/28573 ━━━━━━━━━━━━━━━━━━━━ 122s 4ms/step - AUC: 0.9651 - loss: 0.1686 - val_AUC: 0.9687 - val_loss: 0.1796
Epoch 10/20
28573/28573 ━━━━━━━━━━━━━━━━━━━━ 122s 4ms/step - AUC: 0.9651 - loss: 0.1685 - val_AUC: 0.9688 - val_loss: 0.1663
Epoch 11/20
28573/28573 ━━━━━━━━━━━━━━━━━━━━ 120s 4ms/step - AUC: 0.9651 - loss: 0.1682 - val_AUC: 0.9693 - val_loss: 0.1665
Epoch 12/20
28573/28573 ━━━━━━━━━━━━━━━━━━━━ 122s 4ms/step - AUC: 0.9652 - loss: 0.1681 - val_AUC: 0.9682 - val_loss: 0.1813
Epoch 13/20
28573/28573 ━━━━━━━━━━━━━━━━━━━━ 124s 4ms/step - AUC: 0.9653 - loss: 0.1678 - val_AUC: 0.9687 - val_loss: 0.2017
Epoch 14/20
28573/28573 ━━━━━━━━━━━━━━━━━━━━ 123s 4ms/step - AUC: 0.9651 - loss: 0.1684 - val_AUC: 0.9678 - val_loss: 0.1829
Epoch 15/20
28573/28573 ━━━━━━━━━━━━━━━━━━━━ 124s 4ms/step - AUC: 0.9654 - loss: 0.1677 - val_AUC: 0.9691 - val_loss: 0.2634
Epoch 16/20
28573/28573 ━━━━━━━━━━━━━━━━━━━━ 127s 4ms/step - AUC: 0.9657 - loss: 0.1673 - val_AUC: 0.9686 - val_loss: 0.1743
Epoch 17/20
28573/28573 ━━━━━━━━━━━━━━━━━━━━ 122s 4ms/step - AUC: 0.9656 - loss: 0.1671 - val_AUC: 0.9688 - val_loss: 0.1779
Epoch 18/20
28573/28573 ━━━━━━━━━━━━━━━━━━━━ 123s 4ms/step - AUC: 0.9655 - loss: 0.1674 - val_AUC: 0.9689 - val_loss: 0.1688
Epoch 19/20
28573/28573 ━━━━━━━━━━━━━━━━━━━━ 121s 4ms/step - AUC: 0.9655 - loss: 0.1673 - val_AUC: 0.9689 - val_loss: 0.1653
Epoch 20/20
28573/28573 ━━━━━━━━━━━━━━━━━━━━ 124s 4ms/step - AUC: 0.9655 - loss: 0.1673 - val_AUC: 0.9685 - val_loss: 0.2408
Training completed successfully!
In [69]:
plot_learning_evolution(r)
No description has been provided for this image
In [70]:
y_train_pred = model.predict(X_train)
evaluate_nn(y_train, y_train_pred.round(), train=True)
28573/28573 ━━━━━━━━━━━━━━━━━━━━ 33s 1ms/step
Train Result:
================================================
Accuracy Score: 93.59%
_______________________________________________
CLASSIFICATION REPORT:
                0.0       1.0  accuracy  macro avg  weighted avg
precision      0.93      0.95      0.94       0.94          0.94
recall         0.99      0.74      0.94       0.87          0.94
f1-score       0.96      0.83      0.94       0.90          0.93
support   719181.00 195135.00      0.94  914316.00     914316.00
_______________________________________________
Confusion Matrix: 
 [[710799   8382]
 [ 50200 144935]]

In [71]:
y_test_pred = model.predict(X_test)
evaluate_nn(y_test, y_test_pred.round(), train=False)
14251/14251 ━━━━━━━━━━━━━━━━━━━━ 16s 1ms/step
Test Result:
================================================
Accuracy Score: 93.40%
_______________________________________________
CLASSIFICATION REPORT:
                0.0      1.0  accuracy  macro avg  weighted avg
precision      0.93     0.94      0.93       0.94          0.93
recall         0.99     0.74      0.93       0.86          0.93
f1-score       0.96     0.83      0.93       0.89          0.93
support   358208.00 97798.00      0.93  456006.00     456006.00
_______________________________________________
Confusion Matrix: 
 [[353616   4592]
 [ 25494  72304]]

In [72]:
scores_dict['ANNs'] = {
        'Train': roc_auc_score(y_train, model.predict(X_train)),
        'Test': roc_auc_score(y_test, model.predict(X_test)),
    }
28573/28573 ━━━━━━━━━━━━━━━━━━━━ 33s 1ms/step
14251/14251 ━━━━━━━━━━━━━━━━━━━━ 17s 1ms/step
In [73]:
ml_models = {
    'Logistic Regression': lr_clf,
    'Decision Tree': dt_clf,
    'GNB': gnb_clf,
    'Gradient Boosting': gb_clf,
    'Random Forest': rf_clf,
    'XGBoost': xgb_clf,
    'ANNs': model
}

for name, clf in ml_models.items():
    try:
        # Special handling: Keras models need predict() output flattened
        if 'ANN' in name:
            y_pred_prob = clf.predict(X_test).ravel()
        elif hasattr(clf, "predict_proba"):
            y_pred_prob = clf.predict_proba(X_test)[:, 1]
        else:
            # fallback for models without predict_proba (e.g., some regressors)
            y_pred_prob = clf.predict(X_test)
            # convert to 0–1 range if not probabilities
            y_pred_prob = np.clip(y_pred_prob, 0, 1)
        
        auc = roc_auc_score(y_test, y_pred_prob)
        print(f"{name.upper():30} roc_auc_score: {auc:.3f}")
        
    except Exception as e:
        print(f"{name.upper():30} error: {e}")
LOGISTIC REGRESSION            roc_auc_score: 0.964
DECISION TREE                  roc_auc_score: 0.973
GNB                            roc_auc_score: 0.925
GRADIENT BOOSTING              roc_auc_score: 0.974
RANDOM FOREST                  roc_auc_score: 0.973
XGBOOST                        roc_auc_score: 0.977
14251/14251 ━━━━━━━━━━━━━━━━━━━━ 17s 1ms/step
ANNS                           roc_auc_score: 0.968
In [74]:
scores_df = pd.DataFrame(scores_dict)
scores_df.hvplot.barh(
    width=500, height=400, 
    title="ROC Scores of ML Models", xlabel="ROC Scores", 
    alpha=0.4, legend='top'
)
Out[74]:

References¶

  • Lending Club Loan 💰 Defaulters 🏃‍♂ Prediction
  • Lending Club Loan Default Prediction Model Pyspark
  • And others Loan Defaulters Predictions' Tutorials on Kaggle.
In [ ]: