Lending Club Loan Prediction


A Bank Loan Default Prediction with Machine Learning Classification Model


Posted by Yun (Jessica) Yan on March 8, 2020

Lending Club - Loan Default Modeling

Out[1]:

0. Setup

In [1]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.patches as patches
from sklearn.metrics import roc_curve,auc
from scipy import interp
In [2]:
pd.set_option('display.max_columns', 500)
In [93]:
loan = pd.read_csv("E:/Downloads/CreditNinja_ModelingChallenge/data.csv")
print(loan.shape)
loan.head()
(80000, 26)
Out[93]:
id loan_amnt term installment emp_length home_ownership annual_inc verification_status issue_d loan_status purpose addr_state dti earliest_cr_line fico_range_low fico_range_high last_credit_pull_d last_fico_range_high last_fico_range_low acc_now_delinq delinq_amnt delinq_2yrs mths_since_last_delinq mths_since_last_record inq_last_6mths inq_last_12m
0 88787390 30000 60 months 761.64 6 years RENT 100100.0 Verified Sep-2016 Current debt_consolidation SC 27.42 Jan-1976 690 694 Dec-2016 709 705 0 0 0 26.0 NaN 0 0.0
1 74722660 30000 60 months 692.93 10+ years MORTGAGE 235000.0 Source Verified Apr-2016 Current home_improvement PA 7.50 Jan-1995 670 674 Dec-2016 649 645 0 0 1 16.0 NaN 1 3.0
2 67919555 16000 60 months 379.39 6 years RENT 84000.0 Not Verified Jan-2016 Current debt_consolidation FL 27.87 Aug-1998 735 739 Dec-2016 739 735 0 0 0 80.0 NaN 0 2.0
3 54027458 14000 36 months 439.88 5 years RENT 50000.0 Source Verified Jul-2015 Fully Paid debt_consolidation WV 21.65 Oct-1996 725 729 Oct-2016 749 745 0 0 0 NaN 42.0 0 NaN
4 72594974 7500 36 months 252.67 3 years MORTGAGE 68000.0 Not Verified Feb-2016 Fully Paid debt_consolidation OH 29.09 Nov-2000 660 664 Dec-2016 709 705 0 0 0 NaN NaN 2 5.0

1. Data Review and Dependent Variable Definition

I chose loan_status as the dependent variable and turned it into a binary classification problem by setting 'Default' as '1' and other status as '0'. Since we only want to focus on the default clients, there is no need to identify the difference between 'Current' and 'Fully Paid'.

In [94]:
# create new dependent variable: default
loan.loc[loan['loan_status'] == 'Default', 'Default'] = 1
loan.loc[loan['loan_status'] != 'Default', 'Default'] = 0

# drop the original unique id and loan_status column
loan = loan.drop(['id', 'loan_status'], axis=1)
print(loan.shape)
loan.head()
(80000, 25)
Out[94]:
loan_amnt term installment emp_length home_ownership annual_inc verification_status issue_d purpose addr_state dti earliest_cr_line fico_range_low fico_range_high last_credit_pull_d last_fico_range_high last_fico_range_low acc_now_delinq delinq_amnt delinq_2yrs mths_since_last_delinq mths_since_last_record inq_last_6mths inq_last_12m Default
0 30000 60 months 761.64 6 years RENT 100100.0 Verified Sep-2016 debt_consolidation SC 27.42 Jan-1976 690 694 Dec-2016 709 705 0 0 0 26.0 NaN 0 0.0 0.0
1 30000 60 months 692.93 10+ years MORTGAGE 235000.0 Source Verified Apr-2016 home_improvement PA 7.50 Jan-1995 670 674 Dec-2016 649 645 0 0 1 16.0 NaN 1 3.0 0.0
2 16000 60 months 379.39 6 years RENT 84000.0 Not Verified Jan-2016 debt_consolidation FL 27.87 Aug-1998 735 739 Dec-2016 739 735 0 0 0 80.0 NaN 0 2.0 0.0
3 14000 36 months 439.88 5 years RENT 50000.0 Source Verified Jul-2015 debt_consolidation WV 21.65 Oct-1996 725 729 Oct-2016 749 745 0 0 0 NaN 42.0 0 NaN 0.0
4 7500 36 months 252.67 3 years MORTGAGE 68000.0 Not Verified Feb-2016 debt_consolidation OH 29.09 Nov-2000 660 664 Dec-2016 709 705 0 0 0 NaN NaN 2 5.0 0.0

2. Data Cleansing

My data cleaning methodology and findings are listed as below.

  1. First, I looked at all the variables. I found that there are some variables containing information that happens after the loan underwriting. Since it may raise data leakage issues, I chose to drop them. In addition, some variables are not relevant to one's ability to pay the loan, such as the state information. I dropped there variables as well.
  • Second, I found that there are four columns in total with missing values. As for those with more than 40% NAs, I chose to drop them. Also, for the emp_length column with only a few NAs, I chose to drop the rows with missing value in this column.
  • Third, there are a few outliers (9999) in the dit column. However, they have already been dropped in the second step.
  • Fourth, I dealt with the nominal and ordinal variables in the dataset. emp_length is the only ordinal variable. I assigned different numbers to different levels accordingly to convert it into a numerical variable that can be fed into the model. As for the four nominal variables, I used the get_dummies() function to conduct one-hot encoding.
  • Fifth, since there are some variables containing really large values, such as annual_inc and dit, I decided to normalize the data to ensure the model performance.

2.1 Select appropriate variables

First, there are several variables we would like to drop.

  • last_credit_pull_d: This information happens after the loan decision and has no predictive power for our model.
  • last_fico_range_high: This information happens after the loan decision and has no predictive power for our model.
  • last_fico_range_low: This information happens after the loan decision and has no predictive power for our model.
  • issue_d: This information happens after the loan decision and has no predictive power for our model.
  • addr_state: The state of the borrower can not reflect one's ability to pay the loan.
  • earliest_cr_line: The borrower's earliest reported credit line open month is not relevant one's ability to pay the loan.
In [95]:
loan = loan.drop(['last_credit_pull_d','last_fico_range_high','last_fico_range_low','addr_state','issue_d','earliest_cr_line'], axis=1)
print(loan.shape)
loan.head()
(80000, 19)
Out[95]:
loan_amnt term installment emp_length home_ownership annual_inc verification_status purpose dti fico_range_low fico_range_high acc_now_delinq delinq_amnt delinq_2yrs mths_since_last_delinq mths_since_last_record inq_last_6mths inq_last_12m Default
0 30000 60 months 761.64 6 years RENT 100100.0 Verified debt_consolidation 27.42 690 694 0 0 0 26.0 NaN 0 0.0 0.0
1 30000 60 months 692.93 10+ years MORTGAGE 235000.0 Source Verified home_improvement 7.50 670 674 0 0 1 16.0 NaN 1 3.0 0.0
2 16000 60 months 379.39 6 years RENT 84000.0 Not Verified debt_consolidation 27.87 735 739 0 0 0 80.0 NaN 0 2.0 0.0
3 14000 36 months 439.88 5 years RENT 50000.0 Source Verified debt_consolidation 21.65 725 729 0 0 0 NaN 42.0 0 NaN 0.0
4 7500 36 months 252.67 3 years MORTGAGE 68000.0 Not Verified debt_consolidation 29.09 660 664 0 0 0 NaN NaN 2 5.0 0.0

2.2 Dealing with NAs

In [6]:
# check the NAs in each column
isnull = loan.isnull().sum().sort_values(ascending=False)/len(loan)
print('col_count:',isnull.count())
isnull
col_count: 19
Out[6]:
mths_since_last_record    0.818350
inq_last_12m              0.532563
mths_since_last_delinq    0.477675
emp_length                0.060862
Default                   0.000000
purpose                   0.000000
term                      0.000000
installment               0.000000
home_ownership            0.000000
annual_inc                0.000000
verification_status       0.000000
fico_range_low            0.000000
dti                       0.000000
fico_range_high           0.000000
acc_now_delinq            0.000000
delinq_amnt               0.000000
delinq_2yrs               0.000000
inq_last_6mths            0.000000
loan_amnt                 0.000000
dtype: float64
In [7]:
# drop the columns with over 40% missing values
loan = loan.dropna(thresh=len(loan)*0.6, axis=1)

print(loan.shape)
loan.head()
(80000, 16)
Out[7]:
loan_amnt term installment emp_length home_ownership annual_inc verification_status purpose dti fico_range_low fico_range_high acc_now_delinq delinq_amnt delinq_2yrs inq_last_6mths Default
0 30000 60 months 761.64 6 years RENT 100100.0 Verified debt_consolidation 27.42 690 694 0 0 0 0 0.0
1 30000 60 months 692.93 10+ years MORTGAGE 235000.0 Source Verified home_improvement 7.50 670 674 0 0 1 1 0.0
2 16000 60 months 379.39 6 years RENT 84000.0 Not Verified debt_consolidation 27.87 735 739 0 0 0 0 0.0
3 14000 36 months 439.88 5 years RENT 50000.0 Source Verified debt_consolidation 21.65 725 729 0 0 0 0 0.0
4 7500 36 months 252.67 3 years MORTGAGE 68000.0 Not Verified debt_consolidation 29.09 660 664 0 0 0 2 0.0
In [8]:
loan.select_dtypes(include=[np.number]).describe().T
Out[8]:
count mean std min 25% 50% 75% max
loan_amnt 80000.0 15055.793750 8729.299624 1000.00 8000.00 13275.00 20000.00 40000.00
installment 80000.0 443.595158 255.693326 30.12 258.10 382.50 585.08 1536.95
annual_inc 80000.0 77789.952824 86486.732410 0.00 47000.00 65000.00 92500.00 8706582.00
dti 80000.0 19.446568 61.871127 0.00 12.43 18.44 25.17 9999.00
fico_range_low 80000.0 694.314000 30.744603 660.00 670.00 685.00 710.00 845.00
fico_range_high 80000.0 698.314112 30.745156 664.00 674.00 689.00 714.00 850.00
acc_now_delinq 80000.0 0.006488 0.084235 0.00 0.00 0.00 0.00 3.00
delinq_amnt 80000.0 16.692025 848.289643 0.00 0.00 0.00 0.00 110626.00
delinq_2yrs 80000.0 0.352275 0.942411 0.00 0.00 0.00 0.00 39.00
inq_last_6mths 80000.0 0.565950 0.863342 0.00 0.00 0.00 1.00 6.00
Default 80000.0 0.075463 0.264138 0.00 0.00 0.00 0.00 1.00
In [9]:
loan.select_dtypes(include=['object']).describe().T
Out[9]:
count unique top freq
term 80000 2 36 months 56346
emp_length 75131 11 10+ years 26972
home_ownership 80000 4 MORTGAGE 39054
verification_status 80000 3 Source Verified 33189
purpose 80000 13 debt_consolidation 46721
In [10]:
# drop the rows with missing values in the `emp_length` variable
loan = loan.dropna(axis=0)
print(loan.shape)
loan.head()
(75131, 16)
Out[10]:
loan_amnt term installment emp_length home_ownership annual_inc verification_status purpose dti fico_range_low fico_range_high acc_now_delinq delinq_amnt delinq_2yrs inq_last_6mths Default
0 30000 60 months 761.64 6 years RENT 100100.0 Verified debt_consolidation 27.42 690 694 0 0 0 0 0.0
1 30000 60 months 692.93 10+ years MORTGAGE 235000.0 Source Verified home_improvement 7.50 670 674 0 0 1 1 0.0
2 16000 60 months 379.39 6 years RENT 84000.0 Not Verified debt_consolidation 27.87 735 739 0 0 0 0 0.0
3 14000 36 months 439.88 5 years RENT 50000.0 Source Verified debt_consolidation 21.65 725 729 0 0 0 0 0.0
4 7500 36 months 252.67 3 years MORTGAGE 68000.0 Not Verified debt_consolidation 29.09 660 664 0 0 0 2 0.0

2.3 Dealing with Ordinal / Nominal variables

The emp_length variable is an ordinal variable. Therefore, I assigned different numbers to different levels accordingly to convert it into a numerical variable that can feed the model.

In [11]:
map = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0
    }
}

loan = loan.replace(map)
print(loan.shape)
loan.head()
(75131, 16)
Out[11]:
loan_amnt term installment emp_length home_ownership annual_inc verification_status purpose dti fico_range_low fico_range_high acc_now_delinq delinq_amnt delinq_2yrs inq_last_6mths Default
0 30000 60 months 761.64 6 RENT 100100.0 Verified debt_consolidation 27.42 690 694 0 0 0 0 0.0
1 30000 60 months 692.93 10 MORTGAGE 235000.0 Source Verified home_improvement 7.50 670 674 0 0 1 1 0.0
2 16000 60 months 379.39 6 RENT 84000.0 Not Verified debt_consolidation 27.87 735 739 0 0 0 0 0.0
3 14000 36 months 439.88 5 RENT 50000.0 Source Verified debt_consolidation 21.65 725 729 0 0 0 0 0.0
4 7500 36 months 252.67 3 MORTGAGE 68000.0 Not Verified debt_consolidation 29.09 660 664 0 0 0 2 0.0

As for the four nominal variables, I used the get_dummies() function to conduct one-hot encoding.

In [12]:
n_columns = ["term","home_ownership", "verification_status", "purpose"] 
dummy_df = pd.get_dummies(loan[n_columns],drop_first=True)
loan = pd.concat([loan, dummy_df], axis=1)
loan = loan.drop(n_columns, axis=1)
print(loan.shape)
loan.head()
(75131, 30)
Out[12]:
loan_amnt installment emp_length annual_inc dti fico_range_low fico_range_high acc_now_delinq delinq_amnt delinq_2yrs inq_last_6mths Default term_60 months home_ownership_MORTGAGE home_ownership_OWN home_ownership_RENT verification_status_Source Verified verification_status_Verified purpose_credit_card purpose_debt_consolidation purpose_home_improvement purpose_house purpose_major_purchase purpose_medical purpose_moving purpose_other purpose_renewable_energy purpose_small_business purpose_vacation purpose_wedding
0 30000 761.64 6 100100.0 27.42 690 694 0 0 0 0 0.0 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0
1 30000 692.93 10 235000.0 7.50 670 674 0 0 1 1 0.0 1 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0
2 16000 379.39 6 84000.0 27.87 735 739 0 0 0 0 0.0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0
3 14000 439.88 5 50000.0 21.65 725 729 0 0 0 0 0.0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0
4 7500 252.67 3 68000.0 29.09 660 664 0 0 0 2 0.0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

2.4 Normalize the Data

In [13]:
columns = loan.columns
features = columns.drop('Default')
features_s = columns.drop('Default').drop(dummy_df.columns)

from sklearn.preprocessing import StandardScaler
sc =StandardScaler()
loan[features_s] =sc.fit_transform(loan[features_s])
print(loan.shape)
loan.head()
(75131, 30)
Out[13]:
loan_amnt installment emp_length annual_inc dti fico_range_low fico_range_high acc_now_delinq delinq_amnt delinq_2yrs inq_last_6mths Default term_60 months home_ownership_MORTGAGE home_ownership_OWN home_ownership_RENT verification_status_Source Verified verification_status_Verified purpose_credit_card purpose_debt_consolidation purpose_home_improvement purpose_house purpose_major_purchase purpose_medical purpose_moving purpose_other purpose_renewable_energy purpose_small_business purpose_vacation purpose_wedding
0 1.685021 1.220027 -0.004434 0.231708 0.945250 -0.139242 -0.139244 -0.078531 -0.019703 -0.376920 -0.655131 0.0 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0
1 1.685021 0.952119 1.072726 1.753647 -1.275759 -0.793480 -0.793469 -0.078531 -0.019703 0.675221 0.503220 0.0 1 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0
2 0.084683 -0.270408 -0.004434 0.050069 0.995424 1.332793 1.332763 -0.078531 -0.019703 -0.376920 -0.655131 0.0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0
3 -0.143937 -0.034550 -0.273724 -0.333519 0.301916 1.005674 1.005650 -0.078531 -0.019703 -0.376920 -0.655131 0.0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0
4 -0.886951 -0.764503 -0.812303 -0.130443 1.131449 -1.120599 -1.120581 -0.078531 -0.019703 -0.376920 1.661570 0.0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
In [14]:
loan.describe().T
Out[14]:
count mean std min 25% 50% 75% max
loan_amnt 75131.0 -3.289603e-16 1.000007 -1.629965 -0.786930 -0.143937 0.541922 2.828119
installment 75131.0 1.773333e-16 1.000007 -1.632250 -0.727461 -0.240151 0.562715 4.243047
emp_length 75131.0 -1.368099e-15 1.000007 -1.620173 -1.081593 -0.004434 1.072726 1.072726
annual_inc 75131.0 -3.025151e-16 1.000007 -0.857002 -0.356082 -0.142041 0.174170 97.329825
dti 75131.0 3.722471e-16 1.000007 -2.111982 -0.731656 -0.068252 0.677101 59.582347
fico_range_low 75131.0 5.781003e-16 1.000007 -1.120599 -0.793480 -0.302802 0.514995 4.931101
fico_range_high 75131.0 7.216102e-16 1.000007 -1.120581 -0.793469 -0.302800 0.514981 4.963713
acc_now_delinq 75131.0 1.778125e-15 1.000007 -0.078531 -0.078531 -0.078531 -0.078531 34.902169
delinq_amnt 75131.0 -1.362343e-15 1.000007 -0.019703 -0.019703 -0.019703 -0.019703 135.862351
delinq_2yrs 75131.0 4.843200e-16 1.000007 -0.376920 -0.376920 -0.376920 -0.376920 40.656587
inq_last_6mths 75131.0 -7.583344e-17 1.000007 -0.655131 -0.655131 -0.655131 0.503220 6.294972
Default 75131.0 7.481599e-02 0.263096 0.000000 0.000000 0.000000 0.000000 1.000000
term_60 months 75131.0 3.027246e-01 0.459440 0.000000 0.000000 0.000000 1.000000 1.000000
home_ownership_MORTGAGE 75131.0 4.907428e-01 0.499918 0.000000 0.000000 0.000000 1.000000 1.000000
home_ownership_OWN 75131.0 1.100478e-01 0.312951 0.000000 0.000000 0.000000 0.000000 1.000000
home_ownership_RENT 75131.0 3.991961e-01 0.489736 0.000000 0.000000 0.000000 1.000000 1.000000
verification_status_Source Verified 75131.0 4.310205e-01 0.495222 0.000000 0.000000 0.000000 1.000000 1.000000
verification_status_Verified 75131.0 2.710333e-01 0.444496 0.000000 0.000000 0.000000 1.000000 1.000000
purpose_credit_card 75131.0 2.305440e-01 0.421184 0.000000 0.000000 0.000000 0.000000 1.000000
purpose_debt_consolidation 75131.0 5.860164e-01 0.492549 0.000000 0.000000 1.000000 1.000000 1.000000
purpose_home_improvement 75131.0 6.250416e-02 0.242071 0.000000 0.000000 0.000000 0.000000 1.000000
purpose_house 75131.0 4.192677e-03 0.064615 0.000000 0.000000 0.000000 0.000000 1.000000
purpose_major_purchase 75131.0 2.052415e-02 0.141786 0.000000 0.000000 0.000000 0.000000 1.000000
purpose_medical 75131.0 1.019553e-02 0.100458 0.000000 0.000000 0.000000 0.000000 1.000000
purpose_moving 75131.0 6.774833e-03 0.082031 0.000000 0.000000 0.000000 0.000000 1.000000
purpose_other 75131.0 5.325365e-02 0.224540 0.000000 0.000000 0.000000 0.000000 1.000000
purpose_renewable_energy 75131.0 6.788143e-04 0.026045 0.000000 0.000000 0.000000 0.000000 1.000000
purpose_small_business 75131.0 9.636502e-03 0.097692 0.000000 0.000000 0.000000 0.000000 1.000000
purpose_vacation 75131.0 6.162569e-03 0.078260 0.000000 0.000000 0.000000 0.000000 1.000000
purpose_wedding 75131.0 1.331009e-05 0.003648 0.000000 0.000000 0.000000 0.000000 1.000000
In [17]:
loan.to_csv('loan_clean.csv')

3. Analysis

3.1 Feature Selection - RFE

Using Recursive Feature Elimination, I selected 20 out of 30 variables that are most relevant to the dependent variable.

In [18]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

x_val = loan[features]
y_val = loan['Default']

estimator = LogisticRegression(class_weight='balanced',solver='liblinear')
rfe = RFE(estimator=estimator,n_features_to_select=20).fit(x_val, y_val)

x_chosed = features[rfe.support_]
x_chosed
Out[18]:
Index(['loan_amnt', 'installment', 'dti', 'fico_range_low', 'fico_range_high',
       'inq_last_6mths', 'term_60 months', 'home_ownership_MORTGAGE',
       'verification_status_Source Verified', 'verification_status_Verified',
       'purpose_credit_card', 'purpose_house', 'purpose_major_purchase',
       'purpose_medical', 'purpose_moving', 'purpose_other',
       'purpose_renewable_energy', 'purpose_small_business',
       'purpose_vacation', 'purpose_wedding'],
      dtype='object')

3.2 Feature Selection - Pearson Correlation (Multicollinearity)

Based on the correlation heatmap, I dropped some variables that are highly relevant to another feature in order to avoid multicollinearity and improve the model performance.

In [19]:
colormap = plt.cm.viridis
plt.figure(figsize=(12,12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(loan[x_chosed].corr(),linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x244dc7e1e88>
In [20]:
drop_col = ['installment','fico_range_high','term_60 months','verification_status_Verified']
x_new = x_chosed.drop(drop_col)
x_new
Out[20]:
Index(['loan_amnt', 'dti', 'fico_range_low', 'inq_last_6mths',
       'home_ownership_MORTGAGE', 'verification_status_Source Verified',
       'purpose_credit_card', 'purpose_house', 'purpose_major_purchase',
       'purpose_medical', 'purpose_moving', 'purpose_other',
       'purpose_renewable_energy', 'purpose_small_business',
       'purpose_vacation', 'purpose_wedding'],
      dtype='object')

3.3 Split the Data

In [21]:
# Partition the data
X = loan[x_new]
y = loan["Default"]

from sklearn.model_selection import train_test_split 

train_X, test_X, train_y, test_y = train_test_split(X,y,train_size=0.8, random_state = 1)
print ('Train Features: ',train_X.shape ,
      'Test Features: ',test_X.shape)
print ('Train Labels: ',train_y.shape ,
      'Test Labels: ',test_y.shape)
Train Features:  (60104, 16) Test Features:  (15027, 16)
Train Labels:  (60104,) Test Labels:  (15027,)

3.4 Balance the Data

Obviously, our dataset is not very balanced since most people would pay their loan. My solution is to use SMOTE() function to balance the training dataset.

In [22]:
# SMOTE
n_sample = train_y.shape[0]
n_pos_sample = train_y[train_y == 0].shape[0]
n_neg_sample = train_y[train_y == 1].shape[0]
print('Observations: {}; Positives: {:.2%}; Negatives: {:.2%}'.format(n_sample,
                                                   n_pos_sample / n_sample,
                                                   n_neg_sample / n_sample))
print('Features: ', train_X.shape[1])
Observations: 60104; Positives: 92.58%; Negatives: 7.42%
Features:  16
In [23]:
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=1)
train_X, train_y = sm.fit_sample(train_X, train_y)
print('After SMOTE: ')
n_sample = train_y.shape[0]
n_pos_sample = train_y[train_y == 0].shape[0]
n_neg_sample = train_y[train_y == 1].shape[0]
print('Observations: {}; Positives: {:.2%}; Negatives: {:.2%}'.format(n_sample,
                                                   n_pos_sample / n_sample,
                                                   n_neg_sample / n_sample))
After SMOTE: 
Observations: 111284; Positives: 50.00%; Negatives: 50.00%

4. Modeling

4.1 Logistic Regression

In [24]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='liblinear')
model.fit(train_X,train_y)
Out[24]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)
In [25]:
import sklearn.metrics as sklmetrics
predict_y=model.predict(test_X)
sklmetrics.accuracy_score(test_y, predict_y)
Out[25]:
0.5764290942969322
In [26]:
# Confusion Matrix
conf_mat = sklmetrics.confusion_matrix(test_y, predict_y, labels =[0,1])
conf_mat
Out[26]:
array([[7938, 5930],
       [ 435,  724]], dtype=int64)
In [27]:
from sklearn.metrics import roc_auc_score
roc_auc = roc_auc_score(test_y, predict_y)
print("Area under the ROC curve : %f" % roc_auc)
Area under the ROC curve : 0.598537
In [28]:
def plot_feature_importance_coeff(model, Xnames, cls_nm = None):

    imp_features = pd.DataFrame(np.column_stack((Xnames, model.coef_.ravel())), columns = ['feature', 'importance'])
    imp_features[['importance']] = imp_features[['importance']].astype(float)
    imp_features[['abs_importance']] = imp_features[['importance']].abs()
    # Sort the features based on absolute value of importance
    imp_features = imp_features.sort_values(by = ['abs_importance'], ascending = [1])
    
    # Plot the feature importances of the forest
    plt.figure()
    plt.title(cls_nm + " - Feature Importance")
    plt.barh(range(imp_features.shape[0]), imp_features['importance'],
            color="b", align="center")
    plt.yticks(range(imp_features.shape[0]), imp_features['feature'], )
    plt.ylim([-1, imp_features.shape[0]])
    plt.xlabel('Importance')
    plt.ylabel('Feature')
    plt.tight_layout() 
    plt.savefig(cls_nm + "_feature_imp.png", bbox_inches='tight')
    plt.show()
In [29]:
plot_feature_importance_coeff(model, X.columns, cls_nm="Logistic Regression")

We found that the purpose of the specfic loan is really important for its default probability. Also, fico score in application would have a negative effect on the default probability. We should consider these features more carefully when processing an loan application.

Moreover, we found that dti, the number of inquiries in past 6 months and loan amount have a positive relationship to the default probability. It aligns with the business rules we created before the model runs. The heavier the financial burden is, the higher risk that the loan would get default.

4.2 Grid Search CV

In [57]:
from sklearn.model_selection import GridSearchCV

param_grid = {'C': 10.**np.arange(-5, 5),
                            'penalty': [ 'l1', 'l2']}

grid_search = GridSearchCV(LogisticRegression(solver='liblinear'),  param_grid, cv=10)
grid_search.fit(train_X, train_y)
Out[57]:
GridSearchCV(cv=10, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='liblinear',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': array([1.e-05, 1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02,
       1.e+03, 1.e+04]),
                         'penalty': ['l1', 'l2']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)
In [31]:
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.5f}".format(grid_search.best_score_))
Best parameters: {'C': 10.0, 'penalty': 'l2'}
Best cross-validation score: 0.61060
In [32]:
predict_y=grid_search.predict(test_X)
sklmetrics.accuracy_score(test_y, predict_y)
Out[32]:
0.5762960005323751
In [33]:
# Confusion Matrix
conf_mat = sklmetrics.confusion_matrix(test_y, predict_y, labels =[0,1])
conf_mat
Out[33]:
array([[7935, 5933],
       [ 434,  725]], dtype=int64)

4.3 K-fold Cross Validation

We used the optimal hyperparameter(C=10.0,penalty='l2') generated in the grid search above.

In [73]:
from sklearn.model_selection import cross_val_predict, KFold, cross_val_score
lr = LogisticRegression(C=10.0,penalty='l2',solver='liblinear')
kf = KFold(n_splits=10, shuffle=True)
cross_val_score(lr,train_X, train_y,cv=kf,scoring='accuracy').mean()
Out[73]:
0.610590864865008
In [84]:
fig1 = plt.figure(figsize=[12,12])
ax1 = fig1.add_subplot(111,aspect = 'equal')
ax1.add_patch(
    patches.Arrow(0.45,0.5,-0.25,0.25,width=0.3,color='green',alpha = 0.5)
    )
ax1.add_patch(
    patches.Arrow(0.5,0.45,0.25,-0.25,width=0.3,color='red',alpha = 0.5)
    )

tprs = []
aucs = []
mean_fpr = np.linspace(0,1,100)
i = 1
for train,test in kf.split(train_X, train_y):
    prediction = lr.fit(train_X.iloc[train],train_y.iloc[train]).predict_proba(train_X.iloc[test])
    fpr, tpr, t = roc_curve(train_y[test], prediction[:, 1])
    tprs.append(interp(mean_fpr, fpr, tpr))
    roc_auc = auc(fpr, tpr)
    aucs.append(roc_auc)
    plt.plot(fpr, tpr, lw=2, alpha=0.3, label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))
    i= i+1

plt.plot([0,1],[0,1],linestyle = '--',lw = 2,color = 'black')
mean_tpr = np.mean(tprs, axis=0)
mean_auc = auc(mean_fpr, mean_tpr)
plt.plot(mean_fpr, mean_tpr, color='blue',
         label=r'Mean ROC (AUC = %0.2f )' % (mean_auc),lw=2, alpha=1)

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right")
plt.text(0.32,0.7,'More accurate area',fontsize = 12)
plt.text(0.63,0.4,'Less accurate area',fontsize = 12)
plt.show()

The model accuracy has been improved to 61% after the cross validation. Also, we can see that the AUC is relatively stable across different folders.

Besides, a penalty can be added on false negatives since it would be more serious to classify a person would have default as not default than the opposite. It would further improve our accuracy.

Last Thoughts

Are there any ways to derive additional variables that would improve the model prediction accuracy?

  • dti_range: Since the orignial dti variable is a continuous feature that needs to be scaled before putting into the model, we can categorize it into several groups / bins. For example, if dti > 10 and dti <= 20, this observation belongs to the dti_range_1020 group.
  • annual_inc_range: The same as dti_range
  • loan_pay_times: When making decisions, one of our assumption is that all the Defaults are created equal. People who becomes default in their first payment should not be treated as the same as the people who becomes default in their very last payment. With that being said, a label that can categorize different default behaviour is needed.

What variables, if any, can be used to create business rules that can be used to decline customer’s application before the model runs?

  • Whether a customer has any defaults before: If a customer has many default records before (or in the last 12 months), we need to be careful when considering his/her application.
  • Employment length (emp_length) & income (annual_inc): If a person has a relatively short employment length or low income, we may assume that his/her ability to pay the loan is not very strong. We can use these variable to decline customer's application before the model runs.
  • Low FICO / credit score: the loan company would not want to give loan to person with low FICO / credit score.
  • Low dti / Large Loan Amount: dti reflects a person's ability to pay the debts. If one has a low dti, the loan company would tend to have less confidence in his/her ability to pay the new loan. Also, a large loan amount would place a heavy financial burden on the applicant. It would result in high risk of not paying back the loan.