Lending Club - Loan Default Modeling¶

0. Setup¶

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.patches as patches
from sklearn.metrics import roc_curve,auc
from scipy import interp

pd.set_option('display.max_columns', 500)

loan = pd.read_csv("E:/Downloads/CreditNinja_ModelingChallenge/data.csv")
print(loan.shape)
loan.head()

(80000, 26)

1. Data Review and Dependent Variable Definition¶

I chose loan_status as the dependent variable and turned it into a binary classification problem by setting 'Default' as '1' and other status as '0'. Since we only want to focus on the default clients, there is no need to identify the difference between 'Current' and 'Fully Paid'.

# create new dependent variable: default
loan.loc[loan['loan_status'] == 'Default', 'Default'] = 1
loan.loc[loan['loan_status'] != 'Default', 'Default'] = 0

# drop the original unique id and loan_status column
loan = loan.drop(['id', 'loan_status'], axis=1)
print(loan.shape)
loan.head()

(80000, 25)

2. Data Cleansing¶

My data cleaning methodology and findings are listed as below.

First, I looked at all the variables. I found that there are some variables containing information that happens after the loan underwriting. Since it may raise data leakage issues, I chose to drop them. In addition, some variables are not relevant to one's ability to pay the loan, such as the state information. I dropped there variables as well.

Second, I found that there are four columns in total with missing values. As for those with more than 40% NAs, I chose to drop them. Also, for the emp_length column with only a few NAs, I chose to drop the rows with missing value in this column.
Third, there are a few outliers (9999) in the dit column. However, they have already been dropped in the second step.
Fourth, I dealt with the nominal and ordinal variables in the dataset. emp_length is the only ordinal variable. I assigned different numbers to different levels accordingly to convert it into a numerical variable that can be fed into the model. As for the four nominal variables, I used the get_dummies() function to conduct one-hot encoding.
Fifth, since there are some variables containing really large values, such as annual_inc and dit, I decided to normalize the data to ensure the model performance.

2.1 Select appropriate variables¶

First, there are several variables we would like to drop.

last_credit_pull_d: This information happens after the loan decision and has no predictive power for our model.
last_fico_range_high: This information happens after the loan decision and has no predictive power for our model.
last_fico_range_low: This information happens after the loan decision and has no predictive power for our model.
issue_d: This information happens after the loan decision and has no predictive power for our model.
addr_state: The state of the borrower can not reflect one's ability to pay the loan.
earliest_cr_line: The borrower's earliest reported credit line open month is not relevant one's ability to pay the loan.

loan = loan.drop(['last_credit_pull_d','last_fico_range_high','last_fico_range_low','addr_state','issue_d','earliest_cr_line'], axis=1)
print(loan.shape)
loan.head()

(80000, 19)

2.2 Dealing with NAs¶

# check the NAs in each column
isnull = loan.isnull().sum().sort_values(ascending=False)/len(loan)
print('col_count:',isnull.count())
isnull

col_count: 19

mths_since_last_record    0.818350
inq_last_12m              0.532563
mths_since_last_delinq    0.477675
emp_length                0.060862
Default                   0.000000
purpose                   0.000000
term                      0.000000
installment               0.000000
home_ownership            0.000000
annual_inc                0.000000
verification_status       0.000000
fico_range_low            0.000000
dti                       0.000000
fico_range_high           0.000000
acc_now_delinq            0.000000
delinq_amnt               0.000000
delinq_2yrs               0.000000
inq_last_6mths            0.000000
loan_amnt                 0.000000
dtype: float64

# drop the columns with over 40% missing values
loan = loan.dropna(thresh=len(loan)*0.6, axis=1)

print(loan.shape)
loan.head()

(80000, 16)

loan.select_dtypes(include=[np.number]).describe().T

loan.select_dtypes(include=['object']).describe().T

# drop the rows with missing values in the `emp_length` variable
loan = loan.dropna(axis=0)
print(loan.shape)
loan.head()

(75131, 16)

2.3 Dealing with Ordinal / Nominal variables¶

The emp_length variable is an ordinal variable. Therefore, I assigned different numbers to different levels accordingly to convert it into a numerical variable that can feed the model.

map = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0
    }
}

loan = loan.replace(map)
print(loan.shape)
loan.head()

(75131, 16)

As for the four nominal variables, I used the get_dummies() function to conduct one-hot encoding.

n_columns = ["term","home_ownership", "verification_status", "purpose"] 
dummy_df = pd.get_dummies(loan[n_columns],drop_first=True)
loan = pd.concat([loan, dummy_df], axis=1)
loan = loan.drop(n_columns, axis=1)
print(loan.shape)
loan.head()

(75131, 30)

2.4 Normalize the Data¶

columns = loan.columns
features = columns.drop('Default')
features_s = columns.drop('Default').drop(dummy_df.columns)

from sklearn.preprocessing import StandardScaler
sc =StandardScaler()
loan[features_s] =sc.fit_transform(loan[features_s])
print(loan.shape)
loan.head()

(75131, 30)

loan.describe().T

loan.to_csv('loan_clean.csv')

3. Analysis¶

3.1 Feature Selection - RFE¶

Using Recursive Feature Elimination, I selected 20 out of 30 variables that are most relevant to the dependent variable.

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

x_val = loan[features]
y_val = loan['Default']

estimator = LogisticRegression(class_weight='balanced',solver='liblinear')
rfe = RFE(estimator=estimator,n_features_to_select=20).fit(x_val, y_val)

x_chosed = features[rfe.support_]
x_chosed

Index(['loan_amnt', 'installment', 'dti', 'fico_range_low', 'fico_range_high',
       'inq_last_6mths', 'term_60 months', 'home_ownership_MORTGAGE',
       'verification_status_Source Verified', 'verification_status_Verified',
       'purpose_credit_card', 'purpose_house', 'purpose_major_purchase',
       'purpose_medical', 'purpose_moving', 'purpose_other',
       'purpose_renewable_energy', 'purpose_small_business',
       'purpose_vacation', 'purpose_wedding'],
      dtype='object')

3.2 Feature Selection - Pearson Correlation (Multicollinearity)¶

Based on the correlation heatmap, I dropped some variables that are highly relevant to another feature in order to avoid multicollinearity and improve the model performance.

colormap = plt.cm.viridis
plt.figure(figsize=(12,12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(loan[x_chosed].corr(),linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x244dc7e1e88>

drop_col = ['installment','fico_range_high','term_60 months','verification_status_Verified']
x_new = x_chosed.drop(drop_col)
x_new

Index(['loan_amnt', 'dti', 'fico_range_low', 'inq_last_6mths',
       'home_ownership_MORTGAGE', 'verification_status_Source Verified',
       'purpose_credit_card', 'purpose_house', 'purpose_major_purchase',
       'purpose_medical', 'purpose_moving', 'purpose_other',
       'purpose_renewable_energy', 'purpose_small_business',
       'purpose_vacation', 'purpose_wedding'],
      dtype='object')

3.3 Split the Data¶

# Partition the data
X = loan[x_new]
y = loan["Default"]

from sklearn.model_selection import train_test_split 

train_X, test_X, train_y, test_y = train_test_split(X,y,train_size=0.8, random_state = 1)
print ('Train Features: ',train_X.shape ,
      'Test Features: ',test_X.shape)
print ('Train Labels: ',train_y.shape ,
      'Test Labels: ',test_y.shape)

Train Features:  (60104, 16) Test Features:  (15027, 16)
Train Labels:  (60104,) Test Labels:  (15027,)

3.4 Balance the Data¶

Obviously, our dataset is not very balanced since most people would pay their loan. My solution is to use SMOTE() function to balance the training dataset.

# SMOTE
n_sample = train_y.shape[0]
n_pos_sample = train_y[train_y == 0].shape[0]
n_neg_sample = train_y[train_y == 1].shape[0]
print('Observations: {}; Positives: {:.2%}; Negatives: {:.2%}'.format(n_sample,
                                                   n_pos_sample / n_sample,
                                                   n_neg_sample / n_sample))
print('Features: ', train_X.shape[1])

Observations: 60104; Positives: 92.58%; Negatives: 7.42%
Features:  16

from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=1)
train_X, train_y = sm.fit_sample(train_X, train_y)
print('After SMOTE: ')
n_sample = train_y.shape[0]
n_pos_sample = train_y[train_y == 0].shape[0]
n_neg_sample = train_y[train_y == 1].shape[0]
print('Observations: {}; Positives: {:.2%}; Negatives: {:.2%}'.format(n_sample,
                                                   n_pos_sample / n_sample,
                                                   n_neg_sample / n_sample))

After SMOTE: 
Observations: 111284; Positives: 50.00%; Negatives: 50.00%

4. Modeling¶

4.1 Logistic Regression¶

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='liblinear')
model.fit(train_X,train_y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

import sklearn.metrics as sklmetrics
predict_y=model.predict(test_X)
sklmetrics.accuracy_score(test_y, predict_y)

0.5764290942969322

# Confusion Matrix
conf_mat = sklmetrics.confusion_matrix(test_y, predict_y, labels =[0,1])
conf_mat

array([[7938, 5930],
       [ 435,  724]], dtype=int64)

from sklearn.metrics import roc_auc_score
roc_auc = roc_auc_score(test_y, predict_y)
print("Area under the ROC curve : %f" % roc_auc)

Area under the ROC curve : 0.598537

def plot_feature_importance_coeff(model, Xnames, cls_nm = None):

    imp_features = pd.DataFrame(np.column_stack((Xnames, model.coef_.ravel())), columns = ['feature', 'importance'])
    imp_features[['importance']] = imp_features[['importance']].astype(float)
    imp_features[['abs_importance']] = imp_features[['importance']].abs()
    # Sort the features based on absolute value of importance
    imp_features = imp_features.sort_values(by = ['abs_importance'], ascending = [1])
    
    # Plot the feature importances of the forest
    plt.figure()
    plt.title(cls_nm + " - Feature Importance")
    plt.barh(range(imp_features.shape[0]), imp_features['importance'],
            color="b", align="center")
    plt.yticks(range(imp_features.shape[0]), imp_features['feature'], )
    plt.ylim([-1, imp_features.shape[0]])
    plt.xlabel('Importance')
    plt.ylabel('Feature')
    plt.tight_layout() 
    plt.savefig(cls_nm + "_feature_imp.png", bbox_inches='tight')
    plt.show()

plot_feature_importance_coeff(model, X.columns, cls_nm="Logistic Regression")

We found that the purpose of the specfic loan is really important for its default probability. Also, fico score in application would have a negative effect on the default probability. We should consider these features more carefully when processing an loan application.

Moreover, we found that dti, the number of inquiries in past 6 months and loan amount have a positive relationship to the default probability. It aligns with the business rules we created before the model runs. The heavier the financial burden is, the higher risk that the loan would get default.

4.2 Grid Search CV¶

from sklearn.model_selection import GridSearchCV

param_grid = {'C': 10.**np.arange(-5, 5),
                            'penalty': [ 'l1', 'l2']}

grid_search = GridSearchCV(LogisticRegression(solver='liblinear'),  param_grid, cv=10)
grid_search.fit(train_X, train_y)

GridSearchCV(cv=10, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='liblinear',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': array([1.e-05, 1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02,
       1.e+03, 1.e+04]),
                         'penalty': ['l1', 'l2']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.5f}".format(grid_search.best_score_))

Best parameters: {'C': 10.0, 'penalty': 'l2'}
Best cross-validation score: 0.61060

predict_y=grid_search.predict(test_X)
sklmetrics.accuracy_score(test_y, predict_y)

0.5762960005323751

# Confusion Matrix
conf_mat = sklmetrics.confusion_matrix(test_y, predict_y, labels =[0,1])
conf_mat

array([[7935, 5933],
       [ 434,  725]], dtype=int64)

4.3 K-fold Cross Validation¶

We used the optimal hyperparameter(C=10.0,penalty='l2') generated in the grid search above.

from sklearn.model_selection import cross_val_predict, KFold, cross_val_score
lr = LogisticRegression(C=10.0,penalty='l2',solver='liblinear')
kf = KFold(n_splits=10, shuffle=True)
cross_val_score(lr,train_X, train_y,cv=kf,scoring='accuracy').mean()

0.610590864865008

fig1 = plt.figure(figsize=[12,12])
ax1 = fig1.add_subplot(111,aspect = 'equal')
ax1.add_patch(
    patches.Arrow(0.45,0.5,-0.25,0.25,width=0.3,color='green',alpha = 0.5)
    )
ax1.add_patch(
    patches.Arrow(0.5,0.45,0.25,-0.25,width=0.3,color='red',alpha = 0.5)
    )

tprs = []
aucs = []
mean_fpr = np.linspace(0,1,100)
i = 1
for train,test in kf.split(train_X, train_y):
    prediction = lr.fit(train_X.iloc[train],train_y.iloc[train]).predict_proba(train_X.iloc[test])
    fpr, tpr, t = roc_curve(train_y[test], prediction[:, 1])
    tprs.append(interp(mean_fpr, fpr, tpr))
    roc_auc = auc(fpr, tpr)
    aucs.append(roc_auc)
    plt.plot(fpr, tpr, lw=2, alpha=0.3, label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))
    i= i+1

plt.plot([0,1],[0,1],linestyle = '--',lw = 2,color = 'black')
mean_tpr = np.mean(tprs, axis=0)
mean_auc = auc(mean_fpr, mean_tpr)
plt.plot(mean_fpr, mean_tpr, color='blue',
         label=r'Mean ROC (AUC = %0.2f )' % (mean_auc),lw=2, alpha=1)

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right")
plt.text(0.32,0.7,'More accurate area',fontsize = 12)
plt.text(0.63,0.4,'Less accurate area',fontsize = 12)
plt.show()

The model accuracy has been improved to 61% after the cross validation. Also, we can see that the AUC is relatively stable across different folders.

Besides, a penalty can be added on false negatives since it would be more serious to classify a person would have default as not default than the opposite. It would further improve our accuracy.

Last Thoughts¶

Are there any ways to derive additional variables that would improve the model prediction accuracy?

dti_range: Since the orignial dti variable is a continuous feature that needs to be scaled before putting into the model, we can categorize it into several groups / bins. For example, if dti > 10 and dti <= 20, this observation belongs to the dti_range_1020 group.
annual_inc_range: The same as dti_range
loan_pay_times: When making decisions, one of our assumption is that all the Defaults are created equal. People who becomes default in their first payment should not be treated as the same as the people who becomes default in their very last payment. With that being said, a label that can categorize different default behaviour is needed.

What variables, if any, can be used to create business rules that can be used to decline customer’s application before the model runs?

Whether a customer has any defaults before: If a customer has many default records before (or in the last 12 months), we need to be careful when considering his/her application.
Employment length (emp_length) & income (annual_inc): If a person has a relatively short employment length or low income, we may assume that his/her ability to pay the loan is not very strong. We can use these variable to decline customer's application before the model runs.
Low FICO / credit score: the loan company would not want to give loan to person with low FICO / credit score.
Low dti / Large Loan Amount: dti reflects a person's ability to pay the debts. If one has a low dti, the loan company would tend to have less confidence in his/her ability to pay the new loan. Also, a large loan amount would place a heavy financial burden on the applicant. It would result in high risk of not paying back the loan.

	id	loan_amnt	term	installment	emp_length	home_ownership	annual_inc	verification_status	issue_d	loan_status	purpose	addr_state	dti	earliest_cr_line	fico_range_low	fico_range_high	last_credit_pull_d	last_fico_range_high	last_fico_range_low	delinq_2yrs	mths_since_last_delinq	mths_since_last_record	inq_last_6mths	inq_last_12m
0	88787390	30000	60 months	761.64	6 years	RENT	100100.0	Verified	Sep-2016	Current	debt_consolidation	SC	27.42	Jan-1976	690	694	Dec-2016	709	705	0	26.0	NaN	0	0.0
1	74722660	30000	60 months	692.93	10+ years	MORTGAGE	235000.0	Source Verified	Apr-2016	Current	home_improvement	PA	7.50	Jan-1995	670	674	Dec-2016	649	645	1	16.0	NaN	1	3.0
2	67919555	16000	60 months	379.39	6 years	RENT	84000.0	Not Verified	Jan-2016	Current	debt_consolidation	FL	27.87	Aug-1998	735	739	Dec-2016	739	735	0	80.0	NaN	0	2.0
3	54027458	14000	36 months	439.88	5 years	RENT	50000.0	Source Verified	Jul-2015	Fully Paid	debt_consolidation	WV	21.65	Oct-1996	725	729	Oct-2016	749	745	0	NaN	42.0	0	NaN
4	72594974	7500	36 months	252.67	3 years	MORTGAGE	68000.0	Not Verified	Feb-2016	Fully Paid	debt_consolidation	OH	29.09	Nov-2000	660	664	Dec-2016	709	705	0	NaN	NaN	2	5.0

	loan_amnt	term	installment	emp_length	home_ownership	annual_inc	verification_status	issue_d	purpose	addr_state	dti	earliest_cr_line	fico_range_low	fico_range_high	last_credit_pull_d	last_fico_range_high	last_fico_range_low	delinq_2yrs	mths_since_last_delinq	mths_since_last_record	inq_last_6mths	inq_last_12m
0	30000	60 months	761.64	6 years	RENT	100100.0	Verified	Sep-2016	debt_consolidation	SC	27.42	Jan-1976	690	694	Dec-2016	709	705	0	26.0	NaN	0	0.0
1	30000	60 months	692.93	10+ years	MORTGAGE	235000.0	Source Verified	Apr-2016	home_improvement	PA	7.50	Jan-1995	670	674	Dec-2016	649	645	1	16.0	NaN	1	3.0
2	16000	60 months	379.39	6 years	RENT	84000.0	Not Verified	Jan-2016	debt_consolidation	FL	27.87	Aug-1998	735	739	Dec-2016	739	735	0	80.0	NaN	0	2.0
3	14000	36 months	439.88	5 years	RENT	50000.0	Source Verified	Jul-2015	debt_consolidation	WV	21.65	Oct-1996	725	729	Oct-2016	749	745	0	NaN	42.0	0	NaN
4	7500	36 months	252.67	3 years	MORTGAGE	68000.0	Not Verified	Feb-2016	debt_consolidation	OH	29.09	Nov-2000	660	664	Dec-2016	709	705	0	NaN	NaN	2	5.0

	loan_amnt	term	installment	emp_length	home_ownership	annual_inc	verification_status	purpose	dti	fico_range_low	fico_range_high	delinq_2yrs	mths_since_last_delinq	mths_since_last_record	inq_last_6mths	inq_last_12m
0	30000	60 months	761.64	6 years	RENT	100100.0	Verified	debt_consolidation	27.42	690	694	0	26.0	NaN	0	0.0
1	30000	60 months	692.93	10+ years	MORTGAGE	235000.0	Source Verified	home_improvement	7.50	670	674	1	16.0	NaN	1	3.0
2	16000	60 months	379.39	6 years	RENT	84000.0	Not Verified	debt_consolidation	27.87	735	739	0	80.0	NaN	0	2.0
3	14000	36 months	439.88	5 years	RENT	50000.0	Source Verified	debt_consolidation	21.65	725	729	0	NaN	42.0	0	NaN
4	7500	36 months	252.67	3 years	MORTGAGE	68000.0	Not Verified	debt_consolidation	29.09	660	664	0	NaN	NaN	2	5.0

	loan_amnt	term	installment	emp_length	home_ownership	annual_inc	verification_status	purpose	dti	fico_range_low	fico_range_high	delinq_2yrs	inq_last_6mths
0	30000	60 months	761.64	6 years	RENT	100100.0	Verified	debt_consolidation	27.42	690	694	0	0
1	30000	60 months	692.93	10+ years	MORTGAGE	235000.0	Source Verified	home_improvement	7.50	670	674	1	1
2	16000	60 months	379.39	6 years	RENT	84000.0	Not Verified	debt_consolidation	27.87	735	739	0	0
3	14000	36 months	439.88	5 years	RENT	50000.0	Source Verified	debt_consolidation	21.65	725	729	0	0
4	7500	36 months	252.67	3 years	MORTGAGE	68000.0	Not Verified	debt_consolidation	29.09	660	664	0	2

	count	mean	std	min	25%	50%	75%	max
loan_amnt	80000.0	15055.793750	8729.299624	1000.00	8000.00	13275.00	20000.00	40000.00
installment	80000.0	443.595158	255.693326	30.12	258.10	382.50	585.08	1536.95
annual_inc	80000.0	77789.952824	86486.732410	0.00	47000.00	65000.00	92500.00	8706582.00
dti	80000.0	19.446568	61.871127	0.00	12.43	18.44	25.17	9999.00
fico_range_low	80000.0	694.314000	30.744603	660.00	670.00	685.00	710.00	845.00
fico_range_high	80000.0	698.314112	30.745156	664.00	674.00	689.00	714.00	850.00
acc_now_delinq	80000.0	0.006488	0.084235	0.00	0.00	0.00	0.00	3.00
delinq_amnt	80000.0	16.692025	848.289643	0.00	0.00	0.00	0.00	110626.00
delinq_2yrs	80000.0	0.352275	0.942411	0.00	0.00	0.00	0.00	39.00
inq_last_6mths	80000.0	0.565950	0.863342	0.00	0.00	0.00	1.00	6.00
Default	80000.0	0.075463	0.264138	0.00	0.00	0.00	0.00	1.00

	count	unique	top	freq
term	80000	2	36 months	56346
emp_length	75131	11	10+ years	26972
home_ownership	80000	4	MORTGAGE	39054
verification_status	80000	3	Source Verified	33189
purpose	80000	13	debt_consolidation	46721

	loan_amnt	installment	emp_length	annual_inc	dti	fico_range_low	fico_range_high	acc_now_delinq	delinq_amnt	delinq_2yrs	inq_last_6mths	term_60 months	home_ownership_MORTGAGE	home_ownership_RENT	verification_status_Source Verified	verification_status_Verified	purpose_debt_consolidation	purpose_home_improvement
0	1.685021	1.220027	-0.004434	0.231708	0.945250	-0.139242	-0.139244	-0.078531	-0.019703	-0.376920	-0.655131	1	0	1	0	1	1	0
1	1.685021	0.952119	1.072726	1.753647	-1.275759	-0.793480	-0.793469	-0.078531	-0.019703	0.675221	0.503220	1	1	0	1	0	0	1
2	0.084683	-0.270408	-0.004434	0.050069	0.995424	1.332793	1.332763	-0.078531	-0.019703	-0.376920	-0.655131	1	0	1	0	0	1	0
3	-0.143937	-0.034550	-0.273724	-0.333519	0.301916	1.005674	1.005650	-0.078531	-0.019703	-0.376920	-0.655131	0	0	1	1	0	1	0
4	-0.886951	-0.764503	-0.812303	-0.130443	1.131449	-1.120599	-1.120581	-0.078531	-0.019703	-0.376920	1.661570	0	1	0	0	0	1	0

	count	mean	std	min	25%	50%	75%	max
loan_amnt	75131.0	-3.289603e-16	1.000007	-1.629965	-0.786930	-0.143937	0.541922	2.828119
installment	75131.0	1.773333e-16	1.000007	-1.632250	-0.727461	-0.240151	0.562715	4.243047
emp_length	75131.0	-1.368099e-15	1.000007	-1.620173	-1.081593	-0.004434	1.072726	1.072726
annual_inc	75131.0	-3.025151e-16	1.000007	-0.857002	-0.356082	-0.142041	0.174170	97.329825
dti	75131.0	3.722471e-16	1.000007	-2.111982	-0.731656	-0.068252	0.677101	59.582347
fico_range_low	75131.0	5.781003e-16	1.000007	-1.120599	-0.793480	-0.302802	0.514995	4.931101
fico_range_high	75131.0	7.216102e-16	1.000007	-1.120581	-0.793469	-0.302800	0.514981	4.963713
acc_now_delinq	75131.0	1.778125e-15	1.000007	-0.078531	-0.078531	-0.078531	-0.078531	34.902169
delinq_amnt	75131.0	-1.362343e-15	1.000007	-0.019703	-0.019703	-0.019703	-0.019703	135.862351
delinq_2yrs	75131.0	4.843200e-16	1.000007	-0.376920	-0.376920	-0.376920	-0.376920	40.656587
inq_last_6mths	75131.0	-7.583344e-17	1.000007	-0.655131	-0.655131	-0.655131	0.503220	6.294972
Default	75131.0	7.481599e-02	0.263096	0.000000	0.000000	0.000000	0.000000	1.000000
term_60 months	75131.0	3.027246e-01	0.459440	0.000000	0.000000	0.000000	1.000000	1.000000
home_ownership_MORTGAGE	75131.0	4.907428e-01	0.499918	0.000000	0.000000	0.000000	1.000000	1.000000
home_ownership_OWN	75131.0	1.100478e-01	0.312951	0.000000	0.000000	0.000000	0.000000	1.000000
home_ownership_RENT	75131.0	3.991961e-01	0.489736	0.000000	0.000000	0.000000	1.000000	1.000000
verification_status_Source Verified	75131.0	4.310205e-01	0.495222	0.000000	0.000000	0.000000	1.000000	1.000000
verification_status_Verified	75131.0	2.710333e-01	0.444496	0.000000	0.000000	0.000000	1.000000	1.000000
purpose_credit_card	75131.0	2.305440e-01	0.421184	0.000000	0.000000	0.000000	0.000000	1.000000
purpose_debt_consolidation	75131.0	5.860164e-01	0.492549	0.000000	0.000000	1.000000	1.000000	1.000000
purpose_home_improvement	75131.0	6.250416e-02	0.242071	0.000000	0.000000	0.000000	0.000000	1.000000
purpose_house	75131.0	4.192677e-03	0.064615	0.000000	0.000000	0.000000	0.000000	1.000000
purpose_major_purchase	75131.0	2.052415e-02	0.141786	0.000000	0.000000	0.000000	0.000000	1.000000
purpose_medical	75131.0	1.019553e-02	0.100458	0.000000	0.000000	0.000000	0.000000	1.000000
purpose_moving	75131.0	6.774833e-03	0.082031	0.000000	0.000000	0.000000	0.000000	1.000000
purpose_other	75131.0	5.325365e-02	0.224540	0.000000	0.000000	0.000000	0.000000	1.000000
purpose_renewable_energy	75131.0	6.788143e-04	0.026045	0.000000	0.000000	0.000000	0.000000	1.000000
purpose_small_business	75131.0	9.636502e-03	0.097692	0.000000	0.000000	0.000000	0.000000	1.000000
purpose_vacation	75131.0	6.162569e-03	0.078260	0.000000	0.000000	0.000000	0.000000	1.000000
purpose_wedding	75131.0	1.331009e-05	0.003648	0.000000	0.000000	0.000000	0.000000	1.000000