Financial Data Analysis - Data Processing 1: Loan Eligibility Prediction

comments

By Sabber Ahamed, Computational Geophysicist and Machine Learning Enthusiast

Introduction

Financial institutions/companies have been using predictive analytics for quite a long time. Recently, due to the availability of computational resources and tremendous research in machine learning made it possible to better data analysis hence better prediction. In the series of articles, I explain how to create a predictive loan model that identifies a bad applicant who is more likely to be charged off. In step by step processes, I show how to process raw data, clean unnecessary part of it, select relevant features, perform exploratory data analysis, and finally build a model.

As an example, I use Lending club loan data dataset. Lending Club is the world’s largest online marketplace connecting borrowers and investors. An inevitable outcome of lending is default by borrowers. The idea of this tutorial is to create a predictive model that identifies applicants who are relatively risky for a loan. In order to accomplish this, I organized the whole series into four parts as follows:

Data processing-1: In this first part I show how to clean and remove unnecessary features. Data processing is very time-consuming, but better data would produce a better model. Therefore, careful and very detail examination is required to prepare better data. I show how to identify constant features, duplicate feature, duplicate rows, and features with a high number of missing values.
Data processing-2: In this part, I manually go through each and every features selected from part -1. This is the most time-consuming part, but worth it for a better model.
EDA: In in this part, I do some exploratory data analysis (EDA) on the features selected in part-1 and 2. A good EDA is required to get a better knowledge of the domain. We need to spend some quality time to find out the relations between the features.
Create a model: Finally, In this last but not the last part, I create models. Creating a model is also not an easy task. It’s also an iterative process. I show how to start with a with a simple model, then slowly add complexity for better performance.

Alright, let’s get started with the part-1: data processing, cleaning and feature selections.

Data processing-1

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport warningswarnings.filterwarnings("ignore")

In this project, I used three years of datasets (2014, 2015 and 2017(first-thrid quarter)) and stored in five separate CSV files. Lets read the files first:

df1 = pd.read_csv(‘./data/2017Q1.csv’, skiprows=[0])df2 = pd.read_csv(‘./data/2017Q2.csv’, skiprows=[0])df3 = pd.read_csv(‘./data/2017Q3.csv’, skiprows=[0])df4 = pd.read_csv(‘./data/2014.csv’, skiprows=[0])df5 = pd.read_csv(‘./data/2015.csv’, skiprows=[0])

Since data are stored in separate files, we have to make sure that we have the same number of features in each file. We can check using the following code snippet:

columns = np.dstack((list(df1.columns), list(df2.columns), list(df3.columns), list(df4.columns), list(df5.columns)))coldf = pd.DataFrame(columns[0])

The above code is self-explanatory, we first extract the column names the stack them together using Numpy ‘dstack’ object. If you look at the Jupyter-notebook on Github, you would see they are same. Which is good for us. We can move on to the next step. It’s time to check the shape of the data:

df = pd.concat([df1, df2, df3, df4, df5])df.shape

(981665, 151)

We see that there are approximately one million examples and each of the examples has 151 features including target variable. Let’s look at the feature name to get familiar with the data. It’s imperative to get to know the domain, especially the details of the features relationship with the target variable. It’s not easy to learn overnight, that’s why need to spend some days or maybe a week to get familiar with the data before jumping into further detail analysis. Let’s see the feature names:

print(list(df.columns))

['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'annual_inc', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'url', 'desc', 'purpose', 'title', 'zip_code', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line', 'fico_range_low', 'fico_range_high', 'inq_last_6mths', 'mths_since_last_delinq', 'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 'initial_list_status', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_d', 'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d', 'last_fico_range_high', 'last_fico_range_low', 'collections_12_mths_ex_med', 'mths_since_last_major_derog', 'policy_code', 'application_type', 'annual_inc_joint', 'dti_joint', 'verification_status_joint', 'acc_now_delinq', 'tot_coll_amt', 'tot_cur_bal', 'open_acc_6m', 'open_act_il', 'open_il_12m', 'open_il_24m', 'mths_since_rcnt_il', 'total_bal_il', 'il_util', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc', 'all_util', 'total_rev_hi_lim', 'inq_fi', 'total_cu_tl', 'inq_last_12m', 'acc_open_past_24mths', 'avg_cur_bal', 'bc_open_to_buy', 'bc_util', 'chargeoff_within_12_mths', 'delinq_amnt', 'mo_sin_old_il_acct', 'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl', 'mort_acc', 'mths_since_recent_bc', 'mths_since_recent_bc_dlq', 'mths_since_recent_inq', 'mths_since_recent_revol_delinq', 'num_accts_ever_120_pd', 'num_actv_bc_tl', 'num_actv_rev_tl', 'num_bc_sats', 'num_bc_tl', 'num_il_tl', 'num_op_rev_tl', 'num_rev_accts', 'num_rev_tl_bal_gt_0', 'num_sats', 'num_tl_120dpd_2m', 'num_tl_30dpd', 'num_tl_90g_dpd_24m', 'num_tl_op_past_12m', 'pct_tl_nvr_dlq', 'percent_bc_gt_75', 'pub_rec_bankruptcies', 'tax_liens', 'tot_hi_cred_lim', 'total_bal_ex_mort', 'total_bc_limit', 'total_il_high_credit_limit', 'revol_bal_joint', 'sec_app_fico_range_low', 'sec_app_fico_range_high', 'sec_app_earliest_cr_line', 'sec_app_inq_last_6mths', 'sec_app_mort_acc', 'sec_app_open_acc', 'sec_app_revol_util', 'sec_app_open_act_il', 'sec_app_num_rev_accts', 'sec_app_chargeoff_within_12_mths', 'sec_app_collections_12_mths_ex_med', 'sec_app_mths_since_last_major_derog', 'hardship_flag', 'hardship_type', 'hardship_reason', 'hardship_status', 'deferral_term', 'hardship_amount', 'hardship_start_date', 'hardship_end_date', 'payment_plan_start_date', 'hardship_length', 'hardship_dpd', 'hardship_loan_status', 'orig_projected_additional_accrued_interest', 'hardship_payoff_balance_amount', 'hardship_last_payment_amount', 'disbursem*nt_method', 'debt_settlement_flag', 'debt_settlement_flag_date', 'settlement_status', 'settlement_date', 'settlement_amount', 'settlement_percentage', 'settlement_term']

Looking at the above features, it may seem scary first. But we will get through every feature and then select the relevant features. Let's start with the target feature “loan_status”

df.loan_status.value_counts()

Current 500937Fully Paid 358629Charged Off 99099Late (31-120 days) 13203In Grace Period 6337Late (16-30 days) 3414Default 36Name: loan_status, dtype: int64

We see that there are seven types of loan status. However, in this tutorial, we are interested in two classes: 1) Fully paid: those who paid the loan with interests and 2) Charged off: those who could not pay and finally charged off. Therefore, we select the data sets for these two classes:

df = df.loc[(df['loan_status'].isin(['Fully Paid', 'Charged Off']))]

df.shape(457728, 151)

Looking at the shape, we see that we now have half of the data point than original data and the same number of features. Before processing and cleaning manually, let’s do some general data processing steps first:

Remove features associated with >85% missing values
Remove constant features
Remove duplicates features
Remove duplicate rows
Remove highly collinear features (In part 3 EDA)

Alright, let’s get started with the typical data processing:

1. Remove features associated with 90% missing values:In the code below I first use pandas’ built-in method ‘isnull()’ to find the rows associated with missing values. Then I sum them up to get the count for each feature. Finally, I sort the features according to the number of missing values and create a data frame for further analysis.

In the above result, we see that there are 53 features which have 400000 missing values. I use the pandas’ drop method to remove these 53 features. Notice that in this function I set the “inplace” option to True”, which removes these features from original data framedfwithout returning anything.

2. Remove constant features:At this step, we remove features that have a single unique value. A feature associated with one unique value does not help the model to generalize well since it’s variance is zero. A tree-based model cannot take advantage of these type of features since the model can not split these features. To identify features with a single unique value is relatively straightforward:

In the above code, I create a function “find_constant_features” to identify constant features. The function goes through each feature and sees if it has less than two unique values. If so, the features are added to the constant feature list. We can also find out constant feature looking at the variance or standard deviation. If the feature has zero variance or standard deviation, we are sure that the feature has single unique value. The print statement shows that five features have single unique value. So we remove them using “inplace” option true.

3. Remove duplicate features:Duplicate features are those have the same value in multiple features with the same/different name. To find out the duplicate features I borrowed the following code from thisstack overflow link:

We see only one feature which seems to be duplicated. I am not going to remove the feature yet rather wait until we do EDA in the next part.

4. Remove duplicate rows:In this step, we remove all the duplicate rows. I use pandas built-in “drop_duplicates(inplace= True)” method to perform this action:

df.drop_duplicates(inplace= True)

The above four processings are basic which we need to do for any data science project. Let's see the shape of the data after all of these steps:

df.shape(457728, 93)

We see that we have 93 features after performing the above steps.

In thenext partof this tutorial, I will go through each feature, then perform cleaning and remove it if necessary. In the meantime, if you have any question regarding this part, please feel free to write your comment below. You can reach out to me:

Email: sabbers@gmail.comLinkedIn: https://www.linkedin.com/in/sabber-ahamed/Github: https://github.com/msahamedMedium: https://medium.com/@sabber/

Bio: Sabber Ahamed is the Founder of xoolooloo.com. Computational Geophysicist and Machine Learning Enthusiast.

Original. Reposted with permission.

Related:

Text Mining on the Command Line
Three techniques to improve machine learning model performance with imbalanced datasets

FAQs

What is loan eligibility prediction? ›

The loan eligibility prediction model makes use of an analysis technique that modifies historical and present credit user information to create predictions. A significant issue in predicting loan eligibility is, making precise loan predictions using risk and evaluation analysis.

Read On ›

Which algorithm is used for loan prediction? ›

We will start with Logistic Regression which is used for predicting binary outcomes. Logistic Regression is a classification algorithm. It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables. Logistic regression is an estimation of the Logit function.

Discover More Details ›

Which model is best for loan prediction? ›

Now that we have determined that SVM, XGBoost, and Random Forest are some of the best performing ML models for performing loan prediction and building a beginner's loan prediction machine learning project, let's see more details of what each model found in our loan prediction dataset.

What is loan prediction analysis classification? ›

To automate this process, they have given a problem to identify the customer's segments, those are eligible for loan amount so that they can specifically target these customers. This is a standard supervised classification task. A classification problem where we have to predict whether a loan would be approved or not.

See Details ›

Does loan eligibility affect credit score? ›

This rating indicates your chances of getting approved for a specific offer. Your score won't be affected, because comparing deals only leaves a soft credit search on your report, which lenders can't see.

Find Out More ›

How do banks determine loan eligibility? ›

Lenders will look at factors like your credit score, income, debt-to-income (DTI) ratio, and collateral to determine your eligibility for a personal loan. Different lenders will have different requirements for approving personal loans. Some lenders may be willing to work with applicants who have lower credit scores.

Tell Me More ›

Which algorithm is best for prediction? ›

Logistic regression is a popular algorithm for predicting a binary outcome, such as “yes” or “no,” based on previous data set observations.

Show Me More ›

Which software is used for prediction? ›

The best predictive analytics software at a glance

	Best for	Standout feature
SAS Viya	Automated forecasting	Flexible automations
One Model	People analytics	Built for HR use cases
SAP Analytics Cloud	Generative AI	Well-integrated generative AI assistant
Qlik	Interactive forecasting	No-code utility

2 more rows

Apr 11, 2024

Explore More ›

Which algorithm is used in data analytics? ›

The algorithms used in data analytics are diverse, with different methods being suited to different types of data and problems. Linear regression, for example, is widely used in prediction and forecasting, while logistic regression is widely used in marketing and risk assessment.

What model is best for prediction? ›

The most widely used predictive models are:

Decision trees: Decision trees are a simple, but powerful form of multiple variable analysis. ...
Regression (linear and logistic) Regression is one of the most popular methods in statistics. ...
Neural networks.

Show Me More ›

What are the advantages of loan prediction system? ›

Loan approval prediction using machine learning will offer lenders many benefits. Lenders can improve the accuracy of creditworthiness assessments, reduce the risk of defaults, and improve overall portfolio quality.

Read The Full Story ›

How to predict loan defaults? ›

At present, researchers generally use machine learning methods to predict loan defaults, including Logistic Regression, Decision Trees, Random Forest, XGBoost, and other advanced techniques. The main advantages of Logistic Regression lie in its simple understanding, sturdy performance, and easy implementation [6].

See Details ›

What is the methodology for loan approval prediction? ›

Some of the methods used here to predict are Naïve bayes, Logistic Regression, Support Vector Machine, Classification, Random Forest. So based on the accuracy of these used algorithms we can predict the loan approval easily.

Get More Info Here ›

How do you Analyse a loan? ›

· Loan analysis examines the progress as well a point in time.

➢ Loan analysis is focused on the three R's.
Risk -- what are the risks of the business and the person.
Return -- what is the return or profit of a loan to the business.
Repayment -- what is the capacity for repayment of the loan.

More items...

What is a 5c analysis loan? ›

The lender will typically follow what is called the Five Cs of Credit: Character, Capacity, Capital, Collateral and Conditions. Examining each of these things helps the lender determine the level of risk associated with providing the borrower with the requested funds.

What is the meaning of loan eligible? ›

Loan Eligibility in broad terms means your ability / capacity to avail the loan from lender banks/ NBFCs. It the measurement tool of the banks to decide to lend what quantum of funds on the basis of your financial credentials (Financial credentials means your income, other obligations, additional source of income, etc.

View Details ›

What is the objective of loan prediction project? ›

The system is concluded to be an efficient way to help banks process loan applications while reducing risk. The document discusses using machine learning techniques to predict loan repayment risk by analyzing loan applicant data. It aims to classify applicants as either high or low credit risk.

What determines student loan eligibility? ›

Your eligibility depends on your Expected Family Contribution, year in school, enrollment status, and the cost of attendance at the school you'll be attending. Learn more about how your EFC is calculated.

Learn More ›

Will checking loan eligibility affect credit score? ›

As with credit cards, you can typically receive preapproval for a personal loan with only a soft credit pull which won't affect your credit scores. A preapproval for a personal loan is a way to determine if you're eligible for a loan before formally applying and triggering a hard credit inquiry.

Discover More Details ›

Financial Data Analysis - Data Processing 1: Loan Eligibility Prediction - KDnuggets (2024)

Introduction

Data processing-1

More On This Topic

FAQs

What is loan eligibility prediction? ›

What are the advantages of loan prediction system? ›