Financial Data Analysis - Data Processing 1: Loan Eligibility Prediction - KDnuggets (2024)

Financial Data Analysis - Data Processing 1: Loan Eligibility Prediction - KDnuggets (1) comments

By Sabber Ahamed, Computational Geophysicist and Machine Learning Enthusiast

Financial Data Analysis - Data Processing 1: Loan Eligibility Prediction - KDnuggets (2)

Introduction


Financial institutions/companies have been using predictive analytics for quite a long time. Recently, due to the availability of computational resources and tremendous research in machine learning made it possible to better data analysis hence better prediction. In the series of articles, I explain how to create a predictive loan model that identifies a bad applicant who is more likely to be charged off. In step by step processes, I show how to process raw data, clean unnecessary part of it, select relevant features, perform exploratory data analysis, and finally build a model.

As an example, I use Lending club loan data dataset. Lending Club is the world’s largest online marketplace connecting borrowers and investors. An inevitable outcome of lending is default by borrowers. The idea of this tutorial is to create a predictive model that identifies applicants who are relatively risky for a loan. In order to accomplish this, I organized the whole series into four parts as follows:

  • Data processing-1: In this first part I show how to clean and remove unnecessary features. Data processing is very time-consuming, but better data would produce a better model. Therefore, careful and very detail examination is required to prepare better data. I show how to identify constant features, duplicate feature, duplicate rows, and features with a high number of missing values.
  • Data processing-2: In this part, I manually go through each and every features selected from part -1. This is the most time-consuming part, but worth it for a better model.
  • EDA: In in this part, I do some exploratory data analysis (EDA) on the features selected in part-1 and 2. A good EDA is required to get a better knowledge of the domain. We need to spend some quality time to find out the relations between the features.
  • Create a model: Finally, In this last but not the last part, I create models. Creating a model is also not an easy task. It’s also an iterative process. I show how to start with a with a simple model, then slowly add complexity for better performance.

Alright, let’s get started with the part-1: data processing, cleaning and feature selections.

Data processing-1

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport warningswarnings.filterwarnings("ignore")

In this project, I used three years of datasets (2014, 2015 and 2017(first-thrid quarter)) and stored in five separate CSV files. Lets read the files first:

df1 = pd.read_csv(‘./data/2017Q1.csv’, skiprows=[0])df2 = pd.read_csv(‘./data/2017Q2.csv’, skiprows=[0])df3 = pd.read_csv(‘./data/2017Q3.csv’, skiprows=[0])df4 = pd.read_csv(‘./data/2014.csv’, skiprows=[0])df5 = pd.read_csv(‘./data/2015.csv’, skiprows=[0])

Since data are stored in separate files, we have to make sure that we have the same number of features in each file. We can check using the following code snippet:

columns = np.dstack((list(df1.columns), list(df2.columns), list(df3.columns), list(df4.columns), list(df5.columns)))coldf = pd.DataFrame(columns[0])

The above code is self-explanatory, we first extract the column names the stack them together using Numpy ‘dstack’ object. If you look at the Jupyter-notebook on Github, you would see they are same. Which is good for us. We can move on to the next step. It’s time to check the shape of the data:

df = pd.concat([df1, df2, df3, df4, df5])df.shape
(981665, 151)

We see that there are approximately one million examples and each of the examples has 151 features including target variable. Let’s look at the feature name to get familiar with the data. It’s imperative to get to know the domain, especially the details of the features relationship with the target variable. It’s not easy to learn overnight, that’s why need to spend some days or maybe a week to get familiar with the data before jumping into further detail analysis. Let’s see the feature names:

print(list(df.columns))
['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'annual_inc', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'url', 'desc', 'purpose', 'title', 'zip_code', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line', 'fico_range_low', 'fico_range_high', 'inq_last_6mths', 'mths_since_last_delinq', 'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 'initial_list_status', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_d', 'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d', 'last_fico_range_high', 'last_fico_range_low', 'collections_12_mths_ex_med', 'mths_since_last_major_derog', 'policy_code', 'application_type', 'annual_inc_joint', 'dti_joint', 'verification_status_joint', 'acc_now_delinq', 'tot_coll_amt', 'tot_cur_bal', 'open_acc_6m', 'open_act_il', 'open_il_12m', 'open_il_24m', 'mths_since_rcnt_il', 'total_bal_il', 'il_util', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc', 'all_util', 'total_rev_hi_lim', 'inq_fi', 'total_cu_tl', 'inq_last_12m', 'acc_open_past_24mths', 'avg_cur_bal', 'bc_open_to_buy', 'bc_util', 'chargeoff_within_12_mths', 'delinq_amnt', 'mo_sin_old_il_acct', 'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl', 'mort_acc', 'mths_since_recent_bc', 'mths_since_recent_bc_dlq', 'mths_since_recent_inq', 'mths_since_recent_revol_delinq', 'num_accts_ever_120_pd', 'num_actv_bc_tl', 'num_actv_rev_tl', 'num_bc_sats', 'num_bc_tl', 'num_il_tl', 'num_op_rev_tl', 'num_rev_accts', 'num_rev_tl_bal_gt_0', 'num_sats', 'num_tl_120dpd_2m', 'num_tl_30dpd', 'num_tl_90g_dpd_24m', 'num_tl_op_past_12m', 'pct_tl_nvr_dlq', 'percent_bc_gt_75', 'pub_rec_bankruptcies', 'tax_liens', 'tot_hi_cred_lim', 'total_bal_ex_mort', 'total_bc_limit', 'total_il_high_credit_limit', 'revol_bal_joint', 'sec_app_fico_range_low', 'sec_app_fico_range_high', 'sec_app_earliest_cr_line', 'sec_app_inq_last_6mths', 'sec_app_mort_acc', 'sec_app_open_acc', 'sec_app_revol_util', 'sec_app_open_act_il', 'sec_app_num_rev_accts', 'sec_app_chargeoff_within_12_mths', 'sec_app_collections_12_mths_ex_med', 'sec_app_mths_since_last_major_derog', 'hardship_flag', 'hardship_type', 'hardship_reason', 'hardship_status', 'deferral_term', 'hardship_amount', 'hardship_start_date', 'hardship_end_date', 'payment_plan_start_date', 'hardship_length', 'hardship_dpd', 'hardship_loan_status', 'orig_projected_additional_accrued_interest', 'hardship_payoff_balance_amount', 'hardship_last_payment_amount', 'disbursem*nt_method', 'debt_settlement_flag', 'debt_settlement_flag_date', 'settlement_status', 'settlement_date', 'settlement_amount', 'settlement_percentage', 'settlement_term']

Looking at the above features, it may seem scary first. But we will get through every feature and then select the relevant features. Let's start with the target feature “loan_status”

df.loan_status.value_counts()
Current 500937Fully Paid 358629Charged Off 99099Late (31-120 days) 13203In Grace Period 6337Late (16-30 days) 3414Default 36Name: loan_status, dtype: int64

We see that there are seven types of loan status. However, in this tutorial, we are interested in two classes: 1) Fully paid: those who paid the loan with interests and 2) Charged off: those who could not pay and finally charged off. Therefore, we select the data sets for these two classes:

df = df.loc[(df['loan_status'].isin(['Fully Paid', 'Charged Off']))]
df.shape(457728, 151)

Looking at the shape, we see that we now have half of the data point than original data and the same number of features. Before processing and cleaning manually, let’s do some general data processing steps first:

  • Remove features associated with >85% missing values
  • Remove constant features
  • Remove duplicates features
  • Remove duplicate rows
  • Remove highly collinear features (In part 3 EDA)

Alright, let’s get started with the typical data processing:

1. Remove features associated with 90% missing values:In the code below I first use pandas’ built-in method ‘isnull()’ to find the rows associated with missing values. Then I sum them up to get the count for each feature. Finally, I sort the features according to the number of missing values and create a data frame for further analysis.

In the above result, we see that there are 53 features which have 400000 missing values. I use the pandas’ drop method to remove these 53 features. Notice that in this function I set the “inplace” option to True”, which removes these features from original data framedfwithout returning anything.

2. Remove constant features:At this step, we remove features that have a single unique value. A feature associated with one unique value does not help the model to generalize well since it’s variance is zero. A tree-based model cannot take advantage of these type of features since the model can not split these features. To identify features with a single unique value is relatively straightforward:

In the above code, I create a function “find_constant_features” to identify constant features. The function goes through each feature and sees if it has less than two unique values. If so, the features are added to the constant feature list. We can also find out constant feature looking at the variance or standard deviation. If the feature has zero variance or standard deviation, we are sure that the feature has single unique value. The print statement shows that five features have single unique value. So we remove them using “inplace” option true.

3. Remove duplicate features:Duplicate features are those have the same value in multiple features with the same/different name. To find out the duplicate features I borrowed the following code from thisstack overflow link:

We see only one feature which seems to be duplicated. I am not going to remove the feature yet rather wait until we do EDA in the next part.

4. Remove duplicate rows:In this step, we remove all the duplicate rows. I use pandas built-in “drop_duplicates(inplace= True)” method to perform this action:

df.drop_duplicates(inplace= True)

The above four processings are basic which we need to do for any data science project. Let's see the shape of the data after all of these steps:

df.shape(457728, 93)

We see that we have 93 features after performing the above steps.

In thenext partof this tutorial, I will go through each feature, then perform cleaning and remove it if necessary. In the meantime, if you have any question regarding this part, please feel free to write your comment below. You can reach out to me:


Bio: Sabber Ahamed is the Founder of xoolooloo.com. Computational Geophysicist and Machine Learning Enthusiast.

Original. Reposted with permission.

Related:

  • Text Mining on the Command Line
  • Three techniques to improve machine learning model performance with imbalanced datasets

More On This Topic

  • AWS Data Exchange Webinar: Maintain competitive edge with…
  • Building a Structured Financial Newsfeed Using Python, SpaCy and Streamlit
  • How ML Model Explainability Accelerates the AI Adoption Journey for…
  • Data Science Project of Rotten Tomatoes Movie Rating Prediction:…
  • Data Science Project of Rotten Tomatoes Movie Rating Prediction:…
  • Multivariate Time-Series Prediction with BQML
Financial Data Analysis - Data Processing 1: Loan Eligibility Prediction - KDnuggets (2024)

FAQs

What is loan eligibility prediction? ›

The loan eligibility prediction model makes use of an analysis technique that modifies historical and present credit user information to create predictions. A significant issue in predicting loan eligibility is, making precise loan predictions using risk and evaluation analysis.

Which algorithm is used for loan prediction? ›

We will start with Logistic Regression which is used for predicting binary outcomes. Logistic Regression is a classification algorithm. It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables. Logistic regression is an estimation of the Logit function.

Which model is best for loan prediction? ›

Now that we have determined that SVM, XGBoost, and Random Forest are some of the best performing ML models for performing loan prediction and building a beginner's loan prediction machine learning project, let's see more details of what each model found in our loan prediction dataset.

What is loan prediction analysis classification? ›

To automate this process, they have given a problem to identify the customer's segments, those are eligible for loan amount so that they can specifically target these customers. This is a standard supervised classification task. A classification problem where we have to predict whether a loan would be approved or not.

Does loan eligibility affect credit score? ›

This rating indicates your chances of getting approved for a specific offer. Your score won't be affected, because comparing deals only leaves a soft credit search on your report, which lenders can't see.

How do banks determine loan eligibility? ›

Lenders will look at factors like your credit score, income, debt-to-income (DTI) ratio, and collateral to determine your eligibility for a personal loan. Different lenders will have different requirements for approving personal loans. Some lenders may be willing to work with applicants who have lower credit scores.

Which algorithm is best for prediction? ›

Logistic regression is a popular algorithm for predicting a binary outcome, such as “yes” or “no,” based on previous data set observations.

Which software is used for prediction? ›

The best predictive analytics software at a glance
Best forStandout feature
SAS ViyaAutomated forecastingFlexible automations
One ModelPeople analyticsBuilt for HR use cases
SAP Analytics CloudGenerative AIWell-integrated generative AI assistant
QlikInteractive forecastingNo-code utility
2 more rows
Apr 11, 2024

Which algorithm is used in data analytics? ›

The algorithms used in data analytics are diverse, with different methods being suited to different types of data and problems. Linear regression, for example, is widely used in prediction and forecasting, while logistic regression is widely used in marketing and risk assessment.

What model is best for prediction? ›

The most widely used predictive models are:
  • Decision trees: Decision trees are a simple, but powerful form of multiple variable analysis. ...
  • Regression (linear and logistic) Regression is one of the most popular methods in statistics. ...
  • Neural networks.

What are the advantages of loan prediction system? ›

Loan approval prediction using machine learning will offer lenders many benefits. Lenders can improve the accuracy of creditworthiness assessments, reduce the risk of defaults, and improve overall portfolio quality.

How to predict loan defaults? ›

At present, researchers generally use machine learning methods to predict loan defaults, including Logistic Regression, Decision Trees, Random Forest, XGBoost, and other advanced techniques. The main advantages of Logistic Regression lie in its simple understanding, sturdy performance, and easy implementation [6].

What is the methodology for loan approval prediction? ›

Some of the methods used here to predict are Naïve bayes, Logistic Regression, Support Vector Machine, Classification, Random Forest. So based on the accuracy of these used algorithms we can predict the loan approval easily.

How do you Analyse a loan? ›

· Loan analysis examines the progress as well a point in time.
  1. ➢ Loan analysis is focused on the three R's.
  2. Risk -- what are the risks of the business and the person.
  3. Return -- what is the return or profit of a loan to the business.
  4. Repayment -- what is the capacity for repayment of the loan.

What is a 5c analysis loan? ›

The lender will typically follow what is called the Five Cs of Credit: Character, Capacity, Capital, Collateral and Conditions. Examining each of these things helps the lender determine the level of risk associated with providing the borrower with the requested funds.

What is the meaning of loan eligible? ›

Loan Eligibility in broad terms means your ability / capacity to avail the loan from lender banks/ NBFCs. It the measurement tool of the banks to decide to lend what quantum of funds on the basis of your financial credentials (Financial credentials means your income, other obligations, additional source of income, etc.

What is the objective of loan prediction project? ›

The system is concluded to be an efficient way to help banks process loan applications while reducing risk. The document discusses using machine learning techniques to predict loan repayment risk by analyzing loan applicant data. It aims to classify applicants as either high or low credit risk.

What determines student loan eligibility? ›

Your eligibility depends on your Expected Family Contribution, year in school, enrollment status, and the cost of attendance at the school you'll be attending. Learn more about how your EFC is calculated.

Will checking loan eligibility affect credit score? ›

As with credit cards, you can typically receive preapproval for a personal loan with only a soft credit pull which won't affect your credit scores. A preapproval for a personal loan is a way to determine if you're eligible for a loan before formally applying and triggering a hard credit inquiry.

Top Articles
Pineapple Cashew Fried Rice.
What is Semrush? Is It Worth to Invest in 2024? (New Features)
Katie Pavlich Bikini Photos
Gamevault Agent
Hocus Pocus Showtimes Near Harkins Theatres Yuma Palms 14
Free Atm For Emerald Card Near Me
Craigslist Mexico Cancun
Hendersonville (Tennessee) – Travel guide at Wikivoyage
Doby's Funeral Home Obituaries
Vardis Olive Garden (Georgioupolis, Kreta) ✈️ inkl. Flug buchen
Select Truck Greensboro
Things To Do In Atlanta Tomorrow Night
How To Cut Eelgrass Grounded
Pac Man Deviantart
Alexander Funeral Home Gallatin Obituaries
Craigslist In Flagstaff
Shasta County Most Wanted 2022
Energy Healing Conference Utah
Testberichte zu E-Bikes & Fahrrädern von PROPHETE.
Aaa Saugus Ma Appointment
Geometry Review Quiz 5 Answer Key
Walgreens Alma School And Dynamite
Bible Gateway passage: Revelation 3 - New Living Translation
Yisd Home Access Center
Home
Shadbase Get Out Of Jail
Gina Wilson Angle Addition Postulate
Celina Powell Lil Meech Video: A Controversial Encounter Shakes Social Media - Video Reddit Trend
Walmart Pharmacy Near Me Open
Dmv In Anoka
A Christmas Horse - Alison Senxation
Ou Football Brainiacs
Access a Shared Resource | Computing for Arts + Sciences
Pixel Combat Unblocked
Cvs Sport Physicals
Mercedes W204 Belt Diagram
Rogold Extension
'Conan Exiles' 3.0 Guide: How To Unlock Spells And Sorcery
Teenbeautyfitness
Weekly Math Review Q4 3
Facebook Marketplace Marrero La
Nobodyhome.tv Reddit
Topos De Bolos Engraçados
Gregory (Five Nights at Freddy's)
Grand Valley State University Library Hours
Holzer Athena Portal
Hampton In And Suites Near Me
Stoughton Commuter Rail Schedule
Bedbathandbeyond Flemington Nj
Free Carnival-themed Google Slides & PowerPoint templates
Otter Bustr
Selly Medaline
Latest Posts
Article information

Author: Pres. Lawanda Wiegand

Last Updated:

Views: 5775

Rating: 4 / 5 (51 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Pres. Lawanda Wiegand

Birthday: 1993-01-10

Address: Suite 391 6963 Ullrich Shore, Bellefort, WI 01350-7893

Phone: +6806610432415

Job: Dynamic Manufacturing Assistant

Hobby: amateur radio, Taekwondo, Wood carving, Parkour, Skateboarding, Running, Rafting

Introduction: My name is Pres. Lawanda Wiegand, I am a inquisitive, helpful, glamorous, cheerful, open, clever, innocent person who loves writing and wants to share my knowledge and understanding with you.