Default Prediction Model: The Significant Role of Data
Engineering in the Quality of Outcomes
Ahmad Al-Qerem1, Ghazi Al-Naymat2,3, Mays Alhasan3,
and Mutaz Al-Debei4
1Computer Science Department, Zarqa University, Jordan
2Department of Information Technology, Ajman University, United Arab
Emirates
3King Hussein School of Computing Sciences, Princess Sumaya University for
Technology, Jordan
4Management Information Systems Department, The University of Jordan,
Jordan
Abstract: For financial institutions and
the banking industry, it is very crucial to have predictive models for their
core financial activities, and especially those activities which play major
roles in risk management. Predicting loan default is one of the critical issues
that banks and financial institutions focus on, as huge revenue loss could be
prevented by predicting customer’s ability not only to pay back, but also to be
able to do that on time. Customer loan default prediction is a task of
proactively identifying customers who are most probably to stop paying back
their loans. This is usually done by dynamically analyzing customers’ relevant
information and behaviors. This is significant so as the bank or the financial
institution can estimate the borrowers’ risk. Many different machine learning
classification models and algorithms have been used to predict customers’
ability to pay back loans. In this paper, three different classification
methods (Naïve Bayes, Decision Tree, and Random Forest) are used for
prediction, comprehensive different pre-processing techniques are being applied
on the dataset in order to gain better data through fixing some of the main
data issues like missing values and imbalanced data, and three different
feature extractions algorithms are used to enhance the accuracy and the
performance. Results of the competing models were varied after applying data
preprocessing techniques and features selections. The results were compared
using F1 accuracy measure. The best model achieved an improvement of about 40%,
whilst the least performing model achieved an improvement of 3% only. This
implies the significance and importance of data engineering (e.g., data
preprocessing techniques and features selections) course of action in machine
learning exercises.
Keywords: Default Prediction, Classification, Pre-processing,
Prediction, Features Selection, Generic Algorithm, PSO Algorithm, Naïve Bayes, Decision
Tree, SVM, Random Forest, Banking, Risk Management.
Received February 29, 2020; accepted June 9, 2020