A Novel Feature Selection Method Based on
Maximum Likelihood Logistic Regression
for Imbalanced Learning in Software Defect
Prediction
Kamal Bashir1, Tianrui Li1, and Mahama Yahaya2
1School of
Information Science and Technology, Southwest Jiaotong University, China
2School of
Transport and Logistics Engineering, Southwest Jiaotong University, China
Abstract: The most frequently used machine learning feature ranking
approaches failed to present optimal feature subset for accurate prediction of
defective software modules in out-of-sample data. Machine learning Feature Selection
(FS) algorithms such as Chi-Square (CS), Information Gain (IG), Gain Ratio
(GR), RelieF (RF) and Symmetric Uncertainty (SU) perform relatively poor at
prediction, even after balancing class distribution in the training data. In
this study, we propose a novel FS method based on the Maximum Likelihood Logistic
Regression (MLLR). We apply this method on six software defect datasets in
their sampled and unsampled forms to select useful features for classification
in the context of Software Defect Prediction (SDP). The Support Vector Machine
(SVM) and Random Forest (RaF) classifiers are applied on the FS subsets that are
based on sampled and unsampled datasets. The performance of the models captured
using Area Ander Receiver Operating Characteristics Curve (AUC) metrics are
compared for all FS methods considered. The Analysis Of Variance (ANOVA) F-test
results validate the superiority of the proposed method over all the FS
techniques, both in sampled and unsampled data. The results confirm that the
MLLR can be useful in selecting optimal feature subset for more accurate
prediction of defective modules in software development process.
Keywords: Software defect prediction· Machine
learning· Class imbalance· Maximum-likelihood logistic regression.
Received April 30, 2018; accepted January
28, 2020
https://doi.org/10.34028/iajit/17/5/5