In today’s competitive economy, credit scoring
constitutes one of the most significant and successful operational research techniques used in banking and finance. It was developed by Fair and Isaac in the early 1960s and corresponds to the procedure of estimating the risk related to credit products which is evaluated using applicants’ credentials (such as annual income, job status, residential status, etc.) and historical data [1
]. In simple terms, credit scoring produces a score which can be used to classify customers into two separate groups: the “credit-worthy” (likely to repay the credit loan), and the “non credit-worthy” (rejected due to its high probability of defaulting).
The global financial crisis of 2008
(which resulted in the collapse of many venerable names in the industry) caused a ripple effect throughout the economy and demonstrated the potential large losses when a credit applicant defaults on a loan [3
]. Therefore, the credit scoring systems are of great interest to banks and financial institutions, not only because they must measure credit risk, but also because any small improvement would produce great profits [5
]. For this task, many researchers in the past have developed credit- scoring models by exploiting the knowledge acquired from individual and company records of past borrowing and repaying actions gathered by the banks and financial institutions [8
]. In the field of credit scoring, imbalanced datasets frequently occur as the number of non-worthy applicants is usually much smaller than the number of worthy. In order to address this difficulty, ensemble learning methods have been proposed as a new direction for obtaining a better composite global model with more accurate and reliable decisions than can be obtained from using a single model [15
]. The basic idea of ensemble learning is the combination of a set of diverse prediction models for developing a prediction model with improved classification accuracy. Nevertheless, the vigorous development of the Internet, the emergence of vast collections and the widespread adoption of electronic records have led to the development of large repositories of labeled and mostly unlabeled data. Most conventional credit-scoring models are based on individual supervised classifiers or a simple combination of these classifiers which exploit only labeled data, ignoring the knowledge hidden in the unlabeled data.
Semi-Supervised Learning (SSL) algorithms constitute a hybrid model which comprises characteristics of both supervised and unsupervised learning algorithms. More specifically, these algorithms efficiently exploit the hidden knowledge in the unlabeled data with the explicit classification knowledge from the labeled data. To this end, they are generally considered as the appropriate machine learning methodology to build powerful classifiers by extracting information from both labeled and unlabeled data [16
]. Self-labeled algorithms constitute the most popular and frequently used class of SSL algorithms, thus have been efficiently applied in several real-world problems [17
]. These algorithms wrap around a supervised prediction base learner and exploit the unlabeled data via a self-learning philosophy. Recently, Triguero et al. [25
] presented an in-depth taxonomy focusing on demonstrating their simplicity of implementation and their wrapper-based methodology [16
In this work, we examine and evaluate the performance of two ensemble-based self-labeled algorithms for the credit risk scoring problem. The proposed algorithms combine/fuse the predictions of three of the most productive and frequently used self-labeled algorithms, using different methodologies. Our experimental results demonstrate the classification accuracy of the presented algorithms on three credit scoring datasets.
The remainder of this paper is organized as follows: Section 2
presents a survey of recent studies concerning the application of data mining in credit scoring problem. Section 3
presents a brief description of self-labeled methods and the proposed ensemble-based SSL algorithms. Section 4
presents a series of experiments to evaluate the accuracy of the proposed algorithms for the credit scoring problem. Finally, Section 5
sketches our concluding remarks and our future work.
2. Related Work
During the last decades, the developments and advances of machine learning systems in credit decision making have gained popularity, addressing many issues in banking and finance. Louzada et al. [27
] presented an extensive review, discussing the chronicles of recent credit scoring financial analysis and developments and analyze the outcomes produced by a machine learning approach. Additionally, they described in detail the most accurate prediction models used for gaining significant insights on credit scoring problem and conducted a variety of experiments, using three real-world datasets (Australian credit scoring, Japanese credit scoring and German credit scoring). A number of rewarding studies have been carried out in recent years; some useful outcomes of them are briefly presented below.
Kennedy et al. [28
] evaluated the suitability of semi-supervised one-class classification algorithms against supervised two-class classification algorithms on low-default portfolio problem. Nine banking datasets were used and class imbalance is artificially created by removing 10%of the defaulting observations from the training set after each run. Additionally, they also investigated the suitability of oversampling, which constitutes a common approach to dealing with low-default portfolios. Their experimental results demonstrated that semi-supervised techniques should not be expected to outperform the supervised two-class classification techniques and they should be used only in the near or complete absence of defaulters. Moreover, although oversampling improved the performance of some two-class classifiers, it does not lead to an overall improvement of the best performing classifiers.
Alaraj and Abbod [29
] introduced a model based on the combination of hybrid and ensemble methods for credit scoring. Firstly, they combined filtering and feature selection methods to develop an effective pre-processor for machine learning models. In addition, they proposed a new classifier combination rule based on the consensus approach of different classification algorithms, during the ensemble modeling phase. Their experimental analysis on seven real-world credit datasets illustrated that the proposed model exhibited better predictive performance than the individual classifiers.
Abellán and Castellano [30
] performed a comparative study on several base classifiers used in different ensemble schemes for credit scoring tasks. Additionally, they evaluated the performance of Credal Decision Tree (CDT) which uses imprecise probabilities and uncertainty measures to build a decision tree. Via an experimental study, they concluded that all the investigated ensemble schemes present better performance when they use CDT model as a base learner on credit scoring problems.
In more recent works, Tripathi et al. [31
] proposed a hybrid credit scoring model based on dimensionality reduction by Neighborhood Rough Set algorithm for feature selection and layered ensemble classification with weighted voting approach to enhance the classification performance. They have proposed a novel classifier ranking algorithm as an underlying model for representing ranks of the classifiers based on classifier accuracy. The experimental results revealed the efficacy and robustness of the proposed method in two benchmarked credit scoring datasets.
Zhang et al. [32
] proposed a new predictive model which is based on a novel technique for selecting classifiers using a genetic algorithm, considering both the accuracy and diversity of the ensemble. They conducted a variety of experiments, using three real-world datasets (Australian credit scoring, Japanese credit scoring and German credit scoring) to explore the effectiveness of their proposed model. Based on their numerical experiments the authors concluded that their proposed ensemble method outperforms classical classifiers in terms of prediction accuracy.
J. Levatić et al. [33
] proposed method for semi-supervised learning of classification trees. The trees can be trained with nominal and/or numeric descriptive attributes on binary and multi-class classification datasets. Additionally, they performed an extensive empirical evaluation of their framework using an ensemble of decision trees as base learners obtaining some interesting results. Along this line, they extended their work, presenting some ensemble-based algorithms for multi-target regression problems [33
4. Experimental Methodology
In this section, we conducted a series of experiments to evaluate the performance of CST-Voting and EnSSL algorithms against the most popular and frequently used self-labeled algorithms.
The implementation code was written in JAVA, making use of the WEKA
Machine Learning Toolkit [46
]. To study the influence of the amount of labeled data, three different ratios (R
) of the training data were used, i.e.,
and all self-labeled algorithms were evaluated using the stratified 10-fold cross-validation.
The experiments in our study took place in two distinct phases. In the first phase, we evaluated the classification performance of CST-Voting and EnSSL against the most popular SSL algorithms namely Self-training, Co-training, and Tri-training; while in the second phase, we compared their performance against some state-of-the-art self-labeled algorithms, namely SETRED, Demo-Co and Co-Forest. Table 1
reports the configuration parameters of all evaluated self-labeled algorithms while all base learners were used with their default parameter settings included in the WEKA
Machine Learning Toolkit (University of Waikato, Hamilton, New Zealand) for minimizing the effect of any expert bias.
All algorithms were evaluated using three different benchmark datasets: Australian credit, Japanest credit and German credit which are publicly available in UCI Machine Learning Repository [48
], concerning approved or rejected credit card applications. The first has 690 cases, with 14 explanatory variables (6 continuous and 8 categorical); the second one has 653 instances, with 14 explanatory variables (3 continuous, 3 integer and 9 categorical); while the third one has 1000 instances, with 20 explanatory variables (7 continuous and 13 categorical). With regards to the cardinality of each class, in Australian dataset there is a small imbalance of rejected and accepted instances, namely 383 and 307, respectively. In Japanese dataset there is a small imbalance of rejected and accepted instances, namely 357 and 296, respectively; while in German dataset a sharper imbalance is observed, with 300 negative decisions and 700 positive.
The performance of the classification algorithms was evaluated using the following four performance metrics: Sensitivity (Sen
), Specificity (Spe
and Accuracy (Acc
) which are respectively defined by
stands for the number of instances which have been correctly classified as positive,
stands for the number of instances which have been correctly classified as negative,
error) stands for the number of instances which have been wrongly classified as positive,
error) stands for the number of instances which have been wrongly classified as negative.
It is worth mentioning that Sensitivity of classification is the proportion of actual positives which are predicted as positive; Specificity represents the proportion of actual negatives which are predicted as negative, consists of a harmonic mean of precision and recall while Accuracy is the ratio of correct predictions of a classification model.
4.1. First Phase of Experiments
In the sequel, we focus our interest on the experimental analysis for evaluating the classification performance of CST-Voting and EnSSL algorithms against its component self-labeled methods, i.e., Self-training, Co-training, and Tri-training. All SSL algorithms were evaluated by deploying as base learners the Naive Bayes (NB) [49
], the Sequential Minimum Optimization (SMO) [50
], the Multilayer Perceptron (MLP) [51
] and the k
NN algorithm [52
]. These algorithms probably constitute the most effective and popular machine learning algorithms for classification problems [53
]. Moreover, similar to Blum and Mitchell [37
], a limit to the number of iterations of all self-labeled algorithms is established. This strategy has also been adopted by many researchers [18
, Table 3
and Table 4
present the performance of each compared SSL algorithms for Australian, Japanese, German datasets, respectively, relative to all performance metrics. Notice that the highest classification accuracy is highlighted in bold for each base learner. Firstly, it is worth mentioning that the ensemble SSL methods, CST-Voting and EnSSL, exhibited the best performance, regarding all datasets and improved their performance metric as the labeled ratio increased. In more detail:
CST-Voting exhibited the best performance in 10, 8 and 8 cases for Australian dataset, Japanese dataset and German dataset, respectively, while EnSSL exhibited the highest accuracy in 6, 8 and 8 cases in the same situations.
Depending upon the base classifier, CST-Voting is the most effective method using NB or SMO as base learner, while EnSSL reported the highest performance using MLP as base learner.
In machine learning, the statistical comparison of several evaluation algorithms over multiple datasets is fundamental and it is frequently performed by means of a statistical test [20
]. Since our motivation stems from the fact that we are interested in evaluating the rejection of the hypothesis that all the algorithms perform equally well for a given level based on their classification accuracy and highlighting the existence of significant differences between our proposed algorithm and the classical self-labeled algorithms, we used the non-parametric Friedman Aligned Ranking (FAR) [57
] test. Moreover, the Finner test [58
] is applied as a post-hoc procedure to find out which algorithms present significant differences.
presents the information of the statistical analysis performed by nonparametric multiple comparison procedures for Self-training, Co-training, Tri-Training, CST-Voting and EnSSL algorithms. Notice that the control algorithm for the post-hoc test is determined by the best (e.g., lowest) ranking obtained in each FAR test. Moreover, the adjusted p
-value with Finner’s test (
) was presented based on the corresponding control algorithm at the
level of significance. The post-hoc test rejects the hypothesis of equality when the value of
is less than the value of a
Clearly, CST-Voting and EnSSL demonstrate the best overall performance, as they outperform the rest self-labeled algorithms. This is because it reports the highest probability-based ranking by statistically presenting better results, relative to all used base learners. CST-Voting exhibited the best performance using NB and SMO as base learners while EnSSL presented the best performance using MLP and kNN as base learners. Furthermore, the FAR test but mostly the Finner post-hoc test revealed that CST-Voting and EnSSL perform equally well.
4.2. Second Phase of Experiments
Next, we evaluated the classification performance of the presented ensemble algorithms, CST-Voting and EnSSL, against some other state-of-the-art self-labeled algorithms such as SETRED, Co-Forest and Democratic-Co learning. Notice that CST-Voting and EnSSL uses SMO and MLP as base learners, respectively which exhibited the best performance, relative to all performance metrics.
, Table 7
and Table 8
report the performance of each tested self-labeled algorithm on Australian dataset, Japanese dataset and German credit dataset, respectively. As above mentioned, the accuracy measure of the best performing algorithm is highlighted in bold. Clearly, the presented ensemble SSL algorithms illustrate the best performance, independent of the used labeled ratio. Furthermore, it is worth noticing that EnSSL exhibits slightly better average performance than CST-Voting.
presents the statistical analysis for SETRED, Co-Forest, Democratic-Co learning, CST-Voting and EnSSL, performed by nonparametric multiple comparison procedures. As mentioned above, the control algorithm for the post-hoc test is determined by the best (e.g., lowest) ranking obtained in each FAR test while the adjusted p
-value with Finner’s test (
) was presented based on the corresponding control algorithm at the
level of significance. The interpretation of Table 9
illustrates that CST-Voting and EnSSL exhibit the highest probability-based ranking by statistically presenting better results. Moreover, it is worth noticing that CST-Voting and EnSSL perform similarly with EnSSL presenting slightly better performance according to the FAR test.
In this work, we evaluated the performance of two ensemble SSL algorithms, entitled CST-Voting and EnSSL, for the credit scoring problem. The proposed ensemble algorithms combine the individual predictions of three of the most efficient and popular self-labeled algorithms, i.e., Co-training, Self-training, and Tri-training, using two different voting methodologies. The numerical experiments presented the efficacy of the ensemble SSL algorithms on three well-known credit score datasets, illustrating that reliable and robust prediction models could be developed by the adaptation of ensemble techniques in the semi-supervised learning framework.
It is worth noticing that we understand the limitations imposed on the generalizability of the presented results due to the use of the only three free available data sets as compared to other works [26
]. Furthermore, since we do not know whether the values of the key parameters of the base learners within WEKA 3.9 are randomly initialized, optimized, or adapted, one may generally consider this approach as a limitation when comparisons of algorithms are conducted using only three datasets. We certainly intend to investigate this further in the near future.
Additionally, another interesting aspect for future research could be the development of a decision-support tool based on an ensemble SSL algorithm, concerning the credit risk scoring problem. The use of a predictive tool could assist financial institutions to decide whether to grant credit to consumers who apply. Since our numerical experiments are quite encouraging, our future work is concentrated on evaluating the proposed algorithms versus relevant methodologies and frameworks addressing the credit score problem such as [27
] and versus recently proposed advanced SSL algorithms such as [59