informatics

: Credit scoring is generally recognized as one of the most signiﬁcant operational research techniques used in banking and ﬁnance, aiming to identify whether a credit consumer belongs to either a legitimate or a suspicious customer group. With the vigorous development of the Internet and the widespread adoption of electronic records, banks and ﬁnancial institutions have accumulated large repositories of labeled and mostly unlabeled data. Semi-supervised learning constitutes an appropriate machine-learning methodology for extracting useful knowledge from both labeled and unlabeled data. In this work, we evaluate the performance of two ensemble semi-supervised learning algorithms for the credit scoring problem. Our numerical experiments indicate that the proposed algorithms outperform their component semi-supervised learning algorithms, illustrating that reliable and robust prediction models could be developed by the adaptation of ensemble techniques in the semi-supervised learning framework


Introduction
In today's competitive economy, credit scoring constitutes one of the most significant and successful operational research techniques used in banking and finance.It was developed by Fair and Isaac in the early 1960s and corresponds to the procedure of estimating the risk related to credit products which is evaluated using applicants' credentials (such as annual income, job status, residential status, etc.) and historical data [1,2].In simple terms, credit scoring produces a score which can be used to classify customers into two separate groups: the "credit-worthy" (likely to repay the credit loan), and the "non credit-worthy" (rejected due to its high probability of defaulting).
The global financial crisis of 2008 (which resulted in the collapse of many venerable names in the industry) caused a ripple effect throughout the economy and demonstrated the potential large losses when a credit applicant defaults on a loan [3,4].Therefore, the credit scoring systems are of great interest to banks and financial institutions, not only because they must measure credit risk, but also because any small improvement would produce great profits [5][6][7].For this task, many researchers in the past have developed credit-scoring models by exploiting the knowledge acquired from individual and company records of past borrowing and repaying actions gathered by the banks and financial institutions [8][9][10][11][12][13][14].In the field of credit scoring, imbalanced datasets frequently occur as the number of non-worthy applicants is usually much smaller than the number of worthy.In order to address this difficulty, ensemble learning methods have been proposed as a new direction for obtaining a better composite global model with more accurate and reliable decisions than can be obtained from using a single model [15].The basic idea of ensemble learning is the combination of a set of diverse prediction models for developing a prediction model with improved classification accuracy.Nevertheless, the vigorous development of the Internet, the emergence of vast collections and the widespread adoption of electronic records have led to the development of large repositories of labeled and mostly unlabeled data.Most conventional credit-scoring models are based on individual supervised classifiers or a simple combination of these classifiers which exploit only labeled data, ignoring the knowledge hidden in the unlabeled data.
Semi-Supervised Learning (SSL) algorithms constitute a hybrid model which comprises characteristics of both supervised and unsupervised learning algorithms.More specifically, these algorithms efficiently exploit the hidden knowledge in the unlabeled data with the explicit classification knowledge from the labeled data.To this end, they are generally considered as the appropriate machine learning methodology to build powerful classifiers by extracting information from both labeled and unlabeled data [16].Self-labeled algorithms constitute the most popular and frequently used class of SSL algorithms, thus have been efficiently applied in several real-world problems [17][18][19][20][21][22][23][24].These algorithms wrap around a supervised prediction base learner and exploit the unlabeled data via a self-learning philosophy.Recently, Triguero et al. [25] presented an in-depth taxonomy focusing on demonstrating their simplicity of implementation and their wrapper-based methodology [16,26].
In this work, we examine and evaluate the performance of two ensemble-based self-labeled algorithms for the credit risk scoring problem.The proposed algorithms combine/fuse the predictions of three of the most productive and frequently used self-labeled algorithms, using different methodologies.Our experimental results demonstrate the classification accuracy of the presented algorithms on three credit scoring datasets.
The remainder of this paper is organized as follows: Section 2 presents a survey of recent studies concerning the application of data mining in credit scoring problem.Section 3 presents a brief description of self-labeled methods and the proposed ensemble-based SSL algorithms.Section 4 presents a series of experiments to evaluate the accuracy of the proposed algorithms for the credit scoring problem.Finally, Section 5 sketches our concluding remarks and our future work.

Related Work
During the last decades, the developments and advances of machine learning systems in credit decision making have gained popularity, addressing many issues in banking and finance.Louzada et al. [27] presented an extensive review, discussing the chronicles of recent credit scoring financial analysis and developments and analyze the outcomes produced by a machine learning approach.Additionally, they described in detail the most accurate prediction models used for gaining significant insights on credit scoring problem and conducted a variety of experiments, using three real-world datasets (Australian credit scoring, Japanese credit scoring and German credit scoring).A number of rewarding studies have been carried out in recent years; some useful outcomes of them are briefly presented below.
Kennedy et al. [28] evaluated the suitability of semi-supervised one-class classification algorithms against supervised two-class classification algorithms on low-default portfolio problem.Nine banking datasets were used and class imbalance is artificially created by removing 10%of the defaulting observations from the training set after each run.Additionally, they also investigated the suitability of oversampling, which constitutes a common approach to dealing with low-default portfolios.Their experimental results demonstrated that semi-supervised techniques should not be expected to outperform the supervised two-class classification techniques and they should be used only in the near or complete absence of defaulters.Moreover, although oversampling improved the performance of some two-class classifiers, it does not lead to an overall improvement of the best performing classifiers.Alaraj and Abbod [29] introduced a model based on the combination of hybrid and ensemble methods for credit scoring.Firstly, they combined filtering and feature selection methods to develop an effective pre-processor for machine learning models.In addition, they proposed a new classifier combination rule based on the consensus approach of different classification algorithms, during the ensemble modeling phase.Their experimental analysis on seven real-world credit datasets illustrated that the proposed model exhibited better predictive performance than the individual classifiers.
Abellán and Castellano [30] performed a comparative study on several base classifiers used in different ensemble schemes for credit scoring tasks.Additionally, they evaluated the performance of Credal Decision Tree (CDT) which uses imprecise probabilities and uncertainty measures to build a decision tree.Via an experimental study, they concluded that all the investigated ensemble schemes present better performance when they use CDT model as a base learner on credit scoring problems.
In more recent works, Tripathi et al. [31] proposed a hybrid credit scoring model based on dimensionality reduction by Neighborhood Rough Set algorithm for feature selection and layered ensemble classification with weighted voting approach to enhance the classification performance.They have proposed a novel classifier ranking algorithm as an underlying model for representing ranks of the classifiers based on classifier accuracy.The experimental results revealed the efficacy and robustness of the proposed method in two benchmarked credit scoring datasets.
Zhang et al. [32] proposed a new predictive model which is based on a novel technique for selecting classifiers using a genetic algorithm, considering both the accuracy and diversity of the ensemble.They conducted a variety of experiments, using three real-world datasets (Australian credit scoring, Japanese credit scoring and German credit scoring) to explore the effectiveness of their proposed model.Based on their numerical experiments the authors concluded that their proposed ensemble method outperforms classical classifiers in terms of prediction accuracy.
J. Levatić et al. [33] proposed method for semi-supervised learning of classification trees.The trees can be trained with nominal and/or numeric descriptive attributes on binary and multi-class classification datasets.Additionally, they performed an extensive empirical evaluation of their framework using an ensemble of decision trees as base learners obtaining some interesting results.Along this line, they extended their work, presenting some ensemble-based algorithms for multi-target regression problems [33,34].

A Review of Semi-Supervised Self-Labeled Classification Methods
The basic aim of semi-supervised self-labeled methods is the enrichment of the initial labeled data through labeling of unlabeled data via a self-learning process based on supervised prediction models.In the literature, several self-labeled methods have been proposed each one based on a different philosophy on exploiting knowledge from unlabeled data.In the sequel, we briefly describe the popular and frequently used semi-supervised self-labeled methods.

Self-Labeled Methods
Self-training [35] is a semi-supervised learning algorithm characterized by its simplicity and its good classification performance.In self-training, a supervised base learner is trained using the labeled data and iteratively augmented its training set gradually with the most confident predictions on unlabeled examples and re-trained.Nevertheless, this methodology can lead to erroneous predictions if noisy examples are classified as the most confident examples and incorporated into the labeled training set.
To address this difficulty, Li and Zhou [36] proposed SETRED (Self-trained with editing), which is-based on the adaptation of data editing in the self-training framework.This method constructs a neighboring graph in D-dimensional feature space and at each iteration a hypothesis test filters the candidate unlabeled instances.Finally, the unlabeled instances which have successfully passed the test, are added in the training set.
The standard Co-training [37] is a multi-view learning method, based on the assumption that the feature space can be split into two different views which are conditionally independent.Under the assumption about the existence of sufficient and redundant views, Co-training trains separately a base learner in each specific view.Then, iteratively each base learner teaches the other with its most confidently predicted unlabeled examples, hence augmenting their training sets.However, in most real-case scenarios this assumption is a luxury hardly met [21].
Zhou and Goldman [38] have adopted the idea of incorporating majority voting in the semi-supervised framework and proposed Democratic co-learning (Demo-Co) algorithm.This algorithm although it belongs to the multi-view algorithm, it operates in a different manner.Instead of demanding for multiple views of the corresponding data, it uses multiple algorithms for producing the necessary information and endorses a voted majority process for the final decision.Based on the previous work, Li and Zhou [39] proposed Co-Forest, in which a number of Random trees are trained on bootstrap data from the dataset and the output is defined as the combined individual prediction of each tree, via a simple majority voting.The basic idea behind this algorithm is that during the training process, the algorithm assigns a few unlabeled instances to each Random tree.Notice that, the efficiency of Co-Forest is mainly based on the use of Random trees, although the number of the available labeled instances is significantly reduced.
Another approach which is also based on an ensemble methodology is the Tri-training algorithm [40] which constitutes an improved single-view extension of the Co-training algorithm.This algorithm uses a labeled dataset to initially train three base learners which are used to make predictions on the instances of the unlabeled dataset.Then, if two base learners agree on labeling an example, then this is labeled for the third base learner too.The "majority teach minority strategy" has the advantage of avoiding the explicitly measuring the confidence of labeling, since such measuring is sometimes a quite complicated and time-consuming process; therefore, the training process is efficient [21].

Ensemble Self-Labeled Methods
The development of an ensemble of classifiers consists of two main steps: selection and combination.The selection of the component classifiers is considered essential for the efficiency of the ensemble while the key point for its efficacy is based on the diversity and the accuracy of the component classifiers [41].Furthermore, the combination of the individual predictions of the classifier takes place through several techniques and methodologies with different philosophy and classification performance [15,42].
By taking these into consideration, Kostopoulos et al. [43] and Livieris et al. [18,20] proposed two ensemble SSL algorithms, called CST-Voting and EnSSL, respectively.Both ensemble SSL algorithms exploit the individual predictions of three self-labeled algorithms i.e., Self-training, Co-training and Tri-training using a difference combination technique and mechanism.
CST-Voting [20,43] is based on the idea of combining the predictions of the self-labeled algorithms which constitute the ensemble using a simple majority voting methodology.Initially, the classical self-labeled algorithms are trained using the same labeled L and unlabeled U sets.Next, the final hypothesis on an instance from the testing set combines the individual predictions of the self-labeled algorithms, using a simple majority voting.Hence, the output of the ensemble is the one made by more than half of them.A high-level description of the proposed CST-Voting is presented in Algorithm 1.
EnSSL [18,44] combines the individual prediction of the same self-labeled algorithms using a maximum probability-based voting scheme.More specifically, the self-labeled algorithm which exhibits the most confident prediction over an unlabeled example of the test set is selected.In case the confidence of the prediction of the selected classifier meets a predefined threshold then the classifier labels the example otherwise the prediction is not considered reliable enough.It is worth mentioning that the way in which the confidence predictions are measured depends on the type of used base learner (see [45][46][47] and the references there in).In this case, the output of the ensemble is defined as the combined predictions of three self-labeled learning algorithms via a simple majority voting.A high-level description of the En-SSL algorithm is presented in Algorithm 2.

6:
Find the classifier C * with the highest confidence prediction on x.

7:
if (Confidence of C * ≥ ThresLev) then 8: C * predicts the label y * of x. 9: else 10: Use majority vote to predict the label y * of x.

11:
end if 12: end for

Experimental Methodology
In this section, we conducted a series of experiments to evaluate the performance of CST-Voting and EnSSL algorithms against the most popular and frequently used self-labeled algorithms.
The implementation code was written in JAVA, making use of the WEKA 3.9 Machine Learning Toolkit [46].To study the influence of the amount of labeled data, three different ratios (R) of the training data were used, i.e., 10%, 20% and 30% and all self-labeled algorithms were evaluated using the stratified 10-fold cross-validation.
The experiments in our study took place in two distinct phases.In the first phase, we evaluated the classification performance of CST-Voting and EnSSL against the most popular SSL algorithms namely Self-training, Co-training, and Tri-training; while in the second phase, we compared their performance against some state-of-the-art self-labeled algorithms, namely SETRED, Demo-Co and Co-Forest.Table 1 reports the configuration parameters of all evaluated self-labeled algorithms while all base learners were used with their default parameter settings included in the WEKA 3.9 Machine Learning Toolkit (University of Waikato, Hamilton, New Zealand) for minimizing the effect of any expert bias.All algorithms were evaluated using three different benchmark datasets: Australian credit, Japanest credit and German credit which are publicly available in UCI Machine Learning Repository [48], concerning approved or rejected credit card applications.The first has 690 cases, with 14 explanatory variables (6 continuous and 8 categorical); the second one has 653 instances, with 14 explanatory variables (3 continuous, 3 integer and 9 categorical); while the third one has 1000 instances, with 20 explanatory variables (7 continuous and 13 categorical).With regards to the cardinality of each class, in Australian dataset there is a small imbalance of rejected and accepted instances, namely 383 and 307, respectively.In Japanese dataset there is a small imbalance of rejected and accepted instances, namely 357 and 296, respectively; while in German dataset a sharper imbalance is observed, with 300 negative decisions and 700 positive.

Self
The performance of the classification algorithms was evaluated using the following four performance metrics: Sensitivity (Sen), Specificity (Spe), F 1 and Accuracy (Acc) which are respectively defined by where T P stands for the number of instances which have been correctly classified as positive, T N stands for the number of instances which have been correctly classified as negative, F P (type I error) stands for the number of instances which have been wrongly classified as positive, F N (type I I error) stands for the number of instances which have been wrongly classified as negative.
It is worth mentioning that Sensitivity of classification is the proportion of actual positives which are predicted as positive; Specificity represents the proportion of actual negatives which are predicted as negative, F 1 consists of a harmonic mean of precision and recall while Accuracy is the ratio of correct predictions of a classification model.

First Phase of Experiments
In the sequel, we focus our interest on the experimental analysis for evaluating the classification performance of CST-Voting and EnSSL algorithms against its component self-labeled methods, i.e., Self-training, Co-training, and Tri-training.All SSL algorithms were evaluated by deploying as base learners the Naive Bayes (NB) [49], the Sequential Minimum Optimization (SMO) [50], the Multilayer Perceptron (MLP) [51] and the kNN algorithm [52].These algorithms probably constitute the most effective and popular machine learning algorithms for classification problems [53].Moreover, similar to Blum and Mitchell [37], a limit to the number of iterations of all self-labeled algorithms is established.This strategy has also been adopted by many researchers [18][19][20][21][22]25,47,54,55].
Tables 2-4 present the performance of each compared SSL algorithms for Australian, Japanese, German datasets, respectively, relative to all performance metrics.Notice that the highest classification accuracy is highlighted in bold for each base learner.Firstly, it is worth mentioning that the ensemble SSL methods, CST-Voting and EnSSL, exhibited the best performance, regarding all datasets and improved their performance metric as the labeled ratio increased.In more detail: • CST-Voting exhibited the best performance in 10, 8 and 8 cases for Australian dataset, Japanese dataset and German dataset, respectively, while EnSSL exhibited the highest accuracy in 6, 8 and 8 cases in the same situations.
• Depending upon the base classifier, CST-Voting is the most effective method using NB or SMO as base learner, while EnSSL reported the highest performance using MLP as base learner.
In machine learning, the statistical comparison of several evaluation algorithms over multiple datasets is fundamental and it is frequently performed by means of a statistical test [20,21,56].Since our motivation stems from the fact that we are interested in evaluating the rejection of the hypothesis that all the algorithms perform equally well for a given level based on their classification accuracy and highlighting the existence of significant differences between our proposed algorithm and the classical self-labeled algorithms, we used the non-parametric Friedman Aligned Ranking (FAR) [57] test.Moreover, the Finner test [58] is applied as a post-hoc procedure to find out which algorithms present significant differences.
Table 5 presents the information of the statistical analysis performed by nonparametric multiple comparison procedures for Self-training, Co-training, Tri-Training, CST-Voting and EnSSL algorithms.Notice that the control algorithm for the post-hoc test is determined by the best (e.g., lowest) ranking obtained in each FAR test.Moreover, the adjusted p-value with Finner's test (p F ) was presented based on the corresponding control algorithm at the α = 0.05 level of significance.The post-hoc test rejects the hypothesis of equality when the value of p F is less than the value of a.
Clearly, CST-Voting and EnSSL demonstrate the best overall performance, as they outperform the rest self-labeled algorithms.This is because it reports the highest probability-based ranking by statistically presenting better results, relative to all used base learners.CST-Voting exhibited the best performance using NB and SMO as base learners while EnSSL presented the best performance using MLP and kNN as base learners.Furthermore, the FAR test but mostly the Finner post-hoc test revealed that CST-Voting and EnSSL perform equally well.

Second Phase of Experiments
Next, we evaluated the classification performance of the presented ensemble algorithms, CST-Voting and EnSSL, against some other state-of-the-art self-labeled algorithms such as SETRED, Co-Forest and Democratic-Co learning.Notice that CST-Voting and EnSSL uses SMO and MLP as base learners, respectively which exhibited the best performance, relative to all performance metrics.Tables 6-8 report the performance of each tested self-labeled algorithm on Australian dataset, Japanese dataset and German credit dataset, respectively.As above mentioned, the accuracy measure of the best performing algorithm is highlighted in bold.Clearly, the presented ensemble SSL algorithms illustrate the best performance, independent of the used labeled ratio.Furthermore, it is worth noticing that EnSSL exhibits slightly better average performance than CST-Voting.The accuracy measure of the best performing algorithm is highlighted in bold.
Table 9 presents the statistical analysis for SETRED, Co-Forest, Democratic-Co learning, CST-Voting and EnSSL, performed by nonparametric multiple comparison procedures.As mentioned above, the control algorithm for the post-hoc test is determined by the best (e.g., lowest) ranking obtained in each FAR test while the adjusted p-value with Finner's test (p F ) was presented based on the corresponding control algorithm at the α = 0.05 level of significance.The interpretation of Table 9 illustrates that CST-Voting and EnSSL exhibit the highest probability-based ranking by statistically presenting better results.Moreover, it is worth noticing that CST-Voting and EnSSL perform similarly with EnSSL presenting slightly better performance according to the FAR test.

Conclusions
In this work, we evaluated the performance of two ensemble SSL algorithms, entitled CST-Voting and EnSSL, for the credit scoring problem.The proposed ensemble algorithms combine the individual predictions of three of the most efficient and popular self-labeled algorithms, i.e., Co-training, Self-training, and Tri-training, using two different voting methodologies.The numerical experiments presented the efficacy of the ensemble SSL algorithms on three well-known credit score datasets, illustrating that reliable and robust prediction models could be developed by the adaptation of ensemble techniques in the semi-supervised learning framework.
It is worth noticing that we understand the limitations imposed on the generalizability of the presented results due to the use of the only three free available data sets as compared to other works [26][27][28][29].Furthermore, since we do not know whether the values of the key parameters of the base learners within WEKA 3.9 are randomly initialized, optimized, or adapted, one may generally consider this approach as a limitation when comparisons of algorithms are conducted using only three datasets.We certainly intend to investigate this further in the near future.
Additionally, another interesting aspect for future research could be the development of a decision-support tool based on an ensemble SSL algorithm, concerning the credit risk scoring problem.The use of a predictive tool could assist financial institutions to decide whether to grant credit to consumers who apply.Since our numerical experiments are quite encouraging, our future work is concentrated on evaluating the proposed algorithms versus relevant methodologies and frameworks addressing the credit score problem such as [27][28][29][30][31][32] and versus recently proposed advanced SSL algorithms such as [59][60][61].

Algorithm 1 :
CST-Voting Input: L − Set of labeled training instances.U − Set of unlabeled training instances.Output: The labels of instances in the testing set./* Phase I: Training */ 1: Self-training(L, U) 2: Co-training(L, U) 3: Tri-training(L, U) /* Phase II: Voting-Fusion */ 4: for each x ∈ T do 5: Apply Self-training, Co-training, and Tri-training on x. 6: Use majority vote to predict the label y * of x. 7: end for Algorithm 2: EnSSL Input: L − Set of labeled training instances.U − Set of unlabeled training instances.ThresLev − Threshold level.Output: The labels of instances in the testing set./* Phase I: Training */ 1: Self-training(L, U) 2: Co-training(L, U) 3: Tri-training(L, U) /* Phase II: Voting-Fusion */ 4: for each x ∈ T do 5: Apply Self-training, Co-training, and Tri-training on x.

Table 1 .
Parameter specification for all SSL methods.

Table 2 .
Performance evaluation of Self-training, Co-training, Tri-training, CST-Voting and EnSSL on the Australian credit dataset.

Table 3 .
Performance evaluation of Self-training, Co-training, Tri-training, CST-Voting and EnSSL on the Japanese credit dataset.Notice that the highest classification accuracy is highlighted in bold for each base learner.

Table 4 .
Performance evaluation of Self-training, Co-training, Tri-training, CST-Voting and EnSSL on the German credit dataset.

Table 5 .
Friedman Aligned Ranking (FAR) test and Finner post-hoc test for Self-training, Co-training, Tri-training, CST-Voting and EnSSL.

Table 6 .
Performance evaluation of SETRED, Co-Forest, Democratic-Co learning, CST-Voting and EnSSL on the Australian credit dataset.
The accuracy measure of the best performing algorithm is highlighted in bold.

Table 7 .
Performance evaluation of SETRED, Co-Forest, Democratic-Co learning, CST-Voting and EnSSL on the Japanese credit dataset.
The accuracy measure of the best performing algorithm is highlighted in bold.

Table 8 .
Performance evaluation of SETRED, Co-Forest, Democratic-Co learning, CST-Voting and EnSSL on the German credit dataset.