1. Introduction
The 2007–2008 global financial crisis and the recommendations on banking regulations have attracted the growing interest of institutions in credit and operational risk management, which has become a key determinant of success because incorrect decisions may lead to heavy losses. One major difficulty for financial institutions relates to credit granting and, more specifically, how to discriminate between default and non-default applicants.
Conventional methods for credit risk management have usually been based on subjective decisions made by analysts, using past experiences and well-established guidelines, but the increasing needs of companies and the huge amounts of financial data now available have motivated the design and application of more formal and precise techniques to make credit granting decisions more efficiently. Thus, the use of statistical and operations research methods depicted a first step towards this objective [
1,
2,
3]. However, some assumptions of the statistical models are often difficult to meet in practice, which makes these methods theoretically null and void for databases with a limited number of samples [
4]. In more recent years, important efforts have been addressed to exploit a variety of artificial intelligence and machine learning techniques, ranging from biologically inspired algorithms [
5,
6,
7,
8] to ensembles of classifiers [
9,
10,
11,
12], cluster analysis [
13,
14,
15,
16], and support vector machines [
17,
18,
19], to shape solutions for both bankruptcy and credit risk prediction. An interesting advantage of these methods over the statistical models is that those automatically derive information from the past observations available in a data set, without assuming any specific prior knowledge.
From a practical viewpoint, credit granting decision can be expressed in the form of a two-class prediction problem in which a new case has to be assigned to one of the predetermined classes according to a set of input or explanatory attributes. These attributes or variables gather a diversity of information that summarizes both socio-demographic features and financial status of the credit applicants, whereas the classifier gives an output based on their financial solvency. Generally, a credit risk prediction system attempts to assign a credit applicant to either non-defaulter or defaulter. Let us assume a set of n past observations , where each instance is described by D input attributes, , and is the class (defaulter/non-defaulter), then the objective of a prediction model is to estimate the value y for a new sample , that is, .
A considerable number of papers whose purpose has been to conduct a comparison of credit risk prediction algorithms are available in the literature, but their conclusions are often contradictory because of the criteria used for the evaluation. For instance, Desai et al. [
20] showed that linear models perform worse than artificial neural networks when using the proportion of defaulters correctly predicted, and logistic regression achieves the highest proportion of non-defaulters and defaulters correctly predicted. Bensic et al. [
6] noticed that the probabilistic neural networks are superior to learning vector quantization, classification and regression tree (CART), logistic regression, multilayer perceptron, and radial basis function based on the prediction accuracy. Yobas et al. [
21] concluded that linear discriminant analysis is superior to decision trees, genetic algorithms, and neural networks when using the percentage of applicants correctly classified. Wang [
12] showed that bagging and stacking with a decision tree as base classifier were the best performing algorithms when using type-I error, type-II error, and overall accuracy. Baesens et al. [
17] found that the neural networks are superior to other methods based on the area under the receiver operating characteristic curve (ROC) curve, while the support vector machines perform the best in terms of overall accuracy. Bhaduri [
22] tested some artificial immune systems against well-known classifiers on accuracy for two benchmark credit scoring data sets. Antonakis and Sfakianakis [
23] compared linear discriminant analysis, decision trees,
k-nearest neighbors decision rule, multilayer perceptron, naïve Bayes classifier, and logistic regression, pointing out that the
k-nearest neighbors model performed the best in terms of accuracy, and the multilayer perceptron achieved the highest rate based on the Gini coefficient.
The contradictory conclusions of those studies and some other similar works suggest that no classifier can be considered the best on any performance evaluation metric. However, model selection is a subject of great interest for credit risk management, which advises the need of using more influential techniques for assessing the performance of prediction methods. Taking the limitations of individual performance scores into account, this paper suggests the synergetic application of MCDM models to provide a more comprehensive evaluation of credit granting decision systems. Thus, the TOPSIS and PROMETHEE methods rank a set of prediction models using a single scalar score that will be derived from aggregating their preference rates, showing that this technique allows for more consistent conclusions regarding the effectiveness of credit risk prediction models than the use of individual performance measures.
Henceforward, the paper is organized as follows.
Section 2 offers an overview of MCDM and describes the two methods used here.
Section 3 presents the details of the experimental design, with the description of the databases and the performance measures.
Section 4 discusses the results of the experiments conducted.
Section 5 summarizes the main conclusions that can be drawn from the present work and outlines possible avenues of further research.
2. Multiple-Criteria Decision-Making
Over the past several years, MCDM models have acquired a great relevance because this paradigm presents a number of features that make it especially suitable for analyzing hard real-life problems. One of the fundamental features of the MCDM methodologies refers to the fact that most of them can cope with both quantitative and qualitative data, along with the subjective opinions and/or the preferences of experts [
24]. From a theoretical viewpoint, MCDM is a powerful component of operations research that encompasses some analytical tools and techniques to appraise the strengths and weaknesses of a set of
M competing alternatives
evaluated on a family of
N (usually conflicting) criteria of different nature
, with the objective of making an accurate decision regarding the preference judgment of the decision-maker [
25,
26]. Thus, an MCDM problem can be generally represented by means of a
decision matrix as that shown in
Table 1.
Choosing the best alternative requires combining partial evaluations of each alternative into an aggregated value by using an aggregation operator
that relates a global value
to alternative
. This aggregation operator depends on the preferences of the analyst, which can be expressed regarding the relevance of criteria through weights
. Thus, the aggregation operator can be defined as
where
are the partial evaluations of the alternative
.
MCDM methods can be categorized into two general groups [
27]: the multi-objective decision-making approach assumes a theoretically infinite (or a very large) number of alternatives, whereas the multi-attribute decision-making requires the assessment of a finite number of alternatives, which corresponds to the most common situation in financial decision-making problems (e.g., credit approval applications).
A rather different taxonomy identifies four categories [
28]: (i) multi-objective mathematical programming, (ii) multi-attribute utility/value theory, (iii) outranking relations, and (iv) preference disaggregation analysis. As already pointed out, the present work concentrates on the outranking relations approach because it is recognized as one of the most effective ways to face the complexity of business and financial decision-making problems. In addition, unlike other MCDM techniques, the outranking relations methods are able to deal with any kind of problematics.
Performance assessment of classification algorithms requires dealing with various complementary criteria of interest, typically weighting the gains of each criterion against the others. Taking this into account, choosing the best performing prediction model can be considered as a particular MCDM problem, where M represents the number of prediction models (alternatives) and N expresses the number of performance assessment measures (criteria). In the framework of credit risk analysis, the MCDM techniques ought to allow analysts and decision-makers to pick up the algorithm that yields a closely optimal compromise between the evaluation criteria.
Well-known examples of the numerous MCDM algorithms that have been presented in the literature are TOPSIS (Technique for Order of Preference by Similarity to Ideal Solution), which is a representative of the multi-attribute value theory, and PROMETHEE (Preference Ranking Organization METHod for Enrichment of Evaluations), which belongs to the outranking techniques. Apart from their conceptual and implementational simplicity, both of these methods present some interesting benefits over other models [
29]; for instance, they provide a single result in the form of a scalar value that constitutes the logic of human decision.
2.1. The TOPSIS Method
The basis of TOPSIS is to rank the alternatives or to discover the best alternative by simultaneously minimizing the distance to the positive ideal solution and maximizing the distance from the negative ideal solution [
30]. The positive ideal solution (
) is shaped as a mixture of the best performance values of any alternative for each criterion, whilst the negative ideal solution (
) corresponds to the mixture of the worst performance values.
Afterwards, the procedure follows by computing the separations of each alternative
from the positive and negative ideal solutions,
and
, using the
N-dimensional Euclidean distance. Finally, the relative proximity to the ideal solution is computed as
. Note that
because
and
. Then, the alternatives can be ranked using this index in decreasing order, without the need for criterion preferences to be independent [
31].
Let us assume an MCDM problem with
M alternatives and
N criteria represented as a decision matrix (
Table 1); then, the TOPSIS method can be defined following the steps of Algorithm 1. It is worth noting that the alternatives are completely ranked based on their global utilities and, on the other hand, the criterion preferences are not required to be independent [
30].
Algorithm 1 TOPSIS |
- 1:
Compute the normalized decision matrix, where the normalized value of the original score is computed as - 2:
Compute the weighted normalized values , where denotes the weight of the criterion and - 3:
Compute the positive and negative ideal solutions where I and J are associated with benefit and cost criteria, respectively - 4:
Compute the separation of each alternative from the positive and negative ideal solutions and - 5:
Compute the relative proximity to the ideal solution. The relative closeness of the alternative with respect to is defined as - 6:
Rank alternatives based on the decreasing order of
|
2.2. The PROMETHEE Method
The PROMETHEE methodology [
32] intends to select the best alternatives (PROMETHEE I) or to sort the alternatives based on their values over different criteria (PROMETHEE II). As an outranking relations technique, the PROMETHEE method quantifies a ranking through the pairwise comparisons (differences) of alternatives
to determine the preference index
, which reflects how
is preferred to
on criterion
. The calculation of the preference index is based on the specification of the normalized weights
and the preference functions
for each criterion
. The idea of this index is similar to that of the global concordance index in the ELECTRE methodology: the higher the preference index is, the higher the strength of the preference for
over
.
On the other hand, the PROMETHEE methodology also makes use of the concepts of positive and negative preference flows [
33]: the positive preference flow
evaluates how a given alternative
outranks the remaining alternatives, and the negative preference flow
measures how an alternative
is outranked by all the other alternatives. Finally, the global net preference flow, which is calculated as
, indicates how an alternative
is outranking (
) or outranked (
) by all the other alternatives on all the evaluation criteria. As a result, the alternative
with the maximum global net preference flow will be deemed to be the best.
The general PROMETHEE methodology can be easily implemented in the form of a stepwise procedure as defined in Algorithm 2.
Algorithm 2 PROMETHEE |
- 1:
For each pair of a finite set of alternatives , compute aggregated preference indices and - 2:
Compute the positive and negative preference flows and - 3:
Compute the net preference flow for each alternative as
|
The global net preference flow indicates how an alternative is outranking () or outranked () by all the remaining alternatives on all the evaluation criteria. As a result, the alternative with the maximum global net preference flow will be identified as the best one.
4. Results
Table 5,
Table 6,
Table 7,
Table 8,
Table 9 and
Table 10 provide the results of each classifier on the seven performance assessment criteria (accuracy, root mean squared error, true-positive and true-negative rates, AUC, geometric mean, and F-measure) for each database. On the other hand,
Table 11 reports the mean value across all data sets generated by each prediction model on each metric, which is here used to illustrate the performance of that classifier. For each performance metric, the best performing algorithm has been highlighted in boldface.
As can be observed in
Table 5,
Table 6,
Table 7,
Table 8,
Table 9 and
Table 10, no algorithm achieved the best performance across all criteria. For instance, when analyzing the results over the Australian database, logistic regression, RIPPER, and random forest were the prediction methods with the highest accuracy rate and F-measure, whereas the naïve Bayes classifier was the best performing algorithm in terms of TN-rate. Even a more obvious example is for the results over the Thomas database: the Bayesian belief network, logistic regression, MLP, and SVM achieved the highest rates when using the accuracy, the naïve Bayes classifier was the model with the highest true-negative rate and geometric mean, and MLP and random forest were the best algorithms on the F-measure.
These results show that there was a significant discrepancy regarding the set of criteria. Consequently, different conclusions about the best performing method could be drawn based on the performance assessment metric used. These conflicting outcomes depict a realistic scenario in which a pool of analysts or decision-makers might make very different decisions depending on the criteria used to measure the performance of a credit granting decision system. In our opinion, this reflects an illustrative example of real-life applications where the MCDM techniques should be taken into consideration for making more consistent, trustworthy decisions.
The conflicting points related to the employment of single performance assessment criteria led to carry out some experiments with the MCDM methods included in this study. Taking into account that identifying relative weights of criterion importance is nontrivial, one can use either subjective weighting methods or objective weighting methods [
45]. While the subjective methods determine weights solely according to the decision-maker’s judgments/preferences, the objective methods define weights by solving mathematical models automatically without any consideration of the decision maker’s preferences. In general, objective weighting is applied to situations where reliable subjective weights cannot be obtained [
46].
In this work, the weights used by the TOPSIS and PROMETHEE methods were set in line with the relative relevance of the performance evaluation measures for credit granting decision problems. For instance, AUC, G-mean, and F-measure have traditionally been deemed as significant performance metrics for this application domain because they choose optimal methods independently of the class distribution and the misclassification costs [
44,
47]. Keeping these questions in mind, elicitation of weights was based on the subjective procedure of the fuzzy approach proposed by Wang and Lee [
45] and then the weights were normalized in the interval
(see the last row of
Table 11).
Table 12 reports the ranks and the preference values of the prediction models given by TOPSIS and PROMETHEE. Note that the higher the ranking, the better the classifier. The analysis of the ranks produced by these two MCDM techniques reveals that the random forest and logistic regression algorithms were the best performing algorithms since both TOPSIS and PROMETHEE agreed with their decisions. Paradoxically, despite the conclusions drawn by some authors [
17], the SVM appeared as one of the worst alternatives for credit granting decision problems according to the ranks produced by TOPSIS and PROMETHEE; this situation could be explained by the employment of unsuitable performance assessment criteria, while the MCDM techniques could correct such misleading results. In addition, the naïve Bayes classifier and the 1NN decision rule were among the worst ranked classification algorithms.
Despite the ranks achieved with TOPSIS and PROMETHEE being rather similar to one another, a composite ranking score was further defined as the mean of the preference values of both techniques for each prediction method
i. This composite score allows for combining the preference rates
and
of an alternative (prediction model)
i in a fair manner as follows:
Furthermore, this score can be easily generalized to
L different MCDM methods as:
where
denotes the preference value given by the method
j.
Figure 2 displays a graphical representation of the composite scores, which is a simple way of visualizing the rationale of the decisions made. It clearly shows that both random forest and logistic regression are superior to all the other classifiers and, on the other hand, the poor performance achieved by the naïve Bayes, SVM and 1NN algorithms is also apparent.
5. Conclusions
The present analysis supports the synergetic application of MCDM techniques for the performance assessment of credit granting decision systems. Through a series of experiments, it has been shown that the employment of an individual metric may give rise to inconsistent conclusions about what is the best prediction model for a given problem, which would lead to selecting an inappropriate method with not the most reliable results.
TOPSIS and PROMETHEE, which are two well-known MCDM techniques, have been tested in the experiments applying ten prediction models (alternatives) to six real-world bankruptcy and credit data sets and using seven performance evaluation criteria. The use of single performance metrics have designated different classifiers as the most suitable alternatives. These results suggest that credit granting decision corresponds to a real-world application where the MCDM techniques are especially useful to consistently assess a pool of classifiers and help decision-makers to choose the most beneficial model. In our experiments, both TOPSIS and PROMETHEE have determined that random forest and logistic regression are the best performing prediction methods on most of the performance evaluation measures.
Furthermore, we have also introduced a plain score that can be easily expressed as a linear combination of the preference values given by a number of MCDM methods. The most important advantages of this simple score are two-fold: (i) it converts the individual preference values of the MCDM models into a single scalar, thus allowing for making more trustworthy decisions; and (ii) it can be graphically represented for a better understanding of the decisions made.
In the experiments, we have tested 10 classification models using their default parameter values given in WEKA. It is known that some of these classifiers can yield widely different results depending on the value of their parameters (e.g., the kernel function used in SVM, or the number of decision trees in a random forest). As future work, a more exhaustive analysis of the optimal parameter values for the classification problem here addressed should be performed.