Comparison of Profit-Based Multi-Objective Approaches for Feature Selection in Credit Scoring

: Feature selection is crucial to the credit-scoring process, allowing for the removal of irrelevant variables with low predictive power. Conventional credit-scoring techniques treat this as a separate process wherein features are selected based on improving a single statistical measure, such as accuracy; however, recent research has focused on meaningful business parameters such as profit. More than one factor may be important to the selection process, making multi-objective optimization methods a necessity. However, the comparative performance of multi-objective methods has been known to vary depending on the test problem and specific implementation. This research employed a recent hybrid non-dominated sorting binary Grasshopper Optimization Algorithm and compared its performance on multi-objective feature selection for credit scoring to that of two popular benchmark algorithms in this space. Further comparison is made to determine the impact of changing the profit-maximizing base classifiers on algorithm performance. Experiments demonstrate that, of the base classifiers used, the neural network classifier improved the profit-based measure and minimized the mean number of features in the population the most. Additionally, the NSBGOA algorithm gave relatively smaller hypervolumes and increased computational time across all base classifiers, while giving the highest mean objective values for the solutions. It is clear that the base classifier has a significant impact on the results of multi-objective optimization. Therefore, careful consideration should be made of the base classifier to use in the scenarios.


Introduction
Credit-scoring evaluations are an important part of the lending process, allowing financial institutions to manage risks [1]. Feature selection, a crucial part of the credit-scoring process, typically aims to minimize the number of features, thereby reducing model complexity, data acquisition costs, and computation time [2]. Traditionally, feature selection is conducted as a separate step before model training and is used to improve a single statistical measure, such as the area under the receiver operating curve (AUC) [3]. Beyond this, other factors, such as the profitability of the resulting model [4,5], have been the focus of the feature-selection process. These factors, which depend on data and applications, can be incorporated into the feature-selection process as objectives in multi-objective optimizations (MOOs).
MOO algorithms allow designers to balance several, often conflicting, objectives [6]. These methods have been applied to simultaneously consider the number of features and another training objective, such as profit, in feature selection [7]. Several algorithms have been developed to handle MOO problems, including the Strength Pareto Evolutionary Algorithm (SPEA-II), non-dominated sorting genetic algorithm (NSGA-II) [8,9], and its reference-based adaptation for many-objective problems, NSGA-III. Hybrid algorithms, which integrate aspects of two or more optimization methods, have also been employed.
An example is the adaptation of the continuous Grasshopper Optimization Algorithm (GOA) for filter-based feature selection through the introduction of binary conversion [10,11]. Further examples used non-dominated strategies to convert Cuckoo Optimization Algorithm (COA) [12]. A non-dominated sorting binary GOA, NSBGOA, was proposed for feature selection with optimization of multiple objective [13].
Existing research has shown that even for closely related multi-objective algorithms, performance varies depending on the test problem [14]. However, there is limited research comparing performance of different multi-objective algorithms for feature selection. In particular, existing research tends to use on one base classifier, with the base classifier used depending on the analyst's discretion. Changes in performance due to different base classifiers require further examination. This research aims to fill the gap by comparing the performance of several multi-objective methods on feature selection in credit scoring, namely NSGA-II, NSGA-III, and the newly proposed hybrid meta-heuristic, NSBGOA. Secondly, the effect of different base classifiers on performance is considered to determine the most suitable. Third, these multi-objective methods are compared to conventional feature-selection techniques. Three common objectives for credit-scoring feature selection are employed: maximizing profit, selecting features that are more easily explained to stakeholders, and minimizing number of features [4,15].
Related research is considered in Section 2, while Section 3 introduces the methods used in this research. Detailed in Section 4 is the problem formulation and empirical evaluation. Results of the evaluation are shown in Section 5 and discussed in Section 6. Lastly, the conclusion is given in Section 7.

Profit Scoring
Credit scoring is defined as "…a set of decision models and their underlying techniques that aid lenders in granting consumer credit" [1]. Its core purpose is to assess the risk of lending to a prospective borrower. Historically, credit decisions were based the lender's knowledge of the borrower. In modern times, statistical and machine learning approaches have taken precedence. The goal of these approaches to credit scoring is to distinguish borrowers who are likely to show some negative behavior. Recent research has trended towards evaluation of profit as part of the credit-scoring process because it allows for improved decision-making by lenders. A profit measure comprised of benefits versus losses due to misclassification was proposed, with varying data variable acquisition costs also being considered [5]. The Internal Rate of Return (IRR) was used to measure profitability of peer-to-peer loans [16]. A new measure, expected maximum profit (EMP), which is composed of the benefits of correct classification and costs of misclassification, was suggested [17] and reworked for consumer credit scoring [18].

Feature Selection
The selection of input features is an important part of model building. Typically, the process aims to ensure optimum model performance with minimum features. This reduces noise, data costs, and the risk of overfitting. Feature-selection methods are generally classified into wrapper, filter, and embedded methods [2]. With wrapper methods, models are fit with subsets of the features, and the resulting model performance is evaluated. However, due to the high computational cost, they are difficult to run on datasets with a large number of features. Examples include backward and forward selection. For filter methods, features are selected based on inherent properties, such as variance. Analysis of variance (ANOVA) is an example of this [19]. Finally, embedded methods, such as LASSO [20] and ridge regression, perform feature selection and model fitting simultaneously.
Feature selection has been conducted by using support vector machines based on EMP in an embedded method [4]. A profit-based measure was applied with a Holdout Support Vector Machine (HOSVM) to extract the features with highest profitability [5].
Feature selection was conducted by using mixed-integer linear programming models with varying acquisition costs as constraints [15]. The orthogonal transform was used for dimensionality reduction, thus reducing the number of features for model training, leading to faster convergence and better performance [21]. Feature selection is carried out by integrating a multicriteria optimization classifier (MCOC) with a one-norm regularization term inspired by the LASSO regression method to create a sparse feature vector [22].
Multiple objectives have also been optimized in feature selection through multi-objective feature analysis. Existing literature has several examples of optimizing the featureselection process with two objectives. A non-dominated sorting genetic algorithm-II (NSGA-II) fitted to maximize the expected maximum profit (EMP) and minimize the number of features was demonstrated [7]. Mutual information and entropy were optimized for filter-based feature selection with a non-dominated sorting binary Particle Swarm Optimization (NSBPSO) [23]. Binary Grasshopper Optimization Algorithms were applied for filter-based feature selection based on error rate and number of features in References [10,11]. A wrapper based multi-objective evolutionary algorithm optimized feature selection with three objectives: default prediction, exposure at default, and number of features [24]. Two objectives, number of features and root mean square error (RMSE), were optimized in feature selection with multi-objective genetic algorithm and neurofuzzy models [25].

Multi-Objective Optimization
Multi-objective evolutionary algorithms (MOEAs), including Strength Pareto Evolutionary Algorithm (SEPA-II) and binary non-dominated sorting genetic algorithm (NSGA-II), have been applied for problems with two objectives [8,9]. Non-dominated sorting has also been incorporated with meta-heuristic optimizers for multi-objective problems. For instance, the Particle Swarm Optimizer [26] and the Ant Colony optimizer [27] have both been adapted to multi-objective problems. A binary version of the Grasshopper Optimization Algorithm has been developed for feature selection [11].
For many-objective optimizations (MaOP), which typically involve three or more objectives, their performance degrades with the increase in objectives due in part to the large number of mutually non dominated solutions [28]. As such, indicator, aggregation, and reference-based methods have been proposed to tackle MaOps. Examples include NSGA-III, a reference-based extension of NSGA-II for many-objective problems [29]. When compared on several test problems with different numbers of objectives, it was determined that the NSGA-III does not always outperform NSGA-II. In fact, the performance is affected by the number of objectives and the specific test problem evaluated [14].

Multi-Objective Optimization
The Pareto-optimal set is a non-dominated set of solutions which allows decisionmakers to find a trade-off where more than one objective is involved [30]. Multi-objective optimization (MOO) methods guide the search for solutions towards the Pareto-optimal set. MOOs are especially important for feature selection where more than one objective must be considered. For instance, maximizing profit while minimizing number of features. Mathematically, multi-objective optimization problems may be expressed by Equations (1)-(3) below: ∈ Ω The objective functions vector F(x) maps F: Ω →ꓥ. Here the decision space and vector are Ω and X, respectively. For many-objective optimization problems (MaOP), m ≥ 3.

Non-Dominated Genetic Algorithm (NSGA-II)
NSGA-II is a popular multi-objective optimization algorithm that can applied to feature-selection problems [8]. The algorithm begins by evaluating the fitness of the initial population of potential solutions. From these, "parent solutions" are selected and crossed to generate "child solutions". Mutation may occur where some components of the solutions are randomly altered. If the stopping criteria are not met, a non-dominated sorting scheme is used to select the best solutions, with diversity maintained by calculating and maximizing the crowding distance between solutions. The result of this sorting process becomes the population for the next round of evaluation. In this manner, the population converges towards the set of overall best, non-dominated solutions known as the Pareto frontier. NSGA-II can be adapted to feature selection by denoting the individuals as different feature combinations and setting the number of features as an objective in addition to the main objective such as profit. The process is given in Algorithm 1.

Non-Dominated Genetic Algorithm (NSGA-III)
Deb and Jain [29] proposed the NSGA-III algorithm as an extension of NSGA-II to deal with MaOPs. NSGA-III generates a reference set from virtual points in the objective space to measure the quality of solutions. A population, composed of potential solutions, is initialized. As with genetic algorithms, the fitness of the solutions is assessed by computing the fitness or objective functions. So-called "parent" solutions are selected and crossed over to obtain "child" solutions.
Mutations may also occur where components of the solutions may be altered. The population is normalized and associated to the reference set by the orthogonal distance to reference lines. During selection, rather than preserving diversity by using the crowding distance, as is the case with NSGA-II, NSGA-III uses niche-preservation based on the reference set [31]. This process is shown in Algorithm 2.

Non-Dominated Binary Grasshopper Optimization Algorithm (NSBGOA)
NSBGOA is a hybrid meta-heuristic proposed to handle multi-objective feature selection [13]. It adapts the Grasshopper Optimization Algorithm (GOA) [32], a swarm intelligence based optimizer that models the behavior of grasshoppers in a swarm. Each individual grasshopper's position is a possible solution, and the velocity of the individual grasshoppers as they attempt to swarm into the so-called "comfort zone" is updated with each iteration. This velocity is a function of their social interaction and movement towards the target.
For feature selection, the continuous GOA algorithm is converted to a binary GOA by introducing the sigmoidal transfer function of Equation (4) [10]. The velocity, ΔX, is adapted to Equation (5), where dij is the distance between two grasshoppers, the function s is the strength of the social forces, parameter c decreases the comfort zone with each iteration, and ubd and ibd are the upper bound and lower bound in the dth dimension. Equation (6) gives the dth dimension of a grasshopper in the next iteration. Additionally, non-dominated sorting is integrated into the algorithm to allow for comparison of multiple objectives. This results in the algorithm of Algorithm 3.

Expected Maximum Profit (EMP)
Expected maximum profit (EMP), the maximum profit obtainable, is a profit-based metric [18] that is applicable to credit scoring. Four potential classification outcomes exist, as per Table 1. When a good borrower is rejected, the lender loses the return on investment (ROI). Additionally, accepting a bad borrower results in loss of the benefit, b ∊ [0,1], expressed by the Equation (7), with exposure at default (EAD), loss given default (LGD), and principal amount (A) of the loan [33]. Benefit: LGD × EAD/A Probability: π0 F0 The benefit, b, depends on how much of the loan is repaid in full.
• b = 0 with probability p0 that the loan is repaid in full, • b = 1 with probability p1 that no portion of the loan is repaid, Finally, EMP is given by equation 8 below with prior probabilities of default (π0), prior probabilities non-default (π1), predicted cumulative density functions (F0 and F1), and constant ROI.

Performance Metrics
To evaluate the output of many-and multi-objective optimizations, the hypervolume indicator (HV) may be used. HV, which can be used to evaluate convergence and distribution, is denoted by Equation (9), where λm is the m-dimensional Lebesgue measure and m is the number of objectives [34]. It calculates the volume of objective space dominated by the Pareto Front approximation, P, and delimited from above by the reference point r such that z ∊ P, z dominates r.

Problem Formulation
With credit-scoring data identified, objectives may be listed. To achieve this, the following definitions are given: 1. Available features X, (Equation (10)) a set of j variables that could be used to predict loan repayment. (11)) is the number of selected features per solution. 3. Expected Maximum Profit, EMP, (Equation (8)) a profit-based measure for credit scoring. 4. Ease of explanation, C, (Equation (12)) a vector representing the ease of explaining each variable to stakeholders. (13)) is a vector with loan repayment information for each borrower. 6. Borrower information, B, (Equation (14)

Default status, D, (Equation
Each solution, P (Equation (15)) is comprised of n features such that P ⊂ X. The goal is to select a set, = { , , , , … } so that each element is a distinct subset of X resulting in non-dominated objective values.

Contribution
NSGA-II was used for feature selection in credit scoring with two objectives, profit and number of features [7], and different base classifiers were applied. However, its performance on different base classifiers was not compared to that of other multi-objective methods. Feature selection was conducted by introducing data-acquisition costs as constraints to a support vector machine (SVM) classifier for credit scoring [15]. Although the performance of NSGA-II and NSGA-III on different problems was compared, these comparisons were limited to test problems, not feature selection. Interestingly, it was found that their performance varied depending on the test problem and number of objectives. This leaves open the question of which multi-objective method and base classifier is most suitable for credit-scoring problems. Furthermore, the performance of the NSBGOA algorithm with different base classifiers is still in question. This research aims to answer these open research questions.

Data and Objectives
To test the various algorithms, the German credit dataset [35], which contains 1000 entries with 700 being non-default and 300 being default. There are 20 features in the initial dataset: 13 qualitative and 7 numerical. For the purposes of this evaluation, the variables that were judged to have ambiguous definitions were given ease of explanation values of 0, namely V2, V3, V12, V14, V17, and V18. Meanwhile, the remaining variables were assigned values of 1. The targets of the optimization are given in Table 2.

Analysis
Numeric variables went through max-min rescaling, and class imbalance was achieved with the random over-sampling examples technique (ROSE). An assortment of packages in CRAN R was used to conduct analysis. The NSGA-III and NSGA-II algorithms were implemented with the "rmoo" package. The analysis of NSBGOA was implemented with an appropriately modified version of the GOA function from the "metaheu-risticOpt" package. To calculate the objective values for each subset, three classifiers from the "caret" package were trained by using tenfold cross-validation, and the best model was selected based on the EMP ("EMP" package). The parameters for EMP evaluation were set to p0 = 0.55, p1 = 0.1, and ROI = 0.2644, as proposed by the authors in Reference [18]. The three base classifiers were regression (LR), support vector machine with a linear kernel (SVM LIN), and artificial neural networks (NN). Following this, the values of the remaining objectives were calculated according to Table 2. For comparison, a LASSO [20] regression model with alpha = 1 and lambda = 0 was fit on the data, using tenfold crossvalidation to maximize EMP. Additionally, a single-objective genetic algorithm (GA) [36] was used for feature selection with the fitness function being a tenfold cross-validation classifier trained to maximize EMP. Lastly, the three base classifiers were also trained on all the original features to maximize EMP. Population size of 10 and maximum iteration number of 50 was used for NSGA-III, NSBGOA, NSGA-II, and GA methods.

Results
The mean hypervolumes of the final populations from five runs of the NSGA-III, NSGA-II, and NSBGOA algorithms are given in Figure 1. Further, Figure 2 shows the mean computational time from five runs of each algorithm on a computer with 8GB RAM and an Intel(R) Core(TM) i5-7200U CPU 2.50GHz, while the objective values of the feature set with max EMP are shown in Figure 3. For a single run of each algorithm, the distribution of points in the outputs of the multi-objective algorithms is shown with scatter plots in Figure 4. For the same run, a statistical summary of the final populations is given in Table 3, and the three algorithms trained to optimize a single objective have their results given in Table 4. The nadir point, which describes the solution set, is also given.     6. Discussion

Base Classifier
The performance of three base classifiers, namely logistic regression, neural network classifier, and linear support vector machine, was compared. To achieve this, each multiobjective optimization (MOO) algorithm was evaluated with the three base classifiers in turn. The neural network classifier gave the smallest mean number of features per solution (Table 3), with the best EMP values across all multi-objective methods. It is possible that the neural network classifier's ability to consider non-linear interactions in the data leads to better performance with fewer features. Additionally, this classifier required the greatest computational time (Figure 2), regardless of the MOO algorithm applied. This is to be expected as the neural network method has a higher computation complexity than the other two classifiers.
Based on the results given in Figure 1, the base classifier does not appear to have a uniform impact on hypervolume across all multi-objective methods. The hypervolumes obtained by the neural network and linear support vector machine classifiers were similar for NSGA-III and NSGA-II algorithms. On the other hand, for the logistic regression classifier, the hypervolume was higher for NSGA-III. Where number of features and EMP are more of a concern than computational time, the neural network may be the best option, with NSGA-II or NSGOA for reduced hypervolume.

Feature Selection Algorithm
Several feature-selection algorithms were compared, namely LASSO, Genetic Algorithm, NSGA-II, NSGA-III, and NSBGOA. Of the methods that focused on optimizing one objective, LASSO and logistic regression gave the poorest values for cardinality, even when the other two objectives performed well. As observed in Table 4 and Figure 3, LASSO regression gave a high EMP value at the expense of cardinality and explainability. Additionally, each of the base classifiers gave high values for explainability and EMP with poor cardinality when computed with all the available features. This was as expected and highlights the insufficiency of such methods when multiple objectives are to be optimized. Furthermore, it is observed that GA methods achieved improved cardinality compared to the LASSO and all feature evaluations. However, the multi-objective methods gave better cardinality than GA in all but two cases (GA with linear support vector machine outperformed NSGA-II and NSGA-III with linear support vector machine). The GA method would be the best option where a single solution is required, and where a high EMP is the main concern with at least some reduction in the number of features.
Among the multi-objective algorithms (NSGA-II, NSGA-III, and NSGOA), NSBGOA had the longest computation time regardless of base classifier (Figure 2). It was able to produce the smallest hypervolume compared to the other multi-objective methods (Figure 1). The NSBGOA methods also gave the smallest average number of features per feature set (Table 3). In addition, in most cases, the mean objective values were higher than those obtained with NSGA-II and NSGA-III, using comparable base classifiers. Of the multi-objective methods, the NSGOA algorithm would work best where computational time was not a concern, minimizing the number of feature of features was a major consideration, and multiple solutions were required.

Application
Overall, the multi-objective methods were more efficient in reducing the number of features; however, this led to lower EMP and explainability in some cases. This demonstrates the ability of multi-objective methods to balance several factors. Such methods may be advantageous in cases where the decision-maker is willing to make slight sacrifices in EMP and explainability to reduce the number of features, which could result in reduced data-acquisition costs. Furthermore, it should be noted that the multi-objective methods produce several non-dominated solutions or feature sets with different objective values for each. From the business perspective, these methods give the benefit of a posteriori decision-making. To apply these results, the decision-maker would consider the objective values obtained from each subset of features and select the final solution based on these objectives and priorities.

Conclusions
A comparison was made of the effect of different base classifiers on multi-objective feature-selection methods in credit scoring. Of the base classifiers used (neural network, logistic regression and linear support vector machine), the neural network classifier improved the profit-based measure and reduced the number of features the most. However, it also significantly increased computational time. Further, the base classifier was found to have an uneven impact on the hypervolume of multi-objective optimization output. It was found that all the multi-objective methods gave a better balance of objectives than the single-objective methods. Additionally, the NSBGOA algorithm gave relatively smaller hypervolumes and increased computational time across all base classifiers. It also resulted in better mean objective values in most case. This study showed that the performance of the multi-objective method is affected by the base classifier chosen. As such, the implementation of multi-objective methods should carefully consider the base classifier used for evaluation.

Institutional Review Board Statement:
This was waived because data were anonymized by the data provider before they were provided to authors for research purposes.

Informed Consent Statement:
This was waived because data were anonymized by the data provider before they were provided to authors for research purposes.

Data Availability Statement:
Restrictions apply to the availability of these data. Data were obtained from Agribuddy Ltd. and are available with the permission of Agribuddy Ltd.