Cost-Sensitive Metaheuristic Optimization-Based Neural Network with Ensemble Learning for Financial Distress Prediction

: Financial distress prediction is crucial in the ﬁnancial domain because of its implications for banks, businesses, and corporations. Serious ﬁnancial losses may occur because of poor ﬁnancial distress prediction. As a result, signiﬁcant efforts have been made to develop prediction models that can assist decision-makers to anticipate events before they occur and avoid bankruptcy, thereby helping to improve the quality of such tasks. Because of the usual highly imbalanced distribution of data, ﬁnancial distress prediction is a challenging task. Hence, a wide range of methods and algorithms have been developed over recent decades to address the classiﬁcation of imbalanced datasets. Metaheuristic optimization-based artiﬁcial neural networks have shown exciting results in a variety of applications, as well as classiﬁcation problems. However, less consideration has been paid to using a cost sensitivity ﬁtness function in metaheuristic optimization-based artiﬁcial neural networks to solve the ﬁnancial distress prediction problem. In this work, we propose ENS_PSONN cost and ENS_CSONN cost : metaheuristic optimization-based artiﬁcial neural networks that utilize a particle swarm optimizer and a competitive swarm optimizer and ﬁve cost sensitivity ﬁtness functions as the base learners in a majority voting ensemble learning paradigm. Three extremely imbalanced datasets from Spanish, Taiwanese, and Polish companies were considered to avoid dataset bias. The results showed signiﬁcant improvements in the g-mean (the geometric mean of sensitivity and speciﬁcity) metric and the F1 score (the harmonic mean of precision and sensitivity) while maintaining adequately high accuracy.


Introduction
The phrases bankruptcy and insolvency are frequently used interchangeably in the literature [1]. Bankruptcy is a legal financial procedure in which an individual or an organization declares that they are unable to pay their obligations. As an outcome of this legal position, the debtor's assets are liquidated to repay some of their debts, while the remainder of their debts are ignored [2]. Insolvency is defined as the failure to pay or the scenario in which a corporation, another legal entity or an individual cannot meet their financial commitments by the maturity date [1]. Hence, financial distress (i.e., bankruptcy or insolvency) prediction is a critical tool within the financial industry that serves as an aid for making appropriate business decisions [3]. The successful forecasting of this challenge provides a broader view of the business's health and assists decision-makers in anticipating occurrences before they happen.
As a result, there has been a significant effort in the literature to construct statisticsand artificial intelligence-based models that can accurately estimate a company's financial state. In general, the previous evaluations of the company's condition, whether it has had financial distress or not, are examined as a binary classification problem from a machine learning perspective.
The challenge in dealing with financial distress datasets is that they are highly imbalanced. When there are significantly more samples from one class than other classes, the dataset is said to be imbalanced. Due to the effects of the majority class on the traditional training criteria, classifiers may have a high accuracy for the majority class but an extremely low accuracy for the minority class(es). The goal of most original classification algorithms is to reduce the error rate or the percentage of erroneous class label predictions [4].
There are two primary techniques for dealing with imbalanced datasets: at the data level, by resizing the training datasets (undersampling or oversampling), and at the algorithmic level, by using cost-sensitive classifiers [4]. In this work, we evaluated the algorithmic-level approach using a metaheuristic optimization-based artificial neural network (MHOANN) as our classifier, which was based on a particle swarm optimizer (PSO) [5] and a competitive swarm optimizer (CSO) [6] with a cost sensitivity fitness function. We then improved the capabilities of our model using homogeneous majority voting ensemble learning.
Evolutionary neural networks (ENNs) [7][8][9][10][11][12] are a subset of neural networks (NNs) in which evolution is a key type of adaptation, in addition to learning. Connection weight training, architectural design, learning rule adaption, input feature selection, connection weight initialization, rule extraction from NNs, and other activities are performed using evolutionary algorithms (EAs) [13].
MHOANNs are a subset of artificial neural networks (ANNs) in which the selection of weights and biases is performed using metaheuristic optimization algorithms [14]. Inspired by the collective behavior of social animals, swarm-based algorithms have been developed into a strong family of optimization approaches. The collection of potential solutions to the optimization issue is characterized in a PSO as a swarm of particles that flow across the parameter space, establishing trajectories that are driven by their own and their neighbors' best performances [15]. On the other hand, a CSO is a recent variation of a PSO in which a pairwise competition mechanism is implemented that causes the losing particle to learn from the winner and update its location [6].
This paper proposes using a cost-sensitive MHOANN to improve the prediction of minor classes in a financial distress dataset and then applying majority voting ensemble learning to create a strong learner out of several weak learners. The cost-sensitive component is used to improve the prediction of the minority classes, whereas the majority voting attempts to mitigate the negative influences of cost on the prediction of the majority class. Applying a cost sensitivity fitness function in an ensemble learning paradigm is different from existing cost-sensitive methods because it reduces the effects of the bias toward the minority classes, which is caused by the costs that are associated with the misclassification of minor class instances in the classical cost-sensitive methods. Moreover, the evolutionary nature of the utilized metaheuristic algorithms provides the accuracy and diversity that are required by ensemble learning to achieve a high prediction capability that exceeds the prediction capability of a single learner. The reason for selecting a PSO and a CSO as the optimization techniques in this work was that, compared to other metaheuristic algorithms, a PSO requires a small number of parameters and a correspondingly lower number of iterations [16]. On the other hand, a CSO is a relatively recent variation of a PSO that was designed to be used for large-scale optimization problems because half of the population is updated during each iteration [17].
To validate this, we used three different datasets from Spanish, Taiwanese, and Polish companies to evaluate the proposed method. The dataset of Spanish companies was considered very challenging, owing to its highly imbalanced distribution in which insolvency cases only formed 2% of the whole sample. In the datasets of Taiwanese companies and Polish companies, insolvency cases formed approximately 3% and 2% of the samples, respectively.
When applying the cost sensitivity fitness function, we noticed a significant improvement in the number of true positive (TP) predictions but an increase in the number of false positive (FP) predictions. To overcome this problem, we used majority voting ensemble learning to maintain the high TP prediction rate and reduce the number of FP predictions. This work proposes a framework for solving financial distress prediction problems for extremely imbalanced datasets. The framework uses a cost sensitivity fitness function to reduce the number of FN predictions. Moreover, it relies on ensemble learning to compensate for the faults of individual learners and reduce the number of FP predictions. All of the steps in the framework are internal and do not affect the data; hence, it can be a helpful tool in financial distress prediction. To the best of our knowledge, our work is the first to combine a cost-sensitive MHOANN with majority voting ensemble learning for financial distress prediction. Another contribution of this work is the comparison of a PSO and CSO as optimization techniques for the MHOANN.
The remainder of this paper is organized as follows. In the following section, we review the related works. Then, in Section 3, we explain the optimization algorithms that were used in our study. In Section 4, we describe the considered datasets. Section 5 describes the proposed method and in Section 6, we describe the evaluation metrics that were used. The experiments that were conducted and the obtained results are explained in Section 7. Finally, the conclusions and future work are discussed in Section 8.

Related Works
In the literature, much research has been conducted on examining the problem of imbalanced datasets using a variety of methods and approaches in different combinations. For example, a modified version of a support vector machine (SVM) that was based on density weight was proposed in [18] to tackle the binary class imbalance classification problem. Experimental analyses were performed on certain intriguing imbalanced artificial and real-world datasets and their performances were measured using the metrics of the area under the curve and the geometric mean. The results were compared to those from an SVM, a least squares SVM, a fuzzy SVM, an improved fuzzy least squares SVM, a fuzzy SVM that was based on affinity and class probability, and an entropy-based fuzzy least squares SVM. The similar or better generalization results indicated the efficacy and applicability of the proposed algorithms. Deep learning (DL) methods have also been considered to overcome the class imbalance challenge. In [19], the authors presented a novel comparison between three different DL methods: a deep belief network (DBN), long-short term memory (LSTM), and a multilayer perceptron model (MLP). They also compared five ensemble classifiers financial distress prediction: XGBoost, SVM, K-nearest neighbor (KNN), and AdaBoost. A new selective oversampling approach (SOA) that uses an outlier identification technique to separate the most representative samples from the minority classes and then uses these samples for synthetic oversampling was proposed in [20]. Their experiments demonstrated that the suggested method outperformed two state-of-the-art oversampling strategies: synthetic minority oversampling and adaptive synthetic sampling.
Moreover, using cost-sensitive learning to solve the imbalanced classification problem has also been very popular in the literature. Robust cost-sensitive classifiers have been constructed by changing the objective functions of well-known algorithms, including logistic regression, decision trees, extreme gradient boosting, and random forests, which can then be then utilized to predict medical diagnoses effectively, as proposed in [21]. Furthermore, the cost-sensitive approaches outperformed the standard algorithms, according to the findings of those experiments. In another study, the authors used decision trees as a boosting method to improve business failure prediction performance. A weighted objective function, weighted cross-entropy, was incorporated into the boosted tree architecture to overcome the class imbalance issue in the business failure datasets, making the weighted XGBoost a cost-sensitive business failure prediction model [22].
Furthermore, using evolutionary algorithms to train artificial neural networks (ANNs) has been very popular since the 1980s. The use of the genetic algorithm (GA) to train an ANN for image classification was discussed in [23]. Additionally, using metaheuristic algorithms to train ANNs to manage the disadvantages of gradient-based methods, particularly backpropagation techniques, has also been extensively researched. During the early 2000s, numerous studies focused on the use of metaheuristic algorithms in neural network training for binary classification tasks, such as financial distress prediction. Metaheuristic approaches were proven to perform better than gradient-based algorithms in [24]. The effects of fitness functions on MHOANN learning when dealing with imbalanced datasets was also discussed in [25]. A PSO algorithm was used to optimize the weights and biases in a neural network architecture to predict bankruptcy among Indian firms in [26]. An artificial neural network that was trained by a metaheuristic artificial bee colony (ABC) algorithm was proposed in [27]. The model was used for corporate bankruptcy prediction and then the proposed method was compared to the multiple discriminant analysis (MDA) model and an ANN that was trained by the most common learning algorithm (backpropagation (BPNN)). Their experimental results showed that the ABC algorithm could be used as an optimization algorithm for artificial neural networks to predict potential corporate bankruptcy. In another study, the authors conducted a comprehensive benchmark of 15 population-based optimization algorithms that were used to train ANNs. Their obtained experimental results using a challenging set of eight classification problems showed that the PSO yielded the best performance out of the other population-based metaheuristic algorithms [28].
On the other hand, ensemble classifiers have been effectively employed in credit scoring and the forecasting of company insolvency in recent years. For example, a costsensitive neural network ensemble for credit scoring was proposed in [29]. The suggested method outperformed the benchmark individual and ensemble methods, as evidenced by the comparative results. In another study, an ensemble classifier-based scoring model for the early prediction of the risk of bankruptcy among Polish businesses was proposed in [30]. Their results proved that using ensemble classifiers could be very powerful for foreseeing bankruptcy. Additionally, an ensemble classifier for classifying binary, nonstationary, and imbalanced data streams in which the Hellinger distance was used to prune the ensemble was implemented in [31]. The Hellinger distance weighted ensemble approach was thoroughly tested using many imbalanced data streams and the results proved the usefulness of the method.
MHOANN, cost-sensitive learning, and ensemble learning have shown promising results for classification problems. However, little attention has been paid to the effects of combining the cost sensitivity fitness function within an MHOANN with ensemble learning for financial distress prediction.

Background
Optimization algorithms are methods that are used to update the weights and biases in an ANN to overcome the disadvantages of conventional training algorithms. This work utilized state-of-the-art PSO and CSO (a recent variant of a PSO) metaheuristic algorithms as optimization techniques for our ANN.

Particle Swarm Optimization (PSO)
This population-based optimization technique was inspired by the movement of flocks of birds and schools of fish. It uses social interactions to find the best solutions. The swarm is randomly initialized with a population of solutions that are called particles (or agents). The search for the optimal solution is repeated in iterations, during which these particles move around the search space according to a mathematical formula that governs the position and velocity of the particles. The motion of each particle is affected by the best solution that has been achieved so far by that particular particle and is guided to the known best positions within the search space, which are adjusted when better positions are discovered by other particles in the swarm. Hence, the swarm moves toward the optimal solution [15].
In this study, the velocity was modeled mathematically, as stated in Equation (1), where v id (t) is the velocity of the particle i in dimension d = 1, . . . , n p at time step t, w is the inertia weight, r 1 and r 2 are random values ∈ [0, 1] from a uniform distribution, c 1 and c 2 are positive acceleration constants, p id (t) is the best position that the particle i has visited since the first time step with d dimensions at time t, and g d (t) is the best global particle position. The position was also modeled mathematically, as stated in Equation (2), where x id (t) is the position of the particle and v id (t + 1) is the velocity of the particle i in dimension d at time step t + 1 [32]. (1)

Competitive Swarm Optimizer (CSO)
This is a method that is based on a PSO but is significantly different. In a CSO, neither the particle's personal best position nor the global best position (or the neighborhood best positions) is used to update the particles. Instead, a pairwise competition mechanism is implemented in which the losing particle learns from the winner and updates its location. Despite its algorithmic simplicity, CSOs outperform the latest metaheuristic algorithms in terms of overall performance [6].
In our CSO, we had P(t), which comprised a swarm of m particles, where m is the size of the swarm and t is the index of the generation. Each particle represented a candidate solution for the optimization problem. The CSO compared two particles that were randomly picked from P(t) in each generation until all particles had competed in at least one competition, providing that the swarm size was an even number. The comparison was made by calculating the fitness of each particle. The particle with the better fitness was considered the winner and was passed directly to the next generation P(t + 1), while the particle that lost the competition was passed to the next generation after learning from the winner. The velocity of the losing particle was updated using Equation (3), where, x w,i (t) is the position of the winning particle in the i-th round of competition in generation t, x l,i (t) is the position of the losing particle in the i-th round of competition in generation t, v w,i (t) is the velocity of the winning particle in the i-th round of competition in generation t, v l,i (t) is the velocity of the losing particle in the i-th round of competition in generation t, i = 1, 2, . . . , m/2, m is the population size, r 1 (i, t), r 2 (i, t) and r 3 (i, t) ∈ [0, 1] are three vectors that were randomly generated after the i-th competition and learning process in generation t,x(t) is the mean position value of all particles (which can be regarded as the center of the swarm in generation t), and ϕ is the parameter that controlled the influences or effects ofx(t). Then, the position of the losing particle was updated using the newly calculated velocity, according to Equation (4) [6].

The Considered Datasets
As previously indicated, three different datasets were selected to verify the effectiveness of the proposed method. While the independent variables and the number of independent variables varied per dataset, forecasting the financial distress of companies was treated as a classification problem in this work and the effectiveness of the proposed method was validated separately for each dataset. The following is a brief description of each dataset.

Dataset of Spanish Companies
This dataset was for Spanish companies, from which we considered several financial and non-financial features. We considered the dependent variable of bankruptcy as the class for each record or sample and we aimed to classify the instances according to class. The dependent variable was insolvency, which corresponded to the existence of continued losses over three years [33].
This dataset was extracted from the Infotel database (which was bought from http: //infotel.es, accessed on 1 May 2017). As a result, we had data from 470 businesses that were gathered over six years (from 1998 to 2003). There were 2860 samples in all, with 62 corresponding to insolvent companies, meaning that insolvency cases only formed 2% of the whole sample.
Initially, each row of the dataset had 37 independent variables and 1 dependent variable (bankruptcy). A prior effort by the authors in [33] changed this list by removing unnecessary variables (i.e., those without significance, for instance, internal database firm codes), resulting in 33 independent variables. So, every record in the dataset that was used in this work had 33 features, which comprised a mix of financial indicators and non-financial indicators. Each feature had either a qualitative (categorical) or a quantitative (numerical) value. Table 1 shows the independent variables after removing the unnecessary variables, as well as their type and description. The size of the firm, the kind of company, provincial code (i.e., where the company is situated), and the auditor's judgments were among the non-financial data that had categorical value. Usually, the size of the firm is a number, but in this dataset, it was either small, medium or large, based on the size of the company. Moreover, in this work, we used all 33 features without applying feature selection because, as pointed out by [34], adding a feature selection step would not improve the results.

Dataset of Taiwanese Companies
This dataset was compiled from 10 years (1999-2009) of records from the Taiwan Economic Journal and comprised 6819 entries in total, with 6599 records relating to nonbankrupt firms (97%) and the remainder representing bankrupt firms (220 records), meaning that bankruptcy cases formed approximately 3% of the whole sample. The dataset had 95 financial characteristics. However, the firms in this dataset were chosen based on two criteria: the company's information had to be accessible for three years (so a decision on its financial state could occur) and the size of the firm had to measure up to a sufficient number of firms for comparison. The judgments concerning each firm's financial standing were mostly based on the trading regulations of the stock exchange in Taiwan. Additional information can be found in [35].

Dataset of Polish Companies
This dataset contained information about the likelihood of a Polish company becoming bankrupt. The information was gathered from the Developing Markets Information Service (EMIS), which is a global collection of information on emerging markets. The insolvent firms were studied from 2007 to 2012, while the enterprises that were still running were assessed from 2007 to 2013. This dataset was also extremely imbalanced, with the number of insolvent companies (203) forming around 2% of the whole sample, which contained around 10,000 instances. The dataset had 64 numerical financial characteristics with no category values. More information about this dataset can be found in [36] and the dataset itself can be downloaded from the Kaggle ML community website (https://www.kaggle. com/competitions/companies-bankruptcy-forecast/data, accessed on 28 June 2022).

The Proposed Method
This section presents the proposed method for classifying insolvent companies using an MHOANN with a PSO and a CSO as the optimization algorithms and a cost sensitiv-ity fitness function within a homogeneous majority voting ensemble learning paradigm (ENS_PSONN cost and ENS_CSONN cost ). The system architecture of the MHOANN with the embedded cost sensitivity fitness function is illustrated in Figure 1. Furthermore, the proposed architecture for the MHOANN in the majority voting ensemble learning paradigm is shown in Figure 2.
First, we discuss the ANN as a classifier and then we explain how the optimizers (PSO and CSO) were used to set the weights and biases of the ANN, the use of fitness functions to obtain the best solutions, and finally, how all of that fit within a majority voting ensemble learning paradigm. An illustration of the proposed method is presented in Figure 3.  The cost sensitivity fitness function that was embedded in the metaheuristic optimizationbased neural network architecture. Here, the metaheuristic optimizer (PSO or CSO) generated the NN weights and biases. After the optimizer found a solution, the solution was used to set the weights and biases for the NN and then the constructed NN was used to generate the predictions. After that, the costs were calculated by the cost sensitivity fitness function and the best solution was saved. These steps were repeated up to the maximum number of iterations and then the saved best solution was used to set up the NN weights and biases. Then, the trained NN was used to classify the instances in the testing dataset and all of the evaluation metrics were calculated and reported.  Here, the main blocks of our framework can be seen. Each inducer was an MHOANN with a PSO or a CSO as the optimizer for the NN, with an embedded custom fitness function that was cost-sensitive. In the second block, the output of each inducer was combined with the output of the other inducers to generate with the final predictions, based on the majority voting method.

ANN Classifier
Artificial neural networks (ANNs) [13,[37][38][39] are one of the main tools that are used to solve classification problems and are brain-inspired systems that are intended to simulate the way that humans learn. The learning process of an ANN is very difficult, owing to its nonlinear nature and the unknown optimal set weights and biases of the neural network. The efficiency of an ANN is significantly affected by its learning process. An architectural diagram of a standard ANN is shown in Figure 4.

The Optimizer
Optimization algorithms are methods that are used to update the weights and biases in an ANN to overcome the disadvantages of conventional training algorithms. In this work, we utilized state-of-the-art PSO and CSO metaheuristic algorithms as the optimization techniques for our ANN.
In this work, we constructed a neural network model using two sets of weights, (w 11 − w nm ) and (w 11 − w mk ), and two sets of biases, (β 1 − β m ) and (β 1 − β k ), where n is the total number of input features, m represents the number of hidden neurons, k is the number of output neurons, w represents the weights between the input and hidden layers (and the weights between the hidden and output layers), and β represents the biases of the hidden and output layers. Every particle in the swarm population corresponded to one vector. The total length of the solution vector (l s ) could be calculated using Equation (5). An illustration of a solution vector (particle) is shown in Figure 5. In the binary classification, as we had a single neuron in the output layer, k was equal to 1 and the total length of a solution vector (l s_binary ) could be simplified, as in Equation (6).

Weights between Input and Hidden layer
Weights between Hidden and Output layer

Biases of Hidden Layer
Biases of Output Layer

Fitness Functions
In evolutionary computing, the population evolves to increase its fitness, which is the selected fitness function [40]. In this work, we used the mean squared error (MSE) and accuracy as the benchmark fitness functions for the proposed cost sensitivity fitness function. In these cases, the fitness was compared using the following functions.

Mean Squared Error (MSE)
MSE is considered to be one of the most common fitness functions that are used in MHOANNs and ENNs [41,42]. The value is the mean of the summation of the differences between the predictions and the ground truths, as described in Equation (7), where i = 1, 2, . . . , n, n is the number of samples, y i is the actual or ground truth value, andŷ i is the prediction.

Accuracy
Accuracy is the number of correctly predicted data points out of all the data points. In this case, the value was simply the accuracy subtracted from 1 (see Equation (8), where TP is the number of true positives, TN is the number of the true negatives, FP is the number of false positives, and FN is the number of false negatives).

Cost Sensitivity
We took misclassification costs into consideration using a cost matrix. Similar to a confusion matrix, a cost matrix is an n × n matrix (where n is the number of classes) and each element within the cost matrix represents the weight of the misclassification costs of the corresponding element in the confusion matrix.
We let A be the confusion matrix and C be the cost matrix. We multiplied each element in the confusion matrix by its corresponding weight in the cost matrix to obtain matrix A , which was our newly updated confusion matrix. We then calculated the accuracy using the updated confusion matrix. We subtracted the resulting values from 1 to obtain the final costs. The steps that were followed to calculate the costs of the cost sensitivity fitness function are illustrated in Equation (9).

Majority Voting Ensemble Learning
Ensemble learning refers to methods for making predictions that combine several inducers. It is often used in supervised machine learning applications. An inducer, also known as a basic learner or weak learner, is a machine learning algorithm that takes a set of labeled examples as its input and produces a model. The model can then be used to make predictions for new unlabeled samples. Any type of machine learning approach can be employed as an ensemble inducer (e.g., decision trees, neural networks, linear regression models, etc.). The predictions of these models are then integrated to generate a final prediction. The core concept of ensemble learning is that by combining multiple models, the faults of an individual inducers can be compensated by the other inducers, which creates a strong learner out of several weak learners [43].
Ensemble members can be of the same or various types and they may or may not be trained using the same training dataset [44]. When all individual learners in an ensemble are of the same type, the ensemble is said to be homogeneous. For example, a "neural network ensemble" contains only neural networks [45].
In the case of classification, the combination of the results from all of the base learners can be accomplished using majority voting, which has three types: (1) unanimous voting, in which all of the classifiers agree on the prediction; (2) simple majority, in which more than half of the classifiers predict the same class; (3) plurality voting, in which the prediction receives the most votes, regardless of whether the total number of votes exceeds 50% of the classifiers [46].
In this work, we trained homogeneous ensemble learning using the MHOANN with the cost sensitivity fitness function as the ensemble members and the training dataset after applying sampling with replacements. Subsequently, majority (plurality) voting was implemented to generate the final predictions using the testing dataset.

Evaluation Measurements
The obvious challenge when dealing with the binary classification of an imbalanced dataset is that the training model is biased toward the majority class, resulting in a high accuracy for the majority class but the model failing to predict instances from the minority classes. In this work, we used the following metrics: accuracy, which was calculated using the confusion matrix defined in Equation (10), where TP represents the number of true positives, TN represents the number of true negatives, FP represents the number of false positives, and FN represents the number of false negatives [47]; g-mean, which was the geometric mean of the sensitivity and specificity, as defined in Equation (13); F1 score, which was the harmonic mean of the precision and sensitivity, as defined in Equation (14), where β is the real positive factor, which was chosen such that the sensitivity was considered to be β times more important than the precision. In this work, we used β = 1, which allocated the same weighting to both the sensitivity and precision.

Experiments and Results
This section provides the experimental setups, benchmarks, and steps that were used throughout the experiments, along with the results that were obtained and their analysis.

Environmental and Experimental Setups
The experiments were executed using a laptop with 16 GB of RAM and eight cores of 2.3-GHz CPUs. We used Evolopy-NN [48] to implement the ANN, which was powered by a PSO or a CSO as the optimization technique with the cost sensitivity fitness function. Evolopy-NN is an open-source nature-inspired optimization framework for training neural networks using evolutionary and metaheuristic algorithms, which was built with Python 3.7. Both datasets were split into a training dataset (66%) and a testing dataset (34%) [49,50]. We used stratified sampling to maintain the ratio between the minor and major classes in the resulting datasets. So, after the sampling, the minor classes formed 2% of the training and testing datasets for the Spanish companies. Similarly, the minor classes formed 3% and 2% of the training and testing datasets for the Taiwanese companies and the Polish companies, respectively. Each experiment was executed 10 different times for 100 iterations, in which the population size was set to 50. During the ensemble learning, we used five weak learners and majority voting to generate the final predictions.
As described in Section 5, we proposed the use of two optimization algorithms, a PSO and CSO, and three fitness functions, MSE, accuracy, and cost sensitivity. In this experiment, we constructed six variations of the MHOANN, as follows: 1.
The ANN with a PSO and MSE as the fitness function; 2.
The ANN with a PSO and accuracy as the fitness function; 3.
The ANN with a PSO and cost sensitivity as the fitness function (ENS_PSONN cost ); 4.
The ANN with a CSO and MSE as the fitness function; 5.
The ANN with a CSO and accuracy as the fitness function; 6.
The ANN with a CSO and cost sensitivity as the fitness function (ENS_CSONN cost ).

Effects of Fitness Function
We extended the MHOANN to add in the costs of misclassified instances during model training by implementing a cost sensitivity fitness function, which was based on the confusion matrix that was described in Section 5. For the problem in question, we tried to avoid FN predictions, i.e., the model predicts that a company is financially stable while it is actually in financial distress. Hence, we assigned a weighted cost to FN predictions. Determining the proper weight for the FN predictions depended on the dataset and the algorithm that were being used. We accomplished this by experimenting with different weights while monitoring the metrics to determine the best weight to use. Since the datasets in this work were relatively small in size, we were able to experiment using the whole datasets; however, in real applications with large datasets, we recommend using a sample of the dataset to find the best weight to use in order to reduce the computational costs. We considered the weight that yielded the highest g-mean score for the subsequent experiments. Table 2 shows the results for the dataset of Spanish companies with the PSO, Table 3 shows the results for the dataset of Spanish companies with the CSO, Table 4 shows the results for the dataset of Taiwanese companies with the PSO, Table 5 shows the results for the dataset of Taiwanese companies with the CSO, Table 6 shows the results for the dataset of Polish companies with the PSO, and Table 7 shows the results for the dataset of Polish companies with the CSO.  From these experiments, we observed that the best weight for FN predictions when using the PSO for the dataset of Spanish companies was 100, as shown in Figure 6. The corresponding result was 75 when using the CSO for the same dataset, as shown in Figure 7. On the other hand, we noticed that the best weight for FN predictions when using the PSO for the dataset of Taiwanese companies was 50, as shown in Figure 8. The result was the same when the CSO was used for the same dataset, as shown in Figure 9. We also noticed that the best weight for FN predictions when using the PSO for the dataset of Polish companies was 175, as shown in Figure 10. The result was the same when using the CSO for the same dataset, as shown in Figure 11. After determining the best FN weight for each particular optimization algorithm and dataset, we trained the MHOANN using the cost sensitivity fitness function, fed it with the corresponding FN weight, and then used the trained model to classify the instances in the testing dataset. We observed that in order to obtain reasonable g-mean scores, the weight of the FN predictions needed to be considerably high, from 50 to 175. This could be explained by the extreme imbalance of the data in the considered datasets.      To assess the effects of the cost sensitivity fitness function, we based our results on a benchmark. In the benchmark, we used each optimizer (PSO and CSO) with two different fitness functions, MSE and accuracy, and then trained the ANN using both datasets to observe the evaluation metrics without cost-sensitive learning. For each dataset, we executed four experiments: the ANN with the PSO and MSE as the fitness function, the ANN with the PSO and accuracy as the fitness function, the ANN with the CSO and MSE as the fitness function, and the ANN with the CSO and accuracy as the fitness function. The averages and standard deviations were calculated, along with the best scores for each metric. Table 8 shows the results for all of the fitness functions that were applied to the dataset of Spanish companies. In Table 9, the results from all of the fitness functions that were applied to the dataset of Taiwanese companies are illustrated. In Table 10, the results from all of the fitness functions that were applied to the dataset of Polish companies are shown. The cost-sensitive MHOANN showed major improvements when predicting the minority classes, which had a major positive impact on the g-mean and F1 score metrics and a negative impact on the accuracy.  Using the dataset of Spanish companies, when comparing the ANN with the PSO and the cost sensitivity fitness function to the same classifier with MSE as the fitness function, we noticed a major improvement in the g-mean from 0.211 to 0.842, an improvement in the F1 score from 0.104 to 0.141, and a drop in the accuracy from 0.978 to 0.749. When comparing the ANN with the PSO and the cost sensitivity fitness function to the same classifier with accuracy as the fitness function, we observed similar results: a major increase in the g-mean from 0.131 to 0.842, an improvement in the F1 score from 0.054 to 0.141, and a drop in the accuracy from 0.979 to 0.749. Similarly, when comparing the ANN with the CSO and the cost sensitivity fitness function to the same classifier with MSE as the fitness function, we noticed a major increase in the g-mean from 0.211 to 0.793, an improvement in the F1 score from 0.104 to 0.134, and a drop in the accuracy from 0.980 to 0.768. When comparing the ANN with the CSO and the cost sensitivity fitness function to the same classifier with accuracy as the fitness function, we also observed a major increase in the g-mean from 0.062 to 0.793, an improvement in the F1 score from 0.032 to 0.134, and a drop in the accuracy from 0.980 to 0.768.
We also observed similar results while using the dataset of Taiwanese companies. When comparing the ANN with the PSO and the cost sensitivity fitness function to the same classifier with MSE as the fitness function, we noticed a major increase in the g-mean from 0.332 to 0.834, an improvement in the F1 score from 0.186 to 0.242, and a drop in the accuracy from 0.968 to 0.824. When comparing the ANN with the PSO and the cost sensitivity fitness function to the same classifier with accuracy as the fitness function, we also noticed a major increase in the g-mean from 0.244 to 0.834, an increase in the F1 score from 0.110 to 0.242, and a drop in the accuracy from 0.967 to 0.824. When comparing the ANN with the CSO and the cost sensitivity fitness function to the same classifier with MSE as the fitness function, the increase in the g-mean was from 0.290 to 0.845, the increase in the F1 score was from 0.147 to 0.237, and the drop in the accuracy was from 0.967 to 0.808. Likewise, when comparing the ANN with the CSO and the cost sensitivity fitness function to the same classifier with accuracy as the fitness function, the increase in g-mean was from 0.207 to 0.845, the increase in the F1 score was from 0.087 to 0.237, and the drop in the accuracy was from 0.968 to 0.808.
Moreover, we observed similar results while using the dataset of Polish companies. When comparing the ANN with the PSO and the cost sensitivity fitness function to the same classifier with MSE as the fitness function, we noticed a major increase in the g-mean from 0.118 to 0.842, an improvement in the F1 score from 0.019 to 0.149, and a drop in the accuracy from 0.970 to 0.790. When comparing the ANN with the PSO and the cost sensitivity fitness function to the same classifier with accuracy as the fitness function, we also noticed a similar increase in the g-mean from 0.118 to 0.842, an increase in the F1 score from 0.018 to 0.149, and a drop in the accuracy from 0.967 to 0.790. When comparing the ANN with the CSO and the cost sensitivity fitness function to the same classifier with MSE as the fitness function, the increase in the g-mean was from 0.118 to 0.848, the increase in the F1 score was from 0.020 to 0.150, and the drop in the accuracy was from 0.970 to 0.790. Likewise, when comparing the ANN with the CSO and the cost sensitivity fitness function to the same classifier with accuracy as the fitness function, the increase in the g-mean was from 0.117 to 0.848, the increase in the F1 score was from 0.017 to 0.150, and the drop in the accuracy was from 0.967 to 0.790.
We could see that by applying the weight of the FN predictions, the number of TP instances increased, which explained the improvements in the g-mean and F1 score values. However, it also caused an increase in the number of FP instances, which explained the decrease in the accuracy score. Next, we used majority voting ensemble learning to decrease the number of FP instances while maintaining the number of TP instances.
Additionally, since the PSO and CSO produced similar results, an interesting observation was that a light optimizer with a simple mechanism to update the particles within the search space, such as the CSO, could achieve similar results when used as an optimizer for an MHOANN.
Another observation was that, whereas the PSO and CSO produced similar results when using similar fitness functions, the CSO was better in terms of execution time. Using the same population size of (50) and the same number of iterations (100), CSO was 22.4% faster for the dataset of Spanish companies, 34.4% faster for the dataset of Taiwanese companies, and 48.3% faster for the dataset of Polish companies. Table 11 lists the actual execution times in seconds.
In this work, as discussed in Section 7.2, we noticed a direct relationship between the weight of the FN predictions and the set of metrics that were monitored. While we chose the weight that produced the best g-mean score, which meant a weight that produced a balance between sensitivity and specificity, a lower weight could produce a better specificity score and a higher weight could produce a better sensitivity score, depending on which metric the user focused on.

Effects of the Ensemble Learning Framework
Whereas the cost-sensitive MHOANN performed better when predicting the minority classes and significantly reduced the number of FN instances, there was an increase in FP predictions as well. However, for these particular datasets, the minority classes were far more valuable and essential than the majority class. In other words, predicting that a company is solvent when it is actually in financial distress has considerably higher costs than predicting that a company is in financial distress when it is actually stable [51]; therefore, maintaining a high accuracy score in the classification model was crucial.
As described in Section 5, the key premise of ensemble learning is that by mixing many models, the flaws of one model can most likely be cancelled out by the other models. Hence, we used sampling with replacements to create five training sets per dataset and then trained the cost-sensitive MHOANN using each new training set. We then generated predictions using the existing testing dataset and used majority voting to obtain the final predictions. Table 12 shows a comparison of the results from the cost-sensitive MHOANN using the dataset of Spanish companies and those from the cost-sensitive MHOANN within the majority voting ensemble learning system using the same dataset. In Table 13, a comparison of the results from the cost-sensitive MHOANN and those from the costsensitive MHOANN within the majority voting ensemble learning system using the dataset of Taiwanese companies is illustrated. Table 14 shows the same comparison for the dataset of Polish companies. By reviewing the results, we observed improvements in most of the evaluation metrics; specifically, we noticed an improvement in the accuracy of between 8.4% and 15.0%, an improvement in the g-mean score of between 4.2% and 12.6%, and a significant improvement in the F1 score of between 36.7% and 87.3%. The main idea of ensemble learning is to achieve a high prediction capability that at least exceeds the individual prediction capabilities of the techniques that make up the ensemble. To achieve this, the weak learners within the ensemble should be both accurate and diverse [52]. The improvements that were achieved for all metrics confirmed that the PSO and CSO were sufficient to optimize the ANN, which was both accurate and diverse and could be utilized within a homogeneous ensemble learning system.

Comparison to Other Approaches
In [34], the authors proposed a hybrid method that combined the synthetic minority oversampling technique with other ensemble methods. Additionally, the authors applied five different feature selection methods to determine the most dominant attributes of insolvency prediction using the same dataset of Spanish companies. First, the authors compared four oversampling methods and then applied the C4.5 decision tree classifier to determine the best method. SMOTE was subsequently selected since it produced the best results, as suggested by the authors. Second, the authors compared several standard basic and ensemble classification algorithms as the baseline for the study. Table 15 shows the g-mean scores when using the standard classifiers in [34] compared to those when using the two methods that are proposed in this work. It can be seen that the proposed methods produced higher g-mean scores than all of the other classifiers in the related study. Third, the authors compared several basic and ensemble classification algorithms after applying oversampling using SMOTE to compare their performances and select the best performing classifier. The AB-Rep tree was subsequently selected as the best classifier. Finally, the authors applied different attribute selectors for feature selection and then applied oversampling using SMOTE and classification using the AB-Rep tree algorithm before comparing the results. Table 16 shows the best results, based on the g-mean scores in [34] and those of the two methods that are proposed in this work. It is clear that the proposed methods significantly improved the g-mean scores. According to these results, we noticed the benefits of applying cost-sensitive learning to our MHOANN, as well as the advantages that could be gained by using ensemble learning to improve financial distress prediction. Although the same dataset was used in this work and in [34], it is worth mentioning that there were some differences between the experiment setups: (1) in this work, we used a 66% to 34% split for the training and testing datasets, while the authors of [34] used a 10-fold cross-validation technique that meant that 90% of their data were used to train the model, but the approach that is proposed in this work still showed better results; (2) ten separate runs were performed in [34] for each combination, while we performed five separate runs per combination in this work. Table 15. The results for the g-mean scores of the standard classifiers that were used in the related work compared to those of the two methods that are proposed in this work using the dataset of Spanish companies. The best g-mean result per classification approach is marked in boldface.

Oversampling Feature Selection G-Mean
Random Tree [34] No No 0.602 AB-J48(20) [34] No No 0.609 Random Tree [34] Yes No 0.696 AB-Rep Tree/(90) [34] Yes No 0.730 AB-Rep Tree/(90) [34] Yes Yes 0.720 In another study that used the dataset of Taiwanese companies [35], the authors established that the integration of financial ratios (FRs) and corporate governance indicators (CGIs) could enhance the performance of the classifiers when forecasting the financial health of Taiwanese firms. Following this combination, five feature selection methodologies were evaluated to see whether they could lower data dimensionality. Consequently, the best results were achieved using an SVM with the stepwise discriminant analysis (SDA) feature selection method, along with the combination of FRs and CGIs (FC). The g-mean was not used as an evaluation metric in that study. Instead, type I and type II errors were used.
A type I error [53] is also known as the False Positive Rate (FPR). In binary classification tasks, the FPR quantifies the proportion of false positives among all of the positive samples. It is defined in Equation (15): A type II error [53] is also known as the False Negative Rate (FNR). In binary classification tasks, the FNR quantifies the proportion of false negatives among all of the negative samples. It is defined in Equation (16): Hence, the g-mean score could be extracted using Equation (17): Table 17 shows the best results for the calculated g-mean scores using the type I and type II errors in [35] and the two methods that are proposed in this work. It can be seen that both of the proposed methods produced higher g-mean scores. Table 17. The best results for the g-mean scores that were obtained in the related work compared to those of the two methods that are proposed in this work using the dataset of Taiwanese companies. The best g-mean result is marked in boldface.

Analysis and Discussion
The results from our experiments indicated that for highly imbalanced datasets, the proposed method had a significant positive impact on the g-mean score (which measures the balance between the classification performances for both the majority and minority classes) while maintaining an acceptable accuracy score. We found that the cost sensitivity fitness function helped to shift the bias away from the majority class and toward the minority classes and that ensemble learning could help to decrease the side effects of that bias shift.
In line with our hypothesis, applying a weight to the misclassified positive instances increased the number of TP predictions and decreased the number of FN predictions. However, as a side effect, the number of FP predictions increased and the number of TN predictions decreased. Since we were dealing with highly imbalanced datasets, the number of instances that belonged to the minor class (TP + FN) was much lower than the number of instances that belonged to the major class (FP + TN); so, the improvement in the sensitivity score was significant and the drop in the specificity score was not as drastic, which led to an overall improved g-mean score, as observed in the results from all experiments.
Moreover, when applying ensemble learning, we observed an overall improvement in all of the evaluation measurements that were used. This proved that the MHOANN was diverse and could be used in a homogeneous ensemble learning system. The ensemble learning created a stronger learner that approximately maintained the number FN predictions but decreased the number of FP predictions, resulting in a slightly better g-mean score and a significant improvement in the accuracy score.
In terms of performance, as previously mentioned, the CSO outperformed the PSO regarding execution time. In contrast to the PSO, only half of the population was updated in the CSO, which explained the faster execution times.
In Appendix A, we show the convergence (learning) curve graphs for sample runs using both optimizers (the PSO and CSO) for each fitness function and each dataset. We noticed that the fitness values were minimal in the cases of the MSE and accuracy fitness functions, which indicated that the model had a high accuracy (as confirmed by the previous results) but was biased toward the majority class and failed to predict the minority classes (as previously discussed). On the other hand, the fitness value was higher when using the cost sensitivity fitness function, which was expected because the number of FN predictions was multiplied by the allocated weight. Additionally, in all of our experiments, the fitness scores stabilized when approaching 100 iterations, which indicated that additional training would not significantly improve the model.

Conclusions and Future Work
This paper proposed the use of an MHOANN with a PSO or CSO as the optimization technique and a cost sensitivity fitness function within a majority voting ensemble learning system to handle the imbalanced distribution of financial distress datasets and maximize the prediction of positive instances. Experiments were conducted using datasets of Spanish companies, Taiwanese companies, and Polish companies. Then, we compared the results from the proposed approach to those that were obtained by applying the same MHOANN with a PSO or CSO but using MSE or accuracy as the fitness function.
The proposed method was able to provide better estimations for the financial distress prediction by avoiding biased results. The results showed that the cost sensitivity fitness function had an extremely positive overall effect on the accurate prediction of the minor classes in imbalanced datasets, with a significant improvement in the g-mean score and a moderately positive impact on the F1 score. Moreover, adapting the majority voting ensemble learning system improved the accuracy and the g-mean scores, along with a significant increase in the F1 scores. One primary limitation of this work was not having access to a domain expert to define the weights for the FN predictions, which is common in cost-sensitive learning [54]. It would be beneficial to obtain domain expert opinions and compared them to the proposed method to find the best weight for the FN instances.
In the future, we aim to investigate the application of the proposed method to other bankruptcy datasets. Additionally, we aim to use the same proposed approach for other imbalanced classification problems. Moreover, we aim to explore other methods for hyperparameter tuning, including finding the costs of misclassified instances, such as AutoML [55]. Data Availability Statement: The dataset of Spanish companies was bought from http://infotel.es (accessed on 1 May 2017), the dataset of Taiwanese companies was downloaded from https://www. kaggle.com/datasets/fedesoriano/company-bankruptcy-prediction (accessed on 1 March 2020), and the dataset of Polish companies was downloaded from https://www.kaggle.com/competitions/ companies-bankruptcy-forecast/data (accessed on 28 June 2022).

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
Here, we present the figures that show the convergence (learning) curves for the sample runs using both optimizers (the PSO and CSO) for each fitness function and each dataset.