A Weighted Ensemble Learning Algorithm Based on Diversity Using a Novel Particle Swarm Optimization Approach

: In ensemble learning, accuracy and diversity are the main factors a ﬀ ecting its performance. In previous studies, diversity was regarded only as a regularization term, which does not su ﬃ ciently indicate that diversity should implicitly be treated as an accuracy factor. In this study, a two-stage weighted ensemble learning method using the particle swarm optimization (PSO) algorithm is proposed to balance the diversity and accuracy in ensemble learning. The ﬁrst stage is to enhance the diversity of the individual learner, which can be achieved by manipulating the datasets and the input features via a mixed-binary PSO algorithm to search for a set of individual learners with appropriate diversity. The purpose of the second stage is to improve the accuracy of the ensemble classiﬁer using a weighted ensemble method that considers both diversity and accuracy. The set of weighted classiﬁer ensembles is obtained by optimization via the PSO algorithm. The experimental results on 30 UCI datasets demonstrate that the proposed algorithm outperforms other state-of-the-art baselines.


Introduction
Ensemble learning combines multiple individual learners to improve the generalization performance of an individual learner [1,2]. It is an important and popular branch of machine learning and is widely used in attack detection [3][4][5], fraud recognition [6,7], image recognition [8,9], biomedicine [10,11], intelligent manufacturing [12,13], time series analysis [14,15] and other fields. Ensemble learning usually involves multiple weak classifiers, such as decision trees [16], support vector machines [17], neural networks [18] and k-nearest neighbors [19], to form a strong classifier, multiple strong classifiers [20] or even a combination of multiple machine learners to complete the learning task. Currently, ensemble learning is regarded as the best way to solve machine learning problems [21].
Classic ensemble learning algorithms include bagging and boosting [22,23]. The basic idea of bagging is that the original dataset is first repeatedly replaced with uniform random sampling to obtain sample subsets; then, different individual learners are trained on these subsets, which are finally integrated by voting. The basic idea of boosting is to select a subset from the sample set as a training set according to a uniform distribution, then run a weak classifier multiple times to give a larger distribution weight to the samples that failed to train, and finally integrate the set by weighting. Bagging and boosting are the most elegant and widely utilized ensemble strategies, Livieris et al. [24] develop two ensemble prediction models that use these strategies for combining the predictions of multiple weight-constrained neural network classifiers. Based on bagging and boosting design philosophy, the current popular ensemble learning algorithms mainly include AdaBoost [25], random forest (RF) [26], gradient boosting decision tree (GBDT) [27], and extremely randomized trees (ERTs) [28].
To obtain good ensemble effects, individual learners should be accurate and diverse [23]. Theoretical and experimental studies have shown that combining a set of accurate and complementary learners can improve the generalization ability of ensemble learning. The diversity between learners is an important factor affecting the generalization performance of ensemble systems. Most existing classical ensemble learning algorithms implicitly use the diversity between learners. To enhance the diversity, Rokach [29] proposes a variety of methods, such as manipulating the inducer, manipulating the training sample, changing the target attribute representation and partitioning the search space. For example, the DECORATE algorithm [30] adds artificial data to the training dataset to train the classifier to increase the diversity of the training data and reduce the classification error.
Although the diversity of learners is necessary for improving the ensemble effect, maximizing diversity does not result in the best generalization performance. The experimental results also show that no strong correlation exists between the accuracy of the learners and diversity. An increase in diversity actually reduces the overall accuracy [31], as its improvement comes with the cost of reducing the accuracy of each individual classifier [32]. Compared with the overall diversity, the accuracy of an individual learner is the main factor of the success of ensemble learning [33]. Liu et al. [34] introduce the diversity factor as a regular term for ensemble learning to avoid overfitting. Therefore, to improve the overall ensemble learning performance, a balance between the diversity and accuracy of the learners must be established. Under the premise of considering accuracy and diversity, Zhang et al. [35] propose a classifier selection method based on a genetic algorithm. They integrated unsupervised clustering with a fuzzy assignment process to make full use of data patterns to improve the ensemble performance. Mao et al. [36] propose a transformation ensemble learning framework in which the combination of multiple base learners is converted into a linear transformation of all these base learners and which constructs an optimization objective function for balancing accuracy and diversity. The alternating direction multiplier method is used to solve this problem. This method effectively improves the performance of ensemble learning.
In previous studies [35,36], when the ensemble learning model was built, although the objective function includes accuracy and diversity factors, the diversity factor is used as a regularization term to avoid overfitting. Nevertheless, the diversity factor considers only the prediction results of the classifier. The diversity does not express the accuracy factor that the diversity should imply, nor does it reflect how to produce an individual learner with moderate diversity. The weighted ensemble strategy is to assign a weight to the individual learner so that it can properly represent the ensemble. It has been demonstrated to be highly efficient in many real-world fields, such as biochemistry [37], medical diagnosis [38] and statistical modeling [39]. In the past two decades, there has been an unprecedented development in the field of computational intelligence. Evolutionary computing and swarm intelligence have proved effective in many applications because of their flexible methods and few assumptions based on objective functions, such as deep learning [40] and multiobjective optimization [41][42][43]. Particle swarm optimization (PSO) is one among many such techniques and has been widely used in continuous/discrete function with complex structures optimization problems [44].
In this paper, a novel weighted ensemble learning algorithm based on diversity (WELAD) is proposed to balance the accuracy and diversity in ensemble learning. The method is divided into two stages: the first stage follows the principle of moderate diversity of individual learners to generate diversified individual learners by manipulating training samples and input features, and the diversity measure adopts the disagreement equation of the pairwise measure. This method includes the accuracy factor of the ensemble classifier and uses this value as the fitness function value of the PSO algorithm [45] to control the iterative search process, after which a set of ensemble classifiers with appropriate accuracy and adequate diversity is obtained. In the second stage, the weighted ensemble of the classifiers is used. In the ensemble process, the optimization objective function considers both the accuracy and diversity simultaneously for ensemble learning. The diversity factor at this stage is used as a regularization term to prevent overfitting. The diversity measurement method at this stage considers only the dissimilarity in the prediction results of the ensemble classifiers. Then, the PSO algorithm is used to optimize the weight values of a set of classifiers, and finally, a weighted ensemble classifier with good generalization performance is obtained. The major distinction between the WELAD design philosophy and previous studies is in the use of ensemble classifiers with appropriate diversity as an important factor when constructing ensemble learning models. Therefore, in the first stage, when the PSO algorithm is used to search for appropriate diversified individual classifiers, the fitness function of the PSO algorithm takes into account the accuracy of the individual learner and deliberately generates the individual learner to improve the performance of the ensemble classifier.
The rest of this paper is organized as follows: Section 2 provides the relevant theoretical background for this study, including introductions to the strategies for combining ensemble classifiers, the diversity measure and the PSO algorithm. In Section 3, we describe the proposed WELAD approach using a two-stage PSO algorithm. We present our experimental results in Section 4 and analyze the proposed method and the current and classic ensemble learning algorithms. Finally, in Section 5, we provide conclusions, summarize this study and suggest future research directions.

Notations and definitions
For convenience of description, the following notations and definitions are shown in Table 1.  The predicted output of classifier h i on sample x w The weight vector of an ensemble classifier . * The multiplication operation of the corresponding elements in the same dimension vector ⊕ The XOR operation The prediction output of ensemble learning by the majority voting method The prediction output of ensemble learning by the weighted voting method Div 1 (h) Diversity measures based on consideration of accuracy Div 2 (h, w) Diversity measures based on avoiding overfitting K The number of subsets on a dataset using K-fold cross-validation method M The number of clusters on a dataset using K-means clustering method

Combining Strategies of Classifiers
Given an ensemble learning system that contains L classifiers and the dataset X = {x n }, n = 1, 2, . . . , N, X ∈ R D×N , that includes C categories {c 1 , c 2 , . . . , c C }. For classification problems, if the prediction output of h i for sample x is represented by a one-hot coding, then the prediction output can be expressed as a C-dimensional vector h 1 A common classifier combination strategy is the relative majority voting method. The prediction output of ensemble learning H 1 (x) can be obtained as follows: Another classifier combination strategy is the weighted voting method. The prediction output of ensemble learning H 2 (x, w) can be obtained as follows:

Classifier Diversity Metrics
However, many methods exist for measuring the diversity between learners. Kuncheva and Whitaker [46] collected more than 10 different measurement methods from different fields and different perspectives, but none has yet been generally recognized by scholars. According to the needs of optimizing the ensemble learner in two stages, this paper uses the following two different measurement methods.

Individual Learner Diversity Measurement Method
Set the category label y n , n = 1, 2, . . . , N of the sample X, the vector q i denotes whether the prediction result of the ith classifier on sample X is correct. The value of each element can be obtained by Equation (3), and the pairwise diversity metric between classifiers h i and h j can be calculated by Equation (4).
Equation (4) indicates that under the premise of correct classification by one classifier, i.e., the ratio of the sum of the number of misclassified samples to the total number of samples in another classifier, this diversity measurement includes an accuracy factor. For the ensemble classifier with L classifiers, the diversity measure considering accuracy in the first stage of this paper can be calculated by Equation (5) which represents the average of the sum of the diversity values of the paired classifiers.

Weighted Ensemble Process Diversity Measurement Method
Unlike the first stage, in the second stage of the classifier weighted ensemble, the diversity between classifiers is mainly used as a regularization term to prevent overfitting of the ensemble classifier. It does not consider the correctness of the classifiers and considers only the diversity in the prediction results between the classifiers. Similar to Mao et al. [33], the weighted diversity metric between classifiers h i and h j can be defined as Equation (6), and the weighted diversity metric of the ensemble classifier can be defined as Equation (7).
Following [36], Equation (7) can be transformed into Equation (8): where D denotes the diversity matrix between classifiers h i and h j and each element d ij can be obtained by Equation (9):

PSO Algorithm
The PSO algorithm was initially developed by Eberhart and Kennedy [45]. It is inspired by the intelligent behavior of birds in search of food. As a swarm intelligence optimization algorithm, PSO has many advantages, such as stability, a short time convergence, few parameters to adjust and ease of implementation. PSO has been successfully applied to the combination optimization of various engineering problem areas, such as data mining [47], artificial neural network training [48], vehicle path planning [49], medical diagnosis [50,51] and system and engineering design [52].

Basic PSO Solution Process
In the PSO algorithm, each particle is represented as a potential solution, and several particles form a swarm. Let R D denote the D-dimensional problem space. In the tth generation, the position and velocity of the ith particle are expressed as Let pbest t be the individual best position of this particle and gbest t be the swarm best position. In the (t + 1)th generation, the velocity and position of the particle are updated as follows: where c 1 is the social learning factor, c 2 is the cognition learning factor, c 1 and c 2 determine the magnitude of the random force in the direction of particles with the previous best-visited position and the best particle, r 1 and r 2 are two independent uniform random numbers between [0,1], and ω is the inertia weight used to control the balance between exploration and exploitation. A large ω is useful for jumping out of the local optimal solution and improves the global search ability. In contrast, a small ω is suitable for algorithm convergence and enhancing the local search ability. Since inertia weight is one of the essential parameters in PSO, in this study, a dynamic inertia weight [53] is used to improve PSO performance through a linearly decreasing mechanism, as shown in Equation (12). This scenario requires emphasizing the particle's exploration ability in the early period and the particle's exploitation ability in the later period.
where ω max and ω min are the bounds on the inertia weight, t is the number of iterations, and T max is the maximum number of iterations. Generally, ω max = 0.9 and ω min = 0.4 are fixed values. In addition, a maximum velocity v max serves as a constraint to control the positions of swarms within the solution search space.

PSO Algorithm for Binary Optimization
The original PSO algorithm is able to optimize only continuous problems and cannot be used directly for optimization in binary spaces. Therefore, an early binary version of PSO (BPSO) was proposed in 1997 [54]. In BPSO, first, the velocity of the particle is converted into a binary probability of 1 by the sigmoid function, as shown in Equation (13).
Then, the current binary bit state of the particle's position is changed by determining the random values: where rand() is a function that generates a number in the interval (0, 1) using a uniform distribution. v max was set to prevent sig v t i from being too close to 0 or 1. Although the early version of BPSO was successfully applied to many combinatorial optimization problems, BPSO does have some issues, such as the velocity formula of BPSO being exactly the same as that of PSO, which might not be suitable since, in a binary space [55], the three most important components, i.e., momentum, the cognitive component (defined as pbest) and the social component (defined as gbest), may have different meanings in a binary space. Thus, some new velocity and momentum concepts must be determined.

Mixed PSO Algorithm
In the first stage of this study, in the process of generating a set of individual classifiers with moderate diversity, the characteristics of the dataset are selected as discrete values of {0, 1}. The resampling ratio of the sub-dataset is [0,1] continuous values; hence, this stage involves discrete and continuous mixed space optimization problems. Therefore, we designed a mixed PSO algorithm (MPSO), and the design search space was divided into a continuous domain and a binary domain, which correspond to the continuous and binary components of the design variable vector, respectively [56].
The position code of a particle consists of two parts: continuous and binary, as shown in Figure 1, Algorithms 2020, 13, x FOR PEER REVIEW 6 of 18 Then, the current binary bit state of the particle's position is changed by determining the random values: where () is a function that generates a number in the interval (0, 1) using a uniform distribution.
was set to prevent ( ) from being too close to 0 or 1.
Although the early version of BPSO was successfully applied to many combinatorial optimization problems, BPSO does have some issues, such as the velocity formula of BPSO being exactly the same as that of PSO, which might not be suitable since, in a binary space [55], the three most important components, i.e., momentum, the cognitive component (defined as pbest) and the social component (defined as gbest), may have different meanings in a binary space. Thus, some new velocity and momentum concepts must be determined.

Mixed PSO Algorithm
In the first stage of this study, in the process of generating a set of individual classifiers with moderate diversity, the characteristics of the dataset are selected as discrete values of {0, 1}. The resampling ratio of the sub-dataset is [0,1] continuous values; hence, this stage involves discrete and continuous mixed space optimization problems. Therefore, we designed a mixed PSO algorithm (MPSO), and the design search space was divided into a continuous domain and a binary domain, which correspond to the continuous and binary components of the design variable vector, respectively [56].
The position code of a particle consists of two parts: continuous and binary, as shown in Figure  1  The MPSO solution process is divided into two steps: a continuous search step and a binary search step. The continuous search step uses the conventional PSO method, and the binary search step is implemented as the primary search strategy and uses a "stickiness" momentum mechanism in BPSO. The term "stickiness" was proposed by Nguyen et al. [55,57]. Since applying the speed concept of PSO directly to BPSO is not appropriate, the main idea of "stickiness" is that a particle moves by flipping its position entry in BPSO. When the bit has just flipped, it should keep its new value for a while and then decrease it in successive iterations.
In the binary search step, a ( = 1,2, … , ) vector is used to record the number of iterations of the jth bit of the particle binary position since the bit has only recently been flipped. If the bit has just been flipped or initialized, then should be 1; after a number of iterations, if the bit does not flip, increases by 1. The stickiness of the j th bit can be calculated as follows: The MPSO solution process is divided into two steps: a continuous search step and a binary search step. The continuous search step uses the conventional PSO method, and the binary search step is implemented as the primary search strategy and uses a "stickiness" momentum mechanism in BPSO. The term "stickiness" was proposed by Nguyen et al. [55,57]. Since applying the speed concept of PSO directly to BPSO is not appropriate, the main idea of "stickiness" is that a particle moves by flipping its position entry in BPSO. When the bit has just flipped, it should keep its new value for a while and then decrease it in successive iterations.
In the binary search step, a curLife t j ( j = 1, 2, . . . , D) vector is used to record the number of iterations of the jth bit of the particle binary position since the bit has only recently been flipped. If the bit has just been flipped or initialized, then curLife t j should be 1; after a number of iterations, if the bit does not flip, curLife t j increases by 1. The stickiness of the jth bit can be calculated as follows: where α ∈ (0, 1) in Equation (15) is the scale factor of the number of iterations (in this study, α = 0.2). If a bit is not flipped, as the number of iterations increases, the bit's stickiness property decreases. For each bit of the particle's binary position, the flipping probability is as given in Equation (17): where β m , β p and β g are the proportion factors, which represent the contribution of the momentum and the cognitive and social factors to the flipping probability, respectively. In this study, β m = 0.25, β p = 0.25, and β g = 0.5, which means that the global best particle's position is more important. Based on Equation (17), the new position of a particle can be calculated by Equation (18).

Architecture of the Proposed Methodology
The architecture of this research method is shown in Figure 2. According to the K-fold cross-validation method, the basic idea is to split the training dataset into K subsets, denoted as {SX i , ST i }, i = 1, 2, . . . , K, where SX i is used for individual learner diversity training and ST i is used for individual learner weighted ensemble training. With the WELAD (refer to Section 3.2. for the two-stage specific implementation process), K set ensemble classifiers are generated, denoted as E k = {h k1 , . . . h kl , . . . h kL }, k = 1, 2, . . . , K, and L is the number of ensemble classifiers. Each classifier in each set of ensemble classifiers has a corresponding weight value w k = {w k1 , . . . w kl , . . . w kL }, k = 1, 2, . . . , K. When making predictions on the test dataset, first use the K set of ensemble classifiers to perform weighted ensemble voting, and then output K prediction results. Then, the K prediction results are combined by majority voting to obtain the final prediction result. The idea behind this approach (i.e., K set of ensemble classifiers) is to test the model's ability to predict new data and avoid overfitting or selection bias and to provide insight. The model will generalize to an independent dataset.

Description of the WELAD
The two-stage implementation of the WELAD is shown in Figure 3. In the first stage, to enhance the diversity of the training set, K-means clustering, resampling and feature selection are performed on the training dataset SX i . Using the diversity measurement method, which considers accuracy as the fitness value function of the MPSO algorithm, iterative optimization generates a set of individual learners with appropriate diversity. The second stage involves the weighted ensemble of individual learners. First, a set of individual learners created in the first stage is used to predict the training dataset ST i , and then the weight coefficients of the individual learners are optimized by the PSO algorithm. Finally, the weighted ensemble classifier (E i , w i ) is obtained.
. For a set of ensemble classifiers with resampling ratios and feature selection parameters, MPSO is used to solve the following problems: Implement resampling and feature selection on the clustered subsets so that the training samples of each classifier are different. Use MPSO to optimize the sample resampling ratio and feature selection parameters of each classifier (i.e., individual learner). After training, a set of ensemble classifiers E i with appropriate diversity are obtained. Let T max represent the maximum number of iterations, SS represent the population size of the swarm, and {B 1 , B 2 , . . . , B M } represent the subset after clustering; the pseudocode of Algorithm 1 is as follows. for j = 1 to SS do 6.
for l = 1 to L do 7.
Initialize f l , f l ← rand{0, 1}, Use NS l to train the learner h l ; 11.
Use learner h l to conduct the classification prediction on SX i , and obtain the category label; 12.
for j = 1 to SN do 20.
Update the binary part of x t j using Equation (18); 22.
Update the continuous part of x t j using PSO; 23.
According to x t j , train and evaluate the learners; 24. end 25.
Evaluate the fitness of particles, and set pbest t and gbest t ; 27. end 28. Output: F, R

Stage II: Generate the Weighted Ensemble Classifier
The purpose of this stage is to determine the weight values of a set of ensemble classifiers. Given a training dataset used for the weighted ensemble, which has G samples, under the effect of weight w, the prediction result of the weighted ensemble classifier for x i is expressed by Equation (20), and the training error is expressed by Equation (21).
At this stage, the PSO algorithm is used to solve the following problems: where Div 2 (h, w) is a regularization term, which is used to prevent overfitting. λ ∈ (0, 1) is a regularization coefficient, which is used to balance the accuracy and diversity of the ensemble learning. Algorithm 2 is implemented to assign different weights to the obtained set of classifiers with appropriate diversity. Let the prediction results of the L classifiers in the weighted ensemble training dataset be q 1 , q 2 , . . . , q L ; their pseudocodes are as follows.
Algorithm 2. Using PSO to optimize the weight parameters of the classifier w

4.
for j = 1 to SS do
Evaluate the fitness of particles, and set pbest t and gbest t ; 16. end 17. Output: w

Experimental Dataset
The experiment used 30 UCI public datasets (http://archive.ics.uci.edu/ml/datasets.php) as test datasets to verify the performance of the algorithm. The information regarding the samples, attributes, and categories of each dataset is shown in Table 2. The computational complexity of the WELAD algorithm is composed of two factors: the decision tree algorithm and the feature selection required to generate the base classifier with diversity. The computational complexity of the decision tree algorithm is O N 2 F . The computational complexity of the feature selection required to generate the base classifier with diversity is O 2 F , where F is the number of features in the dataset. Hence, the overall computational complexity of the WELAD algorithm for each dataset is O N 2 F2 F .

Experimental Setup
In this study, WELAD is defined as the experimental group, and the AdaBoost, Bagging, DECORATE, ERT, GBDT, and RF algorithms are listed as the control group. The performance of the WELAD is verified through experiments. The WELAD is implemented via the Python language, due to the decision tree classifier in the machine learning toolkit scikit-learn (https://scikit-learn.org/stable/) uses classification and regression trees (CART) as default classification algorithm, all algorithms exception the DECORATE in the control group, the base classifier also use CART. Since scikit-learn does not integrate the DECORATE algorithm, we use the WEKA machine learning toolkit (https: //www.cs.waikato.ac.nz/ml/weka/) instead, and its base learner uses J48 decision tree.
The parameter values in the two stages of the WELAD are determined by the grid search method, the bounds of λ ranges from 0.1 to 0.9 step 0.1, M ∈ {3, 5, 7}, SS ∈ {20, 30, 50}, and the final choice is M = 3, λ = 0.4, SS = 30, T max = 50. The number L of base classifier trees of the experimental group and the control group is uniformly set to 30. For the selection of the training dataset splitting parameter K, according to the idea of majority voting ensemble, a larger K value will improve accuracy, but for a dataset with a small number of samples, a larger K value will result in too few samples for training, leading to increase individual errors. In preliminary experiments, we tried to use K ∈ {5, 7, 9} and finally chose K = 7.

Experimental Results
The experiment uses the accuracy as the performance index. To better check the generalization ability of each algorithm and avoid overfitting of the training caused by the cutting of the dataset, which affects the experimental results, the accuracy of each algorithm in the dataset is tested by 10-fold cross-validation repeated ten times (i.e., 10×10 cross-validations), and then the average is taken. The accuracy results of the seven algorithms on the 30 sets UCI datasets are shown in Table 3, and the box plots of the nine selected datasets are shown in Figure 4.  The bold values in Table 2 indicate the best performance of a line. The WELAD achieved the highest mean accuracy of 0.8123 on the 30 datasets, being superior to the six algorithms of the control group on nine of the sets. The standard deviation of the mean value of the WELAD is also the lowest, showing that the algorithm has good stability. Clearly, ERT, GBDT and RF all achieve good and similar performances. The experimental results also reflect the instability of AdaBoost. However, it reached the highest accuracy on 4 of the 30 datasets. It had the lowest accuracy on 11 datasets, resulting in its mean accuracy being the worst among all algorithms. The box plots of the nine datasets also reflect the advantages of the WELAD.

Statistical Test
To further verify the overall performance of the WELAD, this study conducts a statistical analysis in which we follow the procedure proposed in Demsar [58] and use nonparametric statistical tests for multiple comparisons by combining all the datasets. First, we apply the Friedman test to rank the performances of the seven ensemble learning methods and evaluate whether these ensemble learning methods show statistically significant differences on the 30 datasets. Table 4 shows the Friedman rank comparison table, according to which the ensemble learning algorithms, ranked from best to worst, are WELAD (5.70), RF (4.60), ERT (4.38), GBDT (4.22), AdaBoost (3.47), DECORATE (3.15), and Bagging (2.48). The WELAD method is the best, as expected. RF, ERT, and GBDT are recently developed, well-known approaches that achieve the second-best results. The traditional ensemble learning methods AdaBoost, Bagging, and DECORATE with diverse factors perform relatively poorly. The bold values in Table 3 indicate the best performance of a line. The WELAD achieved the highest mean accuracy of 0.8123 on the 30 datasets, being superior to the six algorithms of the control group on nine of the sets. The standard deviation of the mean value of the WELAD is also the lowest, showing that the algorithm has good stability. Clearly, ERT, GBDT and RF all achieve good and similar performances. The experimental results also reflect the instability of AdaBoost. However, it reached the highest accuracy on 4 of the 30 datasets. It had the lowest accuracy on 11 datasets, resulting in its mean accuracy being the worst among all algorithms. The box plots of the nine datasets also reflect the advantages of the WELAD.

Statistical Test
To further verify the overall performance of the WELAD, this study conducts a statistical analysis in which we follow the procedure proposed in Demsar [58] and use nonparametric statistical tests for multiple comparisons by combining all the datasets. First, we apply the Friedman test to rank the performances of the seven ensemble learning methods and evaluate whether these ensemble learning methods show statistically significant differences on the 30 datasets. Table 4 shows the Friedman rank comparison table, according to which the ensemble learning algorithms, ranked from best to worst, are WELAD (5.70), RF (4.60), ERT (4.38), GBDT (4.22), AdaBoost (3.47), DECORATE (3.15), and Bagging (2.48). The WELAD method is the best, as expected. RF, ERT, and GBDT are recently developed, well-known approaches that achieve the second-best results. The traditional ensemble learning methods AdaBoost, Bagging, and DECORATE with diverse factors perform relatively poorly. Table 5 reflects the results of the Friedman test and shows whether an overall statistically significant difference exists between the mean ranks of related groups. Based on the Friedman test results in Table 5, a p-value from 0.000 to 0.05 indicates that the differences between the seven ensemble learning algorithms in this study are statistically significant. Then, we apply the nonparametric Wilcoxon signed-rank test to analyze the differences between paired ensemble learning algorithms.  Table 5 shows whether statistically significant differences exist between the WELAD using the Wilcoxon signed-rank test and other ensemble learning methods. As shown in Table 6, the WELAD and the comparison group algorithms are statistically significantly different. Table 6. Wilcoxon signed-rank test. In summary, the experimental results on these 30 sets of data show that the WELAD ensemble method proposed in this study has the highest accuracy and the best overall performance.

Conclusions
To obtain a good ensemble performance, each individual learner that composes an ensemble classifier is required to have high accuracy and appropriate diversity. To balance the diversity and accuracy in the existing research on ensemble learning, the problem of how to generate individual learners with diversity cannot be fully reflected. In this study, a two-stage weighted ensemble learning method is proposed. When creating individual classifiers and classifiers for integration, the diversity and accuracy are considered at the same time, the corresponding adaptive function of the PSO algorithm is constructed, and the PSO algorithm is used to optimize the target model.
The proposed ensemble learning method is evaluated on 30 UCI datasets, and the experimental results show that the proposed method achieves the highest mean accuracy and the lowest standard deviation of the mean value, which means that the proposed method has better classification performance and stability. In subsequent work, other evolutionary optimization algorithms (i.e., differential evolution [59]) or different swarm intelligence optimization algorithms (i.e., cuckoo search [60] and artificial bee colony [61]) can be used to compare whether their classification performances are different. Another ensemble model called stacking will be used [62]. The method can also be applied to related fields, such as high-tech industry or quality medical services management.