Feature Selection Using Enhanced Particle Swarm Optimisation for Classification Models

In this research, we propose two Particle Swarm Optimisation (PSO) variants to undertake feature selection tasks. The aim is to overcome two major shortcomings of the original PSO model, i.e., premature convergence and weak exploitation around the near optimal solutions. The first proposed PSO variant incorporates four key operations, including a modified PSO operation with rectified personal and global best signals, spiral search based local exploitation, Gaussian distribution-based swarm leader enhancement, and mirroring and mutation operations for worst solution improvement. The second proposed PSO model enhances the first one through four new strategies, i.e., an adaptive exemplar breeding mechanism incorporating multiple optimal signals, nonlinear function oriented search coefficients, exponential and scattering schemes for swarm leader, and worst solution enhancement, respectively. In comparison with a set of 15 classical and advanced search methods, the proposed models illustrate statistical superiority for discriminative feature selection for a total of 13 data sets.


Introduction
The knowledge discovery processes in real-world applications often involve datasets with large numbers of features [1]. The high dimensionalities of datasets increase the likelihood of overfitting and impair generalization capability. Besides that, the inclusion of redundant or even contradictory features can severely reduce the performance of classification, regression and clustering algorithms [2]. As a result, feature selection and dimensionality reduction become critical in overcoming the aforementioned challenges by eliminating certain irrelevant and redundant features while identifying the most effective and discriminative ones [3,4]. Moreover, for datasets with high dimensionalities, it is computationally impractical to conduct an exhaustive search of all possible combinations of the feature subsets [5]. In addition, the search landscape becomes extremely complicated, owing to the sophisticated confounding effects of various feature interactions in terms of redundancy and complementarity [6]. Therefore, effective and robust search methods are required to thoroughly explore the complex effects of feature interactions while satisfying the constraints of practicality in term of computational cost to undertake large-scale feature selection tasks.
Evolutionary Computation (EC) techniques have been widely employed to comprehensively explore the complex effects of feature interactions, owing to the significant capa-bility of EC in finding global optimality [4]. Inspired by natural evolution, EC techniques employ a population-based evolving mechanism to supervise the individual solutions to move towards the promising search territory iteratively and identify the global optima. In EC-based feature selection methods, the coevolution mechanisms based on diverse evolving operators, e.g., crossover and mutation, are capable of producing various feature representations of the original problem in one single run. Therefore, the confounding effects of feature interactions can be thoroughly explored through the evaluation of various feature constitutions during the iterative process. The effectiveness and superiority of various EC techniques over other methods in undertaking feature selection tasks have been extensively verified in many existing studies, such as feature optimisation using Genetic Algorithm (GA) [7], Differential Evolution (DE) [8,9], Particle Swarm Optimisation (PSO) [10], Moth-flame optimisation (MFO) [11], Firefly Algorithm (FA) [3,12], Ant Colony Optimisation (ACO) [13], Grey Wolf Optimisation (GWO) [14], Whale Optimisation Algorithm (WOA) [15], and Sine Cosine Algorithm (SCA) [16]. Nevertheless, the empirical studies indicated that these original EC algorithms tend to be trapped in local optima, and they can be further improved in terms of search diversity and capability of avoiding local stagnation.
As one of the most acknowledged and widely-used EC algorithms, PSO has been adopted in various optimisation problems, owing to its simplicity, fast convergence speed, as well as effectiveness and robust generalization capability. In PSO, each particle adjusts its search trajectory by learning from two historical best experiences, i.e., its own best position and the global best solution. Despite the great advantages in following both the local and global best signals, PSO suffers from the local optima traps as well as inefficient fine-tuning capabilities owing to its working principles [17][18][19]. As an example, PSO lacks the operation of exchanging information between particles, owing to the fact that only the global best solution is exploited as the reference for coevolution [20]. Secondly, the swarm often tends to revisit previously explored regions, owing to the strict adherence to the historical best experiences of each particle [21]. These limitations in the original PSO model severely constrain the search diversity and search scope, hence resulting in early stagnation and premature convergence. Such constraints of the PSO algorithm become worse when undertaking feature selection tasks with complex problem landscapes.
In this research, we propose two enhanced PSO models to address the identified limitations of the original PSO algorithm as well as undertake complex feature selection problems. Specifically, the research overcomes the lack of cooperation between individual particles and ineffectiveness of search owing to frequent re-visits to previously explored regions in the original PSO model. The proposed PSO models employ several key strategies, including leader/exemplar generation using dynamic absorption of elicit genes, search operations with differentiated nonlinear trajectories, exploitation schemes for swarm leader enhancement, as well as re-dispatching mechanisms for enhancement of the worst solutions. These strategies work cooperatively as augmentations to accelerate convergence while preserving diversity. A summary of the research contributions is presented, as follows:

•
Two new PSO variants for feature selection are proposed to overcome two major shortcomings of the original PSO algorithm, i.e., premature convergence and weak local exploitation capability around the near optimal solutions. • The first proposed PSO model, i.e., PSOVA1 (PSO variant 1), comprises the following mechanisms: (1) a modified PSO operation with rectified global and personal best signals, (2) spiral search based local exploitation, (3) Gaussian distribution based swarm leader enhancement, as well as (4) mirroring and DE mutation operations for worst solution improvement.

•
The second proposed PSO model, i.e., PSOVA2 (PSO variant 2), enhances PSOVA1 through four mechanisms: (1) an adaptive exemplar breeding mechanism incorporating multiple optimal signals, (2) search coefficient generation using sine, cosine, and hyperbolic tangent functions, (3) worst solution enhancement using a hybrid re-dispatching scheme, and (4) an exponential exploitation scheme for swarm leader improvement. Moreover, the search diversity and scopes in PSOVA2 are further elevated in comparison with those of PSOVA1. This is owing to the adoption of diverse exemplars to guide the search in each dimension, as well as the employment of versatile search trajectories to calibrate the particle positions. • Evaluation using 13 datasets with a wide spectrum of dimensionalities: the empirical results indicate that both proposed models outperform five classical search methods and ten advanced PSO variants with significant advantages, evidenced by the statistical test outcomes.
The rest of the paper is organized as follows. Section 2 introduces the original and diverse PSO models, and their applications to feature selection. We present the two proposed PSO models with elaborations and analysis for each proposed enhancement in Sections 3 and 4, respectively. Section 5 discusses the evaluation of the proposed and the baseline search methods on a variety of feature selection tasks. Conclusions are drawn and future research directions are presented in Section 6.

Related Studies
In this section, we firstly introduce the original PSO model. Then, the state-of-the-art PSO variants are presented. We also conduct a literature review on the application of PSO variants to feature selection. Finally, we discuss the motivation of this research.

Particle Swarm Optimisation
PSO is a population-based self-adaptive optimisation technique developed by Kennedy and Eberhart [22] based on swarm social behaviors, such as fish in a school and birds in a flock. The PSO algorithm conducts search in the landscape of the objective function by adjusting the trajectories of individual particles in a quasi-stochastic manner [23,24]. Each particle adjusts its velocity and position by following its own best experience in history and the global best solution of the swarm. In the PSO model, the updating equations of the velocity v id t+1 and position x id t+1 of the ith particle at the dth dimension are prescribed in Equations (1) and (2) [22]: where v i and x i represent the velocity and position of the ith particle, while pbest i an gbest represent the historical best solution of the ith particle and the global best solution, respectively. Besides that, c 1 and c 2 denote the position constants, while r 1 and r 2 are two random values generated from [0, 1]. Moreover, t and w represent the current iteration number and inertia weight, respectively.

PSO Variants
Despite its simplicity and fast convergence speed, the PSO model is subject to local optima traps and premature convergence, owing to the constant reference to the global best solution for all swarm particles. The particle positions also become increasingly similar over iterations. As such, various diversity enhancing strategies have been proposed, e.g., repulsion strategies [23], mutation operators [24], multi-swarm concepts [25,26], multiple leaders [25,27], and hybridization with other search methods [27]. Such strategies enable the search process to balance between convergence and diversity while searching for the global optimality.
Chen et al. [28] proposed a dynamic PSO model with escaping prey schemes (DP-SOEP). In DPSOEP, swarm particles were categorized into three sub-swarms according to their fitness scores, i.e., 'preys' (top ranked particles), 'strong particles' (middle ranked particles), and 'weak particles' (lower ranked particles). The particles in the above groups subsequently followed distinctive search operations, i.e., Lévy flights, the original PSO position and a multivariate normal distribution, respectively, to search for global optimality. Li et al. [29] proposed a multi-information fusion "triple variables with iteration" inertia weight PSO (MFTIWPSO) model, in which the inertia weight was generated using multiple information, including the particle velocity, position, random disturbance, number of iterations, as well as inertia weight score from the last iteration. The MFTIWPSO outperformed a number of baseline models for solving benchmark functions and hyperparameter tuning in classification methods.
Wang et al. [24] proposed a diversity enhancing and neighborhood search PSO (DNSPSO) model for solving multimodal high-dimensional benchmark functions. It employed a crossover factor and a DE-based operation for trial particle generation. Moreover, a ring topology was also utilized to facilitate local and global neighborhood search operations. In addition, an eXpanded PSO (XPSO) model was proposed by Xia et al. [30], where the swarm leader and a dynamic neighboring best solution were employed to guide the social component in the PSO operation.
A distributed contribution based quantum-behaved PSO with controlled diversity (DC-QPSO) was proposed by Chen et al. [31] for solving large-scale global optimisation problems. Their model first decomposed the original problem into several sub-problems. A contribution-based mechanism was then employed to ensure more resources (i.e., more number of function evaluations) to be awarded to the sub-swarms with comparatively more fitness enhancement. A diversity control strategy based on genotype diversity (i.e., distance-based diversity) was subsequently used to increase search diversity.
Lin et al. [32] proposed an enhanced genetic learning PSO (GL-PSO) algorithm for global optimisation. In GL-PSO, the genetic operators and a ring topology were employed for the generation of fitter exemplars, which were subsequently used to guide the swarm particles.
Tan et al. [27] proposed an asynchronized learning PSO model, i.e., ALPSO, by incorporating DE, Simulated Annealing (SA) and helix search actions, for hybrid clustering and hyper-parameter fine-tuning in deep Convolutional Neural Networks (CNN) for skin lesion segmentation. Zhang et al. [33] proposed an Enhanced Sine Cosine Algorithm (SCA), which employed two randomly selected neighboring solutions and the Gaussian distribution-based search parameters for the diversification of the global best signal. Moreover, Jordehi [34] proposed an Enhanced Leader PSO (ELPSO) where a five-staged mutation mechanism (e.g., Gaussian, Cauchy and opposition-based mutations) was used for swarm leader enhancement to avoid premature convergence.
Kang et al. [35] proposed a modified PSO algorithm for optimal hyper-parameter selection of Gaussian process regression (GPR). Instead of using the inertial component as in PSO, a momentum element was proposed, which was based on the mean distance of the swarm in two successive iterations. Subsequently a mutation mechanism based on a perturbation function was proposed to further enhance the global best solution.
Yu et al. [36] developed an enhanced DE algorithm for tackling multi-objective optimisation problems. It incorporated a Gaussian mutation operator for the improvement of infeasible solutions as well as a standard DE/rand/1 operation for evolving feasible solutions according their dominance relationships.
Cao et al. [37] integrated comprehensive learning PSO (CLPSO) with an adaptive local search starting strategy to solve multimodal and CEC 2013 benchmark functions, whereas Xu et al. [38] proposed an accelerated two-stage PSO (ATPSO) method with the employment of intra-cluster distance and intra-cluster cohesion measures as objective functions, respectively, for tackling complex clustering problems. Elbaz et al. [39] developed an improved PSO-adaptive neurofuzzy inference system (ANFIS) model for the prediction of shield performance during tunneling. An improved PSO method with an adaptive inertia weight and a constriction factor was employed for the optimisation of parameters in ANFIS. The empirical results indicate that this PSO-ANFIS model offered better prediction accuracy in comparison with those of ANFIS and GA-ANFIS. Elbaz et al. [40] proposed a GA-based evolving group method of data handling (GMDH)-type neural network (GMDH-GA) model for the prediction of disc cutter life during shield tunneling. GA was adopted to identify the optimal network configurations for the GMDH-type neural network.
Besides the aforementioned studies, there are other related investigations on diversity enhancement. Among them include genetic PSO (GPSO) with a crossover operator [41], a Bare-bones PSO variant (BBPSOVA) with repulsive operations and sub-swarm mechanisms [42], a Micro-GA PSO [43], a PSO with multiple sub-swarms for multimodal function evaluation (MFOPSO) [44], and a modified PSO method (MPSOELM) with time-varying adaptive acceleration coefficients for hyper-parameter optimisation pertaining to an Extreme Learning Machine (ELM) [45].

PSO for Feature Selection
Feature selection methods can be broadly divided into two categories, i.e., filter and wrapper. The filter approach ranks the features individually based on certain statistical criteria, such as chi-square test [46] and mutual information [47]. The feature ranking scores indicate their relative importance to the problem. It is challenging to identify the cut-off point for selecting the most important features. Besides that, the individualbased ranking mechanisms are incapable of measuring the confounding effects of feature interactions and feature composition [1]. Instead of measuring the impact of individual features, the wrapper approach evaluates the quality of various feature subsets by taking feature interaction into account, with the learning algorithm wrapped inside. Therefore, the wrapper technique possesses interaction with classifiers to capture feature dependencies.
In addition, PSO and its variants have been widely employed as the search engines in wrapper-based feature selection methods, owing to their fast convergence speed and powerful discriminative search capabilities [3,4,10,42,43]. As an example, Gu et al. [48] proposed a Competitive Swarm Optimiser, i.e., CSO, to undertake high-dimensional feature selection tasks. In CSO, the swarm was randomly divided into two sub-swarms, and pairwise competitions were conducted between particles from each sub-swarm. The winning particle was passed on to the next generation, while the defeated particle updated its position by learning from the position of the winning particle in the cognitive component as well as the mean position of the swarm in the social component. The CSO model outperformed several existing algorithms with various initialisation strategies for discriminative feature selection.
Moradi and Gholampour [49] proposed a hybrid PSO variant, i.e., HPSO-LS, for feature selection by integrating a local search strategy into the original PSO model. Two operators, i.e., "Add" and "Delete", were employed to enhance the local search capability of PSO. Specifically, the "Add" operator inserted the dissimilar features into the particle, while the similar features were deleted from the particle by the "Delete" operator. Evaluated with 13 classification problems, HPSO-LS significantly outperformed a number of existing dimension reduction methods. Another hybrid PSO model, i.e., HPSO-SSM, was proposed by Chen et al. [19]. Specifically, the Logistic chaotic map was used to generate the inertia weight. Subsequently, two dynamic nonlinear correction factors were employed as the search parameters in the position updating operation. A spiral search mechanism was also incorporated to increase search diversity. Evaluated with 20 UCI datasets, HPSO-SSM outperformed several feature selection methods, such as CatfishBPSO (binary PSO with catfish effect). Tan et al. [50] proposed a hybrid learning PSO model, i.e., HLPSO, to identify the most discriminative elements from the shape, color, and texture features extracted from dermoscopic images for the identification of malignant skin lesions. HLPSO adopted three probability distributions, i.e., Gaussian, Cauchy, and Levy distributions, to further enhance the top 50% promising particles. Modified FA and spiral search actions were also employed to guide the lower-ranking 50% particles. Moreover, Xue et al. [4] conducted a comprehensive review on the applications of PSO as well as other EC techniques for tackling feature selection problems.  Table 1 depicts a detailed comparison between several existing studies (including the original PSO algorithm) and this research. The original PSO model employs a search process led by a single swarm leader. Comparatively, both proposed PSOVA1 and PSOVA2 models employ multiple hybrid global optimal signals and a number of cooperative search operations to mitigate premature convergence. In particular, PSOVA2 employs versatile search operations with diverse specified sine, cosine, and hyperbolic tangent search trajectories to overcome stagnation. Both proposed models show superior capabilities in accelerating convergence while preserving diversity, in order to mitigate premature convergence.

Research Motivations
The research motivations of the proposed models are as follows. The classical PSO algorithm explores the search space by following single leader and the particles' own personal best experiences, therefore lack of interactions with other neighboring elite solutions accumulate through coevolution. Owing to a monotonous search operation led by single leader, the particle positions become increasingly similar over iterations. In this research, PSOVA1 is firstly proposed to enhance local and global optimal signals through the use of neighboring historical best experiences. A set of effective cooperative search strategies is introduced to overcome the limitations of the original PSO algorithm, namely a modified PSO operation with rectified local and global best signals, spiral-based local exploitation, enhancement of the swarm leader and the worst solutions using Gaussian distributions, as well as mirroring and DE-based mutations.
Secondly, PSOVA2 is further proposed to enhance the best leader generation and the search operation embedded in PSOVA1. In particular, it employs an adaptive exemplar breeding mechanism incorporating multiple local and global best solutions to guide the search process. A new search action is also proposed by embedding diverse search coefficients yielded using sine, cosine, and hyperbolic tangent formulae. In comparison with PSOVA1 where the search mainly focuses on a modified PSO operation in principle, the aforementioned new search operations equip the search process with a variety of distinctive search behaviors and irregular search trajectories. In short, the search mechanisms in PSOVA1 and PSOVA2 work in a collaborative manner to increase search diversity and mitigate premature convergence.
Moreover, most of the aforementioned existing PSO variants employed purely the single global best solution [19,22,24,31,32,34,35,39,41,44,45,51] to guide the search process. In addition, except for a few studies such as Lin et al. [32], Srisukkham et al. [42], Tan et al. [27], and Yu et al. [36], other existing work did not adopt any exemplar breeding strategies to enhance the optimal signals or generate hybrid leaders. Although some studies adopted diverse search mechanisms [19,24,27,33,42,52], the search processes in many existing studies [22,31,32,[34][35][36]39,41,44,45,51] are mainly conducted by single position updating formula. Therefore, they are more likely to suffer from premature convergence. In comparison with these existing methods, the proposed PSOVA1 and PSOVA2 models employ exemplar breeding mechanisms as well as multiple global best signals to lead the search process and avoid local optima traps. A number of position updating operations (such as local and global based search actions) are embedded in both models. When a certain search operation becomes stagnant (e.g., the global search in PSOVA1 or sine-based search in PSOVA2), the proposed models are able to adopt an alternative search action (e.g., local search in PSOVA1 or cosine-based search in PSOVA2) to drive the search out of stagnation. In addition, swarm leader and worst solution enhancement is also conducted in both methods to reduce the probabilities of being trapped in local optima. The proposed search strategies in both models work cooperatively to overcome premature convergence and increase the chances of finding global optimality.

The Proposed PSOVA1 Model
In this research, we propose two PSO variants for feature selection, which aim to overcome two major shortcomings of the original PSO model, i.e., premature convergence and weak local exploitation near the optimal solutions [4,22]. We introduce the first proposed PSO model, i.e., PSOVA1, in this section. Specifically, the proposed PSOVA1 model employs four major strategies, including (1) Gaussian distribution-based swarm leader improvement, (2) DE and mirroring schemes for worst solution enhancement, (3) a modified PSO position updating strategy based on ameliorated pbest and gbest, and (4) spiral based local exploitation. The implementation of these four mechanisms is able to increase population and search diversity, therefore increasing the likelihood of attaining global optimality as compared with the original PSO algorithm.
The novel aspects of the proposed PSOVA1 model are presented below. Firstly, we propose a modified PSO operation where the rectified forms of gbest and pbest, as well as the Logistic map-oriented chaotic inertia weight are used to increase global exploration. In particular, the personal and global best signals in the search operation are further enhanced using remote and randomly selected promising neighboring solutions to overcome stagnation. Secondly, a logarithmic spiral search mechanism oriented by gbest is used to intensify local exploitation. A dynamic switching probability is designed to enable the search process to balance between the aforementioned global (first) and local (second) search operations. Thirdly, Gaussian distribution is used to enhance the swarm leader. It enables gbest to conduct local exploitation to avoid being trapped in local optima. Then, the mirroring and DE-based mutation operations are employed to improve the three weakest particles in the swarm. The details of the proposed PSOVA1 model are illustrated in Algorithm 1.
Overall, the Gaussian distribution based gbest enhancement, the mutation strategies for enhancement of the worst solutions, exploration schemes assisted by ameliorated gbest and pbest, as well as the intensified fine-tuning capability using the spiral search operation, cooperate with and benefit from each other to effectively avoid being trapped in local optima and increase the likelihood of attaining global optimality. We introduce each of the four proposed strategies in detail below.

A Swarm Leader Enhancing Mechanism
In the context of feature selection, both the elimination of critical features and inclusion of contradictory attributes can impose significant consequences on classification performance. Therefore, a swarm leader enhancing mechanism using the skewed Gaussian distributions is proposed to equip gbest with further discriminative capabilities to overcome local optima traps. Such Gaussian distributions and random walk strategies have also been widely adopted in existing studies for leader or swarm enhancement [33,34,36,50,55]. As shown in Equation (3), gbest is mutated successively based on three Gaussian distributions with different skewness settings. Specifically, on the basis of the gbest solution, the Gaussian distribution with a positive skewness (right-skewed) is likely to eliminate noisy or irrelevant features, while the operation with a negative skewness (left-skewed) is more inclined to include more discriminative features. In addition, the standard Gaussian distribution (non-skewed) is employed to conduct local exploitation of gbest with neutrality in determining the feature numbers [34,55,56].
where gbest' d represents the enhanced global best solution. Parameter α denotes the step size, and is assigned as 0.1 based on the recommendation of related studies [56]. Parameter h represents the skewness of the Gaussian distribution, and is set as −1, 1, and 0 for the left-, right-and non-skewed Gaussian distributions, respectively, based on extensive trialand-error processes. Besides that, U d and L d represent the upper and lower boundaries of the dth dimension, respectively. The new solution generated by the Gaussian distribution is used to replace gbest if it is fitter. Algorithm 1. The pseudo-code of the proposed PSOVA1 model.

Start 2
Initialise a particle swarm using the Logistic chaotic map; 3 Evaluate each particle using the objective function f (x) and identify the pbest solution of each particle, and the global best solution, gbest; 4 Construct a Worst_memory, which stores the three weakest particles with the lowest fitness values, and identify the worst solution as gworst; 5 While (termination criteria are not met) Conduct swarm leader enhancement using Gaussian distribution as defined in Equation (3); Use the new solution to replace gbest if it is fitter; 8 For (each particle i in the population) do 9 { 10 If (particle i belongs to Worst_memory) Construct an offspring solution by employing the local mutation operation based on gbest as defined in Equation (4) Construct an offspring solution by employing the DE-based mutation operation based on three randomly selected pbest solutions as defined in Equations (5)-(6); 17 Evaluate the offspring solution and update the position of particle i in Worst_memory based on the annealing schedule as defined in Equation (7);

} End If 19
Update the pbest and gbest solutions; 20 } End If 21 } End For 22 For (each particle i in the population) do 23 Establish a memory of group i which includes all neighboring pbest solutions with higher or equal fitness scores than that of the pbest solution of the current particle i, i.e., pbest i ; 27 Identify the neighboring fitter pbest solution in group i with the highest degree of dissimilarity to gbest, denoted as pbest D ; 28 Calculate the ameliorated gbest solution, i.e., gbest M , by averaging the following two solutions, i.e., pbest D and gbest, as indicated in Equation (8); 29 Randomly select another neighboring fitter pbest solution from group i , denoted as pbest R , 30 Calculate the ameliorated pbest solution, i.e., pbest M by averaging pbest R and the personal best solution of particle i, pbest i , as shown in Equation (9); 31 Conduct position updating using gbest M and pbest M for particle i as defined in Equation (10); 32 Else 33 Move particle i around gbest by following a logarithmic spiral search path as shown in Equation (11)

Mutation-Based Worst Solution Enhancement
We subsequently enhance the weak particles in the swarm by conducting the mirroring mutation on the swarm leader and a DE-based operation on the local elite solutions.
Firstly, a gbest-based local mutation scheme is proposed to enhance the global worst solution in the swarm. As in Equation (4), the new particle is produced by conducting the mirroring effects and reversing the sign of gbest with a mutation probability, r mu , in each dimension. This simulates the effects of randomly activating or de-selecting some of the features on the basis of the current best feature subset represented by gbest. In short, the gbest-based local mutation scheme guarantees a balance between preserving effective information captured by the current gbest solution and introducing stochastic perturbations to create new momentum for the newly generated solution. Such mirroring actions were also widely adopted in existing studies [34,57] to increase population diversity: where r mu represents the mutation probability, and is set to 0.9 based on trial-and-error and recommendations in related studies [56]. When a randomly generated value is more than or equals to r mu , the new offspring is assigned with the value of the mirroring-gbest solution in the dth dimension, otherwise it is assigned with the value of gbest solution in that dimension. This operation is used to yield a new offspring solution to replace the worst particle in the swarm, if it is fitter. Secondly, a DE-based mechanism is proposed to improve the second and third worst individuals in the swarm. It produces new particles by following the mutation and crossover operations of DE using three p best solutions randomly selected from the collection of all p best individuals in the swarm, as shown in Equations (5) and (6). The differential weight, F, in Equation (5) is generated using the Sinusoidal chaotic map, in order to increase the variety of perturbations for the donor vector, x donor d , in each dimension. Furthermore, the crossover parameter, c r , is generated by the Logistic chaotic map to introduce more randomness to the crossover process in each dimension and exploit more feature interactions on a global scale. When a randomly generated value is more than c r , the current dimension in the new solution is inherited from the corresponding dimension of the personal best solution, otherwise it is inherited from that of the newly generated donor solution. Owing to the adoption of several distinctive personal best solutions in the search operations, this DE-based mutation operation is able to increase population diversity significantly when the p best solutions of the particles illustrate sufficient variance from one another in the early search stage: where pbest 1 d , pbest 2 d , and pbest 3 d represent three randomly selected p best solutions of the swarm particles in the dth dimension, while pbest i represents the p best solution of the current particle i. x donor d and x new d denote the donor and new solutions in the dth dimension, respectively. In addition, F and c r represent the differential weight and crossover factor, respectively.
The newly generated fitter solution is accepted directly while the acceptance of a weaker mutated solution is determined by an annealing schedule, as defined in Equation (7) [56]: where T represents the temperature for controlling the annealing process, and ∆f indicates the fitness difference between the mutated and original solutions. Constant δ is a randomly generated value in the range of [0, 1]. A linear cooling schedule is employed to decrease the temperature, i.e., T = σT, whereas σ is assigned as 0.9 according to [56].
The two mutation operations based on the DE and gbest mirroring operations operate in parallel, in order to improve the weak particles in the swarm.

Diversity-Enhanced PSO Evolving Strategy
In order to address stagnations in the original PSO model, we construct two distinctive search mechanisms, i.e., a modified PSO search strategy and an intensified spiral exploitation action, to increase diversification and intensification. A dynamic switching probability schedule is also proposed to achieve the best trade-off between both mechanisms and exploit the merits of both search operations to the maximum extent.
We firstly upgrade the position updating strategy in the original PSO operation by introducing ameliorated p best and g best , combined with the Logistic chaotic map, to enhance search diversity. As indicated in Equation (8), the global best experience is ameliorated by adopting the mean position of two solutions, i.e., the g best solution and a neighboring superior p best solution, i.e., pbest D , possessing the highest degree of dissimilarity to g best . The dissimilarity measure between g best and any p best solution is determined by the number of distinctive units in their binary forms, which are converted by following the existing studies [10]. In other words, the p best solution that has the least number of shared selected features in comparison with those recommended by g best is selected as pbest D . Moreover, as defined in Equation (9), the local best experience is ameliorated by adopting the mean position of the particle's own p best and another randomly chosen superior p best solution, i.e., pbest R , in the neighborhood. Equation (10) is used to conduct position updating, which employs the enhanced global and local optimal signals defined in Equations (8) and (9), respectively: where pbest D represents the p best solution with the highest degree dissimilarity to g best among all neighboring superior p best solutions, while pbest R represents a randomly chosen p best solution. Moreover, gbest M and pbest M represent the enhanced global and local optimal indicators in the proposed position updating strategy, respectively, while σ represents the inertia weight generated by the Logistic chaotic map.

An Intensified Spiral Exploitation Scheme
An intensified spiral exploitation scheme is introduced to overcome the limitations of the fine-tuning capability of the original PSO algorithm in the near optimal regions. The logarithmic spiral search is originally proposed in the MFO algorithm [11]. We employ this spiral operation to fine-tune the swarm particles in the final iterations. By conducting this local spiral search action, a search space of hyper-ellipse around g best is constructed on each dimension using the spiral function, as defined in Equations (11) and (12) [11]. As a result, the exploitation around the near-optimal solutions can be significantly intensified: where D denotes the distance between g best and particle i in the dth dimension, b is a constant to control the shape of logarithmic spiral, with l as a random number in the range of [-1, 1].
Moreover, we propose a dynamic switching probability schedule with the aim to achieve a trade-off between global exploration and local exploitation in the PSOVA1 model, as demonstrated in Equation (13): where p switch denotes the switching probability, while iter and Max_iter represent the current and maximum iterations, respectively. In each iteration, when the switching probability p switch is higher than a randomly generated value in the range of [0, 1], i.e., p switch > rand, the modified PSO operation discussed in Section 3.3 is conducted. Otherwise, the spiral search action depicted in this section is conducted. In general, the proposed dynamic schedule of p switch not only ensures sufficient global exploration to identify the promising regions in the early search stage, but also guarantees thorough exploitations in the near optimal region before converging in the final iterations.

The Proposed PSOVA2 Model
We further enhance the PSOVA1 model by incorporating new search actions accompanied with diverse nonlinear search trajectories to extend search territory. Specifically, we propose four new strategies in PSOVA2 to refine the transition between search diversity and swarm convergence, i.e., (1) an adaptive exemplar breeding mechanism incorporating multiple local and global best solutions, (2) search coefficient generation using sine, cosine, and hyperbolic tangent functions, (3) worst solution enhancement using a hybrid re-dispatching scheme, and (4) an exponential exploitation mechanism for swarm leader improvement.
PSOVA2 further strengthens PSOVA1 by providing new search mechanisms on the best leader generation and position updating operations. The novel aspects of the proposed PSOVA2 model are as follows. Firstly, an adaptive exemplar breeding mechanism is proposed, which produces a new exemplar by incorporating multiple local and global best solutions to guide the search process. On top of it, a new search action is proposed by embedding diverse search coefficients yielded using sine, cosine, and hyperbolic tangent formulae. In comparison with PSOVA1 where the search mainly focuses on a modified PSO operation in principle, the aforementioned new search operations equip the search process with a variety of distinctive search behaviors and irregular search trajectories. In addition, scattering and random permutations from the p best solutions are incorporated for enhancement of the worst solutions. An adaptive exponential search flight is also used for swarm leader improvement. These new strategies demonstrate great capabilities in accelerating convergence while preserving search diversity. The pseudo-code of PSOVA2 is provided in Algorithm 2. We introduce each proposed strategy in the following subsections.

A New Attraction Operation with Differentiated Search Trajectories
Firstly, a new search operation is proposed. It includes an exemplar breeding strategy and a search coefficient generation scheme using four nonlinear formulae. Equation (14) defines the proposed search action: where f s denotes a search coefficient generated by customized sine, cosine, and hyperbolic tangent functions, respectively, and x target represents a target optimal indicator such as the exemplar or the swarm leader. Gaussian (t) indicates a random walk following a Gaussian distribution. Equation (14) is used for position updating in PSOVA2. We introduce the exemplar breeding and nonlinear search coefficient generation in detail in Sections 4.1.1 and 4.1.2, respectively.
Algorithm 2. The pseudo-code of the proposed PSOVA2 model.

Start 2
Initialise a particle swarm using the Logistic chaotic map; 3 Evaluate each particle using the objective function f (x) and identify the p best solution of each particle, and the global best solution, gbest; 4 While (termination criteria are not met) Conduct swarm leader enhancement as defined in Equations (26)-(27); 7 Implement the worse solution enhancement as defined in Equations (23)-(25); 8 For (each particle i in the population) do 9 { 10 Construct a breeding exemplar as defined in Equations (15) Choose the breeding exemplar as the target signal for position updating; 17 Else 18 Choose the gbest solution as the target signal for position updating;

} End If 20
Update the position of particle i on dimension j as defined in Equation (14) Instead of completely following the g best solution over the search course as in the original PSO algorithm, an adaptive exemplar generation scheme is proposed. It incorporates two adaptive operations for exemplar generation, i.e., (1) stochastic recombination and dynamic incorporation of different numbers of the p best elicit solutions, and (2) an adaptive weight generation to attenuate the impact imposed by the p best solutions, while amplifying the influence of the g best solution over the search course. Specifically, an exemplar is generated through the proposed breeding mechanism between the p best and g best solutions for each particle through three steps. Firstly, a predefined number of the p best solutions (i.e., three or fewer) are randomly selected, and then aggregated into one offspring solution by multiplying random but normalized weights on each dimension, as illustrated in Equation (15). Secondly, the adaptive weights for governing the priority of the aggregated offspring and the g best solution during the breeding operation are generated by two proposed mathematical formulae defined in Equations (16) and (17). Figure 1 presents a visualization of adaptive weight generation defined in Equations (16) and (17). Lastly, the exemplar solution is produced by conducting weighted aggregation between the g best solution and the offspring solution yielded by Equation (15) in each dimension, as defined in Equation (18): where x offspring and x exemplar represent the offspring solution generated from randomly sampled p best solutions and the obtained exemplar solution through the breeding mechanism, while m 1 and m 2 represent the adaptive weights for g best and x offspring respectively. Parameters c 1 , c 2 , and c 3 possess randomly generated scores within [0, 1].
Sensors 2021, 21, x FOR PEER REVIEW Figure 1. Adaptive coefficients for the gbest solution (blue) and the pbest signal (red) for exem generation (where x axis denotes a randomly generated value between 0 and 1, and y axi the weight parameters, i.e., m1 and m1 defined in Equations (16) and (17)).
Specifically, we prescribe a decreasing process to control the number of sel solutions for exemplar generation. It starts with three pbest solutions being rand lected for breeding, then eliminating one in every 25 iterations. As a result, four cases are produced through the iterative process for pbest selection, i.e., 3 for iter 25], 2 for iterations [25,50], 1 for iterations [25,75], and 0 for iterations [75,100 beginning of the search process, the higher number of selected pbest solutions ai troduce more perturbation to gbest during breeding owing to the higher degree o signal diversity and less similarity among the pbest solutions. This can further faci exploration in previously unexploited search territory. By eliminating the sele solutions through the iterative process as well as the higher similarity among el tions owing to a gradually converged population, the disturbance produced by on gbest becomes less significant as compared with that from the early stages. T the search becomes more accelerated through the incorporation of elicit genes while maintaining the necessary level of diversity owing to distinctive elements recombination effect among the pbest solutions. When the search comes to the fin none of the pbest solutions are selected. As such, the exemplar becomes identical t solution, in order to facilitate the exploitation around the most optimal regions sult, the exploration at the early stage is intensified, and search diversity can tained through a dynamic incorporation of the pbest solutions.
In addition to the above proposed mechanisms, we introduce two adaptiv tories for regulating the impact of the gbest and pbest signals during breeding over t iterative process, as illustrated in Figure 1. The weighting factor of the gbest si keeps increasing from 0.4 to approximately 1 as the iteration increases, while th pbest signal (m2) keeps decreasing from 0.4 to 0. Moreover, the slopes for both trajectories change slowly at the beginning of the iteration, and then gradually a Adaptive coefficients for the g best solution (blue) and the p best signal (red) for exemplar generation (where x axis denotes a randomly generated value between 0 and 1, and y axis denotes the weight parameters, i.e., m 1 and m 1 defined in Equations (16) and (17)).
Specifically, we prescribe a decreasing process to control the number of selected p best solutions for exemplar generation. It starts with three p best solutions being randomly selected for breeding, then eliminating one in every 25 iterations. As a result, four different cases are produced through the iterative process for p best selection, i.e., 3 for iterations [1,25], 2 for iterations [25,50], 1 for iterations [25,75], and 0 for iterations [75,100]. At the beginning of the search process, the higher number of selected p best solutions aims to introduce more perturbation to g best during breeding owing to the higher degree of optimal signal diversity and less similarity among the p best solutions. This can further facilitate the exploration in previously unexploited search territory. By eliminating the selected p best solutions through the iterative process as well as the higher similarity among elicit solutions owing to a gradually converged population, the disturbance produced by breeding on g best becomes less significant as compared with that from the early stages. Therefore, the search becomes more accelerated through the incorporation of elicit genes from g best while maintaining the necessary level of diversity owing to distinctive elements from the recombination effect among the p best solutions. When the search comes to the final stage, none of the p best solutions are selected. As such, the exemplar becomes identical to the g best solution, in order to facilitate the exploitation around the most optimal regions. As a result, the exploration at the early stage is intensified, and search diversity can be maintained through a dynamic incorporation of the p best solutions.
In addition to the above proposed mechanisms, we introduce two adaptive trajectories for regulating the impact of the g best and p best signals during breeding over the entire iterative process, as illustrated in Figure 1. The weighting factor of the g best signal (m 1 ) keeps increasing from 0.4 to approximately 1 as the iteration increases, while that of the p best signal (m 2 ) keeps decreasing from 0.4 to 0. Moreover, the slopes for both adaptive trajectories change slowly at the beginning of the iteration, and then gradually ascend as the number of iterations increases. As such, the impact of the p best indicators would not diminish too fast, in order to maintain diversity. At the same time, the influence of the g best solution becomes strengthened over iterations, in order to accelerate convergence. In other words, the proposed adaptive search trajectories enable the exemplar to conduct more exploration attempts in the early stage by receiving significant and diverse influence from the p best signals while ensuring a thorough exploitation around the promising regions in the final stage by receiving a dominant impact from the g best solution. As a result, the proposed trajectories are capable of accelerating convergence while preserving diversity.
The generated exemplar is subsequently used to guide the search operation. To further increase diversification and avoid stagnation, we employ diverse search coefficients yielded using sine, cosine, and hyperbolic tangent functions, which are explained in detail in the next section.

Nonlinear Search Coefficient Generation
We further devise four nonlinear functions for coefficient generation in Equation (14). The objective is to conduct distinctive yet complementary search operations around the exemplar and the g best solution, in order to further increase diversity and overcome stagnation. The proposed coefficient generation functions are presented in Equations (19)- (22) and plotted in Figure 2. In general, the first two functions, i.e., f 1 and f 2 , enable the particles to jump randomly in all directions around the destination optimal signal. The next two functions, i.e., f 3 and f 4 , avail the particles to approach the optimal indicator with various speeds and intensities. Specifically, as illustrated in Figure 2 (blue line), f 1 takes a hyperbolic tangent formula, 2/3 * tanh (2x - 1 2 ), as defined in Equation (19). It increases in the range of [−0.3, 0.5] in a gradual manner. It facilitates the particles to deploy a thorough exploration around the target optimal signal in two ways, i.e., approaching it slowly when positive values are generated and distancing from it mildly when negative values are yielded. In contrast, as illustrated in Figure 2 (red line), f 1 takes a sin (cos(2π × rand 2 )) formula, as defined in Equation (20). Comparing with other three functions, it changes more rapidly with a wider range approximately in [−0.9, 0.9] for coefficient generation. It enables the particles to perform larger jumps to escape from local stagnation.

Sensors 2021, 21, x FOR PEER REVIEW
The parameter generation strategies are incorporated with the proposed e breeding scheme to leverage their respective advantages, i.e., diversification movement strategies and the destination signals. Specifically, in every position u process, each particle is able to choose one coefficient generation function from posed four formulae randomly. Then, in each dimension, the particle is able to one best signal to follow from the breeding exemplar and the gbest solution.
To be specific, the four coefficient generation strategies possess equal probab be chosen for each particle. Note that gbest is allocated a higher probability to b when updating the particle position in each dimension, as shown in lines 8-22 rithm 2. A threshold of 0.4 is determined based on trial-and-error. Such a setting achieve a reasonable balance between introducing a proper perturbation and in benign signals from the swarm leader.  (14)).

A Hybrid Re-Dispatching Scheme for Enhancement of the Worst Solutions
To further accelerate convergence, we enhance several worst solutions by d such solutions to the optimal regions using a hybrid dispatching scheme. Specifi enhance the worse solutions by exploiting the personal best solutions as well as  (14)). Figure 2 (yellow line), f 3 takes a cos (sin(π/2 × rand 2 )) formula, as defined in Equation (21). It constantly maintains at a high plateau in the range of [0.5, 1]. It regulates the particles to march towards the target optimal solution with a large step, in order to accelerate convergence. On the contrary, as indicated in Figure 2 (purple line), f 4 takes a cos −1 (cos(π/4 × x 2 )) formula, as defined in Equation (22). It increases in a gradual manner in the range of [0, 0.7]. It enables the particles to deploy an intensive exploitation in the promising regions. In each iteration, each particle is able to choose from the aforementioned four coefficient generation strategies with an equal probability to maintain search diversify. In comparison with standard sine, cosine, and hyperbolic tangent functions, the proposed refined formulae offer more erratic and irregular search trajectories:

As illustrated in
where x is a randomly generated value within [0, 1], while f 1 , f 2 , f 3 , and f 4 are the four coefficient generation functions (i.e., specified sine, cosine, and hyperbolic tangent functions). The generated coefficients are used as search parameters f s in the position updating procedure as defined in Equation (14).
As discussed earlier, we compose these four distinctive coefficient generation strategies in a complementary manner as an effort to enhance search diversity. When the particles are trapped at local optima, large jumps and reverse directions are able to drive the search out of stagnation. On the other hand, minor movements become dominant when a detailed, near optimal exploitation is needed. Moreover, in the entire input range, the generated coefficients from these four functions are always dominated by positive signals, i.e., at least three positive outputs among four, which are able to lead the swarm to the promising regions in an accelerated manner.
The parameter generation strategies are incorporated with the proposed exemplar breeding scheme to leverage their respective advantages, i.e., diversification of the movement strategies and the destination signals. Specifically, in every position updating process, each particle is able to choose one coefficient generation function from the proposed four formulae randomly. Then, in each dimension, the particle is able to choose one best signal to follow from the breeding exemplar and the g best solution.
To be specific, the four coefficient generation strategies possess equal probabilities to be chosen for each particle. Note that g best is allocated a higher probability to be chosen when updating the particle position in each dimension, as shown in lines 8-22 in Algorithm 2. A threshold of 0.4 is determined based on trial-and-error. Such a setting is able to achieve a reasonable balance between introducing a proper perturbation and inheriting benign signals from the swarm leader.

A Hybrid Re-Dispatching Scheme for Enhancement of the Worst Solutions
To further accelerate convergence, we enhance several worst solutions by diverting such solutions to the optimal regions using a hybrid dispatching scheme. Specifically, we enhance the worse solutions by exploiting the personal best solutions as well as stochastic disturbance induced by random initialisation. As shown in Equations (23) and (24), two donor vectors, denoted as x donor1 and x donor2 , are generated by random initialisation and random permutations from the p best solutions, respectively. In particular, the element on each dimension of x donor2 is obtained by inheriting the value from the corresponding dimension of a randomly selected p best solution. A random number is generated for each dimension as the determinant for the hybridization process, as shown in Equation (25). The element is inherited from the corresponding dimension of x donor1 when the determinant is smaller than or equals to 0.5. Otherwise, it inherits the corresponding element from x donor2 .
x new where x donor1 and x donor2 represent the donor vectors generated by random initialisation and random selection from the p best solutions, respectively, while β is a random number within [0, 1]. pbest d random denotes a randomly selected personal best solution in the d-th dimension. This worst solution enhancement procedure is conducted three times to generate three offspring x new d solutions, which are subsequently used to replace the three worst particles with the lowest fitness scores in the swarm, respectively.
Compared with a complete random initialisation, this hybridization scheme for enhancement of the worst solutions is capable of enhancing such solutions by exploiting elicit genes from the population to accelerate convergence.

Swarm Leader Enhancement Using an Adaptive Exponential Search Flight
In addition, we propose an exponential function to generate random search steps for enhancement of the swarm leader, as defined in Equation (26).
As depicted in Figure 3, the generated step g is confined within [−0.5, 0.5] with an input value between [0, 1]. As a result, a smaller magnitude of steps enables the swarm leader to examine thoroughly within its vicinity from all directions, in order to discover a better position to further improve its quality. Equation (27) is used to generate an offspring solution of gbest using the newly generated search step g. where x donor1 and x donor2 represent the donor vectors generated by random initialisation and random selection from the pbest solutions, respectively, while β is a random number within [0, 1]. pbestd random denotes a randomly selected personal best solution in the d-th dimension. This worst solution enhancement procedure is conducted three times to generate three offspring x new d solutions, which are subsequently used to replace the three worst particles with the lowest fitness scores in the swarm, respectively. Compared with a complete random initialisation, this hybridization scheme for enhancement of the worst solutions is capable of enhancing such solutions by exploiting elicit genes from the population to accelerate convergence.

Swarm Leader Enhancement Using An Adaptive Exponential Search Flight
In addition, we propose an exponential function to generate random search steps for enhancement of the swarm leader, as defined in Equation (26).
As depicted in Figure 3, the generated step g is confined within [−0.5, 0.5] with an input value between [0, 1]. As a result, a smaller magnitude of steps enables the swarm leader to examine thoroughly within its vicinity from all directions, in order to discover a better position to further improve its quality. Equation (27) is used to generate an offspring solution of gbest using the newly generated search step g. Figure 3. The governing function for generating the random step g (where x axis represents a randomly generated value between 0 and 1, while y axis signifies the search step g, as defined in Equation (26)): Figure 3. The governing function for generating the random step g (where x axis represents a randomly generated value between 0 and 1, while y axis signifies the search step g, as defined in Equation (26)): Overall, the proposed PSOVA2 model incorporates the aforementioned four major improvements to further enhance search dynamics and diversity. They include an adaptive exemplar breeding mechanism, search coefficient generation using nonlinear functions, exponential exploitation and re-dispatching schemes for swarm leader and worst solution enhancement. They account for the efficiency of PSOVA2 in accelerating convergence while maintaining diversity.
Both proposed PSO variants are integrated with a K-Nearest-Neighbor (KNN) classifier to conduct fitness evaluation during the search process. Equation (28) [3,4,42,43] defines the objective function, which is used to assess the fitness of each particle: where k 1 and k 2 denote the weights of classification accuracy and the number of selected features, respectively. We assign k 1 = 0.9 and k 2 = 0.1 by following the recommendation in previous studies [3,4,42,43]. The optimisation objective of the proposed PSO models is to identify the most discriminative feature subset from a given database. The fitness function aims to maximize the classification accuracy rate while reducing the number of selected features. The search process of the most significant feature subset is conducted as follows. The particles are initialised with continuous values in each dimension using the Logistic map at the beginning of the search process. Each particle is used to represent the initial randomly assigned feature subset, where the particle dimension is the same as the number of the features in a given dataset. During fitness evaluation, we convert each element of each particle into a binary value, i.e., 1 or 0, representing the selection (1) or non-selection (0) of a particular feature. The recommended feature subset by each particle is evaluated using the training data set. The KNN model with five neighbors, as recommended in related studies [19,58], is employed to evaluate the fitness of the selected feature subset with a 10-fold cross-validation method. A fitness score is calculated using Equation (28). The identified final swarm leader represents the most optimal feature subset. We subsequently evaluate the efficiency of this selected feature subset using the unseen test set in the test phase. The aforementioned feature selection process using each proposed PSO model combining with KNN is also illustrated in Algorithm 3. We evaluate the effectiveness of both proposed PSO variants in feature selection tasks in Section 5. Initialise a particle swarm using the Logistic chaotic map; 3 For (each particle i in the population) do 4 { 5 Convert particle i into a corresponding feature subset by selecting features on the dimensions where positive values are assigned; 6 Calculate classification performance of the feature subset encoded in particle i on the training data set using the KNN classifier; 7 Evaluate the fitness score of particle i based on its classification performance and number of selected features using the proposed objective function f (x) as shown in Equation (28); 8 Identify the pbest solution of each particle and the global best solution gbest; 9 } End For 10 While (termination criteria are not met)

{ 12
Evolve swarm particles using the proposed mechanisms in PSOVA1 (i.e., line 7-35 in Algorithm 1) or PSOVA2 (i.e., line 6-22 in Algorithm 2); 13 For (each particle i in the population) do 14 { 15 Evaluate particle i using the objective function on the training set; 16 Update p best and gbest solutions;

} End For 18 } End While 19
Output gbest; 20 Convert gbest into the identified optimal feature subset; 21 Calculate classification performance on the unseen test set based on the yielded optimal feature subset using the KNN classifier; 22 Output the test classification results & the selected features; 23 End

Evaluation and Discussion
We employ a total of 13 datasets to investigate the efficiency of the proposed PSO models for feature selection. The employed datasets pose diverse challenges to feature selection problems, owing to a great variety of dimensionalities as well as complicated class distributions. The proposed PSO variants are integrated with a KNN-based wrapper model to conduct feature optimisation, where the number of the nearest neighbor is set to 5 according to the recommendation in previous studies [19,58]. Three performance indicators are used to examine the effectiveness of the proposed PSO variants, i.e., classification accuracy, number of selected features, and F-score. Furthermore, we compare the proposed PSO variants against five classical search algorithms, i.e., PSO [22], DE [59], SCA [60], DA [61], and GWO [62], as well as ten PSO variants, i.e., CSO [52], HPSO-SSM [19], binary PSO (BPSO) [63], modified binary PSO with local search and a swarm variability controlling scheme (MBPSO) [53], CatfishBPSO [54], GPSO [41], MPSOELM [45], MFOPSO [44], BBPSOVA [42], and ALPSO [27]. To ensure a fair comparison, we employ the same number of function evaluations (i.e., population size × the maximum number of iterations) as the stopping criterion for all search methods. In our experiments, the population size and the maximum number of iterations are set to 30 and 100, respectively, based on trial and error. We conduct 30 runs for each experiment.

Data Sets
We employ the ALL-IDB2 database [64] for Acute Lymphoblastic Leukaemia (ALL) diagnosis, as well as ten UCI data sets [65], namely Arcene, MicroMass, Parkinson's disease (Parkinson), Human activity recognition (Activity), LSVT voice rehabilitation (Voice), Grammatical facial expressions (Facial Expression), Heart disease (Heart), Ionosphere, Epileptic seizure (Seizure) and Wisconsin breast cancer diagnostic data set (Wdbc), for evaluation. Besides that, two additional microarray gene expression data sets, i.e., Crohn's disease (Crohn) and Multiple Myeloma (Myeloma), from the Gene Expression Omnibus repository [66], are employed for evaluation. The details of each data set are shown in Table 2. These data sets pose diverse challenges to feature selection models, owing to a great variety of dimensionalities and class numbers, as well as complex data distributions. Specifically, the dimensionality of the employed data sets spans from 30 to 22,283, while the number of the classes ranges from 2 to 10. Moreover, according to previous studies [42,67,68], the employed data sets contain significant challenging factors (e.g., small inter-class and large intra-class variations) which can severely affect classification performance. Overall, a comprehensive evaluation can be established for the proposed PSO variants, in view of the dimensionality, number of classes, and sample distributions, pertaining to the data sets used for evaluation.

Parameter Settings
We compare the proposed PSO variants against fifteen baseline methods, i.e., five classical search algorithms, i.e., PSO, DE, SCA, DA, and GWO, and ten advanced PSO variants, i.e., CSO, HPSO-SSM, BPSO, MBPSO, CatfishBPSO, GPSO, MPSOELM, MFOPSO, BBPSOVA, and ALPSO. The parameter settings of each baseline method employed in this study are set in accordance with the recommendations in their original studies. The detailed parameters of the proposed PSO models and fifteen baseline methods are presented in Table 3. Table 3. Parameter settings of each algorithm.
Prop. PSOVA2 switching probability for exemplar adoption = 0.4, initial weight for g best = 0.4, search coefficients implemented using exponential, sine, cosine, and hyperbolic tangent functions.

Results and Discussion
A comprehensive evaluation on the proposed PSO variants is established. Specifically, we adopt four different performance measures, i.e., classification accuracy, the F-score measure, number of selected features, and convergence performance, in our experiments. A total of 30 runs are conducted in each experiment, and the average results are computed for comparison. Tables 4 and 5 summarise the classification accuracy rates, F-scores, and their corresponding standard deviation results, respectively, while Table 8 presents the numbers of selected features for all the search methods. The best results are marked in bold accordingly.

Classification Performance
With respect to classification accuracy in Table 4, PSOVA1 and PSOVA2 achieve the highest accuracy scores on all thirteen classification tasks. They outperform all the fifteen baseline algorithms consistently. Specifically, PSOVA1 produces the highest accuracy scores on two datasets, i.e., Parkinsons and Facial Expression, while PSOVA2 yields the best accuracy scores on the remaining eleven datasets. Moreover, the empirical results reveal the advantages of the proposed models over fifteen baseline methods, especially on data sets with high dimensionalities, e.g., Crohn (22,283), Myeloma (12,625), and MicroMass (1300), as well as data sets with fuzzy boundaries and small inter-class variations, e.g., Heart (72).
Specifically, on the Heart data set, PSOVA2 outperforms the top three best classical search methods, i.e., SCA, HPSO-SSM, and DE, by 6.21%, 7.97%, and 8.06%, respectively. On the MicroMass data set, PSOVA2 outperforms the top three best search methods, i.e., GWO, BBPSOVA, and SCA, by 4.88%, 4.94%, and 5.51%, respectively. Evident performance gaps can also be observed between PSOVA1 and fifteen baseline methods. The effectiveness of both proposed PSO models is further ascertained by the F-score measure, as shown in Table 5. Similar to the accuracy measures, the proposed PSO models achieve the highest F-score performances on all thirteen data sets and outperform fifteen baseline methods with significant performance gaps, especially on feature selection tasks with high complexities, e.g., MicroMass and Heart data sets. Moreover, in comparison with those of fifteen baseline models, our proposed PSO variants demonstrate smaller or similar standard deviation results with respect to both the accuracy and F-score measures. This indicates the reliability of the proposed PSOVA1 and PSOVA2 models in producing superior classification performances across the employed feature selection tasks with various dimensionalities. The reliability of the proposed PSO variants will be further examined using the Wilcoxon statistical test.
We subsequently analyze the performance gaps pertaining to the challenges posed by some example data sets as well as the superiority of both proposed models, as follows. With respect to ALL, the proposed models have successfully identified the clinical features critical to ALL diagnosis, e.g., cytoplasm and nucleus areas, ratio between the nucleus and cytoplasm areas, form factor, compactness, perimeter, and eccentricity [42,67]. These features are commonly selected more than 15 times out of 30 trials by both proposed models. Specifically, as an important indicator of cell irregularity and eccentricity, the inclusion of the ratio between the nucleus and cytoplasm areas in the selected feature subsets can make a significant difference to accurate diagnosis of ALL. However, the baseline models often fail to consider the interactive impact between cytoplasm and nucleus owing to the negligence of either of them in the selected features. Overall, the feature selection results further indicate the effectiveness of both proposed models in identifying the most discriminatory characteristics to ALL diagnosis. In comparison, the baseline models often partially identify these important discriminative features, or overlook some aspects of sophisticated feature interactions, owing to the stagnation at local optima. Likewise, with respect to the diagnosis of coronary heart disease with three different severity levels [69], the feature subsets generated by the proposed PSO models reveal a number of key features, e.g., chest pain type, serum cholesterol, maximum heart rate, and ST depression. These have been identified as critical clinical features for the diagnosis of heart disease in existing studies [70].
The Wilcoxon rank sum test is conducted based on the mean classification accuracy rates, in order to further indicate the statistical difference of both proposed PSO models against the baseline methods. As illustrated in Tables 6 and 7, most of the test results are lower than 0.05, ascertaining that both proposed PSO models outperform the fifteen baseline models on most of the data sets, significantly. Comparing with PSOVA1, PSOVA2 achieves a better statistical superiority. Specifically, PSOVA1 outperforms all the baseline methods for five data sets (Crohn, Myeloma, MicroMass, Parkinsons, and Activity), while PSOVA2 outperforms all the baseline methods for eight data sets (Crohn, Myeloma, Arcene, MicroMass, Parkinsons, Activity, Seizure, and Heart), with statistical significance. Out of 180 evaluations (12 data sets × 15 baseline algorithms), PSOVA1 does not show statistically significant differences in eighteen instances with respect to the baseline methods, as compared with eleven instances from PSOVA2.
The top three baseline models with the most competitive performances in comparison with those of our proposed PSO variants are DE, BBPSOVA, and SCA. Specifically, PSOVA1 shows similar result distributions to those of DE on ALL, Ionosphere, and Wdbc, to those of BBPSOVA on Voice, Ionosphere, and Wdbc, as well as to those of SCA on Arcene, Heart, and Ionosphere data sets, whereas PSOVA2 demonstrates similar performance distributions to those of DE on ALL and Wdbc, to those of BBPSOVA on Voice and Wdbc, as well as to those of SCA on Ionosphere data set.
The advantages of the proposed PSO models become more apparent on classification tasks with higher dimensionalities and sophisticated class distributions, i.e., Crohn (22,283), Myeloma (12,625), Micromass (1300), Parkinson (753), and Activity (561). PSOVA2 depicts statistically significant superiority against all the baseline methods for these highdimensional data sets. This is because of the adoption of diverse exemplars to guide the search in each dimension, as well as the employment of versatile search trajectories to rectify particle positions.  The search strategies in most of the baseline models are monotonous, therefore are more likely to be trapped in local optima in NP-hard problems, such as feature selection tasks. Owing to the proposed comprehensive strategies of avoiding the local optima traps, the search diversity and robustness are significantly enhanced in both proposed PSO models, therefore the likelihood of ascertaining the global optima. Overall, the statistical results prove the significant superiorities of both proposed PSO models over the five classical search methods and ten advanced PSO variants, especially in feature selection tasks with higher complexities.

Selected Feature Sizes
With respect to the number of selected features, as shown in Table 8, CSO selects the fewest numbers of features on eight data sets, i.e., Crohn, Myeloma, Arcene, Voice, Facial Expression, Seizure, ALL and Wdbc, while GWO obtains the smallest feature sizes on four data sets, i.e., MicroMass, Parkinsons, Activity, and Heart. Owing to the excessive elimination of essential features, CSO achieves the lowest classification accuracy rates on the five data sets, i.e., Micromass, Voice, ALL, Wdbc, and Crohn. This indicates that CSO falls into local optima on the above data sets during training, which leads to the stagnation in performance. According to the fitness evaluation illustrated in Equation (28), this phenomenon in turn results in the severe removal of features, in order to further improve the fitness scores. As such, very small feature subsets are identified during the feature selection process, which may not be able to capture sufficient characteristics, leading to a severe performance deterioration in the test stage. On the contrary, the proposed PSO variants succeed in achieving an efficient trade-off between eliminating redundant features and improving performance. They select comparatively smaller feature subsets than those from many search methods in most of the test cases, while achieving the highest accuracy rates and the F-score results on all thirteen test data sets. In particular, the proportions of the eliminated features by PSOVA2 are 65.46%, 66.22%, 65.88%, 63.32%, 63.51%, 66.93%, 64.84%, and 67.54%, on eight high-dimensional data sets, i.e., Crohn, Myeloma, Arcene, MicroMass, Parkinsons, Activity, Voice, and Facial Expression, respectively. A similar feature elimination capability is also depicted by PSOVA1. In short, the empirical results indicate the significant capabilities of the proposed PSO models in removing irrelevant and noisy features while identifying the most discriminative and effective ones without falling into local optima traps during the search process.

Convergence Rates and Computational Costs
The mean convergence curves over a set of 30 runs for each search method on two high-dimensional data sets, i.e., Myeloma and Crohn, respectively, are provided to indicate model efficiency in the training stage.
As illustrated in Figure 4 (Myeloma) and Figure 5 (Crohn), both proposed PSO models (two dash lines) achieve promising results. Specifically, the proposed PSO models illustrate faster convergence rates than those from the baseline models, while maintaining the momentum to improve the fitness score through the entire search course. PSOVA2 performs better than PSOVA1, especially during the later stage of the search course. The proposed exemplar breeding mechanisms and diverse attraction operations with non-linear parameters account for the superior capabilities of PSOVA2 in preserving diversity and overcoming local stagnation. Moreover, CSO illustrates faster convergence rates than those of the proposed models, but at the expense of excessive elimination of a large number of features. It is likely that CSO is trapped in local optima, and its performance becomes stagnant. This is supported by its deterioration in classification accuracy and F-measure results, as indicated in Tables 4 and 5. On the contrary, the proposed models achieve comparatively a balanced trade-off between feature elimination and performance improvement. models (two dash lines) achieve promising results. Specifically, the proposed PSO models illustrate faster convergence rates than those from the baseline models, while maintaining the momentum to improve the fitness score through the entire search course. PSOVA2 performs better than PSOVA1, especially during the later stage of the search course. The proposed exemplar breeding mechanisms and diverse attraction operations with non-linear parameters account for the superior capabilities of PSOVA2 in preserving diversity and overcoming local stagnation. Moreover, CSO illustrates faster convergence rates than those of the proposed models, but at the expense of excessive elimination of a large number of features. It is likely that CSO is trapped in local optima, and its performance becomes stagnant. This is supported by its deterioration in classification accuracy and F-measure results, as indicated in Tables 4 and 5. On the contrary, the proposed models achieve comparatively a balanced trade-off between feature elimination and performance improvement.   PSOVA2 performs better than PSOVA1, especially during the later stage of the search course. The proposed exemplar breeding mechanisms and diverse attraction operations with non-linear parameters account for the superior capabilities of PSOVA2 in preserving diversity and overcoming local stagnation. Moreover, CSO illustrates faster convergence rates than those of the proposed models, but at the expense of excessive elimination of a large number of features. It is likely that CSO is trapped in local optima, and its performance becomes stagnant. This is supported by its deterioration in classification accuracy and F-measure results, as indicated in Tables 4 and 5. On the contrary, the proposed models achieve comparatively a balanced trade-off between feature elimination and performance improvement.   Since fitness evaluation is the most time-consuming procedure during the search cycle, the computational load of PSOVA1 and PSOVA2 primarily hinges on the population size × the maximum number of iterations. Note that all the search methods employ the same maximum number of function evaluations during the training stage. As such, all the search methods have a similar computational cost in principle, which is governed by the time taken for fitness evaluation. On the other hand, the internal search mechanisms are different from one algorithm to another, therefore the computational cost of each algorithm differs slightly. Table 9 depicts the average computational costs during training with respect to the proposed PSO models and other search methods over 30 runs on the Crohn, Myeloma, and Seizure data sets. Since they have either high-dimensional features or large sample sizes, these data sets are selected for computational cost analysis. The computational costs of all the methods pertaining to other data sets may vary in accordance with the training sample sizes and dimensions. As indicated in Table 9, in most of the cases, both proposed PSO models show comparatively lower or comparable computational costs in comparison with those from most of the baselines methods. CSO, GWO, and PSOVA2 achieve the most efficient training computational costs for Crohn, Myeloma, and Seizure data sets, respectively.

Evaluation of The Proposed Mechanisms in PSOVA1 and PSOVA2
We subsequently demonstrate the efficiency of each proposed mechanisms in both PSOVA1 and PSOVA2 using the Seizure and Voice data sets. The mean classification accuracy rates over 30 runs are shown in Table 10. The empirical results indicate that each strategy in each proposed model is able to drive the search out of stagnation and enhance the feature selection performance. The results conform to the principles of the introduced mechanisms. In particular, the exemplar breeding mechanism and the versatile search operations using compound sine, cosine, and hyperbolic tangent functions in PSOVA2 are comparatively more effective than the modified PSO operation with ameliorated optimal signals and spiral-based local exploitation in PSOVA1. This is primarily owing to the employment of diverse exemplars to lead the search in each dimension, as well as the adoption of versatile search courses to rectify the particle positions in PSOVA2. In comparison with the original PSO model and PSOVA1, instead of using single leader or rectified separate global and personal best experiences to guide the search process, an exemplar generation scheme with adaptive aggregation of the local and global optimal signals is used in PSOVA2. As such, the impact of the local optimal indicators is more significant at the beginning stage of the search process and the influence of the global best solution is more dominating towards the final iterations. Such an exemplar breeding scheme in PSOVA2 is more capable of overcoming stagnation. Unlike PSOVA1 where the search mainly focuses on a modified PSO algorithm, PSOVA2 employs four search strategies implemented using refined sine, cosine, and hyperbolic tangent formulae for the position updating procedure to increase search diversification.
The mechanisms proposed in both PSO models work in a collaborative manner to diversify the search process and mitigate premature convergence. In PSOVA1, when the modified PSO algorithm with rectified optimal signals becomes stagnant over the iterations, the local exploitation mechanism based on the spiral search action is able to further explore the near-optimal regions and drive the search out of stagnation. In PSOVA2, when the customized sine-based search operation is trapped in local optima, other search mechanisms such as cosine and hyperbolic tangent oriented search actions are able to extend the search territory to overcome early stagnation. In short, the empirical results indicate that the proposed mechanisms in each model offer great efficiency in mitigating premature convergence, leading to great capabilities in accelerating convergence while preserving diversity.
Besides the above, we further evaluate the efficiency of each proposed strategy in PSOVA1 and PSOVA2 for tackling minimization problems using a set of 11 benchmark functions. They include four multimodal functions (i.e., Ackley, Griewank, Rastrigin, and Powell) and seven unimodal landscapes (i.e., Dixon-Price, Rotated Hyper-Ellipsoid, Rosenbrock, Sphere, Sum of Different Powers, Sum Squares, and Zakharov). The definitions of these benchmark functions are provided in [23,34,50,71]. The following experimental settings are employed for model evaluation, i.e., population size = 30, dimension = 30, maximum number of iterations = 500, and trials = 30. Table 11 illustrates the mean, maximum, minimum, and standard deviation results for all the test functions with the best results highlighted in bold. As shown in Table 11, both the mean and minimal results over 30 runs indicate that our models with individual or composite proposed mechanisms all significantly outperform the standard PSO model. For each of the proposed PSO variants, sequential aggregation of the proposed mechanisms amounts to better search efficiencies and capabilities, as evidenced by the enhanced performances. Moreover, PSOVA2 outperforms PSOVA1 on 9 out of 11 test functions. Overall, the empirical results of the test functions demonstrate great superiority of the proposed models. The search mechanisms in PSOVA1 and PSOVA2 work in cooperation to achieve the best performances owing to the advanced trade-offs between diversification and intensification.

Discussion
The empirical results of classification performance, feature elimination effects, as well as convergence rates all indicate the superiority of the proposed PSO variants over other baseline methods in undertaking feature selection tasks, i.e., constructing simplified but valid feature subsets while improving classification performance.
Both proposed PSOVA1 and PSOVA2 models adopt hybrid leader signals and diversified search operations to overcome local optima traps. In essence, PSOVA2 inherits all merits of PSOVA1. It further endows the particles with a higher degree of freedom in terms of (1) the choice of destination signals, and (2) the choice of movements to approach the destination solutions. Besides the generation of the combined best leader by adaptively incorporating both local and global best signals, PSOVA2 implements multiple movement operations towards the destination signal where the search coefficients are delivered by four distinctive yet complementary nonlinear functions. These search mechanisms offer the choices of either a large jump to propel the convergence or a gradual stroll to intensify the exploitation, as well as the choices of either marching towards or distancing from the destination signals. As a result, PSOVA2 is likely to attain global optimality successfully, while preventing stagnation at the local optima traps effectively.
In contrast, for the employed baseline classical search methods, certain limitations have been identified in previous studies, as widely discussed in the literature. Specifically, the search capability of DE can be severely compromised, owing to the failure of generating promising solutions within a limited number of function evaluations [56]. GWO demonstrates a strong bias towards the origin of the coordinate system attributed by its simulated model, as well as stagnation at the local optima traps owing to the poor exploration capability [72]. DA suffers from a poor exploitation capability, owing to the fact that it does not keep track of the elite solutions [2,61]. In addition, most of the existing PSO variants are equipped with improvements from the perspective of either exploration or exploitation, rather than comprehensively taking into account the trade-off between both operations. Overall, the proposed PSO models demonstrate great superiorities over the baseline methods in attaining the global optimality, owing to a delicate consideration of both global exploration and local exploitation. This is realised through distraction with the elicit solutions as well as detection with diverse steps and possible movement in all directions, respectively. Therefore, both proposed PSO models are capable of improving classification performance by identifying the most discriminative features and eliminating noisy and irrelevant ones, as evidenced by the empirical results along with the statistical tests. Moreover, PSOVA2 performs better than PSOVA1 in undertaking feature selection problems owing to the enhanced diversity induced by a greater freedom in choosing the exemplar signals to guide the search in each dimension, as well as a greater versatility in ways of approaching such destination solutions.

Conclusions
In this research, we proposed two PSO models, namely PSOVA1 and PSOVA2, for undertaking a variety of feature selection tasks. Each of the proposed models incorporates a number of distinctive search mechanisms to elevate exploitation of undiscovered search regions, guided by hybrid leader signals. These formulated strategies in each model work cooperatively to produce diverse search behaviors in terms of search flights and directions. In particular, PSOVA2 elevates search diversity by adopting adaptive exemplars as well as four search operations where the search coefficients are implemented using refined sine, cosine, and hyperbolic tangent functions to overcome stagnation.
Evaluated using a total of 13 data sets, with diverse dimensionalities from 30 to 22,283, both models outperform five classical search methods and ten advanced PSO variants significantly in most test cases, as evidenced by the empirical and statistical test results. Specifically, PSOVA1 outperforms all the baseline methods for five data sets (Crohn, Myeloma, MicroMass, Parkinsons, and Activity), while PSOVA2 outperforms all the baseline methods for eight data sets (Crohn, Myeloma, Arcene, MicroMass, Parkinsons, Activity, Seizure and Heart), with statistical significance.
In future directions, other hybrid leader breeding mechanisms will be explored to further enhance performance. Moreover, we also aim to evaluate the proposed models using complex computer vision tasks, e.g., deep architecture generation for object detection and classification [51,[73][74][75] as well as image description generation [76,77].