Competitive Particle Swarm Optimization for Multi-Category Text Feature Selection

Multi-label feature selection is an important task for text categorization. This is because it enables learning algorithms to focus on essential features that foreshadow relevant categories, thereby improving the accuracy of text categorization. Recent studies have considered the hybridization of evolutionary feature wrappers and filters to enhance the evolutionary search process. However, the relative effectiveness of feature subset searches of evolutionary and feature filter operators has not been considered. This results in degenerated final feature subsets. In this paper, we propose a novel hybridization approach based on competition between the operators. This enables the proposed algorithm to apply each operator selectively and modify the feature subset according to its relative effectiveness, unlike conventional methods. The experimental results on 16 text datasets verify that the proposed method is superior to conventional methods.


Introduction
Text categorization involves the identification of the categories associated with specified documents [1][2][3][4]. According to the presence or frequency of words within a document, the so-called bag-of-words model represents each document as a word vector [5]. Each word vector is then assigned to multiple categories because, in general, a document is relevant to multiple sub-concepts [6][7][8]. Text datasets are composed of a large number of words. However, not all the words are useful for solving the associated problem. Irrelevant and redundant words can confound a learning algorithm, deteriorating the performance of text categorization [9]. To resolve these issues, conventional methods have attempted to identify a subset of important words by discarding unnecessary ones prior to text categorization [10][11][12][13]. Thus, multi-label feature selection can be an effective preprocessing step for improving the accuracy of text categorization.
Given a set of word features F = { f 1 , ..., f d }, multi-label feature selection involves the identification of a subset S ⊂ F or a solution composed of n d features that are significantly relevant to the label set L = {l 1 ..., l |L| }. To solve this task, conventional approaches use feature wrappers and filters. At the risk of selecting ineffective features for the learning algorithm to be used subsequently, filters can rapidly identify a feature subset that is mostly composed of important features based on the intrinsic properties of the data [14]. In contrast, wrappers directly determine the superiority of candidate feature subsets by using a specific learning algorithm. Moreover, they generally outperform the filters in terms of the learning performance [10]. Notwithstanding their essential differences, devising an effective search method is important in both approaches. This is because the algorithm must locate the final feature subset from a vast search space specified by thousands of word features.
As an effective search method for feature wrappers, population-based evolutionary algorithms are frequently used in conventional studies because of their stochastic global search capability [15]. These evolutionary algorithms evaluate the fitness of a feature subset based on the categorization performance of the learning algorithm. Furthermore, an evolutionary operator such as a mutation operator modifies the feature subset. Moreover, recent studies have reported that the search capability of an evolutionary algorithm can be further improved through hybridization with a filter [16,17]. Specifically, the feature filter operator can rapidly improve the feature subset by considering only the intrinsic properties of the data, particularly when the solution is overwhelmed by unnecessary features [18].
To achieve an effective hybrid search, the fitness of the feature subset modified by an evolutionary or filter operator must be improved. However, the fitness of a feature subset is not always improved after modification. This is because the evolutionary operator exhibits random properties, and the filter operator is independent of the fitness evaluation function [17,[19][20][21]. If the fitness is not improved after modification by each operator, the modified feature subset is discarded. Thereby, computations performed to evaluate the fitness are wasted. A preferred hybrid search is one in which the modification of a feature subset by each operator always improves the fitness, thus avoiding wastage of computation. If an algorithm can ascertain the fitness after modification by each operator without evaluating the feature subset, it can decide in advance which operator in the feature subset is to be modified. However, this is unfeasible in practice [20]. The second-best option may be a method that estimates the relative effectiveness of each operator based on the fitness of the feature subset already computed in the previous iteration and decides which operator to apply. According to our experiment, although selective engagement of operators can significantly increase the effectiveness of a hybrid search, less attention has been paid to this aspect in recent studies.
To overcome the problems described above, we devise a competitive particle swarm optimization (PSO) algorithm. Unlike conventional PSOs, the proposed method applies each operator selectively based on a novel process for estimating the effectiveness of each operator for each particle. As a result, the particles can be separated into two groups depending on which operator is to be applied in the next iteration. Then, based on the fitness of the particles in each group, a tournament is run. Its results decide which operators will be applied in the next iteration by changing their memberships. Consequently, the proposed method competitively engages each operator in a feature subset search through a fitness-based tournament of the feature subset in each iteration. Our contributions are as follows: • We proposed a novel competitive particle swarm optimization for multi-label feature selection problem by employing an information-theoretic multi-label feature filter as a filter operator. • To selectively apply the evolutionary and filter operators, we proposed a new process for estimating their relative effectiveness based on the fitness-based tournament of the feature subset in each iteration. • To demonstrate the superiority of the information-theoretic measure for improving the search capability, we employed an information-theory-based feature filter and a frequency-based feature filter simultaneously and conducted an in-depth analysis.
Our experiments revealed that the proposed method outperformed conventional methods. It indicates the effectiveness of the proposed estimation process and information-theoretic feature filter operator.

Related Work
In the field of text categorization, feature selection is a crucial task because the feature space is generally high-dimensional. Conventional feature selection methods can be largely categorized into feature filters and feature wrappers. Feature filter methods assess the importance of features using a score function such as the χ 2 statistic, information gain, or mutual information [14]. The top-n features containing the highest scores are then selected. Uysal and Gunal [22] proposed a distinguishing feature selector that investigates the relationship between the absence or presence of a word within a document and the correct label for that document. Rehman et al. [23] proposed a normalized difference measure to remedy the problem of a balanced accuracy measure. It omits the relative document frequency in the classes. Tang et al. [24] proposed a maximum discrimination method based on a new measure for multiple distributions, namely the Jeffreys-multi-hypothesis divergence. However, these methods exhibit limited categorization accuracy because they do not interact with the subsequent learning algorithm.
In contrast, feature wrapper methods evaluate the discriminative power of feature subsets based on a specific learning algorithm and select the best feature subset. Among feature wrapper methods, population-based evolutionary algorithms are widely used for text feature selection owing to their stochastic global search capability. Aghdam et al. [25] applied ant colony optimization to text feature selection. Meanwhile, Lin et al. [26] proposed an improved cat swarm optimization algorithm to reduce the computation time of their originally proposed method. Lu et al. [27] demonstrated the enhanced performance of PSO based on a functional constriction factor and an inertia weight. However, unlike feature filters, these methods generally require significant computational resources for identifying a high-quality feature subset because of their randomized mechanism [28].
To resolve this issue, recent studies have considered hybrid approaches that combine an evolutionary feature wrapper with a filter. These hybrid methods can be categorized into two types according to how the filter operator is applied. One type applies the filter operator to initialize the population of the evolutionary algorithm during the initialization step. For example, Lu and Chen [21] initialized the candidate feature subsets of a small world algorithm using the χ 2 statistic and information gain. Meanwhile, Mafarja and Mirjalili [18] initialized ants in a binary ant lion optimizer using a quick reduct and an approximate entropy reduct based on rough set theory. Although this approach involves the algorithm starting its search from a region exhibiting potential, the algorithm can be deficient in diversity, resulting in premature convergence. In addition, these algorithms can fail to refine the final feature subset because the filter operator is not engaged in the final stage of the search.
The second type of hybrid approach applies the filter operator to modify the feature subset in each iteration during the search process. Ghareb et al. [16] proposed an enhanced genetic algorithm by modifying the crossover and mutation operations by using the ranks of features obtained from six filter methods. Lee et al. [29] proposed an exploration operation that uses a filter to select important features from among those not selected by a genetic operator. Then, a new feature subset is generated. Moradi and Gholampour [30] constructed an enhanced binary PSO using correlation information. Meanwhile, Mafarja and Mirjalili [31] improved the whale optimization algorithm using simulated annealing for the local search. Dong et al. [19] enhanced the genetic algorithm using granular information to address feature selection in high-dimensional data with a low sample size. Zhou et al. [32] proposed a hybrid search that adjusts the influence of the feature filter according to the degree of convergence. However, these methods exhibit limited performance because the evolutionary and filter operators are not engaged selectively. Table 1 presents a brief summary of conventional feature-selection approaches. Table 1. Brief summary of conventional feature selection approaches.

Advantages Disadvantages
Filter methods Rapid identification of a feature subset Lower performance than that of wrapper Wrapper methods High performance than that of filter High complexity Hybrid methods (first type) To start in a region exhibiting potential Premature convergence Hybrid methods (second type) Improved search capability Randomized engagement of operator

Preliminary
To design a competitive hybrid search, we selected PSO as an evolutionary algorithm because it has been demonstrated to be effective in many applications including feature selection [33][34][35][36]. PSO techniques can be classified into continuous PSO and binary PSO. In the former, the population is composed of real numbers. Meanwhile, in binary PSO, the population is composed of zeros and ones. In this study, we considered continuous PSO because binary PSO exhibits potential limitations such as the update of particles based solely on the velocity [36].
In continuous PSO for feature selection, the population of particles is known as a swarm. The location of a particle with d elements can be regarded as a probability vector each of whose elements is the probability that the corresponding feature is selected. The location of a particle is described as follows: where C i is the ith particle in particle group C and C i (j) is the probability that the jth feature is selected when the feature subset is generated from C i in our study. In the initialization step, the elements of each location are initialized as real numbers obtained at random from the uniform distribution [0, 1]. To find feature subsets exhibiting potential, the particle locations are iteratively updated as follows: where V i is the velocity vector of the ith particle; it refers to the magnitude and direction with which the particle moves across the search space. In the initialization step, the velocity of each particle is initialized randomly as a real number obtained from the uniform distribution [−1, 1]. The velocity is calculated as follows: where P i is the so-called "personal best" and denotes the best location identified so far by the ith particle. G is the "global best" and denotes the best location identified so far by the swarm. Specifically, the best locations are selected according to a fitness value obtained by the specific learning algorithm. The inertia weight w controls the influence of the previous velocities on the present velocity. Here, c 1 and c 2 are acceleration constants, and r 1 and r 2 are random values uniformly distributed in [0, 1]. Additionally, the velocity is limited to a maximum velocity v max such that ∀i, j : |V i (j)| < v max . In this study, these user-defined parameters are set based on conventional studies, to w = 0.7298, c 1 = c 2 = 1.49618, and v max = 0.6 [36].

Motivation and Approach
We enhance the performance of a hybrid search for multi-label text feature selection by implementing competitive engagement of the evolutionary and filter operators according to their relative effectiveness. To estimate their relative effectiveness and implement competitive engagement, each operator needs to modify the particles independently in each iteration. Therefore, we separate the particles into small groups depending on which operator is applied in the next iteration, i.e., evolution-based and filter-based particle groups. Figure 1 shows a schematic overview of the proposed algorithm.
First, we design the evolution-based particle group based on conventional PSO. In the initialization step, the evolution-based particles are assigned real numbers obtained at random from the uniform distribution [0, 1]. During the search process, feature subsets are generated using the particle locations (similarly as in conventional PSO), as described in Section 3.1. In addition, they are updated according to Equations (2) and (3) by the evolutionary operator. Secondly, the filter-based particle group is initialized and updated by the filter operator using a score vector obtained from a score function corresponding to the filter. The elements of the score vector are the importances of the features. During the search process, the algorithm updates the membership of the losing particles with that of the winning particle, according to the tournament results based on the fitness value. This is shown in Figure 1. For example, if a filter-based particle wins, the filter operator's search is regarded to be more effective than that of the evolutionary operator in the previous iteration. Thus, the algorithm applies the filter operator to the losing evolution-based particle in the next iteration. This procedure is repeated until the parameterized resources are exhausted.

Competitive Particle Swarm Optimization
Although multiple filter-based operators can be employed in our proposed method, for simplicity, we outline the pseudocode of the proposed method in a case in which only one filter is used. This is illustrated in Algorithm 1. The terms used to describe the algorithm are summarized in Table 2. In the initialization step (Line 3), the algorithm generates evolution-based and filter-based particles with Algorithm 2. On Lines 4 and 5, each particle generates a feature subset using its location. This feature subset is evaluated by a fitness function, in which the obtained fitness value E c , E f denotes the learning performance of the text categorization. In this pseudocode, a high fitness value indicates that the corresponding particle displays good fitness. Additionally, because m c + m f particles are being evaluated, there are m c + m f fitness function calls (FFCs) on Line 6. The number of FFCs is generally used as a stopping criterion [20]. update C using Equations (2) and (3); update locations of particles 9: [S c , S f ] ← generate subsets based on C, F; 10: S ← the best feature subset so far; 14: end while Table 2. Notations used in the design of the proposed method.

Terms Meanings
C The evolution-based particle group F The filter-based particle group m c The number of the evolution-based particles m f The number of the filter-based particles E c The fitness values for feature subsets generated from C E f The fitness values for feature subsets generated from for j = 1 to d do 5:  (4) 10: σ ← σm f 11: for k = 1 to m f do 12: for j = 1 to d do 13: F k (j) ← sample from N(X, σ 2 ); use Gaussian distribution 14: end for 15: end for After the initialization process, the evolution-based particles are updated by the evolutionary operator on Line 8. Moreover, all particles are evaluated by the fitness function on Lines 9 and 10.
On Lines 12 and 13, the evolution-and filter-based particles compete. The losing particles are updated in the next iteration by the winning operator, according to the competition results from Algorithm 3. This procedure is repeated until the algorithm attains the maximum FFCs, denoted by v. The output of Algorithm 1 is the best feature subset obtained during the search process.
w c ← w c + 1; add one whenever evolution-based particle wins 8: l c ← l c + 1; add one whenever evolution-based particle loses 10: 11: E f (j) ← −∞; exclude winning particle at next competition 12: end if 13: end for 14: for k = 1 to w c do 15: 16: delete F j ; delete the particle with low fitness value 17: C end+1 ← a new particle; use uniform distribution 18: end for 19: for k = 1 to l c do 20: delete C j ; 22: F end+1 ← a new particle; use score vector for feature filter 23: end for 24: Algorithm 2 presents the detailed procedure for initializing the particles. On Lines 3-7, the evolution-based particles are initialized. The score function associated with the filter then calculates a score vector to initialize the filter-based particles. If only one filter-based particle group is used, it is generated by the random diffusion of a score vector to maintain diversity within the group on Lines 9-15. Herein, random diffusion can be implemented by diffusing the score vector according to a Gaussian distribution. Therefore, the mean is set to the score vector, and the standard deviation is calculated as follows: where X s is the score vector sorted in ascending order. This is calculated as the average score difference to prevent the diffusion from altering the ranking orders excessively. On Line 10, our algorithm multiplies the standard deviation by the number of filter-based particles to maintain diversity. Algorithm 3 presents the detailed procedure of the competition between the evolution-and filter-based particles. On Line 5, the number of competitions is set to the minimum of the particle group sizes. On Lines 6-12, after each group has selected the particle with the maximum fitness value, the particles compete based on the fitness value. The competition results are stored as the number of winners and losers for the evolution-based particle group. Our algorithm prevents the number of evolution-based particles from becoming zero on Line 8. If a particle continues to win in the competition, all particles can converge to a particle by genetic drift [37]. To circumvent this, a winning particle is prevented from competing in the next competition, preventing any particle from continually winning (Lines 10 and 11). In our algorithm, the losing particles are updated by the winning operator in the next iteration on Lines 17 and 22.
Finally, we conducted a theoretical analysis of the time complexity of the proposed method. In the evolutionary search, each feature subset should be evaluated by the learning algorithm to obtain the fitness value. This involves complicated sub-procedures including a decision-making process for multiple categories and repetitive cross-validation to simulate realistic performance [38]. Thus, the maximum number of FFCs permitted can be used to represent the computational complexity of the proposed method, i.e., O(v).

Information-Theoretic Multi-Label Feature Filter Operator
The information theory is frequently used in conventional studies because of its capability to quantify the similarity between probability distributions. The information-theory-based feature filter methods generally evaluate the importance of features based on the joint entropy between each feature and labels. We selected the information-theoretic multi-label feature filter, namely quadratic programming-based multi-label feature selection [39], as a filter operator. This is because it has performed effectively in multi-label feature selection problems. It calculates a score vector based on a criterion that maximizes the dependency on labels and minimizes the redundancy among the features. Here, the score vector represents the importance of each feature.
Given a set of features F = { f 1 , ..., f d } and label set L = {l 1 ..., l |L| }, the score vector X is calculated by solving the following maximization problem: where is the Shannon's mutual information between the random variables a and b. H(a) = − ∑ i p(a i ) log 2 (p(a i )) is the joint entropy of the probability distributions p(a), p(b), and p(a, b). Specifically, the left-hand side implies dependency between each feature and multiple labels, and the right-hand side implies redundancy among features. In addition, the score vector X has the following constraints: These constraints enable the consideration of the score vector X as a probability vector. Therefore, the score vector can be used as the particle's location.

Experimental Settings
We conducted experiments using 16 datasets from the RCV1 and Yahoo collections, which together comprise over 10,000 features. We used the top 2% and 5%, respectively, of the features with the highest document frequency because the categorization performance would not be affected significantly by the removal of features [40,41]. The datasets contain text data with multiple labels. Herein, the labels correspond to specific subjects related to the document. In the text data, each feature corresponds to the frequency of a word within the document. Table 3 presents the standard statistics for the multi-label datasets used in our experiments. The statistics include the number of patterns in the dataset |W|, number of features |F|, feature type, and number of labels |L|. In addition, the label cardinality Card represents the average number of labels for each pattern. Moreover, the label density Den is the label cardinality over the total number of labels. Furthermore, Distinct indicates the number of unique label subsets in L. The experiments conducted in this study included only text data. We compared the proposed method with two hybrid-based feature selection methods and a PSO-based feature selection method: EGA + CDM [16], bALO-QR [18], and competitive swarm optimizer (CSO) [9], respectively. EGA + CDM combines an enhanced genetic algorithm (EGA) [16] with a class discriminating measure (CDM). bALO-QR initializes the ants in a binary ant lion optimizer (bALO) [42] using the quick reduct (QR). CSO is a PSO-based method that uses multiple swarm. For each method, the parameters were set to the values recommended in the original study, and a problem transformation enabled each label subset to be treated as a single class when calculating each filter algorithm. This is because these were designed to handle single-label datasets. To prevent bias, we set the maximum permissible FFCs to 300. The maximum number of selected features was set to 50. The population size was set to 30. To evaluate the quality of the feature subsets obtained by each method, we used the multi-label naive Bayes (MLNB) [43] and extreme learning machine for multi-label (ML-ELM) [44] classifier with the holdout cross-validation method. For each dataset, 80% of the data was selected for the training set. The remaining 20% was used as the test set. We performed each experiment 10 times and used the average value to represent the categorization performance of each feature selection method.
In the proposed method, to demonstrate the superiority of information-theoretic multi-label filter operator for improving search capability, we employed an additional frequency-based filter operator, namely a normalized difference measure [23]. In our experiments, it competes with the evolutionary operator as well as the information-theoretic filter operator. A comparison between the operators is described in Section 5. For three operators, we set the size of the corresponding particle group to 10.
To evaluate the performance of each feature selection method, we employed four evaluation metrics: Hamming loss, one-error, multi-label accuracy, and subset accuracy [45][46][47]. Let T = {(w i , λ i )|1 ≤ i ≤ |T|} be a specified test set. Here, λ i ⊆ L is a correct label subset related to w i . Given a test sample w i , a predicted label set Y i ⊆ L is estimated by a classifier such as MLNB. In detail, a family |L| of functions { f 1 , f 2 , ..., f |L| } is induced from the multi-label training examples.
Here, each function f k determines the class membership of l k with respect to each instance, i.e., Y i = {l k | f k (w i ) > θ, 1 ≤ k ≤ |L|}; moreover, θ is a predefined threshold. Using the correct label subsets and predicted label sets, we can compute the four metrics. The Hamming loss is defined as follows: where denotes the symmetric difference between two sets. The one-error is defined as where [ · ] returns a value of one if the proposition stated in the brackets is true, and zero otherwise. The multi-label accuracy is calculated as Finally, the subset accuracy is defined as Higher values of the multi-label accuracy and subset accuracy and lower values of the Hamming loss and one-error indicate higher performance.
We conducted a statistical test to compare the proposed method to previous techniques. First, we employed the widely used Friedman test to compare multiple methods [38]. Based on the average rank for each method, the null hypothesis that all the methods perform equally well is either rejected or accepted. When the null hypothesis was rejected, we proceeded with a certain post-hoc test to analyze the relative performance among the methods being compared [38]. Thus, we employed the Bonferroni-Dunn test, which compares the difference between the average ranks of the proposed method and of another method [48]. For the Bonferroni-Dunn test, the performances of the proposed method and of the other methods are regarded as statistically similar if their average ranks over all datasets are within one critical difference (CD). In our experiments, the CD was 1.093 [38].

Comparison Results
Tables 4-11 contain the experimental results for the proposed method and the other methods, on 16 multi-label text datasets. They are presented as the average performance, with the corresponding standard deviations. In Tables 4-7 and Tables 8-11, MLNB and ML-ELM, respectively, are used as classifiers. The highest performance is shown in bold font and indicated by a check mark. Finally, Tables 12 and 13 contain the Friedman statistics and the corresponding critical values on each evaluation measure for each classifier. Here, we set the significance level α = 0.05. In Figures 2 and 3, the CD diagrams illustrate the relative performance of the proposed method and of other methods. Herein, the average rank of each method is marked along the upper axis, with the higher ranks placed on the right side of each subfigure. We also present the CD from the perspective of the proposed method above the graph. This implies that the methods within the range are not significantly different from each other [38]. Those for which the difference is not significant are connected by a thick line.           From the results in Tables 4-11, it is evident that the proposed method outperformed the state-of-the-art feature selection methods for most of the multi-label text datasets. For MLNB, the proposed method achieved the highest performance on 94% of the datasets in terms of Hamming loss, and on all datasets in terms of one-error, multi-label accuracy, and subset accuracy. For ML-ELM, the proposed method achieved the highest performance on all datasets in terms of Hamming loss and one-error, and on 94% of the datasets in terms of multi-label accuracy and subset accuracy. Consequently, the proposed method consistently achieved the highest average rank in all the experiments. As shown in Figure 2, the proposed method significantly outperformed all other algorithms in terms of one-error, multi-label accuracy, and subset accuracy for MLNB. As shown in Figure 2a, the proposed method was significantly better than EGA-CDM and bALO-QR in terms of Hamming loss for MLNB. Figure 3 shows that the proposed method significantly outperformed all the other algorithms in terms of Hamming loss, one-error, and multi-label accuracy for ML-ELM. Figure 3d shows that the proposed method is significantly better than EGA-CDM and bALO-QR in terms of the subset accuracy for ML-ELM.
In summary, the experimental results demonstrate that the proposed method outperformed the three reference algorithms on 16 text datasets. Statistical tests verified that the proposed method was significantly superior to the other methods in terms of one-error, multi-label accuracy, and subset accuracy for MLNB and in terms of Hamming loss, one-error, and multi-label accuracy for ML-ELM.

Analysis for Engagement of the Evolutionary and Filter Operators
To describe the competition results for each iteration, Figure 4 shows the engagement of each operator during the search process. Here, each engagement is represented as the average across 10 experimental trials with MLNB. Specifically, the engagement refers to the number of times an evolutionary operator and two filter operators modify the particles in each iteration. As shown in Figure 4, the effectiveness of the operators could be varied according to the progress of search on a specified dataset. This indicates that the capability of evolutionary search and the performance of a filter method could vary. Such situations could be intensified in text applications owing to the sparsity of the data. Figure 4a-e shows that the filter operator could rapidly improve the particles on the RCV1 dataset in the early stages of the search process. Additionally, Figure 4m shows that the information-theory-based filter operator is more frequently engaged than the evolutionary operator in the early stages when the reference dataset was used. However, the information-theory-based filter operator was more frequently engaged than the evolutionary operator across the entire search process on the dataset in Figure 4f-l,n-p. Moreover, the frequency-based filter operator was more frequently engaged than the evolutionary operator on the dataset in Figure 4h-j,o-p. In addition, the information-theory-based filter operator was more frequently engaged than the frequency-based filter operator on 81% of the datasets in Figure 4. This demonstrates the superiority of the information-theoretic measure in improving the search capability.
This study was motivated by the consideration that competitive engagement via competition between the evolutionary and filter operators could improve the performance of the learning algorithm. To validate this, we conducted an additional experiment in which we compared the proposed method to a non-competitive reference algorithm. Specifically, in the initialization step, the particle groups were initialized as in the proposed method. However, the evolutionary and filter operators equally modified particles during the search process, unlike in the proposed method. We set the maximum permissible FFCs to 300 and the size of each particle group to 10, as stated in Section 4. Figure 5 compares the subset accuracy of the proposed and reference algorithm on 16 datasets using MLNB. In Figure 5, the vertical axis indicates the subset accuracy. To determine whether the two methods were statistically different for each dataset, we conducted a Wilcoxon rank sum test [49]. The corresponding p-values are shown in each subfigure. The test used the results from 10 repeated experiments on each dataset. As shown in Figure 5, the additional experiments demonstrate that the competitive engagement of the operators could improve the search capability.

Discussion
The main contribution of this study is the proposal of a new process for advance estimation of the relative effectiveness of the evolutionary and filter operators and their selective engagement in each iteration to improve the hybrid search. Our method compares the fitness of particles modified by each operator and determines the operator to be applied according to the results of the tournament.
The proposed method has the following advantages. By selectively applying each operator, our method can reduce the number of feature subsets that are discarded because of not having been improved after modification by each operator. That is, the method increases the number of times the fitness is improved. In addition, comparison of the effectiveness of each operator does not require additional computations. The proposed method permits more evolution-based particles to explore and exploit locations exhibiting potential, by increasing the engagement of the evolutionary operator when its effectiveness is higher than that of the filter operator. In the converse case, important features are selected with higher probabilities by increasing the engagement of the filter operator. In this regard, the proposed method may be more stable than conventional hybrid methods in the feature selection tasks. Other PSO variants (such as predator-prey PSO [50,51]) can be applied to our method. For example, multi-swarm PSO methods (such as competitive swarm optimizer [9]) can be applied by dividing evolution-based particles into multiple swarms. Similarly, other filter methods can be applied by using multiple filter operators.
In this study, a method for estimating the superiority of each operator is developed to improve the effectiveness of hybrid search. After the operator to be applied is selected, a new feature subset is created and evaluated. Thus, the proposed method selects between two feasible feature subsets: one is a feature subset modified by the evolutionary operator, and the other is a feature subset modified by the filter operator. This concept originated from the well-established informed best-first search [52], i.e., when the algorithm encounters several nodes to be visited, one is selected based on its potential, which is typically measured by a heuristic function or process. In our experiments, the superiority between the two operators was determined based on the fitness-based tournament. Furthermore, the effectiveness of our method was verified because the proposed method outperformed the reference algorithms.
Tables 4-11 reveal that the proposed method outperformed three state-of-the-art methods. The results demonstrate that the proposed method is an effective feature selection method. Figure 5b-e shows that the proposed method exhibited higher exploration and exploitation capability than the reference algorithm as the search progressed. This is because, as shown in Figure 4, the evolution-based particles generated better feature subsets than the filter-based particles, on the RCV1 dataset. Furthermore, increasing the engagement of the evolutionary operator permitted more evolution-based particles to explore and exploit locations exhibiting potential. Figures 4j-l and 5j-l show that, when the effectiveness of the filter operator was higher than that of the evolutionary operator, increasing the engagement of the filter operator aided in selecting the important features. Finally, the experimental results demonstrate that the competitive engagement of the operators could successfully improve the search performance.

Conclusions
Most conventional hybrid approaches for multi-label feature selection do not consider the relative effectiveness between the evolutionary and filter operators. In this study, we proposed a novel competitive hybrid approach for multi-label text feature selection aimed at improving the learning performance by selective engagement of the operators via competition. The experimental results and a statistical test verified that the proposed method significantly outperformed three state-of-the-art feature selection methods, on 16 multi-label text datasets.
Future research will focus on the applications for our approach. The proposed method was designed for multi-label text feature selection. However, it can be applied to other scenarios.