Multi-Population Genetic Algorithm for Multilabel Feature Selection Based on Label Complementary Communication

Multilabel feature selection is an effective preprocessing step for improving multilabel classification accuracy, because it highlights discriminative features for multiple labels. Recently, multi-population genetic algorithms have gained significant attention with regard to feature selection studies. This is owing to their enhanced search capability when compared to that of traditional genetic algorithms that are based on communication among multiple populations. However, conventional methods employ a simple communication process without adapting it to the multilabel feature selection problem, which results in poor-quality final solutions. In this paper, we propose a new multi-population genetic algorithm, based on a novel communication process, which is specialized for the multilabel feature selection problem. Our experimental results on 17 multilabel datasets demonstrate that the proposed method is superior to other multi-population-based feature selection methods.


Introduction
Multilabel feature selection (MLFS) involves the identification of important features that depend on a given set of labels. It is often used as an effective preprocessing step for complicated learning processes, because noisy features in relationships between multiple labels can be eliminated from subsequent training, resulting in improved multilabel classification performance [1,2]. Given an original feature set F = { f 1 , . . . , f |F| }, MLFS identifies a feature subset S ⊂ F composed of n |F| features that are dependent on the label set L = {l 1 , . . . , l |L| }. Conventional studies on MLFS have indicated that population-based evolutionary algorithms are promising, owing to their global search capability [3,4].
Conventional genetic algorithms that are based on a single population have suffered from premature convergence of the population, resulting in local optimal solutions [5]. Multi-population genetic algorithms (MPGAs) have recently gained significant attention as a means for circumventing the aforementioned issue. This is because they enable one sub-population to avoid premature convergence by referencing individuals or solutions from other sub-populations [6][7][8]. With regard to the feature selection problem, the communication process would improve the search capability of the sub-populations, because they can acquire hints regarding important features by referencing the best individuals from other sub-populations [9].
To the best of our knowledge, most studies have used a traditional communication process to solve the MLFS problem, even though it is intended for solving a single-label feature selection • We proposed an MPGA that specializes in solving the MLFS problem by introducing a novel communication process and improving the update process. • We introduced a new concept of label complementarity derived from the fact that feature subsets with a high discriminating power for different label subsets can complement each other.

Related Work
Recent MLFS methods can be broadly classified into filter-based and wrapper-based methods. The filter-based methods assess the importance of features through their own measure based on feature and label distributions. Thereafter, the top-n features with the highest scores are selected. Li et al. [10] proposed a granular MLFS method that attempts to select a more compact feature subset using information granules of the labels instead of the entire label set. Kashef and Nezamabadi-pour [11] proposed a Pareto dominance-based multilabel feature filter for online feature selection, which concerns the number of features being added sequentially. Gonzalez-Lopez et al. [12,13] proposed distributed models that measure the quality of each feature based on mutual information on Apache Spark. Seo et al. [14] proposed a generalized information-theoretic criterion for MLFS. They introduced entropy approximation generalized to cardinality, which was chosen by users based on the trade-off between approximation precision and computational cost. However, the classification performance of these methods is limited, because they work independently of the subsequent learning algorithm.
In contrast, wrapper-based methods evaluate the superiority of candidate feature subsets that are based on a specific learning algorithm such as a multilabel naive Bayes classifier [15]. They generally outperform the filter-based methods in terms of classification accuracy [16]. Among the wrapper-based methods, population-based evolutionary search methods are frequently used for feature selection, owing to their stochastic global search capability [17]. Lu et al. [18] proposed a new functional constriction factor to avoid premature convergence in traditional particle swarm optimization. Mafarja and Mirjalili [19] proposed binary variants of a whale optimization algorithm and applied them to the feature selection. Nakisa et al. [20] used five population-based methods in order to determine the best subset of electroencephalogram features. Dong et al. [21] improved a genetic algorithm using granular computing to select important features in high-dimensional data with a low sample size. Moreover, Lim and Kim [22] proposed an initialization method for evolutionary search-based MLFS algorithms by approximating conditional mutual information. Lee et al. [23] introduced a score function to deal with multilabel text datasets without problem transformation in a memetic search. However, these single population-based methods suffer from premature convergence of the population, resulting in limited search capability. Although methods, such as a multi-niche crowding genetic algorithm [24], can be used to mitigate premature convergence, they are still sensitive to the initialization of the population.
To resolve these issues, recent single-label feature selection studies have considered multi-population-based methods while using multiple isolated sub-populations. Ma and Xia [25] proposed a tribe competition-based genetic algorithm that attempts to ensure the diversity of solutions by allowing the sub-populations to generate feature subsets with different numbers of features. Additionally, it explores an entire search space by competitively allocating computing resources to the sub-populations. Zhang et al. [26] proposed an enhanced multi-population niche genetic algorithm. To avoid local optima, it included a process of exchanging the best individuals or solutions between the sub-populations during the search process. It also reduced the chances of similar individuals being selected as parents, based on the Hamming distance. Wang et al. [27] proposed a bacterial colony optimization method by considering a multi-dimensional population. Similar to the study that was conducted by Ma and Xia, the entire search space was divided based on the number of features selected, and the sub-populations explored different search spaces. Table 1 summarizes the terms used for elucidating the proposed method. Conventional MPGAs for single-label feature selection entail the following processes. Table 1. Notation used for describing/elucidating the proposed method.

Terms Meanings
Fitness values for the individuals of the P k A k Label-specific accuracy matrix for individuals of P k , Step 1: Initialization of sub-populations. Each sub-population P k consists of individuals whose number is a pre-defined parameter. Furthermore, each individual represents a feature subset. For example, in the genetic algorithm, each individual is represented as a binary vector called a chromosome, which comprises ones and zeros that represent selected and unselected features, respectively. In particle swarm optimization, each individual is represented as a probability vector. The components of a particle are regarded as the probabilities that the corresponding features will be selected. In most studies, the individuals are initialized randomly.
Step 2: Evaluation using a fitness function. The individuals of each sub-population can be evaluated using a fitness function. Given a feature subset represented by each individual ind i , a learning algorithm, such as a naive Bayes classifier, is trained, and trained classifier is used to predict the label for each test pattern. Given a correct label and the predicted label, a fitness value can be computed using evaluation metrics, such as accuracy. Intuitively, a feature subset that results in better single-label prediction has better a fitness value.
Step 3: Communication among sub-populations. The sub-populations communicate with each other based on the best individuals in terms of the fitness value. In each sub-population, the worst individual (with the lowest fitness value) is replaced by the best individual of another sub-population.
Step 4: Sub-population update. The individuals generate offspring via genetic operators. First, each sub-population chooses the parents based on fitness values. For example, roulette-wheel selection employs the fitness value percentage of each individual in each subpopulation, as the probability that the individual will be chosen as a parent. Subsequently, the offspring are generated via the crossover of parents or mutation.
Whenever the individuals are modified in Step 4, they are evaluated in the same manner as in Step 2. During the search process, MPGAs repeat Step 3→Step 4→Step 2 until a stopping criterion is met. In the left side of Figure 1, the aforementioned process is presented as a flowchart.

Motivation and Approach
We designed a novel MPGA that specializes in solving the MLFS problem. To extend the benefits of the communication process used in conventional studies to the MLFS problem, the following issues should be considered:

•
Through communication between the sub-populations, the discriminating power of multiple labels should be complemented. Additionally, the referenced individuals should be used to generate offspring that are superior to the previous generation. • Feature subsets with high discriminating power for different label subsets can complement each other. Therefore, each sub-population should refer to an individual with the highest discriminating power for a subset of labels that are relatively difficult to discriminate, resulting in improved search capability for the MLFS.
Existing fitness-based parent selection methods may not fully use the individuals referenced from other sub-populations, because they are selected, regardless of fitness in our method. This issue can be resolved by ensuring that one of the important individuals in each sub-population is involved when generating the offspring. Figure 1 presents a schematic overview of the proposed MPGA for solving the MLFS problem. Particularly, we modified the communication and update process of the existing MPGA. First, with regard to sub-population communication, the conventional method communicates by exchanging the best individuals among sub-populations. Specifically, the sub-population P 1 imports the best individual ind 4 of P 2 ; then, the worst individual ind 3 is replaced by ind 4 of P 2 . Similarly, ind 2 of P 2 is replaced by ind 1 of P 1 . In the proposed label complementary communication for MLFS, the evaluation of the individuals is performed similarly to that performed in the conventional methods for single-label feature selection; however, the learning algorithm is replaced by a multilabel classification algorithm, such as a multilabel naive Bayes classifier (MLNB) [15], which uses a series of functions that predict each label. Therefore, the discriminating power corresponding to each label can be obtained by reusing the learning algorithm that was trained to evaluate the fitness values of individuals; a detailed description of this process is presented in Section 3.3. As shown in Figure 1, the best individual ind 1 of P 1 lacks sufficient classification performance with regard to the label l 2 . To complement the discriminating power with regard to l 2 , P 1 refers to individual ind 3 of P 2 , which best discriminates l 2 .
In the sub-population updating step, the conventional method stochastically or deterministically selects the parents of P 1 via fitness-based selection. Here, the individual ind 3 that is imported from P 2 is selected and used with a high probability, because it had the highest fitness in P 2 . In contrast, in the proposed label complementary updating step, the complementary individual ind c referenced from P 2 is chosen, regardless of fitness. Because the important individuals of P 1 are the complementary individual ind c and the best individual ind 1 , one of them is selected as a parent. In other words, one of the important individuals is always involved in the generation of offspring. For diversity, another parent is selected from the remaining individuals at random. Finally, the selected parents generate offspring while using a genetic operator.
If a MPGA begins with a promising initial sub-populations, then a good-quality feature subset can be found by spending fewer time than that begins with a randomly-initialized sub-populations. In this study, we introduce a simple but effective initialization method. Given an original feature set F = { f 1 , . . . , f |F| } and the number of sub-populations m, the spherical k-means algorithm partitions F into m clusters [28]; herein, each of the clusters are composed of different features without overlapping, such that |C 1 | + |C 2 | + · · · + |C k | + · · · + |C m | = |F|. Subsequently, each sub-population P k is intialized based on repetitive entropy-based stochastic sampling from cluster C k . Section 3.3 presents a detailed description of the sampling process.

Algorithm
Algorithm 1 represents the pseudocode of the proposed method. Each individual (chromosome) is represented by a binary string that is composed of ones and zeros, representing selected and unselected features, respectively. For simplicity, each sub-population is represented as a set of individuals, i.e., P k = {ind 1 , . . . , ind |P k | }. Additionally, all of the sub-populations have the same number of individuals. In the initialization step (line 4), the individuals of each sub-population are initialized by Algorithm 2, and then evaluated to obtain their fitness values (line 6). In this study, the MLNB is used as the learning algorithm. Given the trained learning algorithm, the fitness values are computed according to the multilabel evaluation metrics detailed in Section 4.1. To evaluate the discriminating power corresponding to each label, our algorithm uses an accuracy metric used in the fitness evaluation of single-label feature selection methods. For each individual ind i that belongs to P k , the label-specific accuracy vector a i = [a i1 , . . . , a i|L| ] is computed by reusing the already trained learning algorithm; here, a ij is the accuracy corresponding to the j-th label predicted by ind i . Consequently, the label-specific accuracy matrix A k ∈ R |P k |×|L| is computed across all individuals of P k (line 7).
use Algorithm 2 5: for each sub-population P k do 6: v k (t) ← evaluate P k (t) using D; compute fitness values via a fitness function 7: A k (t) ← compute the label-specific accuracy matrix for individuals of P k (t); reuse the fitness function 8: end for 9: while (not termination-condition) do 10: for each sub-population P k do 11: ind c ← communication(P k (t), A); use Algorithm 3 12: use Algorithm 4 13: v k (t + 1) ← evaluate P k (t + 1) using D;

14:
A k (t + 1) ← compute the label-specific accuracy matrix for individuals of P k (t + 1); 15: t ← t + 1; if H( f i ) = 0 then 5: use the spherical k-means algorithm 9: for k = 1 to m do 10: for each individual ind i ∈ P k do 11: ind i ← initialize by selecting n features via stochastic sampling; use Equation (1) 12:

13: end for
After the initialization process, the sub-populations complement each other via the proposed label complementary communication (line 11), i.e., Algorithm 3. Specifically, each sub-population identifies a complementary individual ind c that can complement itself from the other sub-populations. Next, our algorithm updates the sub-populations while using ind c via Algorithm 4. All of the sub-populations repeat these processes until the termination condition is met. We use the number of fitness function calls (FFCs) as the termination criterion, and the algorithm conducts the search until the available FFCs are exhausted. Finally, Algorithm 1 outputs the best feature subset. Algorithm 2 represents the procedure of initialization process for each sub-population. With regard to lines 3-7, if the entropy of any feature is zero, then it is preferentially removed because it does not have any information. Each cluster C k of features is generated by the spherical k-means algorithm (line 8), and it is used to initialize each sub-population P k (lines 9-13). Given each feature f k i ∈ C k , its importance score p k i ∈ [0, 1] is calculated as where is the entropy of a variable x. Finally, each individual of P k is initialized via stochastic sampling based on the importance scores (line 11). Algorithm 3 illustrates the procedure for realizing the label complementary communication between the sub-populations for multiple labels. For simplicity, an input sub-population and the others are represented as P and P , respectively. With regard to lines 3-4, our algorithm finds an index set L e of labels for which the best individual ind b in P yields the lowest accuracies, where the size of L e is set to half the size of the entire label set |L|/2 . To find the complementary individual ind c from the other sub-populations P , our algorithm computes the degree of complementarity c i for each individual ind i in P , where c i is regarded as the discriminating power with regard to the labels in L e . Specifically, c i is calculated by adding the accuracies corresponding to the labels in L e (line 6). In contrast with the simple communication of exchanging the best individuals, the individual ind c referenced from the other sub-populations can complement the discriminating power of the sub-population P for the entire label set L, which results in an improved search capability for MLFS. p2 ← select an individual at random among remaining individuals of P(t); 8: [o1, o2] ← generate new offspring by crossover from the p1 and p2; 9: 10: end while 11: P(t + 1) ← run a mutation on overlapping individuals; Algorithm 4 represents the detailed procedure for generating new offspring. Because the complementary individual ind c and the best individual in P(t) are considered to be important, our algorithm generates offspring from them once (line [3][4]. With regard to lines 6-7, our algorithm conducts parent selection to generate offspring. Particularly, the first parent is randomly selected between ind c and the best individual; consequently, the important individuals are always involved in the generation of offspring. Furthermore, to generate diverse offspring, the other parent is selected from one of the remaining individuals. As shown in line 8, the selected parent pair generates offspring via a restrictive crossover method that is frequently used to control the number of selected features in feature selection [29]. When compared to updating based on fitness-based parent selection, our algorithm can generate offspring that are superior to the previous generation by actively using the complementary individual ind c . The generated offspring are sequentially added to P(t + 1) (line 9). To maintain the number of individuals in each sub-population, the generation process is repeated until the offspring are as numerous as the number of individuals in P(t), i.e., |P(t + 1)| = |P(t)|. Furthermore, as described in line 11, a restrictive mutation is conducted on overlapping individuals.
Finally, we conducted the time complexity analysis of the proposed method. The most time is spent to evaluate feature subsets, because the learning algorithm should be trained through complicated sub-procedures for multiple labels [30]. Because the numbers of training patterns and given labels are regarded as constant values during the evaluation process, the computation time required to evaluate a feature subset S is determined by the number of selected features |S| ≤ n, i.e., O(nσ), where σ represents the assumed basic time associated with the evaluation of a single feature [3]. Given the total number of individuals N ind and maximum number of iterations N iter , the feature subset evaluation is conducted N ind · N iter times. Thus, the time complexity of the proposed method is O(N ind · N iter · nσ).

Algorithm: Example
We implement the proposed method on the multilabel toy dataset provided in Table 2 as a representative example. In the table, each text pattern w i is relevant to multiple labels, where the labels are represented as one if relevant and zero otherwise. Specifically, the first pattern w 1 includes the terms "Music", "The", "Funny", and "Lovely", but not "Boring." This pattern can be assigned to the labels "Comedy" and "Disney" simultaneously. For simplicity, we set the number of sub-populations and the number of features as two. Additionally, the number osf individuals in each sub-population was set to three. To focus on the communication process, in the initialization step, two sub-populations were initialized at random, as follows: MLNB and multilabel accuracy are used to evaluate each individual. A detailed description of the evaluation metrics, including multilabel accuracy, is given in Section 4.1. Additionally, the fitness values v k for each sub-population P k are calculated as the average value obtained from 10 repeated experiments, as follows: where the label-specific accuracy matrix A k for P k is calculated using the MLNB that was pretrained for fitness evaluation.

Features Labels
Pattern

Boring Music The Funny Lovely
Comedy Documentary Disney In the communication process for P 1 , our algorithm determines the index set L e of labels for which the lowest accuracies are yielded by the best individual ind 1 = 10010, as it has the highest fitness 0.65 in P 1 . We indicate important individuals in the sub-population P 1 using bold font. In A 1 = (a ij ), ind 1 has the lowest accuracy, 30% for l 3 , as min l k ∈L a 1k is 0.30 when k = 3 because |L e | = |L|/2 = 3/2 = 1, L e = {3}. To complement P 1 , our algorithm finds the complementary individual ind c from P 2 . Based on A 2 and L e , the degree of complementarity c i for each individual ind i of P 2 is calculated as Because the individual ind 3 belonging to P 2 has c 3 = 0.87, the complementary individual for P 1 is ind c = 00101. Conventional methods import the best individual ind 1 = 00110 that belongs to P 2 . Our example exhibits a low accuracy of 40% for l 3 . However, our method refers to ind 3 = 00101 of P 2 , which has the highest accuracy with regard to l 3 . This indicates that our method can further complement the discriminating power of P 1 for multiple labels and increase the likelihood of avoiding local optima, resulting in improved multilabel accuracy. This process is similar for P 2 .
In the update process, P 1 selects its best individual ind 1 and ind c to be the parental pair once. Next, one of ind 1 or ind c is selected as a parent, and one of ind 2 or ind 3 is selected as the other parent at random. The selected parent pair generates offspring via the genetic operators used in conventional methods. Given ind 1 = 10010 and ind c = 00101 as the parent pair, our algorithm generates offspring 00110 and 10001 via the restrictive crossover. As a result, a feature subset { f 1 , f 5 } represented by the offspring 10001 achieved a multilabel accuracy of 91%. This search process is repeated until the stopping criterion is met.

Datasets and Evaluation
We conducted experiments using 17 multilabel datasets corresponding to various domains; these datasets can be obtained from http://mulan.sourceforge.net/datasets-mlc.html [31]. Specifically, the Emotions dataset [32] consists of 8 rhythmic features and 64 timbre features. The Enron dataset [33] was sampled from a large email message set, the Enron corpus. The Genbase and Yeast datasets [34,35] contain information regarding the functions of biological proteins and genes. The Medical dataset [36] is a subset of a large corpus that is associated with suicide letters in clinical free text. The Scene dataset [37] has indexing information on still images containing multiple objects. The remaining 11 datasets were obtained from the Yahoo dataset collection [38], composed of more than 10,000 features. Table 3 indicates standard statistics for the 17 datasets used in our experiments. It includes the number of patterns |W|, number of features |F|, types of features, and number of labels |L|. If the feature type was numeric, we discretized the features while using label-attribute interdependence maximization, which is a discretization method that is specialized for multilabel data [39]. The label cardinality Card. represents the average number of labels in each pattern, and label density Den. is the label cardinality for the total number of labels. Further, Distinct. indicates the number of unique label subsets in L, and Domain represents the applications that are related to each dataset. We compared the proposed method with three state-of-the-art multi-population-based methods that have exhibited promising performance for solving the feature selection problem: TCbGA [25], EMPNGA [26], and BCO-MDP [27]. We set the parameters for each method to the values used in the corresponding original study. For fairness, we set the maximum number of allowable FFCs and selected features to 300 and 50, respectively. The total population size was set to 50. The MLNB and a holdout cross-validation method were used in order to evaluate the quality of the feature subsets obtained by each method. Furthermore, 80% and 20% of each dataset were used as the training and test sets, respectively. We repeated each experiment 10 times and used the average value of the results. In the proposed method, we set the number of sub-populations to five; thus, each sub-population size was 10.
We used four evaluation metrics to evaluate the quality of the feature subsets: Hamming loss, one-error, multilabel accuracy, and subset accuracy [40][41][42]. Let T = {(w i , λ i )|1 ≤ i ≤ |T|} be a given test set, where λ i ⊆ L is a correct label subset that is associated with a pattern w i . Given a test pattern w i and a multilabel classifier, such as MLNB, estimate a predicted label set Y i ⊆ L. Specifically, a series of functions {g 1 , g 2 , . . . , g |L| } is induced from the training patterns. Next, each function g k determines the class membership of l k with respect to each pattern, i.e., Y i = {l k |g k (w i ) > θ, 1 ≤ k ≤ |L|}, where θ is a predetermined threshold, such as 0.5. The four metrics can be computed given λ and Y according to the test patterns. The Hamming loss is defined as where denotes the symmetric difference between two sets. Furthermore, one-error is defined as where [ · ] returns one if the proposition stated in the brackets is true and zero otherwise. Multilabel accuracy is defined as It computes the Jaccard coefficient between two sets. Finally, subset accuracy is defined as It determines whether two sets are exactly identical. A superior feature subset will exhibit higher values of the multilabel and subset accuracies and lower values of the Hamming loss and one-error metrics.
We conducted additional statistical tests in order to verify the statistical significance of our results. First, we conducted a paired t-test [43] at 95% significance level to compare the proposed method with each of other MLFS methods on each of datasets; because there are three comparison algorithms, the paired t-test is performed three times. Here, three null hypotheses (i.e., two methods have equal performance) can either be rejected or accepted. We also performed the Bonferroni-Dunn test in order to compare the average ranks of the proposed and other methods [44]. If the difference between the average rank of one comparison method and that of the proposed method is within the critical difference (CD), its performance is considered to be similar to that of the proposed method. In our experiments, we set the significance level α to 0.05, and, thus, the CD can be computed as 1.0601 [45].

Comparison Results
Tables 4-7 present the experimental results of the proposed method and compare them with those of the other methods on 17 multilabel datasets. The resulting values are represented by their average performances with the corresponding standard deviations; herein, a better average value is indicated by bold font on each dataset. In addition, for each dataset, the paired t-test was conducted at the 95% significance level. As shown in Tables 4-7, ( ) indicates that the corresponding method is significantly worse(better) than the proposed method based on the paired t-test. Table 4 shows that the proposed method is statistically superior or similar than TCbGA on 88% of the datasets and than EMPNGA and BCO-MDP on all datasets in terms of the Hamming loss. Table 5 shows that the proposed method is statistically superior or similar than other methods on 94% of the datasets in terms of the one-error. Particularly, Tables 6 and 7 show that the proposed method is statistically superior or similar than other methods on all datasets in terms of the multilabel accuracy and the subset accuracy.   Figure 2 illustrates the CD diagrams, showing the relative performance of the four methods. Here, the horizontal axis represents the average rank of each method, where the higher ranks are placed on the right side of each subfigure. In addition, the methods within the same CD as that of the proposed method are connected by a bold red line, which means that the difference among them is not significant. Figure 2b indicates that the proposed method significantly outperformed the TCbGA and BCO-MDP in terms of the one-error. The results for the one-error indicate that the simple communication of exchanging the best individuals in the EMPNGA can also yield good results, because the one-error is evaluated based only on the label predicted with the highest probability. In contrast, Figure 2a

Analysis
We conducted an in-depth analysis to determine whether the proposed communication process is effective for solving the MLFS problem via additional experiments on eight datasets using the MLNB. To validate the effectiveness of label complementary communication in the proposed method, we designed Proposed-SC, which is equivalent to the proposed method, except that it does not include the proposed communication process, i.e., Algorithm 3. Specifically, the Proposed-SC uses the simple communication method of exchanging the best individuals and roulette wheel selection as the fitness-based parent selection method. For improved readability, we named the proposed method described in Section 3 as the Proposed-LCC. In addition, we designed Proposed-NC, which is equivalent to Proposed-SC, except that it does not conduct any communication process. Figure 3 shows the search capability of each sub-population during the search process. The vertical axis indicates the multilabel accuracy for the best individual in each method; herein, the baseline indicates the multilabel accuracy obtained by random prediction from 10 repetitions and it is regarded as the baseline performance. As stated in Section 4, the numbers of maximum FFCs and the total number of individuals are 300 and 50, respectively. Therefore, the sub-populations communicate with each other every 50 FFCs. Additionally, the number of sub-populations is five.
As shown in Figure 3, the Proposed-LCC exhibited a better search capability than Proposed-SC and Proposed-NC on eight multilabel datasets. We note that, in MLFS, Proposed-SC and Proposed-NC exhibited a similar level of search capability in MLFS, and it even revealed worse search capability than a method without communication on the Education datasets. It implies that the simple communication method of exchanging the best individuals failed to deal with the multiple labels. In contrast, Proposed-LCC conducted effective MLFS searches. Particularly, in Figure 3c, the initial sub-populations of Proposed-LCC revealed relatively low multilabel accuracy (50 FFCs). This is because each of sub-populations consists of different features by our initialization method and, thus, may not be related to entire label set. During search process, Proposed-LCC exhibited an effective improvement in multilabel accuracy. This indicates that the proposed label complementary communication method can improve the search capability of the sub-populations by referencing individuals from other sub-populations based on the discriminating power of subsets with regard to labels that are difficult to classify.      We also conducted the paired t-test at 95% significance level in order to determine whether the three methods were statistically different. For fairness, Proposed-SC and Proposed-NC also obtained results from 10 repetitions on the eight datasets, respectively. Figure 4 presents the pairwise comparison results on each of datasets in terms of the multilabel accuracy; the p-values for each of tests are shown in each subfigure and the asterisk indicates that corresponding hypothesis was rejected. As shown in Figure 4, the Proposed-LCC significantly outperformed Proposed-SC on seven datasets, except for the Recreation dataset and outperformed Proposed-NC on all datasets. On the other hand, Proposed-SC and Proposed-NC have equal performance on all datasets. As a result, the additional experiment and statistical test verify that the proposed label complementary communication successfully improves the search capability of the sub-populations with regard to MLFS.

Conclusions
In this paper, we proposed a novel MPGA, with label complementary communication, which specializes in solving the MLFS problem. It is aimed at improving the search capability of sub-populations through a communication process that employs the complementary discriminating powers of sub-populations with regard to multiple labels. Our experimental results and statistical tests verified that the proposed method significantly outperformed three state-of-the-art multi-population-based feature selection methods on 17 multilabel datasets.
Future studies can be conducted to overcome the limitation of the proposed method: we have simply set the number of labels to be complemented to half the total number of labels. As the search progresses, this value can be adjusted according to the improvement in the discriminating power for each label. For example, the proposed label complementary communication may only be conducted for labels for which the discrimination performance is not better than that in the previous generation.