Next Article in Journal
The Problem of Engines in Statistical Physics
Previous Article in Journal
Self-Organization, Entropy Generation Rate, and Boundary Defects: A Control Volume Approach
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:

A Multi-Objective Multi-Label Feature Selection Algorithm Based on Shapley Value

Department of Computer Science and Technology, Harbin Engineering University, Harbin 150001, China
Author to whom correspondence should be addressed.
Entropy 2021, 23(8), 1094;
Submission received: 21 July 2021 / Revised: 15 August 2021 / Accepted: 18 August 2021 / Published: 22 August 2021


Multi-label learning is dedicated to learning functions so that each sample is labeled with a true label set. With the increase of data knowledge, the feature dimensionality is increasing. However, high-dimensional information may contain noisy data, making the process of multi-label learning difficult. Feature selection is a technical approach that can effectively reduce the data dimension. In the study of feature selection, the multi-objective optimization algorithm has shown an excellent global optimization performance. The Pareto relationship can handle contradictory objectives in the multi-objective problem well. Therefore, a Shapley value-fused feature selection algorithm for multi-label learning (SHAPFS-ML) is proposed. The method takes multi-label criteria as the optimization objectives and the proposed crossover and mutation operators based on Shapley value are conducive to identifying relevant, redundant and irrelevant features. The comparison of experimental results on real-world datasets reveals that SHAPFS-ML is an effective feature selection method for multi-label classification, which can reduce the classification algorithm’s computational complexity and improve the classification accuracy.

1. Introduction

Classification is an important technical task in pattern recognition. Traditional supervised learning mainly involves single-label classification. However, real-world problems are more complicated. Every sample is labeled with multiple labels. In order to study such objects, research on multi-label learning has emerged over the years. Multi-label learning methods were first used in text classification [1], and they have been applied to new applications such as image annotation [2,3], and biological information [4,5] with the development of research.
The process of multi-label learning is difficult, the size of the label set is uncertain and there is a certain correlation between the labels [6]. Multi-label learning methods are roughly divided into two categories: problem conversion and algorithm adaptation [7]. The original data is converted to problems that can be solved with single-label classifiers in the problem conversion method, such as label power-set method (LP) and binary relevance method (BR) [8]. The algorithm adaptation method does not need to transform the original data but improves the single-label learning methods to adapt to multi-label data, such as a lazy learning approach (MLKNN) [9] and a kernel method (RankSVM) [10]. The algorithm adaptation method does not destroy the original data and better consider the correlation between labels.
The combination of features is critical to the quality of the classification results. The original feature set may contain redundant features and irrelevant features. If the original features are directly input into the classifier, it may interfere with the classification decision of the classifier [11]. The objective of feature selection is to remove redundant features and irrelevant features, thereby reducing the dimensionality of the data and improving the classification accuracy [12]. Therefore, feature selection is able to avoid the disaster of dimensionality in single-label and multi-label data.
At present, most of the feature selection research focuses on single-label learning and has been integrated into practical problems. For example, Stai et al. used a weighted network graph structure to represent multimedia content features and feature weights [13]. The proposed framework structure can identify effective features through a relevance feedback mechanism and provide suitable recommendations. Hs et al. proposed a novel feature selection method that combines three different measurements. This method can qualitative information and can be effectively applied to intrusion detection scenarios [14]. Rauber et al. employed a feature extraction algorithm to extract features with strong discriminative information for bearing fault classification, and then used a greedy algorithm to remove irrelevant and redundant features. This framework has the potential to be extended to the industrial field [15].
Similarly, feature selection methods for multi-label data roughly fall into filter model, wrapper model and embedded model [16]. The filter model does not include a classifier in the evaluation process [17]. Different measurement methods are used to mine feature and label information, such as information entropy [18], correlation [19] and consistency measures [20]. Researchers apply the idea of feature selection study for single-label learning to multi-label task and many excellent algorithms such as multi-label feature selection based on max-dependency and min-redundancy (MDMR) [21], information gain based on the BR approach (IG_BR) [22] and multi-label informed feature selection method (MIFS) [6] are proposed. They are extensions of feature selection based on the maximum relevance and minimum redundancy criterion (mrmr) [23], feature selection via maximizing global information gain (IG) [24] and feature selection based on mutual information (MI) [25] respectively. There are also feature selection methods specifically designed for multi-label learning, such as manifold-based constraint Laplacian score (MCLS) [26] and manifold regularized discriminative multi-label feature selection algorithm MDFS [27]. Embedded model completes the feature reduction in multi-label learning [28,29].
Compared with filter feature selection, the wrapper type is better in classification accuracy, because the classification result is directly utilized to assess the feature subsets [30]. The evolutionary-based feature selection algorithms are popular because evolutionary algorithms have strong global optimization capabilities and can consider the combination of features. In feature selection, smaller-scale features are expected to obtain higher classification accuracy. However, traditional evolutionary algorithms can only set one fitness function, so multi-objective evolutionary algorithms have attracted attention, which can balance the relationship between multiple objectives [31]. The relationship between these objectives may be contradictory. Different from single-label learning, the assessment indicators of multi-label learning are intricate. Therefore, the classification accuracy criterion is specially used for multi-label data in the wrapper multi-label feature selection optimization algorithms.
A wrapper feature selection algorithm is proposed to decrease the features of the original multi-label samples in this paper. A classic multi-objective algorithm which is called the third generation of non-dominated genetic algorithm (NSGA III) [32] is employed to optimize three objectives of Hamming loss, average precision, and the scale of feature subset. The original NSGA III was first proposed for consecutive applications, but feature selection corresponds to the discrete optimization scenario. To cope with this problem, we have made improvements on the encoding method, crossover operator and mutation operator of NSGA III. In order to effectively identify irrelevant and redundant features, we introduce Shapley value to assess the contribution of the features and propose a crossover operator and a mutation operator based on Shapley value to make the global search and local search in a balanced state, thereby increasing the accuracy of the classifier results and improving the algorithm’s convergence speed. The main contributions of this work are presented as follows:
  • Shapley value and multi-objective multi-label feature selection are fused from two sides: feature and individual.
  • Two improved operators are proposed, which adaptively adjust the crossover and mutation probability by evaluating the features’ contribution and equate the algorithm’s global and local search.
  • An improved archive maintenance strategy is put forward to increase the convergence performance of the multi-objective optimization method.
  • Experiments on datasets of different scales prove the validity and adaptability of the proposed algorithm.
The rest of the paper is presented as follows: Section 2 analyzes the research status of multi-label feature selection and Shapely value in feature selection. Section 3 describes the basic knowledge of multi-objective optimization, Shapley value and multi-label learning. Section 4 introduces the objective functions, mutation operator, crossover operator, archive set maintenance strategy and the flow chart of the proposed algorithm SHAPFS-ML. Section 5 shows the experimental results on seven multi-label datasets. Section 6 gives a summary of this paper.

2. Related Works

The search ability of feature selection algorithm is an important factor that determines the quality of selected feature subsets. The exhaustive search selects the optimal feature subset by enumerating the possible combinations of features. Mnich et al. used multidimensional exhaustive analysis of the mutual information between features and labels [33]. This method requires a large number of samples and has a high computational cost. Although exhaustive searching can find the global optimal solution, it is inefficient. Heuristic search uses heuristic information to reduce the search range of the feature space. Hua et al. proposed an improved modified strong approximate Markov blanket method to remove redundant features, and then used sequential forward selection (SFS) method to remove irrelevant features [34]. Fa et al. proposed a backward selection (SBS) approach to eliminate a set of features that are not helpful for classification [35]. Both SFS and SBS are greedy algorithms which selects the current best solution and are can easily fall into the local optimum. Evolutionary computing technology belongs to the heuristic search. Such methods use a random search with heuristic information, which can obtain approximate optimal solutions. Common evolutionary computing methods include genetic algorithm (GA), particle swarm optimization (PSO), differential evolution (DE), etc. Therefore, we use evolutionary computing theory as the methodology in this paper.
In recent years, multi-objective optimization has been a research hotspot in the field of evolutionary computing, and it has been successfully applied to solve the problem of feature selection. Bing et al. proposed a multi-objective differential evolution algorithm to optimize the two objectives of reducing the number of selected features and the classification error rate [36]. Experimental results showed that the proposed algorithm can give a more trade-off solution set and improve the quality of the solution compared with single-objective optimization methods. Liam et al. proposed a binary multi-objective PSO algorithm for filter feature selection based on rough set theory [37]. The performance of this method is better than the traditional single-objective PSO. Nouri-Moghaddam et al. used a novel forest optimization algorithm (FOA) algorithm and designed multiple concepts to deal with the feature selection problem in a multi-objective optimization manner [38]. The experimental results showed that the proposed method performs better than other single-objective and multi-objective optimization methods.
Nowadays, compared with multi-label feature selection algorithms, there are more studies on single-label feature selection algorithms [39]. The investigation on multi-objective optimized multi-label feature selection algorithm has been proposed only recently, benefitting from the successful application of multi-objective optimization methods in single-label feature selection. In 2014, Yin et al. analyzed the contradictions between the two types of multi-label classification indicators and used the second generation of non-dominated genetic algorithm (NSGA II) to optimize Hamming loss and average precision [40]. The proposed algorithm performed better compared with other methods. In 2017, Zhang et al. presented a multi-objective particle swarm optimization method to cover the multi-label feature selection problem [41]. In order to enhance the multi-objective optimization algorithm’s performance, they proposed two operation operators. And the crowding distance mechanism was used for the maintenance of archives. The results showed the exploration ability of the proposed algorithm is better than that of NSGA II. In 2021, Bidgoli et al. proposed a discrete differential evolution method for multi-label feature selection and proposed a binary mutation operator to improve the multi-objective optimization’s global search capabilities [42]. The proposed method’s performance is verified from the assessment of the multi-objective algorithm and the accuracy for classification.
To assess features’ importance for classification, Shapley value in cooperative game theory is introduced. In 2005, Cohen et al. put forward a feature selection algorithm using the Shapley value. It utilized the Shapley value to iteratively calculate the validity of the feature, and features were selected through forward and backward elimination [43]. The forward elimination method achieved the highest accuracy among the experimental comparison algorithms. In 2016, Mokdad et al. designed a feature selection algorithm structure derived from the Shapley value [44]. First, the rank of N groups of features was obtained by N feature selection algorithms, and then the Borda Coun method was adopted to determine the ultimate feature rank. The experimental results showed the validity of this algorithm. In 2020, Chu et al. decomposed the Shapley value into high-order interactive components to reasonably evaluate features’ contribution and proposed to evaluate feature subsets by discarding unselected features [45]. The above methods are non-optimized algorithms. Some studies combine the Shapley value with evolutionary algorithms. Deng et al. put forward a feature selection method that combines the Shapley value and particle swarm optimization [46]. The Shapley value was utilized to remove useless features in the local search and select fewer features. Guha et al. proposed a cooperative genetic algorithm for feature selection [47]. The fitness function combined the classification result, the feature subset size and the Shapley value score in a multi-objective fashion. However, this method is essentially single-objective optimization and cannot obtain multiple non-dominated solutions.
As far as we know, Shapley value has not been introduced into the multi-label feature selection algorithm. We merge Shapley value and a multi-objective multi-label feature selection method. This combination possesses the following two advantages: First, the Shapley value method focuses on the contribution of each feature, and the multi-objective optimization method focuses on the combination of features. Appropriate fusion can prevent a feature from being eliminated due to its poor performance in the feature subset, but the feature is useful for classification. Second, due to the huge search space, the search process at the beginning of the evolutionary algorithm has some randomness. The Shapley value method is conducive to the algorithm to search for potential spaces and to improve the convergence speed.

3. Preliminaries

3.1. Multi-Objective Optimization

In reality, many optimization problems involve multiple objectives, moreover, there may be contradictions or other relationships between the objectives. It is hard for people to acquire the best solution for each objective and determine the importance of different objectives. The multi-objective algorithm can balance the relationship between multiple objectives so that the obtained solution can be approximately optimal on multiple objectives.
The multi-objective optimization problem with M optimization objectives is formally defined as Equation (1):
m i n i m i z e f   x   =   f 1 x , f 2 x , f M x
where x Ω , Ω represents the decision space, x   =   x 1 ,   x 2 , , x n m     Ω     n , x is a decision variable of length n m , f i x   i = 1 , , M is the i-th optimized function.
In a multi-objective optimization problem, for two solutions y , z Ω , y dominates z, donated by y   z , if k : f i y     f i z     k :   f i y   <   f i z ,   k     1 , M . x * Ω is defined as a Pareto optimal solution if there is no other solution x     x *     Ω dominates x * . A Pareto set consists of all the Pareto optimal solutions. The front obtained by mapping the Pareto set to the objective space is known as the Pareto front.

3.2. Shapley Value

Game theory mainly includes two types of cooperative games and non-cooperative games [48,49]. The main feature of cooperative games is that participants cooperate with each other and form alliances to maximize the overall benefits. The collective benefits are more important than the individual benefits. Cooperative games emphasize collective rationality [50], while non-cooperative games emphasize individual rationality [51]. Feature selection can be regarded as a cooperative game, which satisfies the forming conditions of the cooperative game:
The total personal income is less than the alliance’s income.
Compared with not joining the alliance, every participant is able to gain a higher profit.
The Shapley value calculates the weighted sum of the participants’ marginal contributions in the cooperative game [52]. It’s a fair and reasonable method of distributing benefits for participants. In feature selection, Shapley value can be utilized to calculate the feature’s contribution.
Suppose that the set of individuals participating in the cooperative game is P   =   p 1 ,   p 2 , , p n s , p i is the i-th participant, and n s is the number of individuals. S is the set of all subsets that do not contain p i in P . v is a real-valued function, which can map the alliance to the benefits obtained by the cooperation of participants in the alliance. The Shapley value of participant p i is calculated as follows:
ϕ i = S P \ p i S ! P S 1 ! P ! v S p i v S
Specifically, in the feature selection problem, P is the original feature set, p i is the i-th feature, S is all feature subsets that do not contain feature p i , and the function v is represented by the classification result of the selected features under the classifier.

3.3. Multi-Label Learning

In order to better illustrate the difference between multi-label learning and traditional single-label learning, we give an example of an image. As shown in Figure 1, (a) is the original image, and (b) is the annotated image of (a). First, the meaning of the traditional single-label classification is explained. When we judge that the water area in (a) is a sea, a lake or a river, we can see that it belongs to the sea, and it cannot belong to the lake and the river at the same time, because these three categories are in conflict. This problem is a single-label multi-classification problem, that is, there are three categories of sea, lake and river under the label of water description. However, an image often contains more than one object. As shown in (b), the image can be annotated with three labels including sky, sea and sand beach. Each label can be divided into two categories, 0 and 1, meaning including the object and not including the object. Obviously, these two categories are contradictory. However, the three labels can exist at the same time, and there is a certain connection between the three labels. For example, if an image contains sea water, then the image may also include sky and sand beach, which is in line with the objective laws of the world. Therefore, multi-label classification is close to real life. The classification in (b) can be regarded as a multi-label binary classification problem. The multi-label multi-class problem corresponds to multiple labels, and each label has multiple categories of problems.
The original single-label learning method cannot be directly used in multi-label learning [53], because every sample of multi-label data is labeled with one or more labels simultaneously. Moreover, the relationship between the labels may be related. The definition of multi-label learning is as follows:
Let X = x 1 , x 2 , , x N be the d-dimensional input variable on the real number field, and Y = y 1 , y 2 , , y q be the label space. D = x i , Y i , 1 i N is the training dataset, and Y i is the true label set of training data x i . During the training process, the algorithm learns the function h : X   2 q based on the training data. Given test data set H = { x j , y j | 1 j t } , when the test data x j X is input, the predicted labels closest to the proper labels of x j are obtained through the function h .
Multi-label classification has unique evaluation means to analyze the quality of the classification results, which is divided into examples-based criteria and labels-based criteria [54]. In this paper, we mainly introduce the six criteria involved in the experiment:
  • Ranking Loss: It evaluates the fraction that an irrelevant label is ranked before a related label. y i ¯ is the complementary set of y j .
    R L h , H = 1 t j = 1 t { 1 y j y i ¯ | k , l y j × y i ¯ , s . t . h x j , k     h x j , l }
  • Average Precision: It measures the average instances’ correlated labels, and these labels are ranked higher than the preset labels. r a n k h is the descending rank function.
      A P h , H = 1 t j = 1 t 1 | y j | y y j y | r a n k h x , y   r a n k h x j , y , y y j r a n k h x j , y
  • Coverage: It records the minimum number of steps that need to be moved to cover the true labels associated with the sample from the sample’s classification prediction labels list.
    C V h , H = 1 t j = 1 t m a x y y j r a n k h x j , y 1
  • Hamming loss: It measures the proportion of misclassified label pair.
    H L h , H = 1 t j = 1 t h x j y j
    where represents the symmetric difference of the predicted label set and true label set.
  • Macro-F1: It is a label-based index that takes into account the average F-measure of every label.
    M a F f , F = 1 q j = 1 q 2 i = 1 t y i j h j x i i = 1 t y i j + i = 1 t h j x i
  • Micro-F1: It is a label-based index that takes into account the average F-measure of the prediction matrix.
    M i F h , H = 2 j = 1 t h x j y j j = 1 t y j + j = 1 t h x j

4. The Proposed Method

4.1. Objective Function

Generally, the number of objectives for multi-objective optimization does not exceed three, and problems with more than three objectives can be defined as many-objective optimization problems. This paper sets three optimization objectives including AP, HL and the number of selected features. The calculation methods of AP and HL are shown in Equations (4) and (6). The larger the value of AP, the better, and the smaller the value of HL and the number selected features, the better. The relationship between three objectives is complicated. First, the classification index and the number of features are contradictory in most situations. It is difficult to obtain high classification accuracy with a small number of features. Second, the literature [40] pointed out that AP and HL are contradictory. The single-objective optimization method is difficult to deal with the complex relationship between the objectives, so the multi-objective optimization algorithm is adopted as the basic optimization method.

4.2. Mutation Operator

Like the traditional genetic algorithm, NSGA III requires the crossover and mutation operators to produce offspring. Traditional NSGA III suites for continuous optimization matters, which means that the algorithm uses the real number encoding method and operators such as simulate binary crossover and polynomial mutation. Feature selection is a discrete optimization problem, and every feature corresponds to a bit. In the genetic algorithm, for a dataset with d-dimensional features, the population is composed of multiple chromosomes composed of 0 and 1 with length d, and each chromosome represents a solution. 0 means the feature corresponding to the bit is not selected, 1 means selected.
In order to effectively identify the relevant features, redundant features and irrelevant features, the Shapley value is fused with the multi-objective optimization algorithm. A mutation operator and a crossover operator based on Shapley value are proposed so that the improved NSGA III algorithm can resolve the multi-label feature selection problem well.
In the real world, many problems involve with high dimensions. The high-dimensional characteristics of the data increase the decision variables of the evolutionary algorithm, resulting in a huge search space, and the algorithm’s optimization ability and speed are affected. At the beginning, the algorithm’s randomness may lead to the wrong search direction for certain features. For example, suppose that there are m individuals in a population, the length of each individual is d, the i-th feature is relevant, and the j-th feature is irrelevant. Then in the early stage of the iteration, the following two situations may occur:
The number of individuals who choose the i-th feature is u , m 2     u   <   m , and the number of individuals who do not choose the i-th feature is m u .
The number of individuals who choose the j-th feature is w , m 2     w   <   m , and the number of individuals who do not choose the j-th feature is m w .
When u and w are large, it indicates that most individuals have not selected the relevant feature i , or that most individuals have selected the irrelevant feature j . The offspring of similar chromosomes are close to the parent individuals. In this case, it is requisite to carry out mutation operations on the bits with the above-mentioned conditions with greater probability.
However, it is difficult to know the relevance or irrelevance of features. Therefore, we introduced the Shapley value to evaluate the feature’s contribution. Given that three objectives are optimized in this paper, including two multi-label evaluation criteria, the Shapley value of the feature is defined as follows:
ϕ i = S P \ p i S ! P S 1 ! P ! v 1 S p i v 1 S + v 2 S p i v 2 S
where v 1 is the HL value obtained by the feature subset S under the multi-label classifier, and v 2 is the AP value. HL and AP are calculated as shown in Equation (6) and Equation (4).
When the unbalanced search situation no longer occurs, it means that the relevant features have basically been selected, and the irrelevant features are discarded. In order to further judge the redundant features, more attention should be paid to the features near the Shapley value of 0. Because the contribution of such features hardly contribute to classification, whether to choose these features basically does not affect the classification result. The specific mutation procedure is shown in Algorithm 1.
Algorithm 1 Mutation probability calculation
Input: Population p o p ; Population scale N p ; Feature dimension d; Default mutation rate η ; Parameter maker t.
Output: Mutation probability P m u
1: for i = 1:   N p
2:   A i   = Fit( p o p i );//Calculate the fitness of every individual in p o p .
3: end
4: for i = 1:   d
5:   φ i   = Shapley( A );//Calculate the Shapley value of the i -th feature.
6: number1( i ) = Select( p o p (:, i ));//Record the number of individuals with selected the i -th feature.
7: number2( i ) = Unselect(pop(:, i ));//Record the number of individuals with unselected the i -th feature.
8: end
9: for i = 1: d
10: if t = 0
11: if φ i < 0 && (number1( i )/number2( i )) < 1/2
12: P m u i = η + η 1 number 1 i number 2 i ;
13: t = 1;
14: elseif φ i > 0 && (number2( i )/number1( i )) < 1/2
15: P m u i = η + η * 1 number 2 i number 1 i ;
16: t = 1;
17: else
18: P m u i = η ;
19: end
20: end
21: if t = 0
22: if abs ( φ i ) < max(abs( φ ))/2
23: P m u i = η + η 1 a b s φ i max a b s φ ;
24: else
25: P m u i = η ;
26: end
27: end
28: end
Algorithm 1 shows the process of determining the mutation probability of each gene. When a population is initialized, the fitness value of each individual can be calculated through the objective functions (lines 1–3). Then measure the Shapley value of each feature according to Equation (2) and record the number of individuals that selected and unselected a certain feature in the population (lines 4–8). Next, the calculation of mutation probability is divided into two cases. The first is to search for relevant features and irrelevant features under the condition { φ i < 0 & & number 1 i number 2 i   1 / 2 } | { φ i   >   0 & & number 2 i / number 1 i     1 / 2 } . In this case, the i-th feature may be wrongly selected or abandoned by most individuals. Therefore, it is necessary to increase the probability of mutation of the i-th feature. The more unbalanced the current feature’s search, the greater the probability of mutation. When the population no longer satisfies the above conditions in the later period of evolution, the search for redundant features is performed. The smaller the absolute Shapley value of the feature means the greater the probability that it is a redundant feature, thereby increasing the probability of mutation.
After obtaining the mutation probability of each gene, mutation operation will be performed in the population. The uniform mutation is adopted in this paper, and the specific procedure is shown in Algorithm 2.
Algorithm 2 Mutation operation
Input: Population p o p ; Population mutation ratio P m r ; Mutation probability P m ; Population scale N p ; Feature dimension d .
Output: Mutation population p o p m
1: n = N p P m r ;
2: for i   = 1:   n
3: for j   = 1:   d
4: if rand (1) > (1 − P m j )
5:   p o p m ( i , j ) = abs( p o p i , j − 1);
6: else
7:   p o p m i , j   =   p o p i , j ;
8: end
9: end
10: end

4.3. Crossover Operator

The uniform crossover operator is adopted in this paper. The exact process is shown in Algorithm 3.
Algorithm 3 Crossover operation
Input: Population p o p ; Population crossover ratio P c r ; Crossover probability P c ; Population scale N p ; Feature dimension d.
Output: Crossover population p o p c
1:   n = N p P c r ;
2:   m = 0 ;
3: for i = 1 : n
4:   i 1 = rand 1 ,   n ;//Generate a random number from 1 to n .
5: i2 =   rand 1 ,   n ;
6:   m = 2 i 1 ;
7: for j = 1: d
8: if rand (1) > rand (0.5,   P c )
9:   p o p c m , j = p o p i 2 , j ;
10:   p o p c m + 1 , j = p o p i 1 , j ;
11: else
12:   p o p c m , j = p o p i 1 , j ;
13:   p o p c m + 1 , j = p o p i 2 , j ;
14: end
15: end
16: end
In the mutation operation, the Shapley value represents a single feature’s contribution. In the crossover operation, we employ Shapley value to evaluate individuals. The average of the sum of the selected features’ Shapley values is defined as the individual’s Shapley value. Given that the Shapley value vector of all the features is φ , and the chromosome of individual i is p o p i , : , then the Shapley value ϕ i of individual i is ϕ i = p o p i , : × φ p o p i , : 1 . The calculation method of the crossover probability P c is as follows:
p c = 0.5   , i f   t = 1 | ϕ i 1 ϕ i 2 | max ϕ min ϕ   , i f   t = 0
Similar to the mutation probability, the calculation of the crossover probability is divided into two stages. When t = 1 , the algorithm is in the early stage of the iteration. At this time, the global search of the algorithm is necessary, so the probability of each individual crossover is equal. When t = 0 , the algorithm performs a local search for redundant features. If individuals with large fitness gaps are selected for crossover, it may affect the inheritance of excellent genes. The non-dominated individuals are at the same level and cannot be ranked in the multi-objective problem. Therefore, we use the individual’s Shapley value to approximately assess the individual’s quality. When t = 0, if the difference between the two individuals’ Shapley values is large, then the possibility of swapping genes is reduced.

4.4. The Improved Niche Preservation Mechanism

NSGA III employs a set of pre-set reference points to associate non-dominated solutions, and the niche preservation mechanism is employed to select non-dominated individuals from the critical front into the archive. Assume that there are ρ j individuals associated with the j-th reference point. The selection process is as follows:
First, randomly select a reference point j ¯ with the smallest ρ j from the set of reference points. When ρ j = 0, it means that there is no individual associated with it. Therefore, if there is an individual in the critical front that is associated with j ¯ , then the individual closest to the reference line of j ¯ is selected to the archive. If no individual is associated with j ¯ , then reconsider other reference points.
ρ j ¯ 0 indicates that there are one or more individuals associated with j ¯ . If the number of individuals associated with j ¯ is zero in the critical front, the next reference point is reconsidered. If the number of individuals associated with j ¯ is non-zero in the critical front, an individual is randomly selected to associate with j ¯ .
In this paper, we can obtain the Shapley value of each individual, which was introduced in Section 4.2. We sort the individuals in the critical front in descending order according to the individuals’ Shapley value. When ρ j ¯     0 , the first-ranked individual in the critical front is selected and added to the archive if there are individuals in the critical front that are associated with j ¯ . The individual with the highest Shapley value indicates that the features selected by the individual has a higher average contribution to the classification, which means that the individual may be a promising solution. Sorting instead of random selection helps to improve the convergence of the algorithm.

4.5. The Overall Flow of the Algorithm

In order to more clearly illustrate the specific process of the proposed algorithm, Figure 2 shows the flow chart of SHAPFS-ML.
First, the population is initialized. All individuals are binary-coded and the length of the chromosome is the feature dimension of the input data. Then, the fitness values of the individuals in the population are calculated. The multi-label classifier MLKNN is used to evaluate the AP and HL values of the feature subset, and AP, HL and the size of the feature subset are used as the fitness values of the individual. After obtaining the fitness values matrix of the population, the Shapley value of each feature is calculated. Then the crossover probability and mutation probability are determined, and the evolution operation is executed. The regenerated populations are stratified through non-dominated relations. Because the capacity of the archive set is limited, the archive set is maintained by an improved maintenance strategy and the non-dominated solutions in the archive set are updated. If the stop condition is not met, the above process is repeated, and all non-dominated solutions in the archive set are finally output.
We can analyze how the framework is improved from two perspectives. From the perspective of features, each feature acts as a participant to cooperate with other features. The Shapley value can feedback the benefits of the alliances the feature participates in and the alliances that the feature does not participate in. The way of feedback is to adjust the cross probability and mutation probability of the feature, which will affect the probability of the feature appearing in the next iteration, so that the algorithm heuristically searches for potential areas. From the perspective of the population, the fitness function quantifies the quality of each individual. The population evolves through reproduction so that excellent genes are retained in each iteration, disadvantaged individuals are eliminated, and the entire population evolves in a better direction.

5. Experiments

5.1. Experiment Settings

The experiments are conducted on seven multi-label datasets including flags, emotions, yeast, virus, Languagelog, genbase and medical. Flags, emotions, yeast, genbase and medical are available on MULAN. MULAN is an open-source library for multi-label learning [55]. Languagelog are chosen from MEKA [56], an extended version of WEKA [57] in multi-label learning and evaluation. Virus is available in [58]. Table 1 shows the summary of seven datasets. A classic multi-label classification algorithm ML-KNN [9] is employed as the multi-label classifier. K indicates the number of nearest neighbors, which is set to 10 as suggested in [9]. The parameter η is 0.3 and the size of population is 20. The experiments are conducted on a laptop equipped with an Intel(R) Core (TM) i7-9750H CPU and 16 GB memory.

5.2. Comparing Methods

In this section, six comparison methods are employed to demonstrate the usefulness of the proposed algorithm. The traditional NSGA III is compared with SHAPFS-ML to analyze the effectiveness of the improved NSGA III algorithm. Similarly, the coding method of NSGA III is modified to binary, uniform crossover and uniform mutation are adopted as the methods of crossover and mutation. SHAPFS-ML and NSGA III algorithms have been independently run 20 times on each data set. A non-dominated solution set is randomly selected from the running results, and the solution with the smallest sum of all objective function values is selected as the final result. MDFS constructs a low-dimensional embedding method to seek discriminative features [27]. MCLS is a manifold-based method that can transform the original label space and constrain the samples [26]. MIFS exploits the label correlations and decomposes the multi-label information [6]. MDDM maximizes the reliance of features and the associated labels, proj is the irrelated projection dimensionality reduction of MDDM, spc is an unrelated projection feature selection method [59].

5.3. Evaluation of Experimental Results on Multi-Label Classification

Table 2, Table 3, Table 4, Table 5, Table 6 and Table 7 show the comparison results on seven datasets under six multi-label learning criteria, which are introduced in Section 3.3. ↑ means the value should be the bigger the better. ↓ means the value should be the smaller the better. Generally speaking, the performance of SHAPFS-ML is the best (In bold). Avg.Rank is the average ranking value of each algorithm on all datasets. The smaller the Avg.Rank value, the better the performance of the algorithm. In detail, SHAPFS-ML has obtained the best results on three indicators average precision, coverage and Hamming loss. According to ranking loss, SHAPFS-ML obtained the optimal results in addition to the dataset Languagelog. SHAPFS-ML ranked second on Languagelog, but the difference between SHAPFS-ML and MCLS is small. On MicroF and MacroF indicators, MCLS is better than SHAPFS-ML on the data set genbase. Although SHAPFS-ML did not rank first on both indicators, SHAPFS-ML is significantly better than other non-optimized methods on other data sets except genbase. For example, on the emotions dataset, the value of SHAPFS-ML on the MicroF indicator is 0.6446, while the values of the other five non-optimized methods are 0.4636, 0.5114, 0.5156, 0.5829 and 0.5860, respectively, which are all behind SHAPFS-ML. Similarly, on the flags dataset, SHAPFS-ML obtained 0.6546 on the MacroF indicator, while in other methods, the lowest value of MDFS is 0.4777, and the highest value of MCLS is 0.5657. This observation reveals that SHAPFS-ML has a remarkable improvement. In terms of average ranking, SHAPFS-ML ranked first, followed by NSGA III. It is observed that the classification results of SHAPFS-ML are improved compared to the traditional NSGA III. Moreover, the multi-objective optimization-based method is more advantageous compared with other non-optimized algorithms, and the performance is relatively stable on different scale datasets. Among the non-optimized methods, MCLS performed best, and MDDM_proj performed worst.
Table 8 shows the ratio of the size of selected features by different methods to the size of the full feature set. It can be seen that SHAPFS-ML can remove at least 60% of the features on the flags, emotions and virus data sets, and can remove more than 50% of the features on other data sets. Although the size of features selected by SHAPFS-ML is not the least, it has the best performance on the classification results. Through the above discussion, we can draw a conclusion that SHAPFS-ML is competitive among the well-established comparison methods.
The average ranking of SHAPFS-ML is better than that of NSGA III under six multi-label learning criteria. The main difference between the two is the crossover and mutation operators. Crossover and mutation operators are related to the global search and local search capabilities of the algorithm. The two types of searches cooperate with each other to achieve a balanced state. There are two main advantages of SHAPFS-ML. First, the crossover and mutation operators proposed in this paper adaptively calculates the crossover probability and mutation probability of the gene locus corresponding to the feature according to the Shapley value during the evolution process. The two operators cooperate and compete with each other to enhance the exploitation of feature space and the ability to explore local features. Secondly, the multi-objective optimization algorithm can consider the combination effect of features, and the introduction of the Shapley value method can measure the effect of a single feature. Feature combinations involving well-performing features may be more competitive and potential. Therefore, in the problem of multi-label feature selection, the optimization ability of SHAPFS-ML is stronger than that of traditional NSGA III. Therefore, the optimization ability of SHAPFS-ML is stronger than that of traditional NSGA III in the problem of multi-label feature selection.

5.4. The Comparison on Hypervolume Indicator

To quantify the quality of the Pareto set obtained by SHAPFS-ML, Hypervolume (HV) index is introduced into the evaluation of experimental results. HV is a commonly used index for multi-objective algorithms [60]. The larger the value of HV, the better the multi-objective algorithm’s capability. HV calculates the volume of the hypercube formed by the reference points and the non-dominated solution set. HV value can reflect the distribution and convergence of the algorithm. Therefore, the obtained HV value is different if the non-dominated solution set is different. We have run the algorithms SHAPFS-ML and NSGA III 20 times to calculate the average, best and worst values of HV. As shown in Table 9, SHAPFS-ML can obtain a higher average HV value and the best HV value, which shows that the search ability of SHAPFS-ML’s multi-objective optimization is improved compared with the traditional NSGA III algorithm, and it can obtain a widely distributed and uniform Pareto solution set.

5.5. Shapley Value Analysis

To further analyze the validity of the application of Shapley value to multi-label feature selection, we sort the features’ Shapley values calculated in the last iteration of SHAPFS-ML, and gradually select the features for classification according to the order of contribution from the largest to the smallest. NSGA III is a global optimization algorithm, and the number of features cannot be determined arbitrarily. Therefore, NSGA III is not used as a comparison algorithm in this section. The non-optimized algorithms including MDFS, MCLS, MIFS, MDDM_proj and MDDM_spc as mentioned in Section 5.3 are compared for analysis.
The main purpose of this section is to verify whether the Shapley value method can reasonably analyze the contribution of features. This study is meaningful for feature selection. Because features with a high degree of contribution can be regarded as relevant features, which can help the sample to be correctly classified. And features with low contributions can be regarded as irrelevant features, which will interfere with the classification process and may even reduce the classification accuracy. The contribution degree of the feature also reflects the importance of the feature. Other non-optimized methods essentially use different measurement methods to quantify the importance of the feature. Therefore, it is reasonable to compare the feature results based on the Shapley value ranking with other non-optimized methods.
Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8 and Figure 9 show the values of the six algorithms on the seven datasets as the number of features increases on different indicators. From the observation, we can see that SHAPFS-ML has a significant improvement on six indicators compared with other methods on emotions, virus and medical. As the number of features increases, SHAPFS-ML tends to stabilize after reaching a certain maximum value. On yeast and Languagelog, the performance of MCLS is closest to SHAPFS-ML, but SHAPFS-ML can basically obtain the optimal value. On the flags dataset, except for Coverage and MacroF, SHAPFS-ML is slightly inferior to MCLS. On the genbase data set, SHAPFS-ML can obtain the optimal value in addition to the two indicators of MicroF and MacroF.
Through the above analysis, it can be seen that the Shapley value of SHAPFS-ML can effectively evaluate features and identify effective features. In general, non-optimized multi-label feature selection algorithms are more difficult to determine the optimal value, especially for multi-label learning. Under different indicators, the number of features that can obtain the optimal value is different. For a dataset with d dimensions, it is essential to run the classification algorithm d times to obtain the value corresponding to the number of different features on an index. In contrast, the optimization algorithm has excellent global search capabilities and can obtain approximately optimal solutions in one run. According to the experimental results, the features selected by SHAPFS-ML perform better under different multi-label indicators and have better stability.

5.6. Complexity Analysis

In this section, the computational complexity of the proposed algorithm is analyzed. When the population size is set to N p and the number of objectives is M , population initialization, crossover operator, mutation operator, and individual fitness calculations all require O N p basic operations. The non-dominated sort requires O N p l o g M 2 N p basic operations. The selection operation of non-dominated individuals requires O M N p 2 , so the final computational complexity of the proposed algorithm is max O N p l o g M 2 N p ,   O M N p 2 .

5.7. Comparison of Running Time

The running time of the wrapper feature selection algorithm based on the evolutionary optimization depends on the evolutionary algorithm, the size of the data set and the classification algorithm, so it is difficult to measure the actual running time of the evolutionary algorithm [41]. Therefore, we compare the running time of SHAPFS-ML and NSGA III in this section. The running time is the average time of 20 independent runs of each algorithm. It can be seen from Table 10 that the running time of the two algorithms is relatively close, especially on the flags, emotions and virus data sets. SHAPFS-ML is improved on the basis of NSGA III. Both the improved and traditional crossover and mutation operators require linear time.

6. Conclusions

Multi-label classification problems are common in real life. In recent years, there have been more studies in the field of multi-label feature selection, but there are few methods based on multi-objective optimization. A wrapper multi-objective optimization feature selection algorithm for multi-label learning fused with Shapley value (SHAPFS-ML) is proposed in this work. This method has two notable properties. First, the idea of Shapley value in game theory is combined with feature selection. We regard feature selection as the process of a cooperative game between features, and an excellent combination of features is selected by evaluating both features and individuals. Secondly, the mutation operator and crossover operator based on Shapley value are proposed to balance the algorithm’s exploration capability and exploitation capability. The experimental results compared with other well-established multi-label feature selection methods on multi-label datasets prove the validity of SHAPFS-ML.
In future work, we will use the Shapley value method to realize feature visualization, and further distinguish relevant features, redundant features and irrelevant features. This research will improve the interpretability of multi-label feature selection algorithms. And we will apply the multi-label feature selection algorithm to a specific problem, such as image annotation.

Author Contributions

Methodology, J.S.; investigation, H.D. and X.S.; writing—original draft preparation, J.S.; writing—review and editing, H.D. and X.S. All authors have read and agreed to the published version of the manuscript.


This research is funded by the National Science Foundation of China (No. 61472095) and the Natural Science Foundation of Heilongjiang Province (LH2020F023).

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.


  1. Bittencourt, M.M.; Silva, R.M.; Almeida, T.A. ML-MDLText: An efficient and lightweight multilabel text classifier with incremental learning. Appl. Soft Comput. 2020, 96, 106699. [Google Scholar] [CrossRef]
  2. Omar, A.; Mahmoud, T.M.; Abd-El-Hafeez, T.; Mahfouz, A. Multi-label Arabic text classification in Online Social Networks. Inf. Syst. 2021, 100, 101785. [Google Scholar] [CrossRef]
  3. Yun, S.; Oh, S.J.; Heo, B.; Han, D.; Choe, J.; Chun, S. Re-labeling ImageNet: From Single to Multi-Labels, from Global to Localized Labels. arXiv 2021, arXiv:2101.05022. [Google Scholar]
  4. Wang, H.; Ding, Y.; Tang, J.; Zou, Q.; Guo, F. Identify RNA-associated subcellular localizations based on multi-label learning using Chou’s 5-steps rule. BMC Genom. 2021, 22, 56. [Google Scholar]
  5. Chen, L.; Li, Z.; Zeng, T.; Zhang, Y.H.; Li, H.; Huang, T.; Cai, Y.D. Predicting gene phenotype by multi-label multi-class model based on essential functional features. Mol. Genet. Genom. 2021, 296, 905–918. [Google Scholar] [CrossRef]
  6. Jian, L.; Li, J.; Shu, K.; Liu, H. Multi-Label Informed Feature Selection. In Proceedings of the 25th International Joint Conference on Artificial Intelligence, New York, NY, USA, 9–15 July 2016; pp. 1627–1633. [Google Scholar]
  7. Zhang, M.; Zhou, Z. A Review on Multi-Label Learning Algorithms. IEEE Trans. Knowl. Data Eng. 2014, 26, 1819–1837. [Google Scholar] [CrossRef]
  8. Madjarov, G.; Kocev, D.; Gjorgjevikj, D.; Džeroski, S. An extensive experimental comparison of methods for multi-label learning. Pattern Recognit. 2012, 45, 3084–3104. [Google Scholar] [CrossRef]
  9. Zhang, M.L.; Zhou, Z.H. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognit. 2007, 40, 2038–2048. [Google Scholar] [CrossRef] [Green Version]
  10. Elisseeff, A.; Weston, J. A Kernel Method for Multi-Labelled Classification. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 3–8 December 2001; pp. 681–687. [Google Scholar]
  11. Chandrashekar, G.; Sahin, F. A survey on feature selection methods. Comput. Electr. Eng. 2014, 40, 16–28. [Google Scholar] [CrossRef]
  12. Xue, B.; Zhang, M.; Browne, W.N.; Yao, X. A Survey on Evolutionary Computation Approaches to Feature Selection. IEEE Trans. Evol. Comput. 2016, 20, 606–626. [Google Scholar] [CrossRef] [Green Version]
  13. Stai, E.; Kafetzoglou, S.; Tsiropoulou, E.E.; Papavassiliou, S. A holistic approach for personalization, relevance feedback & recommendation in enriched multimedia content. Multimed. Tools Appl. 2016, 77, 283–326. [Google Scholar]
  14. Herrera-Semenets, V.; Bustio-Martínez, L.; Hernández-León, R.; van den Berg, J. A multi-measure feature selection algorithm for efficacious intrusion detection. Knowl. Based Syst. 2021, 227, 107264. [Google Scholar] [CrossRef]
  15. Rauber, T.W.; De AB, F.; Varejao, F.M. Heterogeneous Feature Models and Feature Selection Applied to Bearing Fault Diagnosis. IEEE Trans. Ind. Electron. 2015, 62, 637–646. [Google Scholar] [CrossRef]
  16. Jaesung, L.; Dae-Won, K. Efficient Multi-Label Feature Selection Using Entropy-Based Label Selection. Entropy 2016, 18, 405. [Google Scholar]
  17. Lin, Y.; Hu, Q.; Zhang, J.; Wu, X. Multi-label feature selection with streaming labels. Inf. Sci. 2016, 372, 256–275. [Google Scholar] [CrossRef]
  18. Sechidis, K.; Spyromitros-Xioufis, E.; Vlahavas, I. Information Theoretic Multi-Target Feature Selection via Output Space Quantization. Entropy 2019, 21, 855. [Google Scholar] [CrossRef] [Green Version]
  19. Zhang, P.; Gao, W.; Hu, J.; Li, Y. Multi-Label Feature Selection Based on High-Order Label Correlation Assumption. Entropy 2020, 22, 797. [Google Scholar] [CrossRef]
  20. Chen, L.; Chen, D. Alignment Based Feature Selection for Multi-label Learning. Neural Process. Lett. 2019, 50, 2323–2344. [Google Scholar] [CrossRef]
  21. Lin, Y.; Hu, Q.; Liu, J.; Duan, J. Multi-label feature selection based on max-dependency and min-redundancy. Neurocomputing 2015, 168, 92–103. [Google Scholar] [CrossRef]
  22. Spolaôr, N.; Cherman, E.A.; Monard, M.C.; Lee, H.D. A Comparison of Multi-label Feature Selection Methods using the Problem Transformation Approach. Electron. Notes Theor. Comput. Sci. 2013, 292, 135–151. [Google Scholar] [CrossRef] [Green Version]
  23. Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef]
  24. Shang, C.; Li, M.; Feng, S.; Jiang, Q.; Fan, J. Feature selection via maximizing global information gain for text classification. Knowl. Based Syst. 2013, 54, 298–309. [Google Scholar] [CrossRef]
  25. Yang, Y.; Pedersen, J.O. A Comparative Study on Feature Selection in Text Categorization. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML 1997), Nashville, TN, USA, 8–12 July 1997; pp. 412–420. [Google Scholar]
  26. Huang, R.; Jiang, W.; Sun, G. Manifold-based constraint Laplacian score for multi-label feature selection. Pattern Recognit. Lett. 2018, 112, 346–352. [Google Scholar] [CrossRef]
  27. Zhang, J.; Luo, Z.; Li, C.; Zhou, C.; Li, S. Manifold regularized discriminative feature selection for multi-label learning. Pattern Recognit. 2019, 95, 136–150. [Google Scholar] [CrossRef]
  28. Zhang, M.L.; Pena, J.M.; Robles, V. Feature selection for multi-label naive Bayes classification. Inf. Sci. 2009, 179, 3218–3229. [Google Scholar] [CrossRef]
  29. Guo, Y.; Chung, F.L.; Li, G.; Zhang, L. Multi-Label Bioinformatics Data Classification with Ensemble Embedded Feature Selection. IEEE Access 2019, 7, 103863–103875. [Google Scholar] [CrossRef]
  30. Abdel-Basset, M.; El-Shahat, D.; El-henawy, I.; de Albuquerque, V.H.C.; Mirjalili, S. A new fusion of grey wolf optimizer algorithm with a two-phase mutation for feature selection. Expert Syst. Appl. 2020, 139, 112824. [Google Scholar] [CrossRef]
  31. Hua, Y.; Liu, Q.; Hao, K.; Jin, Y.A. Survey of Evolutionary Algorithms for Multi-Objective Optimization Problems with Irregular Pareto Fronts. IEEE/CAA J. Autom. Sin. 2021, 8, 303–318. [Google Scholar] [CrossRef]
  32. Deb, K.; Jain, H. An Evolutionary Many-Objective Optimization Algorithm Using Reference-Point-Based Nondominated Sorting Approach, Part I: Solving Problems with Box Constraints. IEEE Trans. Evol. Comput. 2014, 18, 577–601. [Google Scholar] [CrossRef]
  33. Mnich, K.; Rudnicki, W.R. All-relevant feature selection using multidimensional filters with exhaustive search. Inf. Sci. 2020, 524, 277–297. [Google Scholar] [CrossRef] [Green Version]
  34. Hua, Z.; Zhou, J.; Hua, Y.; Zhang, W. Strong approximate Markov blanket and its application on filter-based feature selection. Appl. Soft Comput. 2019, 87, 105957. [Google Scholar] [CrossRef]
  35. Fa, A.; As, B. An effective feature selection method for web spam detection. Knowl. Based Syst. 2019, 166, 198–206. [Google Scholar]
  36. Bing, X.; Fu, W.; Zhang, M. Multi-Objective Feature Selection in Classification: A Differential Evolution Approach. In Proceedings of the International Conference on Simulated Evolution and Learning, Dunedin, New Zealand, 15–18 December 2014; Volume 8886, pp. 516–528. [Google Scholar]
  37. Cervante, L.; Xue, B.; Shang, L.; Zhang, M. A Multi-objective Feature Selection Approach Based on Binary Particle Swarm Optimisation (PSO) and Probabilistic Rough Set Theory. In Proceedings of the European Conference on Evolutionary Computation in Combinatorial Optimization, Vienna, Austria, 3–5 April 2013; Volume 7832. [Google Scholar]
  38. Nouri-Moghaddam, B.; Ghazanfari, M.; Fathian, M. A Novel Multi-Objective Forest Optimization Algorithm for Wrapper Feature Selection. Expert Syst. Appl. 2021, 175, 114737. [Google Scholar] [CrossRef]
  39. Dong, H.; Sun, J.; Li, T.; Ding, R.; Sun, X. A multi-objective algorithm for multi-label filter feature selection problem. Appl. Intell. 2020, 50, 3748–3774. [Google Scholar] [CrossRef]
  40. Yin, J.; Tao, T.; Xu, J. A Multi-Label Feature Selection Algorithm Based on Multi-Objective Optimization. In Proceedings of the International Joint Conference on Neural Networks, Killarney, Ireland, 12–17 July 2015; p. 7280373. [Google Scholar]
  41. Zhang, Y.; Gong, D.W.; Sun, X.Y.; Guo, Y.N. A PSO-based multi-objective multi-label feature selection method in classification. Sci. Rep. 2017, 7, 376. [Google Scholar] [CrossRef] [PubMed]
  42. Bidgoli, A.A.; Ebrahimpour-Komleh, H.; Rahnamayan, S. Reference-point-based multi-objective optimization algorithm with opposition-based voting scheme for multi-label feature selection. Inf. Sci. 2021, 547, 1–17. [Google Scholar] [CrossRef]
  43. Cohen, S.B.; Ruppin, E.; Dror, G. Feature Selection Based on the Shapley Value. In Proceedings of the International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015; Morgan Kaufmann Publishers Inc.: Burlington, MA, USA; pp. 665–670. [Google Scholar]
  44. Mokdad, F.; Bouchaffra, D.; Zerrouki, N.; Touazi, A. Determination of an Optimal Feature Selection Method Based on Maximum Shapley Value. In Proceedings of the International Conference on Intelligent Systems Design & Applications, Porto, Portugal, 14–16 December 2016; pp. 116–121. [Google Scholar]
  45. Chu, C.C.F.; Chan, D.P.K. Feature Selection Using Approximated High-Order Interaction Components of the Shapley Value for Boosted Tree Classifier. IEEE Access 2020, 8, 112742–112750. [Google Scholar] [CrossRef]
  46. Deng, X.; Wenzhou, L.I.; Jigang, W.U. Hybrid feature selection algorithm fused Shapley value and particle swarm optimization. J. Comput. Appl. 2018, 38, 1245–1249. [Google Scholar]
  47. Guha, R.; Khan, A.H.; Singh, P.K.; Sarkar, R.; Bhattacharjee, D. CGA: A new feature selection model for visual human action recognition. Neural Comput. Appl. 2020, 33, 5267–5286. [Google Scholar] [CrossRef]
  48. Albizuri, M.J.; Masuya, S.; Zarzuelo, J.M. An Extension of the Shapley Value for Partially Defined Cooperative Games. In Proceedings of the 29th International Conference on Game Theory, Stony Brook, NY, USA, 16–20 July 2018. [Google Scholar]
  49. Nash, J.F. Non-Cooperative Games. Ann. Math. 1951, 54, 286–295. [Google Scholar] [CrossRef]
  50. Peterson, M. Review of Paul Weirich, Collective Rationality: Equilibrium in Cooperative Games. Br. J. Surg. 2010, 44, 55–68. [Google Scholar]
  51. Hannesson, R. Individual Rationality and the “Zonal Attachment” Principle: Three Stock Migration Models. Environ. Resour. Econ. 2006, 34, 229–245. [Google Scholar] [CrossRef]
  52. Pang, J.; Dong, H.; He, J.; Feng, Q. Mixed Mutation Strategy Evolutionary Programming Based on Shapley Value. In Proceedings of the 2016 IEEE Congress on Evolutionary Computation (CEC), Vancouver, BC, Canada, 24–29 July 2016; pp. 2805–2812. [Google Scholar]
  53. Alalga, A.; Benabdeslem, K.; Taleb, N. Soft-Constrained Laplacian score for semi-supervised multi-label feature selection. Knowl. Inf. Syst. 2016, 47, 75–98. [Google Scholar] [CrossRef]
  54. Dong, H.; Sun, J.; Sun, X.; Ding, R. A many-objective feature selection for multi-label classification. Knowl. Based Syst. 2020, 208, 106456. [Google Scholar] [CrossRef]
  55. Tsoumakas, G.; Spyromitros-Xioufis, E.; Vilcek, J.; Vlahavas, I. MULAN: A java library for multi-label learning. J. Mach. Learn. Res. 2011, 12, 2411–2414. [Google Scholar]
  56. Read, J.; Reutemann, P.; Pfahringer, B.; Holmes, G. MEKA: A multi-label/multi-target extension to WEKA. J. Mach. Learn. Res. 2016, 17, 667–671. [Google Scholar]
  57. Holmes, G.; Donkin, A.; Witten, I.H. WEKA: A Machine Learning Workbench. In Proceedings of the ANZIIS 94 Australian New Zealnd Intelligent Information Systems Conference, Brisbane, Australia, 29 November–2 December 1994; pp. 357–361. [Google Scholar]
  58. Available online: (accessed on 22 August 2021).
  59. Zhang, Y.; Zhou, Z.H. Multilabel Dimensionality Reduction via Dependence Maximization. In Proceedings of the National Conference on Artificial Intelligence, Chicago, IL, USA, 13 July 2008; pp. 1503–1505. [Google Scholar]
  60. Bader, J.; Zitzler, E. HypE: An Algorithm for Fast Hypervolume-Based Many-Objective Optimization. Evol. Comput. 2011, 19, 45–76. [Google Scholar] [CrossRef]
Figure 1. An example of multi-label classification.
Figure 1. An example of multi-label classification.
Entropy 23 01094 g001
Figure 2. Flow chart of the proposed method SHAPFS-ML.
Figure 2. Flow chart of the proposed method SHAPFS-ML.
Entropy 23 01094 g002
Figure 3. Comparison results with the increase of the number of features under the six indicators on flags dataset.
Figure 3. Comparison results with the increase of the number of features under the six indicators on flags dataset.
Entropy 23 01094 g003
Figure 4. Comparison results with the increase of the number of features under the six indicators on emotions dataset.
Figure 4. Comparison results with the increase of the number of features under the six indicators on emotions dataset.
Entropy 23 01094 g004
Figure 5. Comparison results with the increase of the number of features under the six indicators on yeast dataset.
Figure 5. Comparison results with the increase of the number of features under the six indicators on yeast dataset.
Entropy 23 01094 g005
Figure 6. Comparison results with the increase of the number of features under the six indicators on virus dataset.
Figure 6. Comparison results with the increase of the number of features under the six indicators on virus dataset.
Entropy 23 01094 g006
Figure 7. Comparison results with the increase of the number of features under the six indicators on Languagelog dataset.
Figure 7. Comparison results with the increase of the number of features under the six indicators on Languagelog dataset.
Entropy 23 01094 g007
Figure 8. Comparison results with the increase of the number of features under the six indicators on genbase dataset.
Figure 8. Comparison results with the increase of the number of features under the six indicators on genbase dataset.
Entropy 23 01094 g008
Figure 9. Comparison results with the increase of the number of features under the six indicators on medical dataset.
Figure 9. Comparison results with the increase of the number of features under the six indicators on medical dataset.
Entropy 23 01094 g009
Table 1. Description of datasets.
Table 1. Description of datasets.
Table 2. Comparison results of multi-label feature selection in terms of Ranking Loss ↓.
Table 2. Comparison results of multi-label feature selection in terms of Ranking Loss ↓.
MethodsFlagsEmotionsYeastVirusLanguagelogGenbaseMedicalAvg. Rank
NSGA III0.20560.18390.17510.01570.17200.20440.06772.8571
Table 3. Comparison results of multi-label feature selection in terms of Average precision ↑.
Table 3. Comparison results of multi-label feature selection in terms of Average precision ↑.
MethodsFlagsEmotionsYeastVirusLanguagelogGenbaseMedicalAvg. Rank
NSGA III0.82250.78480.75040.97030.31090.37490.81382.4286
Table 4. Comparison results of multi-label feature selection in terms of Coverage ↓.
Table 4. Comparison results of multi-label feature selection in terms of Coverage ↓.
MethodsFlagsEmotionsYeastVirusLanguagelogGenbaseMedicalAvg. Rank
NSGA III3.73852.02486.40460.289213.72956.24123.73642.9286
Table 5. Comparison results of multi-label feature selection in terms of Hamming Loss ↓.
Table 5. Comparison results of multi-label feature selection in terms of Hamming Loss ↓.
MethodsFlagsEmotionsYeastVirusLanguagelogGenbaseMedicalAvg. Rank
NSGA III0.26810.21700.19980.03210.01580.04930.01392.1429
Table 6. Comparison results of multi-label feature selection in terms of MicroF ↑.
Table 6. Comparison results of multi-label feature selection in terms of MicroF ↑.
MethodsFlagsEmotionsYeastVirusLanguagelogGenbaseMedicalAvg. Rank
NSGA III0.71500.63620.62400.91490.06490.00750.71832.8571
Table 7. Comparison results of multi-label feature selection in terms of MacroF ↑.
Table 7. Comparison results of multi-label feature selection in terms of MacroF ↑.
MethodsFlagsEmotionsYeastVirusLanguagelogGenbaseMedicalAvg. Rank
NSGA III0.52790.56740.34820.70620.22370.15000.27783.2857
Table 8. The proportion of selected features.
Table 8. The proportion of selected features.
NSGA III0.21050.38890.24270.37120.40940.42030.3823
Table 9. HV values of multi-objective algorithms.
Table 9. HV values of multi-objective algorithms.
NSGA III0.58350.62370.48000.59280.74530.5668
NSGA III0.69290.72460.61840.51240.54880.4392
NSGA III0.39710.42060.37340.37550.38340.3506
NSGA III0.39530.44100.3705
Table 10. The running time of SHAPFS-ML and NSGA III. (S).
Table 10. The running time of SHAPFS-ML and NSGA III. (S).
NSGA III89.6130422.07307375.0450117.2760
NSGA III5728.63901072.47301759.2850
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Dong, H.; Sun, J.; Sun, X. A Multi-Objective Multi-Label Feature Selection Algorithm Based on Shapley Value. Entropy 2021, 23, 1094.

AMA Style

Dong H, Sun J, Sun X. A Multi-Objective Multi-Label Feature Selection Algorithm Based on Shapley Value. Entropy. 2021; 23(8):1094.

Chicago/Turabian Style

Dong, Hongbin, Jing Sun, and Xiaohang Sun. 2021. "A Multi-Objective Multi-Label Feature Selection Algorithm Based on Shapley Value" Entropy 23, no. 8: 1094.

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop