A Multi-Objective Multi-Label Feature Selection Algorithm Based on Shapley Value

Multi-label learning is dedicated to learning functions so that each sample is labeled with a true label set. With the increase of data knowledge, the feature dimensionality is increasing. However, high-dimensional information may contain noisy data, making the process of multi-label learning difficult. Feature selection is a technical approach that can effectively reduce the data dimension. In the study of feature selection, the multi-objective optimization algorithm has shown an excellent global optimization performance. The Pareto relationship can handle contradictory objectives in the multi-objective problem well. Therefore, a Shapley value-fused feature selection algorithm for multi-label learning (SHAPFS-ML) is proposed. The method takes multi-label criteria as the optimization objectives and the proposed crossover and mutation operators based on Shapley value are conducive to identifying relevant, redundant and irrelevant features. The comparison of experimental results on real-world datasets reveals that SHAPFS-ML is an effective feature selection method for multi-label classification, which can reduce the classification algorithm’s computational complexity and improve the classification accuracy.


Introduction
Classification is an important technical task in pattern recognition. Traditional supervised learning mainly involves single-label classification. However, real-world problems are more complicated. Every sample is labeled with multiple labels. In order to study such objects, research on multi-label learning has emerged over the years. Multi-label learning methods were first used in text classification [1], and they have been applied to new applications such as image annotation [2,3], and biological information [4,5] with the development of research.
The process of multi-label learning is difficult, the size of the label set is uncertain and there is a certain correlation between the labels [6]. Multi-label learning methods are roughly divided into two categories: problem conversion and algorithm adaptation [7]. The original data is converted to problems that can be solved with single-label classifiers in the problem conversion method, such as label power-set method (LP) and binary relevance method (BR) [8]. The algorithm adaptation method does not need to transform the original data but improves the single-label learning methods to adapt to multi-label data, such as a lazy learning approach (MLKNN) [9] and a kernel method (RankSVM) [10]. The algorithm adaptation method does not destroy the original data and better consider the correlation between labels.
The combination of features is critical to the quality of the classification results. The original feature set may contain redundant features and irrelevant features. If the original features are directly input into the classifier, it may interfere with the classification decision of the classifier [11]. The objective of feature selection is to remove redundant features and irrelevant features, thereby reducing the dimensionality of the data and improving

•
Shapley value and multi-objective multi-label feature selection are fused from two sides: feature and individual. • Two improved operators are proposed, which adaptively adjust the crossover and mutation probability by evaluating the features' contribution and equate the algorithm's global and local search. • An improved archive maintenance strategy is put forward to increase the convergence performance of the multi-objective optimization method. • Experiments on datasets of different scales prove the validity and adaptability of the proposed algorithm.
The rest of the paper is presented as follows: Section 2 analyzes the research status of multi-label feature selection and Shapely value in feature selection. Section 3 describes the basic knowledge of multi-objective optimization, Shapley value and multi-label learning. Section 4 introduces the objective functions, mutation operator, crossover operator, archive set maintenance strategy and the flow chart of the proposed algorithm SHAPFS-ML. Section 5 shows the experimental results on seven multi-label datasets. Section 6 gives a summary of this paper.

Related Works
The search ability of feature selection algorithm is an important factor that determines the quality of selected feature subsets. The exhaustive search selects the optimal feature subset by enumerating the possible combinations of features. Mnich et al. used multidimensional exhaustive analysis of the mutual information between features and labels [33]. This method requires a large number of samples and has a high computational cost. Although exhaustive searching can find the global optimal solution, it is inefficient. Heuristic search uses heuristic information to reduce the search range of the feature space. Hua et al. proposed an improved modified strong approximate Markov blanket method to remove redundant features, and then used sequential forward selection (SFS) method to remove irrelevant features [34]. Fa et al. proposed a backward selection (SBS) approach to eliminate a set of features that are not helpful for classification [35]. Both SFS and SBS are greedy algorithms which selects the current best solution and are can easily fall into the local optimum. Evolutionary computing technology belongs to the heuristic search. Such methods use a random search with heuristic information, which can obtain approximate optimal solutions. Common evolutionary computing methods include genetic algorithm (GA), particle swarm optimization (PSO), differential evolution (DE), etc. Therefore, we use evolutionary computing theory as the methodology in this paper.
In recent years, multi-objective optimization has been a research hotspot in the field of evolutionary computing, and it has been successfully applied to solve the problem of feature selection. Bing et al. proposed a multi-objective differential evolution algorithm to optimize the two objectives of reducing the number of selected features and the classification error rate [36]. Experimental results showed that the proposed algorithm can give a more trade-off solution set and improve the quality of the solution compared with single-objective optimization methods. Liam et al. proposed a binary multi-objective PSO algorithm for filter feature selection based on rough set theory [37]. The performance of this method is better than the traditional single-objective PSO. Nouri-Moghaddam et al. used a novel forest optimization algorithm (FOA) algorithm and designed multiple concepts to deal with the feature selection problem in a multi-objective optimization manner [38]. The experimental results showed that the proposed method performs better than other single-objective and multi-objective optimization methods.
Nowadays, compared with multi-label feature selection algorithms, there are more studies on single-label feature selection algorithms [39]. The investigation on multiobjective optimized multi-label feature selection algorithm has been proposed only recently, benefitting from the successful application of multi-objective optimization methods in single-label feature selection. In 2014, Yin et al. analyzed the contradictions between the two types of multi-label classification indicators and used the second generation of non-dominated genetic algorithm (NSGA II) to optimize Hamming loss and average precision [40]. The proposed algorithm performed better compared with other methods. In 2017, Zhang et al. presented a multi-objective particle swarm optimization method to cover the multi-label feature selection problem [41]. In order to enhance the multi-objective optimization algorithm's performance, they proposed two operation operators. And the crowding distance mechanism was used for the maintenance of archives. The results showed the exploration ability of the proposed algorithm is better than that of NSGA II. In 2021, Bidgoli et al. proposed a discrete differential evolution method for multi-label feature selection and proposed a binary mutation operator to improve the multi-objective optimization's global search capabilities [42]. The proposed method's performance is verified from the assessment of the multi-objective algorithm and the accuracy for classification.
To assess features' importance for classification, Shapley value in cooperative game theory is introduced. In 2005, Cohen et al. put forward a feature selection algorithm using the Shapley value. It utilized the Shapley value to iteratively calculate the validity of the feature, and features were selected through forward and backward elimination [43]. The forward elimination method achieved the highest accuracy among the experimental comparison algorithms. In 2016, Mokdad et al. designed a feature selection algorithm structure derived from the Shapley value [44]. First, the rank of N groups of features was obtained by N feature selection algorithms, and then the Borda Coun method was adopted to determine the ultimate feature rank. The experimental results showed the validity of this algorithm. In 2020, Chu et al. decomposed the Shapley value into highorder interactive components to reasonably evaluate features' contribution and proposed to evaluate feature subsets by discarding unselected features [45]. The above methods are non-optimized algorithms. Some studies combine the Shapley value with evolutionary algorithms. Deng et al. put forward a feature selection method that combines the Shapley value and particle swarm optimization [46]. The Shapley value was utilized to remove useless features in the local search and select fewer features. Guha et al. proposed a cooperative genetic algorithm for feature selection [47]. The fitness function combined the classification result, the feature subset size and the Shapley value score in a multi-objective fashion. However, this method is essentially single-objective optimization and cannot obtain multiple non-dominated solutions.
As far as we know, Shapley value has not been introduced into the multi-label feature selection algorithm. We merge Shapley value and a multi-objective multi-label feature selection method. This combination possesses the following two advantages: First, the Shapley value method focuses on the contribution of each feature, and the multi-objective optimization method focuses on the combination of features. Appropriate fusion can prevent a feature from being eliminated due to its poor performance in the feature subset, but the feature is useful for classification. Second, due to the huge search space, the search process at the beginning of the evolutionary algorithm has some randomness. The Shapley value method is conducive to the algorithm to search for potential spaces and to improve the convergence speed.

Multi-Objective Optimization
In reality, many optimization problems involve multiple objectives, moreover, there may be contradictions or other relationships between the objectives. It is hard for people to acquire the best solution for each objective and determine the importance of different objectives. The multi-objective algorithm can balance the relationship between multiple objectives so that the obtained solution can be approximately optimal on multiple objectives.
The multi-objective optimization problem with M optimization objectives is formally defined as Equation (1): where x ∈ Ω, Ω represents the decision space, x = (x 1 , x 2 , . . . , x n m ) ∈ Ω ⊆ R n , x is a decision variable of length n m , f i (x) (i = 1, . . . , M) is the i-th optimized function. In a multi-objective optimization problem, for two solutions y, z ∈ Ω, y dominates z, donated by y ≺ z, if ∀k: x * ∈ Ω is defined as a Pareto optimal solution if there is no other solution x = x * ∈ Ω dominates x * . A Pareto set consists of all the Pareto optimal solutions. The front obtained by mapping the Pareto set to the objective space is known as the Pareto front.

Shapley Value
Game theory mainly includes two types of cooperative games and non-cooperative games [48,49]. The main feature of cooperative games is that participants cooperate with each other and form alliances to maximize the overall benefits. The collective benefits are more important than the individual benefits. Cooperative games emphasize collective rationality [50], while non-cooperative games emphasize individual rationality [51]. Feature selection can be regarded as a cooperative game, which satisfies the forming conditions of the cooperative game: (1) The total personal income is less than the alliance's income.
(2) Compared with not joining the alliance, every participant is able to gain a higher profit.
The Shapley value calculates the weighted sum of the participants' marginal contributions in the cooperative game [52]. It's a fair and reasonable method of distributing benefits for participants. In feature selection, Shapley value can be utilized to calculate the feature's contribution.
Suppose that the set of individuals participating in the cooperative game is P = {p 1 , p 2 , . . . , p n s }, p i is the i-th participant, and n s is the number of individuals. S is the set of all subsets that do not contain p i in P. v is a real-valued function, which can map the alliance to the benefits obtained by the cooperation of participants in the alliance. The Shapley value of participant p i is calculated as follows: Specifically, in the feature selection problem, P is the original feature set, p i is the i-th feature, S is all feature subsets that do not contain feature p i , and the function v is represented by the classification result of the selected features under the classifier.

Multi-Label Learning
In order to better illustrate the difference between multi-label learning and traditional single-label learning, we give an example of an image. As shown in Figure 1, (a) is the original image, and (b) is the annotated image of (a). First, the meaning of the traditional single-label classification is explained. When we judge that the water area in (a) is a sea, a lake or a river, we can see that it belongs to the sea, and it cannot belong to the lake and the river at the same time, because these three categories are in conflict. This problem is a single-label multi-classification problem, that is, there are three categories of sea, lake and river under the label of water description. However, an image often contains more than one object. As shown in (b), the image can be annotated with three labels including sky, sea and sand beach. Each label can be divided into two categories, 0 and 1, meaning including the object and not including the object. Obviously, these two categories are contradictory. However, the three labels can exist at the same time, and there is a certain connection between the three labels. For example, if an image contains sea water, then the image may also include sky and sand beach, which is in line with the objective laws of the world. Therefore, multi-label classification is close to real life. The classification in (b) can be regarded as a multi-label binary classification problem. The multi-label multi-class problem corresponds to multiple labels, and each label has multiple categories of problems.
then the image may also include sky and sand beach, which is in line with the objective laws of the world. Therefore, multi-label classification is close to real life. The classification in (b) can be regarded as a multi-label binary classification problem. The multi-label multiclass problem corresponds to multiple labels, and each label has multiple categories of problems.
(a) The original image (b) Image annotation The original single-label learning method cannot be directly used in multi-label learning [53], because every sample of multi-label data is labeled with one or more labels simultaneously. Moreover, the relationship between the labels may be related. The definition of multi-label learning is as follows: Let = { , , … , } be the d-dimensional input variable on the real number field, and = { , , … , } be the label space. = {( , ), 1 ≤ ≤ } is the training dataset, and is the true label set of training data . During the training process, the algorithm learns the function ℎ: → 2 based on the training data. Given test data set ℋ = {( , )|1 ≤ ≤ }, when the test data ∈ X is input, the predicted labels closest to the proper labels of are obtained through the function ℎ. Multi-label classification has unique evaluation means to analyze the quality of the classification results, which is divided into examples-based criteria and labels-based criteria [54]. In this paper, we mainly introduce the six criteria involved in the experiment:

•
Ranking Loss: It evaluates the fraction that an irrelevant label is ranked before a related label. is the complementary set of .
• Average Precision: It measures the average instances' correlated labels, and these labels are ranked higher than the preset labels. is the descending rank function.
• Coverage: It records the minimum number of steps that need to be moved to cover the true labels associated with the sample from the sample's classification prediction labels list. The original single-label learning method cannot be directly used in multi-label learning [53], because every sample of multi-label data is labeled with one or more labels simultaneously. Moreover, the relationship between the labels may be related. The definition of multi-label learning is as follows: Let X = {x 1 , x 2 , . . . , x N } be the d-dimensional input variable on the real number field, and Y = y 1 , y 2 , . . . , y q be the label space.
and Y i is the true label set of training data x i . During the training process, the algorithm learns the function h : X → 2 q based on the training data. Given test data set H = x j , y j |1 ≤ j ≤ t , when the test data x j ∈ X is input, the predicted labels closest to the proper labels of x j are obtained through the function h.
Multi-label classification has unique evaluation means to analyze the quality of the classification results, which is divided into examples-based criteria and labels-based criteria [54]. In this paper, we mainly introduce the six criteria involved in the experiment:

•
Ranking Loss: It evaluates the fraction that an irrelevant label is ranked before a related label. y i is the complementary set of y j .
• Average Precision: It measures the average instances' correlated labels, and these labels are ranked higher than the preset labels. rank h is the descending rank function.
• Coverage: It records the minimum number of steps that need to be moved to cover the true labels associated with the sample from the sample's classification prediction labels list.
Hamming loss: It measures the proportion of misclassified label pair.
where represents the symmetric difference of the predicted label set and true label set. Macro-F1: It is a label-based index that takes into account the average F-measure of every label.
• Micro-F1: It is a label-based index that takes into account the average F-measure of the prediction matrix.

Objective Function
Generally, the number of objectives for multi-objective optimization does not exceed three, and problems with more than three objectives can be defined as many-objective optimization problems. This paper sets three optimization objectives including AP, HL and the number of selected features. The calculation methods of AP and HL are shown in Equations (4) and (6). The larger the value of AP, the better, and the smaller the value of HL and the number selected features, the better. The relationship between three objectives is complicated. First, the classification index and the number of features are contradictory in most situations. It is difficult to obtain high classification accuracy with a small number of features. Second, the literature [40] pointed out that AP and HL are contradictory. The single-objective optimization method is difficult to deal with the complex relationship between the objectives, so the multi-objective optimization algorithm is adopted as the basic optimization method.

Mutation Operator
Like the traditional genetic algorithm, NSGA III requires the crossover and mutation operators to produce offspring. Traditional NSGA III suites for continuous optimization matters, which means that the algorithm uses the real number encoding method and operators such as simulate binary crossover and polynomial mutation. Feature selection is a discrete optimization problem, and every feature corresponds to a bit. In the genetic algorithm, for a dataset with d-dimensional features, the population is composed of multiple chromosomes composed of 0 and 1 with length d, and each chromosome represents a solution. 0 means the feature corresponding to the bit is not selected, 1 means selected.
In order to effectively identify the relevant features, redundant features and irrelevant features, the Shapley value is fused with the multi-objective optimization algorithm. A mutation operator and a crossover operator based on Shapley value are proposed so that the improved NSGA III algorithm can resolve the multi-label feature selection problem well.
In the real world, many problems involve with high dimensions. The high-dimensional characteristics of the data increase the decision variables of the evolutionary algorithm, resulting in a huge search space, and the algorithm's optimization ability and speed are affected. At the beginning, the algorithm's randomness may lead to the wrong search direction for certain features. For example, suppose that there are m individuals in a population, the length of each individual is d, the i-th feature is relevant, and the j-th feature is irrelevant. Then in the early stage of the iteration, the following two situations may occur: requisite to carry out mutation operations on the bits with the above-mentioned conditions with greater probability.
However, it is difficult to know the relevance or irrelevance of features. Therefore, we introduced the Shapley value to evaluate the feature's contribution. Given that three objectives are optimized in this paper, including two multi-label evaluation criteria, the Shapley value of the feature is defined as follows: where v 1 is the HL value obtained by the feature subset S under the multi-label classifier, and v 2 is the AP value. HL and AP are calculated as shown in Equation (6) and Equation (4). When the unbalanced search situation no longer occurs, it means that the relevant features have basically been selected, and the irrelevant features are discarded. In order to further judge the redundant features, more attention should be paid to the features near the Shapley value of 0. Because the contribution of such features hardly contribute to classification, whether to choose these features basically does not affect the classification result. The specific mutation procedure is shown in Algorithm 1. to Equation (2) and record the number of individuals that selected and unselected a certain feature in the population (lines 4-8). Next, the calculation of mutation probability is divided into two cases. The first is to search for relevant features and irrelevant features under the condition ϕ i < 0&& number1(i) number2(i) ≤ 1/2 |{ϕ i > 0&&(number2(i)/number1(i)) ≥ 1/2}. In this case, the i-th feature may be wrongly selected or abandoned by most individuals. Therefore, it is necessary to increase the probability of mutation of the i-th feature. The more unbalanced the current feature's search, the greater the probability of mutation. When the population no longer satisfies the above conditions in the later period of evolution, the search for redundant features is performed. The smaller the absolute Shapley value of the feature means the greater the probability that it is a redundant feature, thereby increasing the probability of mutation.
After obtaining the mutation probability of each gene, mutation operation will be performed in the population. The uniform mutation is adopted in this paper, and the specific procedure is shown in Algorithm 2.

Crossover Operator
The uniform crossover operator is adopted in this paper. The exact process is shown in Algorithm 3. . The calculation method of the crossover probability P c is as follows: Similar to the mutation probability, the calculation of the crossover probability is divided into two stages. When t = 1, the algorithm is in the early stage of the iteration. At this time, the global search of the algorithm is necessary, so the probability of each individual crossover is equal. When t = 0, the algorithm performs a local search for redundant features. If individuals with large fitness gaps are selected for crossover, it may affect the inheritance of excellent genes. The non-dominated individuals are at the same level and cannot be ranked in the multi-objective problem. Therefore, we use the individual's Shapley value to approximately assess the individual's quality. When t = 0, if the difference between the two individuals' Shapley values is large, then the possibility of swapping genes is reduced.

The Improved Niche Preservation Mechanism
NSGA III employs a set of pre-set reference points to associate non-dominated solutions, and the niche preservation mechanism is employed to select non-dominated individuals from the critical front into the archive. Assume that there are ρ j individuals associated with the j-th reference point. The selection process is as follows: First, randomly select a reference point j with the smallest ρ j from the set of reference points. When ρ j = 0, it means that there is no individual associated with it. Therefore, if there is an individual in the critical front that is associated with j, then the individual closest to the reference line of j is selected to the archive. If no individual is associated with j, then reconsider other reference points. ρ j = 0 indicates that there are one or more individuals associated with j. If the number of individuals associated with j is zero in the critical front, the next reference point is reconsidered. If the number of individuals associated with j is non-zero in the critical front, an individual is randomly selected to associate with j.
In this paper, we can obtain the Shapley value of each individual, which was introduced in Section 4.2. We sort the individuals in the critical front in descending order according to the individuals' Shapley value. When ρ j = 0, the first-ranked individual in the critical front is selected and added to the archive if there are individuals in the critical front that are associated with j. The individual with the highest Shapley value indicates that the features selected by the individual has a higher average contribution to the classification, which means that the individual may be a promising solution. Sorting instead of random selection helps to improve the convergence of the algorithm.

The Overall Flow of the Algorithm
In order to more clearly illustrate the specific process of the proposed algorithm, Figure 2 shows the flow chart of SHAPFS-ML.
First, the population is initialized. All individuals are binary-coded and the length of the chromosome is the feature dimension of the input data. Then, the fitness values of the individuals in the population are calculated. The multi-label classifier MLKNN is used to evaluate the AP and HL values of the feature subset, and AP, HL and the size of the feature subset are used as the fitness values of the individual. After obtaining the fitness values matrix of the population, the Shapley value of each feature is calculated. Then the crossover probability and mutation probability are determined, and the evolution operation is executed. The regenerated populations are stratified through non-dominated relations. Because the capacity of the archive set is limited, the archive set is maintained by an improved maintenance strategy and the non-dominated solutions in the archive set are updated. If the stop condition is not met, the above process is repeated, and all non-dominated solutions in the archive set are finally output. critical front, an individual is randomly selected to associate with ̅ . In this paper, we can obtain the Shapley value of each individual, which was introduced in Section 4.2. We sort the individuals in the critical front in descending order according to the individuals' Shapley value. When ̅ ≠ 0, the first-ranked individual in the critical front is selected and added to the archive if there are individuals in the critical front that are associated with ̅ . The individual with the highest Shapley value indicates that the features selected by the individual has a higher average contribution to the classification, which means that the individual may be a promising solution. Sorting instead of random selection helps to improve the convergence of the algorithm.

The Overall Flow of the Algorithm
In order to more clearly illustrate the specific process of the proposed algorithm, Figure 2 shows the flow chart of SHAPFS-ML.   We can analyze how the framework is improved from two perspectives. From the perspective of features, each feature acts as a participant to cooperate with other features. The Shapley value can feedback the benefits of the alliances the feature participates in and the alliances that the feature does not participate in. The way of feedback is to adjust the cross probability and mutation probability of the feature, which will affect the probability of the feature appearing in the next iteration, so that the algorithm heuristically searches for potential areas. From the perspective of the population, the fitness function quantifies the quality of each individual. The population evolves through reproduction so that excellent genes are retained in each iteration, disadvantaged individuals are eliminated, and the entire population evolves in a better direction.

Experiment Settings
The experiments are conducted on seven multi-label datasets including flags, emotions, yeast, virus, Languagelog, genbase and medical. Flags, emotions, yeast, genbase and medical are available on MULAN. MULAN is an open-source library for multi-label learning [55]. Languagelog are chosen from MEKA [56], an extended version of WEKA [57] in multi-label learning and evaluation. Virus is available in [58]. Table 1 shows the summary of seven datasets. A classic multi-label classification algorithm ML-KNN [9] is employed as the multi-label classifier. K indicates the number of nearest neighbors, which is set to 10 as suggested in [9]. The parameter η is 0.3 and the size of population is 20. The experiments are conducted on a laptop equipped with an Intel(R) Core (TM) i7-9750H CPU and 16 GB memory.

Comparing Methods
In this section, six comparison methods are employed to demonstrate the usefulness of the proposed algorithm. The traditional NSGA III is compared with SHAPFS-ML to analyze the effectiveness of the improved NSGA III algorithm. Similarly, the coding method of NSGA III is modified to binary, uniform crossover and uniform mutation are adopted as the methods of crossover and mutation. SHAPFS-ML and NSGA III algorithms have been independently run 20 times on each data set. A non-dominated solution set is randomly selected from the running results, and the solution with the smallest sum of all objective function values is selected as the final result. MDFS constructs a low-dimensional embedding method to seek discriminative features [27]. MCLS is a manifold-based method that can transform the original label space and constrain the samples [26]. MIFS exploits the label correlations and decomposes the multi-label information [6]. MDDM maximizes the reliance of features and the associated labels, proj is the irrelated projection dimensionality reduction of MDDM, spc is an unrelated projection feature selection method [59]. Tables 2-7 show the comparison results on seven datasets under six multi-label learning criteria, which are introduced in Section 3.3. ↑ means the value should be the bigger the better. ↓ means the value should be the smaller the better. Generally speaking, the performance of SHAPFS-ML is the best (In bold). Avg.Rank is the average ranking value of each algorithm on all datasets. The smaller the Avg.Rank value, the better the performance of the algorithm. In detail, SHAPFS-ML has obtained the best results on three indicators average precision, coverage and Hamming loss. According to ranking loss, SHAPFS-ML obtained the optimal results in addition to the dataset Languagelog. SHAPFS-ML ranked second on Languagelog, but the difference between SHAPFS-ML and MCLS is small. On MicroF and MacroF indicators, MCLS is better than SHAPFS-ML on the data set genbase. Although SHAPFS-ML did not rank first on both indicators, SHAPFS-ML is significantly better than other non-optimized methods on other data sets except genbase. For example, on the emotions dataset, the value of SHAPFS-ML on the MicroF indicator is 0.6446, while the values of the other five non-optimized methods are 0.4636, 0.5114, 0.5156, 0.5829 and 0.5860, respectively, which are all behind SHAPFS-ML. Similarly, on the flags dataset, SHAPFS-ML obtained 0.6546 on the MacroF indicator, while in other methods, the lowest value of MDFS is 0.4777, and the highest value of MCLS is 0.5657. This observation reveals that SHAPFS-ML has a remarkable improvement. In terms of average ranking, SHAPFS-ML ranked first, followed by NSGA III. It is observed that the classification results of SHAPFS-ML are improved compared to the traditional NSGA III. Moreover, the multi-objective optimization-based method is more advantageous compared with other non-optimized algorithms, and the performance is relatively stable on different scale datasets. Among the non-optimized methods, MCLS performed best, and MDDM_proj performed worst.   Table 8 shows the ratio of the size of selected features by different methods to the size of the full feature set. It can be seen that SHAPFS-ML can remove at least 60% of the features on the flags, emotions and virus data sets, and can remove more than 50% of the features on other data sets. Although the size of features selected by SHAPFS-ML is not the least, it has the best performance on the classification results. Through the above discussion, we can draw a conclusion that SHAPFS-ML is competitive among the well-established comparison methods. The average ranking of SHAPFS-ML is better than that of NSGA III under six multilabel learning criteria. The main difference between the two is the crossover and mutation operators. Crossover and mutation operators are related to the global search and local search capabilities of the algorithm. The two types of searches cooperate with each other to achieve a balanced state. There are two main advantages of SHAPFS-ML. First, the crossover and mutation operators proposed in this paper adaptively calculates the crossover probability and mutation probability of the gene locus corresponding to the feature according to the Shapley value during the evolution process. The two operators cooperate and compete with each other to enhance the exploitation of feature space and the ability to explore local features. Secondly, the multi-objective optimization algorithm can consider the combination effect of features, and the introduction of the Shapley value method can measure the effect of a single feature. Feature combinations involving wellperforming features may be more competitive and potential. Therefore, in the problem of multi-label feature selection, the optimization ability of SHAPFS-ML is stronger than that of traditional NSGA III. Therefore, the optimization ability of SHAPFS-ML is stronger than that of traditional NSGA III in the problem of multi-label feature selection.

The Comparison on Hypervolume Indicator
To quantify the quality of the Pareto set obtained by SHAPFS-ML, Hypervolume (HV) index is introduced into the evaluation of experimental results. HV is a commonly used index for multi-objective algorithms [60]. The larger the value of HV, the better the multiobjective algorithm's capability. HV calculates the volume of the hypercube formed by the reference points and the non-dominated solution set. HV value can reflect the distribution and convergence of the algorithm. Therefore, the obtained HV value is different if the non-dominated solution set is different. We have run the algorithms SHAPFS-ML and NSGA III 20 times to calculate the average, best and worst values of HV. As shown in Table 9, SHAPFS-ML can obtain a higher average HV value and the best HV value, which shows that the search ability of SHAPFS-ML's multi-objective optimization is improved compared with the traditional NSGA III algorithm, and it can obtain a widely distributed and uniform Pareto solution set. Table 9. HV values of multi-objective algorithms.

Shapley Value Analysis
To further analyze the validity of the application of Shapley value to multi-label feature selection, we sort the features' Shapley values calculated in the last iteration of SHAPFS-ML, and gradually select the features for classification according to the order of contribution from the largest to the smallest. NSGA III is a global optimization algorithm, and the number of features cannot be determined arbitrarily. Therefore, NSGA III is not used as a comparison algorithm in this section. The non-optimized algorithms including MDFS, MCLS, MIFS, MDDM_proj and MDDM_spc as mentioned in Section 5.3 are compared for analysis.
The main purpose of this section is to verify whether the Shapley value method can reasonably analyze the contribution of features. This study is meaningful for feature selection. Because features with a high degree of contribution can be regarded as relevant features, which can help the sample to be correctly classified. And features with low contributions can be regarded as irrelevant features, which will interfere with the classification process and may even reduce the classification accuracy. The contribution degree of the feature also reflects the importance of the feature. Other non-optimized methods essentially use different measurement methods to quantify the importance of the feature. Therefore, it is reasonable to compare the feature results based on the Shapley value ranking with other non-optimized methods. Figures 3-9 show the values of the six algorithms on the seven datasets as the number of features increases on different indicators. From the observation, we can see that SHAPFS-ML has a significant improvement on six indicators compared with other methods on emotions, virus and medical. As the number of features increases, SHAPFS-ML tends to stabilize after reaching a certain maximum value. On yeast and Languagelog, the performance of MCLS is closest to SHAPFS-ML, but SHAPFS-ML can basically obtain the optimal value. On the flags dataset, except for Coverage and MacroF, SHAPFS-ML is slightly inferior to MCLS. On the genbase data set, SHAPFS-ML can obtain the optimal value in addition to the two indicators of MicroF and MacroF.

Complexity Analysis
In this section, the computational complexity of the proposed algorithm is analyzed. When the population size is set to and the number of objectives is , population initialization, crossover operator, mutation operator, and individual fitness calculations all require basic operations. The non-dominated sort requires basic operations. The selection operation of non-dominated individuals requires Through the above analysis, it can be seen that the Shapley value of SHAPFS-ML can effectively evaluate features and identify effective features. In general, non-optimized multi-label feature selection algorithms are more difficult to determine the optimal value, especially for multi-label learning. Under different indicators, the number of features that can obtain the optimal value is different. For a dataset with d dimensions, it is essential to run the classification algorithm d times to obtain the value corresponding to the number of different features on an index. In contrast, the optimization algorithm has excellent global search capabilities and can obtain approximately optimal solutions in one run. According to the experimental results, the features selected by SHAPFS-ML perform better under different multi-label indicators and have better stability.

Complexity Analysis
In this section, the computational complexity of the proposed algorithm is analyzed. When the population size is set to N p and the number of objectives is M, population initialization, crossover operator, mutation operator,

Comparison of Running Time
The running time of the wrapper feature selection algorithm based on the evolutionary optimization depends on the evolutionary algorithm, the size of the data set and the classification algorithm, so it is difficult to measure the actual running time of the evolutionary algorithm [41]. Therefore, we compare the running time of SHAPFS-ML and NSGA III in this section. The running time is the average time of 20 independent runs of each algorithm. It can be seen from Table 10 that the running time of the two algorithms is relatively close, especially on the flags, emotions and virus data sets. SHAPFS-ML is improved on the basis of NSGA III. Both the improved and traditional crossover and mutation operators require linear time.

Conclusions
Multi-label classification problems are common in real life. In recent years, there have been more studies in the field of multi-label feature selection, but there are few methods based on multi-objective optimization. A wrapper multi-objective optimization feature selection algorithm for multi-label learning fused with Shapley value (SHAPFS-ML) is proposed in this work. This method has two notable properties. First, the idea of Shapley value in game theory is combined with feature selection. We regard feature selection as the process of a cooperative game between features, and an excellent combination of features is selected by evaluating both features and individuals. Secondly, the mutation operator and crossover operator based on Shapley value are proposed to balance the algorithm's exploration capability and exploitation capability. The experimental results compared with other well-established multi-label feature selection methods on multi-label datasets prove the validity of SHAPFS-ML.
In future work, we will use the Shapley value method to realize feature visualization, and further distinguish relevant features, redundant features and irrelevant features. This research will improve the interpretability of multi-label feature selection algorithms. And we will apply the multi-label feature selection algorithm to a specific problem, such as image annotation.