Multi-Label Feature Selection Based on High-Order Label Correlation Assumption

Multi-label data often involve features with high dimensionality and complicated label correlations, resulting in a great challenge for multi-label learning. Feature selection plays an important role in multi-label learning to address multi-label data. Exploring label correlations is crucial for multi-label feature selection. Previous information-theoretical-based methods employ the strategy of cumulative summation approximation to evaluate candidate features, which merely considers low-order label correlations. In fact, there exist high-order label correlations in label set, labels naturally cluster into several groups, similar labels intend to cluster into the same group, different labels belong to different groups. However, the strategy of cumulative summation approximation tends to select the features related to the groups containing more labels while ignoring the classification information of groups containing less labels. Therefore, many features related to similar labels are selected, which leads to poor classification performance. To this end, Max-Correlation term considering high-order label correlations is proposed. Additionally, we combine the Max-Correlation term with feature redundancy term to ensure that selected features are relevant to different label groups. Finally, a new method named Multi-label Feature Selection considering Max-Correlation (MCMFS) is proposed. Experimental results demonstrate the classification superiority of MCMFS in comparison to eight state-of-the-art multi-label feature selection methods.


The Background of Multi-Label Feature Selection
During the past decade, multi-label learning has gradually attracted significant attentions and has been widely utilized in diverse real-world applications, such as text categorization [1,2], information retrieval [3,4] and gene function classification [5,6]. In multi-label data sets, each instance is related to multiple class labels simultaneously. For example, in text categorization tasks, a news document may associate with several topics simultaneously, such as "society", "economy" and "legality". Let X = R d denote the d-dimensional instance space and L = {l 1 , l 2 , .., l q } denote the label space including q possible class labels. The task of multi-label learning is to obtain the set of labels related to the unseen instance x ∈ X by learning a classification model from the training data set D = {(x 1 , L 1 ), (x 2 , L 2 ), . . . , (x n , L n )}, where L i ⊆ L is the set of labels associated with x i and x i ∈ X (1 ≤ i ≤ n) is a d-dimensional vector (x i1 , x i2 , . . . , x id ) [7][8][9]. The classification performance of multi-label learning is closely related to the quality of input features. Like traditional single-label learning algorithm, the multi-label learning often faces with the curse of dimensionality [10].
The high-dimensional multi-label data set often contains a large number of irrelevant and redundant features that bring many disadvantages to the multi-label learning such as the computational burden and over-fitting [10][11][12]. To address this problem, many multi-label feature selection techniques have been proposed to select the informative feature subset from the original feature set and to discard irrelevant and redundant features [13][14][15]. Feature selection techniques not only reduce the computing costs but also improve the classification performance effectively [16].
Multi-label feature selection methods are usually categorized into three groups: filter methods, wrapper methods and embedded methods [12,[17][18][19][20]. Among them, filter methods are classifier-independent, that is, filter methods do not consider any learning algorithm; wrapper methods evaluate the importance of feature subsets based on the classification performance of a specific classifier; embedded methods embed the feature selection in the training process of the classifier. Filter methods have the advantage of low computational cost. In this paper, we focus on filter-based multi-label feature selection methods. In addition, filter methods rank features according to their relevance with the label set. Wrapper methods consider all possible subsets of feature combinations by using the prediction performance of a classifier to assess the quality of feature subsets. Then, the feature subset selected by wrapper methods is optimal for the learning algorithm. The disadvantage of filter methods is that its classification performance is not as good as the wrapper methods, especially in the multi-label feature selection. Therefore, we design a new method to consider the high-order label correlations and to select the most informative features for improving the prediction performance of filter methods.

Information-Theoretical-Based Multi-Label Feature Selection Methods
Different from single-label feature selection methods that evaluate the relevancy between features and only one class label (binary or multiclass), multi-label feature selection methods consider the correlations between features and a set of labels [21,22]. Moreover, the labels in multi-label data are usually not independent, where the internal correlations among labels are always very complicated [23,24]. Many filter-based feature selection methods have been proposed to take into account the label correlations on the evaluation of features, in which information-theoretical-based measures have shown to be adequate [25][26][27][28]. The purpose of the information-theoretical-based multi-label feature selection methods is to obtain an optimal feature subset by employing the information measures in information theory, where mutual information is widely utilized to evaluate the correlation between features and the label set. Suppose that S = { f 1 , f 2 , . . . , f k } is a feature subset and L = {l 1 , l 2 , . . . , l q } is the target label set, the mutual information I(S; L) can be denoted as: The feature subset maximizing Equation (1) provides the maximal information for the label set, which can be considered as the optimal feature subset. However, according to (1), an inevitable problem is that the joint probability p(.) is difficult to estimate accurately. Therefore, many feature selection methods based on low-order label correlations have emerged to obtain the approximate optimal feature subset. Some multi-label feature selection methods [29][30][31] use the accumulated mutual information between candidate features and each label to evaluate the feature correlations, where these methods consider first-order label correlations, indicating that those labels are independent of each other. Additionally, some methods [32,33] employ the accumulated conditional mutual information or the interaction information to measure the impact of a candidate feature with each pair of labels, considering second-order label correlations. These methods have been proved to be effective in addressing the curse of dimensionality issues. In fact, there always exist high-order label correlations in the label set that can be abstracted into several semantic groups, in which the same semantic group consists of similar labels and different semantic groups have low dependency. Thus, the cumulative summation approximation based on the whole label set may lead to the following issues: 1. Overestimating the significance of some features when these features have strong correlations with one semantic group containing many labels while being almost independent of the other labels, especially in data with a large collection of labels. 2. Ignoring the key features that are highly correlated with the semantic groups containing less labels. 3. Selecting more redundant features that are often associated with labels in the same semantic group.
In order to address the issues above, we propose a new feature selection method. The main contributions are as follows: • A new term named Max-Correlation (MC) is designed based on the assumption that labels cluster into several groups, the labels in the same group possess the similar semantic meaning. The MC term employs the maximum operation to select the most informative feature. Additionally, the MC term is not limited to the number of labels in the semantic group, which can effectively address the above issues numbered as 1 and 2.

•
We propose a novel feature selection method for multi-label learning based on the Max-Correlation named Multi-label Feature Selection considering the Max-Correlation (MCMFS), which not only maximizes the feature correlation between candidate features and the label set, but also minimizes the feature redundancy in the already-selected feature subset. As a result, our method intends to select the features that are from different semantic groups.

•
The effectiveness of the proposed MCMFS method is validated on one artificial data set and twelve real-world multi-label data sets. The experimental results demonstrate that the proposed method can select compact feature subsets and to achieve better classification performance in terms of multiple evaluation criteria.
The remainder of this paper is organized as follows. Section 2 introduces some basic concepts of information theory and four evaluation criteria for multi-label classification performance. Section 3 briefly reviews the related work. In Section 4, we propose the new multi-label feature selection method MCMFS. Section 5 presents the experimental results to verify the effectiveness of the proposed method. In Section 6, we draw conclusions and give the directions of our future research.

The Basic Concepts of Information Theory
In this subsection, we introduce some basic concepts of information theory which are used to measure the correlations among random variables [34,35]. Let X = {x 1 , x 2 , . . . , x n } and Y = {y 1 , y 2 , . . . , y m } be two discrete random variables. The mutual information measures the amount of information shared by two variables. It is defined as follows: where p(x i , y j ) is the joint probability of (x i , y j ), p(x i ) is the probability of x i , p(y j ) is the probability of y j and the base of log is 2. H(X) is the entropy of the variable X, which measures the uncertainty of X. H(X|Y) is the conditional entropy of X given Y, which measures the uncertainty left of X under the condition of Y. Mutual information can be expressed as the uncertainty reduction about variable X, given Y. H(X) and H(X|Y) are defined as: where p(x i |y j ) is the conditional probability of x i given y j . Conditional mutual information measures the mutual information between two random variables under the condition of another random variable, which is defined as: where Z is a discrete random variable and H(X|Z) and H(X|Y, Z) are two conditional entropies. The joint mutual information can be defined as: Interaction information measures the amount of information shared by three variables, which is defined as:

Multi-Label Evaluation Metrics
To evaluate the classification performance of different multi-label feature selection methods, four evaluation metrics are widely used in multi-label learning in this paper, which are Hamming Loss, Zero-One Loss, Macro-average and Micro-average [36].
Let D = {(x 1 , L 1 ), (x 2 , L 2 ), . . . , (x n , L n )} be a multi-label test set and L = {l 1 , l 2 , . . . , l q } be the label set, where n is the number of instances and L i ⊆ L is the label set corresponding to the instance x i . Suppose that L i is the predicted label set corresponding to the x i instance obtained by multi-label classifier.
Hamming Loss (HL) calculates the average fraction of misclassified labels. HL is defined as: where ⊕ denotes the symmetric difference between the label sets L i and L i . For example, let L = {l 1 , l 2 , l 3 , l 4 , l 5 }. Suppose that L i = {l 1 , l 3 , l 5 } and L i = {l 1 , l 2 , l 5 }. L i corresponds to vector v = (1, 0, 1, 0, 1) where v j = 1 or 0 (j = 1, 2, .., 5) means that l j is included or not included in L i . L i corresponds to vector v' = (1, 1, 0, 0, 1). Then, Zero-One Loss (ZOL) calculates the average fraction of instances whose most confident label is not in the relevant label set. The definition for ZOL is: where δ = 1 if argmax l∈L h(x i , l) / ∈ L i and 0 otherwise. h(x i , l) is the real-valued function based on the multi-label classifier, which returns the confidence of label l being proper label of x i . argmax l∈L h(x i , l) corresponds to the most confident label for x i .
Macro-average (Macro-F1) and Micro-average (Micro-F1) based on the F1 score are two widely adopted evaluation criteria for multi-label learning. Macro-F1 is an arithmetic average of the F1 score of all q labels. Macro-F1 can be obtained as follows: where TP i , FP i and FN i denote the number of true positives, false positives and false negatives in the i-th label, respectively. Micro-F1 can be considered as a weighted average of the F1 over all q labels: The multi-label classification performance can be measured using the mentioned above evaluation criteria. For the four evaluation criteria, a lower value of HL and ZOL indicates a better classification performance. On the other hand, the higher the Macro-F1 and Micro-F1 values are, the better the classification performance is.

Related Work
Conventional multi-label feature selection methods can be divided into two groups to deal with the multi-label data sets: problem transformation and algorithm adaptation [37,38]. The problem transformation methods include two steps: (1) transform the multi-label data set to numerous single-label data sets; (2) select the relevant features from the transformed data sets. Binary Relevance (BR) [39], Label Power set (LP) [40] and Pruned Problem Transformation (PPT) [41] are common problem transformation methods. BR decomposes the multi-label data set into several independent binary classification data sets. LP assigns each instance's label set to a single new class. N. Spolaôr et al. [42] propose four multi-label feature selection methods based on BR and LP which employ ReliefF (RF) [43] and Information Gain (IG) [44] as the feature evaluation criteria to measure the transformed data (RF-BR, RF-LP, IG-BR and IG-LP). However, BR ignores the label correlations and LP may create too many classes causing over-fitting and imbalance problems. PPT removes the instances with rarely occurring labels by a predefined minimal number of occurrences of the label set to improve the effectiveness of LP. Doquire and Verleysen [45] propose a multi-label feature selection method based on mutual information using PPT (PPT + MI). In addition, χ 2 statistics are used to select the effective features (PPT + CHI) [41]. However, the problem transformation-based multi-label feature selection methods usually ignore the correlations among labels or lose the label information.
In recent years, many algorithm adaptation-based multi-label feature selection methods that directly select features from the multi-label data set have been proposed. S Kashef and H Nezamabadi-pour [15] propose a multi-label feature selection algorithm based on the Pareto dominance concept that intends to select the label-specific features in multi-objective optimization problem. Sun et al. [26] propose a novel Mutual-Information-based feature selection method via constrained Convex Optimization (MICO), which obtains the discriminative features considering the label correlation. Multi-label Informed Feature Selection (MIFS) [46] is an embedded-based feature selection method that decomposes the multi-label information into a low-dimensional label space using Latent Semantic Indexing (LSI) and then employs the reduced label space to steer the feature selection process via a regression model. Lee and Kim [32] propose a multi-label feature selection method based on information theory named Pairwise Multi-label Utility (PMU). Its evaluation function is defined as follows: where f k is a candidate feature, S is an already-selected feature subset and f j is a member of S, l i and l j are two members of the label set L. The PMU method selects the feature f k with the largest value of J( f k ). Multi-label feature selection method using interaction information (D2F) [29] is proposed to measure the feature correlation between features and each label in the label set. The criterion of D2F is defined as follows: In addition, Scalable Criterion for a Large Label Set (SCLS) [30] is proposed to design a new multi-label feature selection method based on scalable relevance evaluation. It is denoted as follows: Lin et al. [31] propose a multi-label feature selection method based on Max-Dependency and Min-Redundancy (MDMR) that maximizes the feature dependency between candidate features and each label using mutual information and minimizes the feature redundancy between candidate feature and each already-selected feature. The criterion of MDMR is denoted as follows: where |S| is the number of features in the already-selected feature subset S. In addition, multi-label Feature Selection based on Label Redundancy (LRFS) [33] is proposed, and LRFS employs the conditional mutual information between candidate features and each label given other labels to consider the measurement of feature relevancy. It is defined as follows: Through the above introduction, we can find that previous information-theoretical-based multi-label feature selection methods employ the cumulative summation approximation to take first-order and second-order label correlations into account. In fact, there exist high-order label correlations in the real-world multi-label data sets, naturally, labels cluster into several groups. The common limitation of these methods is that the cumulative summation may overestimate the significance of some candidate features that are related to the groups containing more labels while ignoring the classification information of groups containing few labels. To explore and exploit accurately high-order correlations among labels, we first design a Max-Correlation (MC) term based on the assumption that similar labels cluster into the same groups and dissimilar labels belong to different groups. Then, we propose a novel method named Multi-label Feature Selection considering the Max-Correlation (MCMFS).

Proposed Method
Many information-theoretical-based multi-label feature selection methods apply various low-order approximations to evaluate the candidate features. D2F, SCLS and MDMR methods [29][30][31] employ the accumulated mutual information to quantify the contribution of features to the label set. The specific equation is as follows: Equation (17) assumes that labels are independent of each other in the design of the feature relevancy term, which can be described as shown in Figure 1a, where f k is a candidate feature and l i ∈ L (i = 1, 2, . . . , q) is one label. In addition, conditional mutual information and interaction information are also used to consider the impact of candidate features with each pair of labels (l i , l j ), such as PMU [32] and LRFS [33], which can be described as shown in Figure 1b.
(a) (b) Figure 1. The correlation between feature f k and the label set for the first-order and second-order label correlations. Figure 1 displays the first-order correlations and the second-order correlations among labels. However, label correlations are complicated and of high-order nature in real-world data sets. The labels can naturally cluster into several abstracted semantic meanings. For example, in text categorization, the topics "Athletics", "Gymnastics" and "Swimming" can be extracted as the semantic meaning "Sports", and the topics "Beach", "Sea" and "Mountain" can be extracted as the semantic meaning "Nature". The labels in the same semantic group have larger dependency, while labels in different semantic meanings are more distinctive. In the literature [25], the 45 labels in the medical data set, which has been used in Computational Medicine Centers 2007 Medical Natural Language Processing Challenge, are divided into 4 main groups according to the statistical information about the labels. Different groups are almost independent of each other and the number of labels in different groups is not equal. Actually, we expect to select the features that are highly discriminating to each semantic group, thereby obtaining the representative features for different semantic meaning.
Like the Equations (17)- (19), the cumulative summation of information terms tends to select features that are related to one semantic group, which leads to overestimating some feature significance especially when the number of labels in the same semantic group is large. As a result, many redundant features are selected. For example, suppose that the total number of labels is 100 and there are two main semantic groups in the label set, that are C 1 and C 2 . If the number of labels in C 1 is 90 and the number of labels in C 2 is 10, then the cumulative summation criterion prefers to select the features that are associated with the labels in the semantic group C 1 , while reducing the selection possibility of features that are from C 2 . In such a situation, the critical features are neglected that are highly related to the semantic groups containing few labels because the value of the cumulative summation will be small when these features are independent of most other labels. Additionally, the selection possibility of redundant features will increase due to the overestimation of the features significance when these features are associated with the same semantic group that contains many labels. However, an effective and compact feature subset should choose features that are from different semantic groups, which is proved to be effective [47].
To address the issue, we propose a new multi-label feature selection method to select features that are from different semantic groups. Suppose that the label set L = {l 1 , l 2 , . . . , l q } can be divided into m semantic groups, that is, Our aim is to select the critical features that are from each semantic group, which is described as shown in Figure 2. In order to avoid the overestimation problem caused by the number of labels in the different semantic groups, we employ the maximum operation to measure the mutual information between the candidate feature f k and each semantic group C i (i = 1, 2, .., m). The specific equation is as follows: Equation (20) measures the relevancy between the candidate feature and labels in the same semantic group. The larger the value of Equation (20), the more important the candidate feature is in the semantic group. Equation (20) is the upper bound of the maximal relevancy between one candidate feature and the labels in the semantic group. Furthermore, a small value of Equation (20) means that the relevancy between the candidate feature and the labels in the semantic group is weak. Finally, Equation (20) can effectively avoid the overestimation caused by accumulation, even if many labels are in the same semantic group.
Thereafter, according to Equation (20), an m-dimensional vector Cor( f k ; L ) of feature f k and the label set L is obtained, that is We select the maximum value of Cor( f k ; L ) as the feature relevancy between candidate features and the entire label set L, which is named Max-Correlation (MC). It is defined as: I( f k ; f j ) measures the feature redundancy between the candidate feature f k and each already-selected feature f j . 1 |S| is employed to balance the magnitude between the Max-Correlation term and the feature redundancy term. Therefore, Equation (22) uses MC( f k ; L) to maximize the feature relevancy between candidate features and the label set, while using the mutual information I( f k ; f j ) to minimize the feature redundancy in the already-selected feature subset to choose the feature that are from different semantic groups. The sequential forward search strategy is used in the process of feature selection. We select the feature f k that achieves the maximum value of J( f k ) as the next already-selected feature. The pseudo code of MCMFS is presented in Algorithm 1.

Input:
A training sample D with a full feature set F = { f 1 , f 2 , . . . , f d } and the label set L = {l 1 , l 2 , . . . , l q }; The number of selected features b.

Output:
The already-selected feature subset S. 1: S ← ∅; 2: a ← 0; 3: for i = 1 to d do 4: According to the Equation (21) calculate the MC( f i ; L); 5: end for 6: while a < b do 7: if a == 0 then 8: Select the feature f j with the largest MC( f i ; Y); 9: a = a + 1; 10: end if 13: for each candidate feature f i ∈ F do 14: Calculate the mutual information I( f i ; f j ); 15: According to the Equation (22) calculate the J( f i ); 16: end for 17: Select the feature f j with the largest J( f i ); 18: The minimal-redundancy-maximum-relevance (mRMR) [48] is a well-known single-label feature selection method, which uses mutual information between candidate features and class labels to evaluate feature relevance and adopts the same feature redundancy term with our method. The resemblance between mRMR method and MCMCFS method is that both methods consider the relationship between candidate features and already-selected features to minimize feature redundancy. The difference is that mRMR method does not consider the effects of label correlations. In multi-label feature selection, the proposed MCMCFS method employs the Max-Correlation term to consider the high-order label correlations.

Complexity Analysis
We provide the complexity analysis for the MCMFS method and other five information-theoretical-based feature selection methods (D2F, PMU, SCLS, MDMR and LRFS). Suppose that the number of instances is n, the number of features is d and the number of labels is q. The mutual information, conditional mutual information and interaction information need the time complexity of O(n) since all the instances need to be visited for probability estimation. Suppose that the number of selected features is b, then the time complexity of MCMFS and SCLS is O(ndq + bnd). The time complexity of D2F and MDMR is O(ndq + bndq). PMU and LRFS design the evaluation criteria to consider the second-order label correlations. The time complexity of PMU is O(ndq + bndq + ndq 2 ) and the time complexity of LRFS is O(ndq 2 + bnd). Table 1 lists the time complexity of these methods. As shown in Table 1, MCMFS achieves the same time complexity with SCLS method. In addition, the time complexity of MCMFS method is lower than that of the D2F, MDMR, PMU and LRFS methods. Therefore, the proposed method is more computationally efficient than these four methods.

Experimental Results and Analysis
In this section, we evaluate the classification performance of the proposed MCMFS method and present the experimental results. MCMFS is compared to one embedded-based method (MIFS [46]) and two problem transformation-based methods (PPT + MI [45] and PPT + CHI [41]) and five information-theoretical-based methods (D2F [29], MDMR [31], PMU [32], SCLS [30] and LRFS [33]). First, we introduce the experimental settings and describe the evaluation framework in Figure 3. Second, MCMFS is compared to five information-theoretical-based methods that employ the cumulative summation approximation to evaluate the candidate features on an artificial data set. Finally, the MCMFS method is compared to the eight representative methods on 12 real-word multi-label data sets in terms of four evaluation metrics to verify the effectiveness of MCMFS method. All the experiments are executed on an Intel Core (TM) i7-6700 with 3.4 GHz processing speed.

Experimental Setting
The experimental setting is as follows: First, the continuous features are discretized using an equal-width strategy into three bins, as recommend in the literature [14,29]. Second, the number of already-selected features b varies from 1 to M with a step size of 1, where M is 20% of the total number of features (M = 17% in medical data set used in experiments). Third, we employ the MLKNN [47] as the multi-label classifier to evaluate the classification performance of the MCMFS method and other eight compared feature selection methods in terms of Hamming Loss and Zero-One Loss. Additionally, the number of nearest neighbors K is set to 10. Finally, k-Nearest Neighbors (kNN) and Liblinear-based Support Vector Machine (SVM) are implemented to evaluate the classification performance in terms of Macro-F1 and Micro-F1. The kNN is a non-linear neighborhood-based classifier, while the SVM is a linear classifier. We adopt two different classifiers to display the different classification performance of these methods. In addition, kNN and SVM are two popular classifiers in the feature selection methods based on information theory, they are widely applied in various literature [49][50][51][52][53]. Different k values of kNN classifier appear to have less effect on the classification performance for the filter methods [53]. In these references, k is set to 3, indicating that this is an empirical setting. Therefore, we set k to 3 in this paper. We use the package scikit-learn and in Python 2.7 to implement the classifiers. The multi-label data sets used in the experiment are from Mulan Library [46] where the training set and test set have been already separated in the data source. Therefore, as shown in Figure 3, we use the result of feature selection on the training set to implement on test set directly.

Experiment and Analysis on an Artificial Data Set
We apply an artificial data to visually compare MCMFS to five information-theoretical-based methods (D2F, LRFS, MDMR, PMU and SCLS) that employ the cumulative summation approximation to evaluate the importance of candidate features. Table 2  We present the feature ranking results and the classification performance obtained by the six feature selection methods in Table 3. Five-fold cross-validation is employed to evaluate the classification performance on the artificial data. The values in bold font represent the best classification performance in Table 3. It can be seen that MCMFS obtains better experimental results in terms of HL, ZOL, Macro-F1 and Micro-F1. According to the results of feature ranking, the five compared methods give lesser importance rankings for certain features. For example, compared to D2F, LRFS, MDMR and SCLS, the rank of f 8 is higher in MCMFS. In fact, f 8 is the most relevant feature to the label l 5 ( f 8 = argmax f i ∈F (I( f i ; l 5 ))). Compared to D2F, LRFS and PMU, the rank of f 3 is higher in MCMFS, where f 3 is the most relevant feature to the label l 4 ( f 3 = argmax f i ∈F (I( f i ; l 4 ))). In other words, f 8 and f 3 are critical features of the semantic groups C 3 and C 2 , respectively. f 2 is the key feature of the semantic group C 1 that is selected by most methods. The proposed method finds accurately the key features that belong to different semantic groups.

Experimental Results on the Real-Word Data Sets
The experiments are conducted on 12 real-world multi-label data sets that are from Mulan Library [54]. The description of the data sets is presented in Table 4. These data sets contain different number of instances, features and labels. In addition, these data sets cover two different application areas. The data set scene is collected for semantic image categorization and the remaining data sets are widely applied to text categorization. Tables 5 and 6 record the average classification results and standard deviations of the proposed method and other eight compared methods on the 12 data sets in terms of Hamming Loss and Zero-One Loss, respectively. The values in bold font represent the best classification performance achieved by the corresponding method.
In Table 5, MCMFS obtains the best Hamming Loss performance on 11 data sets. MIFS method provides better results for the Business data set, which means that the decomposition process of the label set of the MIFS method is helpful for the feature selection on the Business data set. As shown in Table 6, PPT + CHI obtains better performance than the proposed MCMFS method and other compared methods on Reference data set in terms of Zero-One Loss performance. MCMFS obtains the best Zero-One Loss performance on 11 data sets. On the whole, MCMFS provides better classification performance compared to other competitive feature selection methods in terms of Hamming Loss and Zero-One Loss on MLKNN classifier. Tables 7-10 record the classification performance of the proposed method and other eight compared methods in terms of Macro-F1 and Micro-F1, respectively. Tables 7 and 8 present the Macro-F1 metric on the SVM classifier and 3NN classifier, respectively. As the results indicate, we can observe that D2F obtains the best Macro-F1 performance on the enron data set using the SVM classifier in Table 7. Our method outperforms the compared methods in terms of the Macro-F1 performance on 11 data sets using the SVM classifier and on 12 data sets using the 3NN classifier. Tables 9 and 10 show the Micro-F1 performance on the SVM classifier and 3NN classifier, respectively. Compared to the eight methods, MCMFS obtains the best Micro-F1 performance on 11 data sets using the SVM classifier. In Table 10, MCMFS obtains the best Micro-F1 performance on 9 data sets using the 3NN classifier. Overall, our method achieves the best classification performance in terms of Macro-F1 and Micro-F1 on these data sets using the SVM classifier and 3NN classifier. Table 5. Experimental results of multi-label feature selection methods in terms of Hamming Loss (HL) (mean ± std).  Table 6. Experimental results of multi-label feature selection methods in terms of Zero-One Loss (ZOL) (mean ± std).  0.54 ± 0.07 0.37 ± 0.08 0.36 ± 0.08 0.29 ± 0.14 0.49 ± 0.05 0.51 ± 0.06 0.49 ± 0.07 0.37 ± 0.03 0.53 ± 0.07 enron 0.13 ± 0.02 0.12 ± 0.02 0.07 ± 0.01 0.09 ± 0.01 0.12 ± 0.01 0.12 ± 0.02 0.12 ± 0.02 0.11 ± 0.01 0.11 ± 0.02 Arts 0.11 ± 0.03 0.08 ± 0.02 0.1 ± 0.03 0.08 ± 0.03 0.06 ± 0.01 0.1 ± 0.02 0.06 ± 0.01 0.07 ± 0.02 0.1 ± 0.02 Business 0.1 ± 0.01 0.08 ± 0.01 0.09 ± 0.01 0.09 ± 0.02 0.07 ± 0.01 0.09 ± 0.01 0.05 ± 0.01 0.07 ± 0.01 0.08 ± 0.01 Education 0.09 ± 0.02 0.08 ± 0.02 0.08 ± 0.02 0.04 ± 0.02 0.06 ± 0.01 0.07 ± 0.01 0.06 ± 0.01 0.06 ± 0.01 0.07 ± 0.01 Entertain 0.14 ± 0.03 0.13 ± 0.02 0.11 ± 0.02 0.08 ± 0.02 0.11 ± 0.01 0.13 ± 0.02 0.08 ± 0.01 0.09 ± 0.01 0.14 ± 0.02 Health 0.14 ± 0.03 0.11 ± 0.02 0.12 ± 0.03 0.05 ± 0.03 0.09 ± 0.01 0.12 ± 0.02 0.09 ± 0.01 0.09 ± 0.01 0.12 ± 0.02 Recreation 0. 13   Observing these results, PPT+CHI provides better classification performance on Reference data set in terms of ZOL, Macro-F1 on SVM classifier and 3NN classifier, Micro-F1 on SVM classifier. χ 2 is effective in evaluating the features of Reference data set by transforming the label set to single label using PPT. Compared with the information-theoretical-based methods, the classification performance of MCMFS is is the best among all methods, followed by LRFS, MDMR, D2F, PMU and SCLS, which verifies the effectiveness of using the maximum operation instead of the cumulative summation approximation to take into account the higher-order label relationship.

Data set
To clearly show the classification performances of different feature selection methods, Figures 4-6 show the experimental results on three data sets (Arts, medical and scene). In these figures, the X-axis represents the number of already-selected features, which is varied as {1%, 2%, . . . , 20%} or {1%, 2%, . . . , 17%} (medical data set) of the total number of features. The Y-axis represents the experimental results of the different evaluation criteria. Different colors and shapes indicate different multi-label feature selection methods.
According to the classification performance on the Figures 4-6, we can observe that MCMFS obtains better classification performance than other compared feature selection methods. Compared to five information-theoretical-based methods D2F, MDMR, PMU, SCLS and LRFS, the experimental results demonstrate that the maximum operation is more effective than the cumulative summation approximation operation. In addition, MCMFS outperforms the other three multi-label feature selection methods PPT + MI, PPT + CHI and MIFS on these data sets.
Finally, we show the running time of MCMFS and other eight compared methods in Table 11. The running time of PPT + MI and PPT + CHI methods is the minimum, because they only need one iteration on the transformed single label to complete the feature selection. Although SCLS and MIFS methods have lower running time than our method, the proposed method outperforms these two methods in terms of multiple evaluation criteria for the classification performance. As compared to D2F, MDMR, PMU and LRFS, our method is more computationally efficient. Therefore, the running time of MCMFS method is generally acceptable. Additionally, we use Figure 7 to present the minimum and maximum values of each method on different data sets. In Figure 7, the X-axis represents data sets while the Y-axis represents the running time of each method. To clearly show the running time of different methods, we use Figure 7b to display the running time of MCMFS, PPT + MI, PPT + CHI, MIFS and SCLS methods. As shown in Figure 7, we can find that PMU obtains the most running time among all methods. PPT + MI has the least running time. The running time of our method MCMFS is acceptable. In addition, the running time of all methods increases as the size of the data sets increases.

Conclusions
In this paper, a novel multi-label feature selection method is proposed named Multi-label Feature Selection considering the Max-Correlation (MCMFS). The Max-Correlation (MC) term is designed based on the high-order label correlations and the assumption that labels naturally cluster into several groups. The combination of maximum operation and the feature redundancy term contributes to selecting the features that are from different label groups.
To demonstrate the effectiveness of our method, MCMFS is compared to five information-theoretical-based multi-label feature selection methods (D2F, MDMR, PMU, SCLS and LRFS) that employ the cumulative summation approximation operation to select features on an artificial data set. Furthermore, MCMFS is compared to eight state-of-the art multi-label feature selection methods (PPT + MI, PPT + CHI, MIFS, D2F, MDMR, PMU, SCLS and LRFS) using MLKNN on 12 real-world multi-label data sets in terms of Hamming Loss and Zero One Loss. Additionally, the 3NN classifier and SVM classifier are used to evaluate the classification performance among the nine feature selection methods in terms of Macro-F1 and Micro-F1. The experimental results demonstrate that MCMFS obtains better classification results than the compared methods and can effectively select a compact feature subset for the classification.
Finally, in our future work, we intend to explore high-order label correlations and sparse learning for multi-label feature selection. Additionally, we intend to propose a method that can automatically assign the appropriate number of feature subsets to each data set.