Multi-Label Feature Selection Combining Three Types of Conditional Relevance

With the rapid growth of the Internet, the curse of dimensionality caused by massive multi-label data has attracted extensive attention. Feature selection plays an indispensable role in dimensionality reduction processing. Many researchers have focused on this subject based on information theory. Here, to evaluate feature relevance, a novel feature relevance term (FR) that employs three incremental information terms to comprehensively consider three key aspects (candidate features, selected features, and label correlations) is designed. A thorough examination of the three key aspects of FR outlined above is more favorable to capturing the optimal features. Moreover, we employ label-related feature redundancy as the label-related feature redundancy term (LR) to reduce unnecessary redundancy. Therefore, a designed multi-label feature selection method that integrates FR with LR is proposed, namely, Feature Selection combining three types of Conditional Relevance (TCRFS). Numerous experiments indicate that TCRFS outperforms the other 6 state-of-the-art multi-label approaches on 13 multi-label benchmark data sets from 4 domains.


Introduction
In recent years, multi-label learning [1][2][3][4] has been increasingly popular in applications such as text categorization [5], image annotation [6], protein function prediction [7], etc. Additionally, feature selection is of great significance to solving industrial application problems. Some researchers monitor the wind speed in the wake region to detect wind farm faults based on feature selection [8]. In signal processing applications, feature selection is effective for chatter vibration diagnosis in CNC machines [9]. Feature selection is adopted to classify the cutting stabilities based on the selected features [10]. The most crucial thing in diverse multi-label applications is to classify each sample and its corresponding label accurately. Multi-label learning, such as traditional classification approaches, is vulnerable to dimensional catastrophes. The number of features in text multi-label data is frequently in the tens of thousands, which means that there are a lot of redundant or irrelevant features [11,12]. It can easily lead to the "curse of dimensionality", which dramatically increases the model complexity and computation time [13]. Feature selection is the process of selecting a set of feature subsets with distinguishing features from the original data set according to specific evaluation criteria. Redundant or irrelevant features can be eliminated to improve model accuracy and reduce feature dimensions, feature space, and running time [14,15]. Simultaneously, the selected features are more conducive to model understanding and data analysis.
In traditional machine learning problems, feature selection approaches include wrapper, embedded, and filter approaches [16][17][18][19]. Among them, wrapper feature selection approaches use the classifier performance to weigh the pros and cons of a feature subset, which has high computational complexity and a large memory footprint [20,21]. The processes of feature selection and learner training are combined in embedded approaches [22,23]. Feature selection is automatically conducted during the learner training procedure when the two are completed in the same optimization procedure. Filter feature selection approaches weigh the pros and cons of feature subsets using specific evaluation criteria [24,25]. It is independent of the classifier, and the calculation is fast and straight. As a result, the filter feature selection approaches are generally used for feature selection.
There are also the above-mentioned three feature selection approaches in multi-label feature selection, with filter feature selection being the most popular. Information theory is a standard mathematical tool for filter feature selection [26]. Based on information theory, this paper mainly focuses on three key aspects that affect feature relevance: candidate features, selected features, and label correlations. The method proposed in this paper examines the amount of information shared between the selected feature subset and the total label set to evaluate feature relevance and denotes it as ∆I for the time being. Once any candidate feature is selected in the current selected feature subset, the current selected feature subset will be updated at this point, and ∆I will be altered accordingly. Moreover, the original label correlations in the total label set also affect ∆I due to some new candidate features being added to the current selected feature subset. Hence, three incremental information terms which combine candidate features, selected features, and label correlations to evaluate feature relevance are used to design a novel feature relevance term. Furthermore, we employ label-related feature redundancy as the feature redundancy term to reduce unnecessary redundancy. Table 1 provides three abbreviations and their corresponding meanings we mentioned. We explain them in detail in Section 4. The major contributions of this paper are as follows: 1.
Analyze and discuss the indispensability of the three key aspects (candidate features, selected features and label correlations) for feature relevance evaluation; 2.
Three incremental information terms taking three key aspects into account are used to express three types of conditional relevance. Then, FR combining the three incremental information terms is designed; 3.
A designed multi-label feature selection method that integrates FR with LR is proposed, namely TCRFS; 4.
TCRFS is compared to 6 state-of-the-art multi-label feature selection methods on 13 benchmark multi-label data sets using 4 evaluation criteria and certified the efficacy in numerous experiments.
The rest of this paper is structured as follows. Section 2 introduces the preliminary theoretical knowledge of this paper: information theory and the four evaluation criteria used in our experiments. Related works are reviewed in Section 3. Section 4 combines three types of conditional relevance to design FR and proposes TCRFS, which integrates FR with LR. The efficacy of TCRFS is proven by comparing it with 6 multi-label methods on 13 benchmark data sets applying 4 evaluation criteria in Section 5. Section 6 concludes our work in this paper.

Information Theory for Multi-Label Feature Selection
Information theory is a popular and effective means to tackle the problem of multilabel feature selection [27][28][29]. It is used to measure the correlation between random variables [30] and its fundamentals are covered in this subsection.
Assume that the selected feature subset S = { f 1 , f 2 , ..., f n }, the label set L = {l 1 , l 2 , ..., l m }. To convey feature relevance, we typically employ I(S; L), which is mutual information between the selected feature subset and the total label set. Mutual information is a measure in information theory. It can be seen as the amount of information contained in one random variable about another random variable. Assume two discrete random variables X = {x 1 , x 2 , ..., x n }, Y = {y 1 , y 2 , ..., y m }, then the mutual information between X and Y can be represented as I(X; Y). Its expansion formula is as follows: where H(X) denotes the information entropy of X, and H(X|Y) denotes the conditional entropy of X given Y. Information entropy is a concept used to measure the amount of information in information theory. H(X) is defined as: where p(x i ) represents the probability distribution of x i , and the base of the logarithm is 2.
The conditional entropy H(X|Y) is defined as the mathematical expectation of Y for the entropy of the conditional probability distribution of X under the given condition Y: where p(x i , y i ) and p(x i |y i ) represent the joint probability distribution of (x i , y i ) and the conditional probability distribution of x i given y i , respectively. H(X|Y) can also be represented as follows: where H(X, Y) is another measure in information theory, namely, the joint entropy. Its definition is as follows: According to Equation (4), combining the relationship between the three different measures of the amount information, the mutual information I(X; Y) can also be alternatively written as follows: It is common in multi-label feature selection to have more than two random variables, assuming another discrete random variable Z = {z 1 , z 2 , ..., z q }. The conditional mutual information I(X; Y|Z), which expresses the expected value of mutual information of two discrete random variables X and Y given the value of the third discrete variable Z. It is represented as follows: where I(X, Z; Y) is the joint mutual information and I(X; Y; Z) is the interaction information. Their expansion formulas are as follows: I(X, Z; Y) = I(X; Y|Z) + I(Y; Z) = I(Y; Z|X) + I(X; Y) (8)

Evaluation Criteria for Multi-Label Feature Selection
In our experiments, we employ four distinct evaluation criteria to confirm the efficacy of TCRFS. The four evaluation criteria are essentially separated into two categories: labelbased evaluation criteria and example-based evaluation criteria [31]. The label-based evaluation criteria include Macro-F 1 and Micro-F 1 [32]. The higher the value of these two indicators, the better the classification effect. Macro-F 1 actually calculates the F 1 -score of q categories first and then averages it as follows: where TP i , FP i , and FN i represent true positives, false positives, and false negatives in i-th category, respectively. Micro-F 1 calculates the confusion matrix of each category, and adds the confusion matrix to obtain a multi-category confusion matrix and then calculates the F 1 -score as follows: The example-based evaluation criteria include the Hamming Loss (HL) and Zero One Loss (ZOL) [33]. The lower the value of these two indicators, the better the classification effect. HL is a metric for the number of times a label is misclassified. That is, a label belonging to a sample is not predicted, and a label not belonging to the sample is projected to belong to the sample. Suppose that D = {(x i , Y i )|1 ≤ i ≤ m} is a label test set and Y i ⊆ Y is a set of class labels corresponding to x i , where Y is the label space with q categories. The definition of HL is as follows: where ⊕ means the XOR operation. Y i denotes the predicted label set corresponding to x i . The other example-based criterion ZOL is defined as follows: If the predicted label subset and the true label subset match, the ZOL score is 1 (i.e., δ = 1), but if there is no error, the score is 0 (i.e., δ = 0).

Related Work
There have been a lot of multi-label learning algorithms proposed so far. These multi-label learning algorithms can be divided into problem transform and algorithm adaptation [34,35]. Problem transform is the conversion of multi-label learning into traditional single-label learning, such as Binary Relevance (BR) [36], Pruned Problem Transformation (PPT) [37], and Label Power (LP) [38]. BR treats the prediction of each label as an independent single classification issue and trains an individual classifier for each label with all of the training data [33]. However, it ignores the relationships between the labels, so it is possible to end up with imbalanced data. PPT removes the labels with a low frequency by considering the label set with a predetermined minimum number of occurrences. However, this irreversible conversion will result in the loss of class information [39].
In contrast to problem transform, algorithm adaptation directly enhances the existing single-label data learning algorithms to adapt to multi-label data processing. Algorithm adaption improves the issues caused by problem transformation. Cai et al. [40] propose Robust and Pragmatic Multi-class Feature Selection (RALM-FS) based on an augmented Lagrangian method, where there is just one 2,1 -norm loss term in RALM-FS, with an apparent 2,0 -norm equality constraint. Lee and Kim [41] propose the D2F method that makes use of interactive information based on mutual information. It is capable of measuring multiple variable dependencies by default, and its definition is as follows: where ∑ are regarded as the feature relevance term and the feature redundancy term, respectively. The feature relevance of D2F only considers the candidate features in feature relevance, which ignores selected features and label correlations. Lee and Kim [42] propose the Pairwise Multi-label Utility (PMU), which is derived from I(S; L) as follows: where ∑ is to measure the feature redundancy. Afterward, Lee and Kim [43] propose multi-label feature selection based on a scalable criterion for large SCLS. SCLS uses a scalable relevance evaluation approach to assess conditional relevance more correctly: In fact, the scalable relevance in SCLS considers both candidate features and selected features but ignores label correlations. Liu et al. [44] propose feature selection for multilabel learning with streaming label (FSSL) in which label-specific features are learned for each newly received label, and then label-specific features are fused for all currently received labels. Lin et al. [45] apply a multi-label feature selection method based on fuzzy mutual information (MUCO) to the redundancy and correlation analysis strategies. The next feature that enters S can be selected by the following: where FMI( f k ; L) denotes the fuzzy mutual information.
When we try to add a new candidate feature f k to the current selected feature subset S, the feature f k , the selected features f j in S, and label correlations in the total label set will all impact feature relevance. To this end, FR is devised by merging the three types of conditional relevance. Therefore, a designed multi-label feature selection method TCRFS that integrates FR with LR is proposed.

TCRFS: Feature Selection Combining Three Types of Conditional Relevance
According to the past multi-label feature selection methods, they do not take into account all the three key aspects of influencing feature relevance. That is, the key aspects that influence feature relevance are not comprehensively examined. Here, we utilize three incremental information terms to depict three types of conditional relevance that consider candidate features, selected features, and label correlations comprehensively. The reasons for our consideration are as follows. We evaluate each candidate feature according to specific criteria. When a candidate feature f k attempts to enter the current selected feature subset S as a new selected feature to generate a new selected feature subset, it will affect the amount of information provided by the current selected feature subset to the label set. The influence of candidate features is represented by a Venn diagram, as shown in Figure 1. In Figure 1, we assume that f k 1 and f k 2 are two candidate features, f j is a selected feature in S, and l i is a label in the total label set L. f k 1 is irrelevant to f j , and f k 2 is redundant with f j . The amount of information provided by f j to l i is mutual information I( f j ; l i ), that is, the area {2, 3}. If f k 1 is selected, then the amount of information provided by f j to l i will be I( f j ; l i | f k 1 ), which corresponds to the area {2, 3}. If f k 2 is selected, then the amount of information provided by f j to l i will be I( f j ; l i | f k 2 ), which corresponds to the area {2} since the area {2} is less than the area {2, 3}, I( f j ; l i | f k 2 ) < I( f j ; l i | f k 1 ). Therefore, the higher the label-related redundancy between the candidate feature and the selected feature in the current selected feature subset, the greater the amount of information between f j and l i is reduced. In other words, the label-related redundancy between the candidate feature and the selected features should be kept to a minimum. From this point of view, f k 1 takes precedence over f k 2 .

Selected Features
The influence of selected features is represented by a Venn diagram as shown in Figure 2. As shown in Figure 2, f k 1 and f k 2 are both redundant with f j . Without considering selected features, the information that f k 1 and f k 2 shared with the label l i are I( f k 1 ; l i ) and I( f k 2 ; l i ), respectively. The area {1, 2} denotes I( f k 1 ; l i ), and the area {5, 6} denotes I( f k 2 ; l i ). We assume that the area {1, 2} is less than the area {5, 6}, the area {2} is less than the area {5}, but the area {1} is larger than {6}. With the selected features taken into account, the information shared by f k 1 and l i is I( f k 1 ; l i | f j ) (i.e., the area {1}), and the information shared by f k 2 and l i is There are two causes for this situation, the first is that the amount of information provided to l i by f k 2 itself is insufficient, and the second is that the label-related redundancy between f k 2 and f j is excessive. Now, in the hypothesis, replace the condition that area {1} is larger than the area {6} to the area {1} is less than the area {6}, and we obtain the following result: . Therefore, considering the influence of the selected features on feature relevance is necessary.

Label Correlations
It has no influence on the amount of information between candidate features and each label if the labels are independent. The influence of label correlations is represented by a Venn diagram as shown in Figure 3. In Figure 3, l i and l j are two redundant labels, that is, there exists a correlation between l i and l j . Without the consideration of label correlations, the amount of information provided to l i by f k 1 is I( f k 1 ; l i ) (the area {1, 2}) and the amount of information provided to l i by f k 2 is I( f k 2 ; l i ) (the area {4, 5}). Then, while taking label correlations into consideration, the amount of information provided to l i by f k 1 is I( f k 1 ; l i |l j ) (the area {1}) and the amount of information provided to l i by f k 2 is I( f k 2 ; l i |l j ) (the area {4}). Now, provide the first hypothesis: the area {1, 2} is larger than the area {4, 5}, the area {2} is larger than the area {5}, but the area {1} is less than the area {4}. Hence, I( f k 1 ; l i ) > I( f k 2 ; l i ) but I( f k 1 ; l i |l j ) < I( f k 2 ; l i |l j ). The second hypothesis modifies the last condition in the first hypothesis: the area {1} is larger than the area {4}. Hence, I( f k 1 ; l i ) > I( f k 2 ; l i ) and I( f k 1 ; l i |l j ) > I( f k 2 ; l i |l j ). We call the area {2} and the area {5} feature-related label redundancy. Therefore, the original amount of information between candidate features and labels and the featurerelated label redundancy can affect the selection of features. Merely using the accumulation of mutual information as the feature relevance will cause the redundant recalculation of feature-related label redundancy.
According to the three key aspects of feature relevance described above, they are indispensable. As a result, we devise FR as the feature relevance term of TCRFS.

Definitions of FR and LR
Regarding the feature relevance evaluation, we distinguish the importance of features based on the closeness of the relationship between features and labels. According to Section 4.1, candidate features, selected features, and label correlations are three key aspects on evaluating feature relevance. In order to be able to perform better in multi-label classification, we utilize three types of conditional relevance (I( f k ; l i | f j ), I( f j ; l i | f k ) and I( f k ; l i |l j ) to represent the feature relevance term in the proposed method. By using three incremental information terms to summarize the three key aspects of feature relevance, FR is devised. The three incremental information terms represent the three respective types of conditional relevance.
.., f m } and L = {l 1 , l 2 , ..., 1 n } are the total feature set and the total label set, respectively. Let S be the selected feature set excluding candidate features, that is, f k ∈ F − S. FR is depicted as follows: The comprehensive evaluation of the above-mentioned three key aspects of feature relevance is more conducive to capturing the optimal features. Furthermore, FR can be expanded as follows: where I( f j ; l i ) and I(l i ; l j ) are considered to be two constants in feature selection.

Definition 2. (LR).
In the initial analysis of the three key aspects of feature relevance, it is mentioned that the label-related feature redundancy is repeatedly calculated in the previous methods, which will impact on capturing the optimal features. Here, LR is devised as follows: As indicated in Table 2, we have compiled a list of feature relevance terms and feature redundancy terms for TCRFS and the contrasted methods based on information theory. Table 2. Feature relevance terms and feature redundancy terms of multi-label feature selection methods.

Methods
Feature Relevance Terms Feature Redundancy Terms

Proposed Method
We design FR and LR to analyze and discuss feature relevance and feature redundancy, respectively, in Section 4.2.1. Subsequently, TCRFS, a designed multi-label feature selection method that integrates FR with LR, is suggested. The definition of TCRFS is as follows: where |L| and |S| represent the number of the total label set and the number of the selected subset, respectively, and their inversions are 1 |L| and 1 |S| , respectively. The feature relevance term and the feature redundancy term can be balanced using the two balance parameters 1 |L||S| and 1 |L||L−1| . According to Formula (19), Formula (21) can be rewritten as follows: The pseudo-code of TCRFS (Algorithm 1) is as follows: Algorithm 1. TCRFS.

Input:
A training sample D with a full feature set F = { f 1 , f 2 , ..., f n } and the label set L = {l 1 , l 2 , ..., l m }; User-specified threshold K.

Output:
The selected feature subset S. end if 13: for each candidate feature f i ∈ F do 14: According to the Formula (21) and calculate the J( f i ); 15: end for 16: Select the feature f j with the largest J( f i ); 17: First, in lines 1-5, the selected feature subset S and the number of selected features k in the proposed method are initialized. To capture the initial feature, we calculate the incremental information I( f i ; l i |l j ) to capture the first feature (lines 6-12). Then, until the procedure is complete, calculate and capture the following feature (lines 13-20).

Time Complexity
Time complexity is also one of the criteria for evaluating the pros and cons of methods. The time complexity of each contrasted method and TCRFS has been computed here. Assume that there are n, p, and q instances, features, and labels, respectively. The computational complexity of mutual information and conditional mutual information is O(n) for all instances that have to be visited for probability. Each iteration of RALM-FS requires O(p 3 ). Assume that k denotes the number of selected features. The time complexity of TCRFS is O(npq 2 + knpq) as three incremental information terms and one label-related feature redundancy term are calculated. Similarly, D2F, PMU, and SCLS have time complexities of O(npq + knpq), O(npq + knpq + npq 2 ), and O(nma + knm), respectively. FSSL has a time complexity of O(knpq). The time complexity of MUCO is O(n 2 + p(p − k)) since it constructs a fuzzy matrix and incremental search.

Experimental Evaluation
Against the demonstrated efficacy of TCRFS, we compare it to 6 advanced multi-label feature selection approaches (RALM-FS [40], D2F [41], PMU [42], SCLS [43], FSSL [44], and MUCO [45]), on 13 benchmark data sets in this section. As a result, we have conducted numerous experiments based on four different criteria using three classifiers, which are Support Vector Machine (SVM), 3-Nearest Neighbor (3NN), and Multi-Label k-Nearest Neighbor (ML-kNN) [46,47]. The 13 multi-label benchmark data sets utilized in the experiments are depicted first. Following that, the findings of the experiments are discussed and examined. Four evaluation metrics that we employed in the experiments have been offered in Section 2.2. The approximate experimental framework is depicted in Figure 4.

Multi-Label Data Sets
A total of 13 multi-label benchmark data sets from 4 different domains have been depicted in Table 3, which are collected on the Mulan repository [48]. Among them, the Birds data set classifies the birds in Audio [49], the Emotions data set is gathered for Music [38], the Genbase and Yeast data sets are primarily concerned with the Biology category [34], and the remaining 9 data sets are categorized as Text. The 13 data sets we chose have an abundant number of instances, which are split into two parts: training set and test set [48]. Ueda and Saito [50] attempted to classify real Web pages linked from the "yahoo.com" domain, which is composed of 14 top-level categories, each of which is split into many second-level subcategories. They tested 11 of the 14 independent text classification problems by focusing on the second-level categories. For each problem, the training set includes 2000 documents and the test set includes 3000 documents, such as the Arts and Health data sets, and so on [51]. The number of labels and the number of features both vary substantially. Previous research demonstrates that maintaining 10% of the features results in no loss, while retaining 1% of the features results in a slight loss dependent on document frequency [3]. For example, the Arts and Social data sets have more than 20,000 features and 50,000 features, respectively, and they retain about 2% of the features with the highest document frequency. The continuous features of 13 data sets are discretized into equal intervals with 3 bins as indicated in the literature [38,52].  Audio  19  260  322  323  645  2  Emotions  Music  6  72  391  202  593  3  Genbase  Biology  27  1185  463  199  662  4  Yeast  Biology  14  103  1500  917  2417  5  Medical  Text  45  1449  333  645  978  6  Entertain  Text  21  640  2000  3000  5000  7  Recreation  Text  22  606  2000  3000  5000  8  Arts  Text  26  462  2000  3000  5000  9 Health

The Theoretical Justification of TCRFS on an Artificial Data Set
To further justify the indispensability of the three key aspects (candidate features, selected features, and label correlations) for feature relevance evaluation. We employ an artificial data set to compare the classification performance of five information-theoreticalbased methods (D2F, PMU, SCLS, MUCO, and TCRFS) that use distinct feature relevance terms. With respect to the feature relevance terms, D2F and PMU employ the amount of information between candidate features and labels, SCLS employs a scalable relevance evaluation, which takes feature redundancy into account in feature relevance, MUCO employs fuzzy mutual information, and TCRFS comprehensively considers the three types of conditional relevance we mentioned to design FR. Tables 4 and 5 display the training set and the test set, respectively.    Table 6 shows the experimental results and the feature ranking of each approach on the artificial data set. As shown in Table 6, the first feature selected by TCRFS is f 5 . Different from D2F and PMU, f 2 is regarded as the least essential feature. In TCRFS, feature rankings f 0 , f 8 , and f 4 are higher than the feature ranking of SCLS, whereas MUCO selects f 4 as the first feature. TCRFS achieves the best classification performance overall. Therefore, TCRFS, which considers three key aspects (candidate features, selected features, and label correlations), is justified. Table 6. Experimental results on the artificial data set.

Methods
Feature Ranking SVM ML-kNN

Analysis and Discussion of the Experimental Findings
The experiments that run on a 3.70 GHz Intel Core i9-10900K processor with 32 GB of main memory are performed on four different evaluation criteria regarding three classifiers. Python is used to create the proposed method [53]. Hamming Loss is conducted on the ML-kNN (k = 10) classifier, and Macro-F 1 and Micro-F 1 measures are conducted on the SVM and 3NN classifiers. The number of selected features on the 12 data sets is set to {1%, 2%,..., 20%} of the total number of features when using a step size of 1, whereas the number of selected features on the Medical data set is set to {1%, 2%,..., 17%}. Tables 7-12 present the classification performance of 6 contrasted approaches and TCRFS on 13 data sets. The average classification results and standard deviations are used to express the classification performance. The average classification results of each method on all data sets are represented in the row "Average". The data of the best-performing classification results in Tables 7-12 are bolded.  Table 8. Classification performance of each method regarding Micro-F 1 on SVM classifier (mean ± std).  Table 9. Classification performance of each method regarding Macro-F 1 on 3NN classifier (mean ± std).  Table 10. Classification performance of each method regarding Micro-F 1 on 3NN classifier (mean ± std).  Observing Tables 7 and 8, TCRFS delivers the optimum classification performance on SVM classifier regarding Macro-F 1 and Micro-F 1 measures, since the higher the values of the two measures, the more superior the classification performance. In Table 9, except for the Yeast data set, TCRFS beats 6 other contrasted approaches on 12 data sets using 3NN classifier for Macro-F 1 . TCRFS surpasses the other 6 contrasted approaches on 11 data sets using the 3NN classifier for Micro-F 1 in Table 10. According to the properties of the HL and ZOL measures, the lower values of the two measures mean the more excellent classification performance. In Tables 11 and 12, TCRFS can exhibit the best system performance on 11 data sets on the ML-kNN classifier for the HL and ZOL criteria. In some cases, comprehensive consideration of the three key aspects to assess feature relevance does not achieve the best classification effect. The classification results of D2F takes the first position on the Yeast data set regarding Macro-F 1 on the 3NN classifier. PMU and RALM-FS possess the optimal classification performance on the Yeast data set and the Education data sets, respectively. In terms of HL (Table 11), RALM-FS and SCLS surpass other approaches on the Birds and Emotions data sets, respectively. In terms of ZOL (Table 12), FSSL and D2F surpass other approaches on the Birds and Emotions data sets, respectively. Despite the fact that D2F, PMU, RALM-FS, SCLS and FSSL have the greatest system performance on individual data sets, the overall optimal classification performance is still TCRFS. The average values of each method for different evaluation criteria are illustrated in Figure 5. The abscissa and different colored bars represent different feature selection methods, while the ordinate represents the average value.

Data Set RALM-FS
Entropy 2021, 1, 0 16 of 21 classification effect. The classification results of D2F takes the first position on the Yeast data set regarding Macro-F 1 on the 3NN classifier. PMU and RALM-FS possess the optimal classification performance on the Yeast data set and the Education data sets, respectively. In terms of HL (Table 11), RALM-FS and SCLS surpass other approaches on the Birds and Emotions data sets, respectively. In terms of ZOL (Table 12), FSSL and D2F surpass other approaches on the Birds and Emotions data sets, respectively. Despite the fact that D2F, PMU, RALM-FS, SCLS and FSSL have the greatest system performance on individual data sets, the overall optimal classification performance is still TCRFS. The average values of each method for different evaluation criteria are illustrated in Figure 5. The abscissa and different colored bars represent different feature selection methods, while the ordinate represents the average value. Observing the trend of the bar graphs in Figure 5a,b, Macro-F 1 results and Micro-F 1 results achieved on the SVM classifier and 3NN classifier have reached similar classification performance. The average results of TCRFS in terms of Macro-F 1 are roughly 0.2 or above, and the average results of TCRFS in terms of Micro-F 1 are roughly 0.4 or above, which are clearly greater than the average results of other approaches. The average result of TCRFS is less than 0.074 in Figure 5c and less than 0.74 in Figure 5d, which are clearly less than the average results of other approaches. Intuitively, TCRFS clearly presents the most excellent average values in terms of the four evaluation criteria. In order to further observe the classification performance of the seven methods on the data sets, we draw Figures 6-9. Observing the trend of the bar graphs in Figure 5a,b, Macro-F 1 results and Micro-F 1 results achieved on the SVM classifier and 3NN classifier have reached similar classification performance. The average results of TCRFS in terms of Macro-F 1 are roughly 0.2 or above, and the average results of TCRFS in terms of Micro-F 1 are roughly 0.4 or above, which are clearly greater than the average results of other approaches. The average result of TCRFS is less than 0.074 in Figure 5c and less than 0.74 in Figure 5d, which are clearly less than the average results of other approaches. Intuitively, TCRFS clearly presents the most excellent average values in terms of the four evaluation criteria. In order to further observe the classification performance of the seven methods on the data sets, we draw Figures 6-9. Figures 6-9 indicate that TCRFS delivers superior classification performance on the Arts, Recreation, Entertain, and Health data sets regarding the four evaluation criteria. As shown in Figure 6, the classification performance of our method is significantly better than the other six contrasted methods. On the Recreation data set (Figure 7), the classification performance of the method is not constantly improved by increasing the number of selected features. TCRFS, for example, may obtain the most significant classification results regarding the ZOL measure when the number of selected features is set at 8% or 11% of the total number of features. On the Entertain data set (Figure 8), TCRFS is clearly in the lead regarding Macro-F 1 when the percentage of the selected features is larger than one. In terms of HL and ZOL, TCRFS also possesses significant advantages among the seven methods. The proposed method can obtain the optimum classification performance for each metric when the percentage of the selected features is set to 6%. In Figure 9, our method outperforms the other six contrasted methods on the Health data set utilizing the four metrics. Although in most cases the performance of feature selection methods improves as the number of selected features increases, as the number of features increases to a certain number, the improvement in the classification performance tends to be flat. When the percentage of the number of features increases to about 16% on the Arts data set (Figure 6a-d) and the percentage of the number of features increases to about 19% on the Entertain data (Figure 8a-d), the classification performance has reached a relatively high level. That is to say, an optimal feature subset is to select a smaller number of features to achieve a better classification performance. However, some methods appear to have the same classification performance as TCRFS in Figure 8d and Figure 9e, but TCRFS is superior on average, and they are not as excellent as TCRFS overall. As a consequence, it is critical to consider the three types of conditional relevance for multi-label feature selection.
We create the final feature subset by starting from an empty feature subset and adding a feature after each calculation of the proposed method. According to the TCRFS evaluation function, the score of each candidate feature is calculated and sorted. Due to TCRFS using three incremental information terms as the evaluation criteria for feature relevance, the incremental information of the remaining candidate features will change after each time the selection operation of candidate features is completed. It needs to be recalculated and scored. Therefore, while achieving better classification performance, more time is consumed.

Conclusions
In this paper, a TCRFS that combines FR and LR is proposed to capture the optimal selected feature subset. FR fuses three incremental information terms that take three key aspects into consideration to convey three types of conditional relevance. Then, TCRFS is compared with 1 embedded approach (RALM-FS) and 5 information-theoreticalbased approaches (D2F, PMU, SCLS, FSSL, and MUCO) on 13 multi-label benchmark data sets to demonstrate its efficacy. The classification performance of seven multi-label feature selection methods is evaluated through four multi-label metrics (Macro-F 1 , Micro-F 1 , Hamming Loss, and Zero One Loss) for three classifiers (SVM, 3NN, and ML-kNN). Finally, the classification results verify that TCRFS outperforms the other six contrasted approaches. Therefore, candidate features, selected features, and label correlations are critical for feature relevance evaluation, and they can aid in the selection of a more suitable subset of selected features. Our current research is based on a fixed label set for multi-label feature selection. In our future research, we intend to explore multi-label feature selection integrating information theory with the stream label problem.