Next Article in Journal
Efficient Interleaver Architecture for Modern Global Navigation Satellite Systems
Previous Article in Journal
A Systematic Review of Large Language Models in Mental Health: Opportunities, Challenges, and Future Directions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Label Feature Selection Method Based on Maximum Label Complexity Ratio

1
School of Artificial Intelligence, Hebei University of Technology, Tianjin 300131, China
2
Hebei Province Key Laboratory of Big Data Calculation, Tianjin 300131, China
3
Aviation University of Airforce, Changchun 130022, China
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(3), 525; https://doi.org/10.3390/electronics15030525
Submission received: 24 December 2025 / Revised: 18 January 2026 / Accepted: 23 January 2026 / Published: 26 January 2026
(This article belongs to the Special Issue New Trends for Feature Selection Applied in Data Mining)

Abstract

Multi-label feature selection, which aims to select reliable and information-rich feature subsets from high-dimensional multi-label data, plays a critical role in data mining and pattern recognition. Conventional information-theoretic methods approximate the high-order correlation between candidate features and the multi-dimensional label set by aggregating low-order mutual information between features and individual labels. However, this strategy inherently assumes all labels are equally significant, thereby overlooking their intricate distributions. To address this limitation, we first define a novel label complexity ratio based on information entropy and mutual information. We then quantify and dynamically update this ratio for each label, accounting for varying label correlations and the differential influence of selected features. Finally, we propose a new feature selection method that jointly considers the correlation with the currently most complex label, the redundancy between candidate and already-selected features, and the interaction information among these three elements to identify a high-quality feature subset. Comprehensive experiments on nine benchmark multi-label datasets demonstrate that the proposed method achieves superior performance compared to eight state-of-the-art multi-label feature selection methods.

1. Introduction

With the rapid advancement of big data and artificial intelligence technologies, vast amounts of data are being generated and stored in numerous applications. This data exhibits growing trends toward complexity and diversity. Increasingly, data objects are characterized by high-dimensional features and are associated with multiple semantic labels simultaneously. For instance, a news document may be represented by tens of thousands of word features and annotated with topics such as “economy,” “culture,” and “sports”. By analyzing the distribution patterns of multi-label training data, multi-label learning algorithms can perform multi-label classification for unseen instances [1]. This capability has led to broad applications in areas such as sentiment analysis, functional genomics classification, and image annotation [2].
However, the “curse of dimensionality” inherent in high-dimensional data significantly increases the complexity and computational burden of learning algorithms. Such high-dimensional multi-label datasets not only contain features relevant to the label set, but also include a substantial number of irrelevant and redundant features. The presence of these irrelevant and redundant features can lead to overfitting in learning models [3,4,5], substantially compromising algorithmic effectiveness. Consequently, selecting a compact yet informative feature subset that is closely related to the label set from high-dimensional multi-label data has become a critical and challenging task [6]. To address this issue, researchers have developed various multi-label feature selection methods [7,8]. These methods aim to identify a relevant feature subset from the original high-dimensional feature space while eliminating those that are irrelevant or redundant [9,10].
Based on the employed selection strategy, multi-label feature selection methods are generally categorized into three types: filter, wrapper, and embedded methods [11]. Filter methods operate independently of any specific learning algorithm and do not interfere with subsequent model training [12]. Wrapper methods evaluate feature subsets by directly assessing the classification performance of a designated predictor [13]. Embedded methods integrate the feature selection process into the training phase of the learning algorithm itself. Unlike wrapper and embedded approaches [14,15], filter methods are classifier-agnostic, offering advantages such as high computational efficiency and strong scalability [16,17,18]. In this work, we introduce a novel feature evaluation criterion following the filter-based paradigm.
In filter-based approaches, information theory provides a widely adopted evaluation criterion capable of capturing both linear and nonlinear feature relationships, thereby offering a quantitative measure of feature importance. Numerous multi-label feature selection methods founded on information theory have been developed. Broadly speaking, these methods assess features mainly from two perspectives: the relevance between features and the label space, and the redundancy among features [19,20,21,22]. Unlike single-label feature selection, multi-label scenarios must consider correlations between features and multiple labels. To address the challenge of evaluating feature relationships with the high-dimensional label set, many existing methods approximate this by accumulating low-order mutual information between candidate features and individual labels. Other methods employ conditional mutual information between features given labels to estimate the feature-label space correlation [23,24,25]. This accumulation strategy is fundamentally premised on the assumption that all labels are equally significant, which introduces several key limitations when assessing candidate feature relevance: (1) it fails to differentiate the information distributions associated with different labels; (2) it neglects the dynamic interrelationships among labels; and (3) it overlooks the varying influence that selected features may exert on different labels. Specifically, the information distribution of each label exhibits varying degrees of complexity. Labels with more intricate distributions contain richer information and are therefore relatively more significant, implying that more relevant features are needed for their adequate representation. Beyond the inherent complexity of individual labels, the relationships among different labels must also be considered. While some prior improvements have acknowledged label relationships, they have not effectively distinguished or quantified label complexity. Moreover, during the feature selection process, as selected features progressively capture label information, the complexity of the corresponding labels changes dynamically. It is; therefore, essential to holistically consider label complexity under multiple influencing factors, ensuring the final selected feature subset sufficiently represents the intricate distribution of label information.
To quantify the complexity of various label distributions, we introduce a novel criterion termed the label complexity ratio (LCratio), derived from entropy and mutual information. Guided by this measure and rooted in information-theoretic principles, we propose a multi-label feature selection method called Multi-label Complexity Feature Selection (MLCFS). MLCFS selects features by dynamically focusing on the currently most complex label, while jointly evaluating feature-label correlation, feature–feature redundancy, and the interaction among features and labels. The goal is to select a compact yet highly informative feature subset. The detailed methodology is presented in Section 4. The main contributions of this work are summarized as follows:
(1)
We systematically investigate how dynamic changes in label complexity influence feature relevance assessment. To quantify this effect, we introduce a dynamic label complexity ratio derived from label information entropy and mutual information.
(2)
A novel multi-label feature selection method named MLCFS is proposed. This method comprehensively addresses the correlation and redundancy among features, as well as the interaction information between features and labels. Additionally, it takes into account the variations in label complexity.
(3)
To verify the effectiveness of MLCFS, experiments are conducted on nine publicly available multi-label datasets. This study compares the proposed method with eight established multi-label feature selection methods. The experimental results demonstrate that MLCFS outperforms the other comparative methods across multiple evaluation metrics, effectively reducing data dimensionality and enhancing the classification performance.
The remaining structure of the paper is as follows. Section 2 introduces some basic concepts of information theory. Section 3 briefly reviews the related work. Section 4 describes the proposed multi-label feature selection method MLCFS in detail. Section 5 presents and analyzes the experimental results to verify the effectiveness of the proposed method. Section 6 summarizes this paper.

2. Preliminaries

The Basic Concepts of Information Theory

In this section, we introduce two fundamental information-theoretic concepts central to our feature selection framework: mutual information and conditional mutual information [26,27]. Mutual information measures the amount of information shared between two variables, reflecting their degree of correlation. Formally, mutual information is defined as follows:
I ( X ; Y ) = H ( X ) H ( X | Y ) = i = 1 n j = 1 m p ( x i , y i ) log p ( x i , y i ) p ( x i ) p ( y i )
where X = { x 1 , x 2 , , x n } and Y = { y 1 , y 2 , , y m } are two discrete random variables, p ( x i ) and p ( y j ) are marginal probability density functions, i.e., p ( x i ) = c o u n t ( X = x i ) n , p ( y j ) = c o u n t ( Y = y j ) m , p ( x i , y j ) is the joint probability density function, computed by p ( x i , y j ) = c o u n t ( X = x i Y = y j ) n m , c o u n t ( ) is the number of values in the variables. H ( X ) is the entropy used to measure the uncertainty of X , computed by H ( x ) = x i X p ( x i ) log p ( x i ) . H ( X | Y ) is the conditional entropy to measure the remaining uncertainty of X given Y , computed by H ( X , Y ) = x i X y j Y p ( x i , y j ) log p ( x i , y j ) . The larger the mutual information, the more information the two random variables share, and the greater the correlation between them.
Conditional mutual information quantifies the interdependence between two random variables when a third variable is known. Let Z = { z 1 , z 2 , , z k } be another discrete random variable. The definition of the conditional mutual information between the random variables X and Y given the random variable Z is as follows:
I ( X ; Y | Z ) = H ( X | Z ) H ( X | Y Z ) = i = 1 n j = 1 m l = 1 k p ( x i , y j , z l ) log p ( x i , y j | z l ) p ( x i | z l ) p ( y j | z l )

3. Related Work

The primary goal of multi-label feature selection is to identify a subset of features that are highly relevant to the label space from a high-dimensional candidate set. In recent years, researchers have proposed numerous methods for this task, which generally fall into two main categories based on how they handle multi-label data: problem transformation-based methods and algorithm adaptation-based methods. Problem transformation-based approaches first convert multi-label data into one or more single-label problems, using techniques such as Binary Relevance (BR) [28], Label Powerset (LP) [29], or Pruned Problem Transformation (PPT) [30]. A conventional single-label feature selection method is then applied to the transformed data. A key limitation of this paradigm is that it often fails to capture correlations among labels. In contrast, algorithm adaptation-based methods operate directly on the original multi-label dataset to select an optimal feature subset. A growing number of such methods have been developed [31,32], which explicitly leverage inter–label relationships to guide the selection of more informative features.
In recent years, algorithm adaptation-based multi-label feature selection methods have gained considerable attention. Jian et al. [33] proposed MIFS, an embedding-based method that uncovers label correlations through latent semantic analysis while jointly performing label decomposition and feature selection via a regression model. Fan et al. [34] introduced LCIFS, a method that incorporates label relationships by jointly modeling label correlations and feature redundancy. Specifically, LCIFS employs adaptive spectral graph learning to capture label structural correlations and fits the feature–label relationship using manifold-based regression, while also utilizing feature correlations to reduce redundancy in the selected subset. Dai et al. [35] presented an approach that evaluates features from a global correlation perspective and integrates this prior knowledge into an orthogonal regression optimization framework. Yin et al. [36] proposed LEFMIFS, which embeds label enhancement into feature selection. LEFMIFS first converts logical labels into real-valued label distributions and then incorporates them into a fuzzy mutual information-based feature evaluation function for multi-label feature assessment.
Information theory has been widely adopted in multi-label feature selection for quantifying nonlinear feature relationships. Sun et al. [37] developed a method that combines mutual information with constrained convex optimization to fully capture feature-label correlations. Gonzalez–Lopez et al. [38] introduced a Gaussian Mixture Model (GMM) approach that selects the optimal feature subset by maximizing the geometric mean of the mutual information between features and each label. Lee et al. devised a series of feature-evaluation measures, including D2F [39], PMU [40], and SCLS [41], which assess feature relevance and redundancy from different angles. Zhang et al. [42] incorporated conditional mutual information and proposed a label-redundancy-aware method termed LRFS. Pan et al. [43] presented an approximation of three-way interaction information, referred to as IDA in this paper, for evaluating feature correlation and redundancy. The FIMF method [44] is a fast information-theoretic technique that accelerates correlation measurement by omitting redundant entropy computations while emphasizing high-entropy labels. Zhang et al. [45] proposed MFSJMI, which embeds joint mutual information and interaction weights into the evaluation function by decomposing joint mutual information and considering multi-label correlations. By examining the feature-evaluation criteria used in the information-theoretic methods above, we can summarize them into the following unified framework:
J ( f k ) = R e l e v a n c e ( f k ; L ) R e d u n d a n c e ( f k ; S )
where f k denotes the candidate feature, L denotes the label set, and S is the selected feature subset. J ( f k ) represents the evaluation criterion for the candidate feature f k , with larger values indicating greater importance. R e l e v a n c e ( f k ; L ) represents the correlation between f k and the label set L , while R e d u n d a n c e ( f k , S ) represents the redundancy between f k and the selected feature set S. Symbol explanations can be found in Appendix A Table A1. Table 1 summarizes the feature evaluation criteria proposed by the above representative methods based on the evaluation framework in Formula (3). Specifically, in the series of feature selection evaluation functions proposed by Lee et al., such as D2F, PMU, and SCLS, R e l e v a n c e ( f k ; L ) is calculated by the sum of mutual information between candidate features and each label, that is l i L I ( f k ; l i ) . In addition, Zhang et al. measured R e l e v a n c e ( f k ; L ) using the accumulated conditional mutual information or joint mutual information of candidate features and paired labels, that is l i l j , l j L I ( f k ; l j | l i ) . Analysis shows that existing methods commonly adopt an accumulation strategy to quantify the correlation of candidate features. However, this strategy uniformly computes the correlation between all labels and features, without performing fine-grained differentiation or measurement of label information. To address this limitation, this paper proposes a novel label importance measure, based on which a precise assessment of candidate feature correlation is achieved.

4. Proposed Feature Selection Method

In Section 4.1, we analyze the dynamic changes in the information carried by labels during the feature selection process. In Section 4.2, we define a new term, Label Complexity Ratio (LCratio), and analyze the cases in which the selected feature set is empty and non-empty. In Section 4.3, we propose a new feature selection method called Multi-label Complexity Feature Selection (MLCFS) and provide its pseudo-code.

4.1. The Dynamic Changes in the Label Information

In this section, we present a detailed description of the feature selection process for analyzing dynamic changes in the distribution of label information.
Information-theoretic multi-label feature selection methods typically employ a sequential forward strategy, iteratively adding one feature at a time to construct the optimal subset. To clarify this process, we present a schematic illustration in Figure 1.
Let { l i , l j , l k , l p } denote the label space. S i i and S j j represent feature subsets selected at two different stages of the feature selection process, with S i i S j j , indicating that S i i corresponds to an earlier stage than S j j . Specifically, when the selected feature subset is empty, the label’s intrinsic information and the structural relationships among labels are illustrated in Figure 1a, where the sizes of different regions represent the amount of information carried by the labels, and the intersection between the information regions of two labels reflects their correlation. In the initial stage, labels l i and l j each has substantial information, and l i shows stronger correlations with l k and l p . Therefore, during the initial stage of feature selection, prioritizing features highly relevant to the label l i helps capture more of the semantic information within the label space.
Subsequently, the selected feature subset S i i gradually incorporates relevant features. As the selected features increasingly reflect label semantics, the remaining uncertainty in the label space decreases, leading to a corresponding change in label information complexity. As shown in Figure 1b, S i i captures more information about l i and l p , while label l i still contains a significant portion of unrepresented information. Thus, in subsequent iterations, selecting features strongly associated with l j enhances the overall representational capacity of the final feature subset. Further, Figure 1c illustrates that after an additional round of feature selection, S i i is updated to a new subset S j j . At this point, l k becomes the label with the highest current information content. Through the iterative sequential forward selection process, the distribution of label information changes dynamically. At each step, selecting features that are strongly correlated with the currently most informative label effectively maximizes the coverage of label information, thereby constructing a semantically richer and more representative feature subset.

4.2. Quantification of the Complexity of Labels

This section introduces a novel term to capture the dynamic variations in the complexity of label information, offering more effective guidance for evaluating feature relevance. This term incorporates the influence of both label relationships and the currently selected features, enabling its calculation and dynamic update [46] throughout the selection process.
Definition 1. 
Let  L = { l 1 , l 2 , , l q }  be the set of labels, and  q  be the number of labels,  l i  represents the i-th label,  1 i q . When the selected feature set  S  is empty, the definition of the complexity rate of the label  l i  is as follows:
L C r a t i o ( l i ) = 1 2 × { H ( l i ) + 1 | L | 1 l j L l i 2 I ( l i ; l j ) H ( l i ) + H ( l j ) }
where H ( l i ) represents the quantification of the uncertainty information distribution of the current label l i itself, I ( l i ; l j ) represents the measurement of the relationship between the label l i and other labels in the label set L. L C r a t i o ( l i ) captures the assessment of label complexity by integrating label entropy and mutual information among labels. A higher value of L C r a t i o ( l i ) indicates a more complex distribution of label l i and stronger inter-label correlations, implying that the information carried by label l i is more important and thus requires more features during the feature selection process for adequate representation.
As the forward sequential search progresses, more relevant features are iteratively added to the selected subset, which in turn captures the information of different labels to varying degrees. Therefore, the complexity of each label must be dynamically updated to reflect the influence of the features already selected. The specific update formula is given below:
L C r a t i o ( l i , S ) = 1 2 × { min f max S   H ( l i | f max ) + 1 | L | 1 l j L l i 2 I ( l i ; l j | f max ) H ( l i | f max ) + H ( l j | f max ) }
where the first term represents the distribution of the remaining uncertainty information of the label l i under the condition of the known feature f max . Here, based on the principle that a smaller conditional entropy indicates the known variable provides more information, the strategy selects the already existing feature f max from S that has the greatest influence on l i , while also reducing computational complexity. Since l i exhibits the least remaining uncertainty information under the influence of f max , it implies that f max captures the most informative content from l i , thereby reducing the label complexity rate. Then, the second term accounts for the dynamic changes in the relationship between the label l i and other labels in the label set L conditioned on f max .
Furthermore, based on the definition of the label complexity ratio, both L C r a t i o ( l i ) and L C r a t i o ( l i , S ) are bounded within the interval (0, 1), with a scaling factor of 1/2 to ensure values remain within the range. This ensures that the value of L C r a t i o ( l i ) intuitively reflects the relative importance of the label l i within the entire label space, rather than merely representing an absolute amount of information. It is also regarded as the probability that this label is selected as the most descriptive label at the current stage. The specific proof process is as follows:
Proof. 
According to the definition and properties of entropy and mutual information in information theory, it is known that the information entropy of the label l i satisfies 0 H ( l i ) 1 . Since 0 I ( l i ; l j ) H ( l i ) and 0 I ( l i ; l j ) H ( l j ) ; therefore, 0 2 I ( l i ; l j ) H ( l i ) + H ( l j ) , further obtaining 0 2 I ( l i ; l j ) H ( l i ) + H ( l j ) 1 . By taking the average in the label set, 0 1 | L | 1 l j L l i 2 I ( l i ; l j ) H ( l i ) + H ( l j ) 1 is satisfied. In conclusion, 0 1 2 { H ( l i ) + 1 | L | 1 l j L l i 2 I ( l i ; l j ) H ( l i ) + H ( l j ) } 1 is satisfied, that is 0 L C r a t i o ( l i ) 1 . □
The proof process of L C r a t i o ( l i , S ) can be obtained by the same reasoning.

4.3. Proposed Method

This section proposes a new multi-label feature selection method, MLCFS, based on the label complexity ratio presented in Section 4.2, and adopts an information-theoretic measure to evaluate features through a two-stage interactive iterative strategy. In the first stage, the specific label l max characterized by a complex information distribution and strong inter–label relationships within the label space is identified using the label complexity ratio, which is calculated as follows:
l max = arg max l i L { L C r a t i o ( l i ) } S = Ø ,   arg max l i L { L C r a t i o ( l i , S ) } Otherwise  
when the selected feature set S is empty, l max is the label corresponding to the maximum L C r a t i o ( l i ) value; when S is non-empty, l max is the label corresponding to the maximum L C r a t i o ( l i , S ) value.
In the second stage, a novel feature evaluation criterion is proposed. This criterion measures the correlation between candidate features and the label l m a x , while also accounting for redundancy relative to the already selected features. Additionally, it incorporates the interaction information among these three components. The specific formulation is as follows:
J ( f k ) = I ( f k ; l max ) 1 | S | f s S I ( f k ; f s ) 1 | S | f s S { I ( f k ; l max ) I ( f s ; l max | f k ) }
where I ( f k ; l max ) represents the correlation between the candidate feature f k and the label l max , I ( f k ; f s ) quantifies the redundancy between the candidate features f k and the feature f s   in the selected feature set S, while I ( f k ; l max ) I ( f s ; l max | f k ) captures the interaction information among the candidate features f k , the selected feature f s , and the label l m a x . The interaction information term may be positive or negative. A negative value indicates that, given a feature f k as a condition, f s provides more classification information for the label l max than f k itself. Therefore, a negative interaction term actually translates into a positive contribution to the candidate feature f k . This encourages the algorithm to favor features that complement the already selected feature set rather than merely being redundant. A higher value of J ( f k ) indicates that f k provides greater representational and descriptive information about the label l max , implying a stronger correlation. Meanwhile, lower redundancy between f k and the selected features suggest that f k contributes more complementary information, thereby enhancing joint interactions within the feature set.
After the current optimal candidate feature f max is identified by maximizing J ( f k ) , it is added to the selected feature set S . Following the forward sequential search strategy, the selection of the next optimal candidate feature begins with obtaining the corresponding specific label through L C r a t i o ( l i , S ) . At this stage, the inclusion of f max influences the update computation of L C r a t i o ( l i , S ) , thereby reducing the complexity rate of the label l max . Since the first term in J ( f k ) is maximized, H ( l max ) can be treated as approximately constant when evaluating all candidate features under I ( f max ; l max ) = H ( l max ) H ( l max | f max ) , leading to a smaller H ( l max | f max ) value. Consequently, according to Formula (5), the overall value of L C r a t i o ( l max , S ) decreases. This mechanism ensures that a distinct complexity rate label is selected at each iteration, guiding the final feature set S toward capturing more comprehensive and richer label descriptive information. The algorithm workflow is illustrated in Figure 2. It contains two stages: select the specific label l max carrying the most complex information distribution, and select the feature with the largest J ( f k ) .
Based on the preceding analysis, feature selection necessitates not only a holistic consideration of feature correlation, redundancy, and feature–label interaction, but also an explicit awareness that distinct label distributions impose differing demands on the descriptive capacity of features. The pseudo-code of MLCFS is presented in Algorithm 1. According to the pseudo-code, the algorithm consists of three steps. Step 1 (lines 1–5) initializes the parameters and computes the complexity ratio of all labels. Step 2 (lines 7–13) selects the label with the highest label complexity ratio and adds the feature with the maximum mutual information for that label to the feature subset. Step 3 (lines 14–25) iteratively updates the label complexity ratio, selects the label with the highest updated complexity ratio, and adds the feature that maximizes Formula (7), repeating until the stopping condition is met. The third step consists of two stages: Stage A updates the label complexity ratio, and Stage B evaluates feature performance and selects candidate features.
Algorithm 1 MLCFS
Input:
A   training   sample   D   with   a   full   feature   set   F = { f 1 , f 2 , , f n }   and   a   label   set   L = { l 1 , l 2 , , l q } ; User-specified threshold K.
Output:
  The already-selected feature subset S.
    //Step 1: Compute initial label complexity ratios for all labels
  1:   Initialize   S Ø ;
  2:   Initialize   k 0;
  3: For i = 1 to q do
  4: Calculate   the   quantification   of   the   complexity   of   the   label   l i   based   on     L C r a t i o ( l i ) ;
  5: End for
  6: While k < K do
    //Step 2: First iteration (when S is empty)
  7:  If k = 0 then
  8:   Select   label   l m a x   with   the   largest   L C r a t i o ( l i ) ;
  9:   Select   feature   f max   with   the   largest   I ( f m ; l m a x ) ;
  10: F = F { f m a x } ;
  11: S = S { f m a x } ;
  12: k = k + 1 ;
  13:  End if
    //Step 3: Subsequent iterations
    //Stage A: Update label complexity ratios considering selected features
  14:  For i = 1 to q do
  15:   According   to   the   Formula   ( 5 ) ,   update   the   L C r a t i o ( l i ; S ) ;
  16:  End for
  17:   Select   label   l m a x   with   the   largest   L C r a t i o ( l i ; S ) ;
    //Stage B: Comprehensive feature evaluation
  18:   For   each   candidate   feature   f m F  do
  19:   According   to   the   Formula   ( 7 ) ,   calculate   the   J ( f m ) ;
  20:   end for
  21:   Select   the   feature   f m a x   with   the   largest   J ( f m ) ;
  22:   F = F { f m a x } ;
  23:   S = S { f m a x } ;
  24:   k = k + 1 ;
  25: End while.

4.4. Theoretical and Time Complexity Analysis

Most information-theoretic methods (e.g., D2F, PMU, MFSJMI) treat all labels as equally important. They aggregate mutual information across labels or consider pairwise label interactions, which is based on an implicit assumption of uniform label importance. Furthermore, although FIMF introduces label weights, these weights are not updated during the feature selection process. Overall, these methods are unable to differentiate and quantify the complexity of different label distributions. In contrast, the MLCFS algorithm dynamically focuses on the currently most complex label, thereby ensuring balanced coverage of the entire label space and overcoming the limitations of existing feature selection strategies in handling label information distribution.
We present a time-complexity analysis for the proposed MLCFS method and eight representative information-theoretic feature selection methods (D2F, PMU, SCLS, LRFS, FIMF, IDA, MFSJMI, and MIFS). Let n denote the number of instances, d the number of features, q the number of labels, and w the size of the selected feature subset. Since probability estimation requires scanning all instances, computing mutual information, conditional mutual information, and interaction information typically incurs a time complexity of O ( n ) . The time complexities of all compared methods are summarized in Table 2.
The analysis shows that MLCFS achieves the same time complexity as the SCLS method. Furthermore, the time complexity of MLCFS is lower than that of D2F, PMU, LRFS, FIMF, and IDA. Specifically, compared to D2F, FIMF, IDA, and MFSJMI, MLCFS saves a factor of q in the second term, by focusing on a single label l max per iteration instead of summing over all q labels when computing feature redundancy and interaction. Compared to PMU and LRFS, which have second-order label correlations at O ( n d q 2 ) cost, MLCFS uses the precomputed and dynamically updated LCratio to guide label selection without requiring repeated calculation of pairwise label interactions during feature evaluation.
Therefore, while introducing a novel dynamic label complexity awareness, the proposed MLCFS method maintains competitive computational efficiency compared to simpler methods and is more efficient than several state-of-the-art methods that account for label correlations.

5. Experimental Results and Analysis

In this section, experiments are conducted on nine publicly available multi-label datasets to evaluate the effectiveness of the proposed multi-label feature selection method MLCFS. The proposed method is compared with eight representative and widely used multi-label feature selection methods. The evaluation metrics are described in Section 5.1. The dataset descriptions and experimental settings are provided in Section 5.2. The experimental results and detailed analysis are presented in Section 5.3. The significance tests of the experimental results are provided in Section 5.4.

5.1. Evaluation of Metrics for Multi-Label Feature Selection

To enhance the evaluation of multi-label algorithms’ performance, we employ four widely recognized metrics commonly used in multi-label learning [47]. Suppose that D = { ( x i , l i ) | x i X , l i L } is a multi-label training dataset, U = { x 1 , x 2 , , x n } is a set containing n samples, F = { f 1 , f 2 , , f d } denotes the set of features, L = { l 1 , l 2 , , l q } is a set of labels. x i U , L ( x i ) and L ( x i ) denote the true label set and the predicted label set, respectively. The specific definitions of the evaluation indicators are as follows.
(1)
Hamming Loss (HL): HL evaluates the occurrence frequency of the given sample labels being misclassified.
H L = 1 n i = 1 n | L ( x i ) L ( x i ) | q
calculates the symmetric difference between L ( x i ) and L ( x i ) .
(2)
Average Precision (AP): AP evaluates the average score of the labels ranked higher than the given label.
A P = 1 n i = 1 n 1 | L ( x i ) | l i L ( x i ) | { l i | r a n k ( g ( x i , l i ) ) r a n k ( g ( x i , l i ) ) , l i L ( x i ) } | r a n k ( x i , l i )
where r a n k ( g ( x i , l ) ) record the results of all labels ranked in descending order of their scores according to g ( . ) .
(3)
Ranking Loss (RL): RL evaluates the average score of the marked pairs in the reverse sorting of the given samples.
R L = 1 n i = 1 n 1 | L ( x i ) | | L ¯ ( x i ) | | { ( l , l ) | f ( x i , l ) f ( x i , l " ) , ( l , l " ) L ( x i ) × L ¯ ( x i ) } |
L ¯ ( x i ) represents the complementary set of the true label set L ( x i ) , L ( x i ) × L ¯ ( x i ) represents the Cartesian product between L ( x i ) and L ¯ ( x i ) .
(4)
Coverage Error (CE): CE evaluates how many steps it takes on average to move down the list of ranked labels, covering all relevant labels for the sample.
C E = 1 n i = 1 n max l L ( x i ) r a n k ( g ( x i , l ) ) 1
The evaluation metrics adopted in the experiment follow the principle that smaller HL, RL, and CE values indicate better performance. Conversely, the higher the value of AP, the better the classification performance.

5.2. Description of Multi-Label Benchmark Datasets and Experimental Settings

We evaluate the performance of our method on nine publicly available multi-label benchmark datasets from the Mulan repository [48], datasets that are well-established in prior work [49,50,51]. Table 3 summarizes their key characteristics, including the number of instances, features, and labels. To comprehensively assess the effectiveness of the proposed approach, we compare it with eight representative multi-label feature selection methods: MIFS, D2F, PMU, SCLS, LRFS, FIMF, IDA, and MFSJMI. For each dataset, the top 20% of features selected by each method are used to compute average performance scores and standard deviations. The classification performance is evaluated using the MLKNN classifier [52] with four standard metrics: Hamming Loss (HL), Average Precision (AP), Ranking Loss (RL), and Coverage Error (CE). Following common practice, we set the number of neighbors K = 10 and the smoothing factor to 1.

5.3. Classification Results and Analysis

Table 4, Table 5, Table 6 and Table 7 present the experimental results of the nine multi-label feature selection methods on the nine benchmark datasets. The best value in each row is highlighted in bold; the last row reports the average performance of each method over all datasets. By comparing four evaluation metrics among MIFS, D2F, PMU, SCLS, LRFS, FIMF, IDA, MFSJMI, and the proposed MLCFS, the effectiveness of MLCFS is comprehensively confirmed. As shown in Table 4, MLCFS achieves the lowest Hamming Loss (HL) on five datasets and obtains competitive results on the scene and medical datasets, reflecting its overall strength in minimizing label-wise misclassification. Table 5 shows that MLCFS outperforms all other methods on the Coverage Error (CE) metric, attaining the best score on eight of the nine datasets.
In Table 6, MLCFS delivers better Ranking Loss (RL) values than all compared methods on six datasets; its average RL is also the lowest among all methods, indicating superior ranking consistency. Regarding Average Precision (AP) in Table 7, MLCFS exhibits outstanding performance on eight datasets and achieves the highest average AP, confirming its advantage in retrieving relevant labels early. Overall, the results demonstrate that MLCFS performs consistently well across all evaluation metrics. These findings highlight the importance of accounting for label complexity distribution and the dynamic interactions between labels and selected features during the feature-selection process.
Figure 3 graphically summarizes the average rank results derived from Table 3, Table 4, Table 5 and Table 6. In each subfigure, the horizontal axis corresponds to the nine compared methods, while the vertical axis shows the average rank of each method across all experimental datasets. MLCFS consistently achieves the best average rank on all four metrics, demonstrating its superior ranking consistency and overall effectiveness relative to the eight benchmark methods.
To evaluate the sensitivity of MLCFS to the number of selected features, Figure 4, Figure 5, Figure 6 and Figure 7 present the performance evolution curves of all methods as the feature subset size increases from 1% to 20%, with a 1% increment. A comprehensive analysis reveals the following: MLCFS generally achieves performance saturation with fewer features and even at very low feature proportions (<5%), MLCFS maintains a leading performance, indicating that the initial critical features it selects are of high quality. And the curves of MLCFS exhibit smooth and steady trends across all datasets, without abnormal fluctuations, demonstrating the robustness of its selection process.
To evaluate the sensitivity of the features selected by different feature selection methods to classifier parameters, we conduct additional experiments. Specifically, we fixed the feature subset selected by each method (top 20%) and varied the neighborhood parameter K of the ML-KNN classifier (taking values from {5, 10, 15}), then observed the corresponding changes in classification performance.
Figure 8, Figure 9 and Figure 10 illustrates the variation in the AP metric with respect to the parameter K on three representative datasets. As shown, the curve corresponding to MLCFS consistently outperforms those of other methods across all values of K. Furthermore, the MLCFS curve exhibits smaller fluctuations, indicating that its performance is less sensitive to changes in K. Similar trends are observed on other datasets and across different evaluation metrics. These results demonstrate that the feature subset selected by MLCFS provides a more stable and robust representation for the classifier, enabling consistently strong performance under varying parameter configurations.

5.4. Statistical Tests

To further investigate whether there are significant differences in the classification performance of the proposed algorithm, MLCFS, and eight comparative feature selection algorithms across four evaluation metrics, the Friedman test and the Bonferroni–Dunn test [53,54] were employed for verification. Table 8 presents the average ranking results of the MLSMFS algorithm and all comparative algorithms on the 8 experimental datasets under the four evaluation metrics. The results show that the MLCFS algorithm achieved the optimal ranking outcomes across all metrics. For K algorithms and N datasets, r j i represents the algorithm on the j dataset, and R i = 1 / N j = 1 N r j i represents the average rank of the i algorithm. The Friedman statistic F F follows with an F -distribution. It follows an F -distribution with degrees of freedom ( K 1 ) in the numerator and ( K 1 ) ( N 1 ) in the denominator.
F F = ( N 1 ) χ F 2 N ( K 1 ) χ F 2 ,   w h e r e   χ F 2 = 12 N K ( K + 1 ) ( i = 1 K R i 2 K ( K + 1 ) 2 4 ) .
Table 9 summarizes the value of F F and the corresponding critical value. If the F F value is greater than the critical value, then the null hypothesis is rejected. The null hypothesis states that the classification performance of all compared algorithms is equal. As shown in Table 9, the null hypothesis was clearly rejected on each evaluation metric at the significance level α = 0.05 .
Therefore, the subsequent Bonferroni–Dunn test was employed to further analyze the relative performance between the proposed algorithm and the other comparative algorithms. If the average ranks of the proposed algorithm MLSMFS and a compared algorithm across all datasets fall within a critical difference (CD), they are considered statistically similar. Conversely, if the difference in average ranks exceeds the CD, it indicates a significant difference in classification performance between the proposed algorithm and the compared algorithm.
With K = 9 and N = 9 , C D = q α k ( k + 1 ) 6 N , where q = 2.724 at α = 0.05 ; thus, we can compute CD = 3.516. Figure 11 presents the critical difference diagrams for each classification evaluation metric, with the average ranks of the nine feature selection algorithms plotted along the axis. The ranking increases from right to left in sequence.
In Figure 11, any compared method whose average rank falls within one critical difference (CD) of the best-performing algorithm is connected to it by a thick red line. Otherwise, compared methods not connected to the best method by such a line are considered to exhibit significantly different performance. Overall, the proposed method MLCFS shows significant outperformance over most compared methods across evaluation metrics.

6. Conclusions and Future Work

In this paper, we propose a novel feature selection method for multi-label learning, designed to identify a compact and informative feature subset. We first define a label complexity ratio based on information entropy and mutual information, which quantifies the varying complexity across different label distributions. This ratio is then dynamically updated via conditional mutual information to reflect the influence of already selected features. Building on this foundation, we introduce a new feature evaluation criterion that maximizes the label complexity ratio while holistically accounting for feature correlation, redundancy, and interaction. Finally, we validate the proposed method, termed MLCFS, on multiple multi-label benchmark datasets using four standard evaluation metrics. Experimental results confirm that MLCFS outperforms several representative feature selection methods.
The proposed MLCFS method dynamically selects features based on the currently most complex label. While this method has demonstrated effectiveness in our experiments, it remains subject to certain limitations. For instance, with respect to feature redundancy, the influence of label complexity on pairwise feature redundancy has not been incorporated into the current framework. In future work, we will conduct an in-depth investigation into the different roles of feature redundancy in information-theoretic feature selection to consider the dynamic changes in redundancy within the selected feature subset and the label space.

Author Contributions

Conceptualization, P.Z. and Y.C.; methodology, P.Z.; software, Y.C.; validation, P.Z. and Y.C.; formal analysis, P.Z.; investigation, P.Z. and L.W.; resources, Y.C.; data curation, Y.C.; writing—original draft preparation, Y.C.; writing—review and editing, P.Z. and L.W.; visualization, Y.C.; supervision, P.Z. and L.W.; project administration, P.Z.; funding acquisition, P.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant No. 62206085, and Grant No. 62376088; The National Natural Science Foundation of Hebei Province under Grants No. F2025202050 and No. F202420204; The Hebei Province Yanzhao Golden Platform Talent Gathering Program Key Talent Project (Postdoctoral Platform) (No. B2024005001).

Data Availability Statement

The original data presented in the study are openly available in [Mulan] at [https://mulan.sourceforge.net/] (accessed on 1 May 2025) or reference [Tsoumakas, G.; Spyromitros-Xioufis, E.; Vilcek, J. Mulan: a Java library for multi-label learning. Journal of Machine Learning Research. 2011, 12, 2411–2414 [48]].

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Symbols and Notations.
Table A1. Symbols and Notations.
SymbolsNotations
U = { x 1 , x 2 , , x n } set of samples
F = { f 1 , f 2 , , f d } set of features
L = { l 1 , l 2 , , l q } set of labels
f k the k-th candidate feature(general nonation)
l i L the i-th label
f s S features in the selected feature subset
S F selected feature subset

References

  1. Huang, R.; Wu, Z. Multi-label feature selection via manifold regularization and dependence maximization. Pattern Recognit. 2021, 120, 108149. [Google Scholar] [CrossRef]
  2. Wu, J.S.; Huang, S.J.; Zhou, Z.H. Genome-wide protein function prediction through multi-instance multi-label learning. IEEE/ACM Trans. Comput. Biol. Bioinform. 2014, 11, 891–902. [Google Scholar] [CrossRef]
  3. Spolaôr, N.; Monard, M.C.; Tsoumakas, G.; Lee, H.D. A systematic review of multi-label feature selection and a new method based on label construction. Neurocomputing 2016, 180, 3–15. [Google Scholar] [CrossRef]
  4. Gao, W.; Hu, L.; Zhang, P. Class-specific mutual information variation for feature selection. Pattern Recognit. 2018, 79, 328–339. [Google Scholar] [CrossRef]
  5. Lin, Y.; Hu, Q.; Liu, J.; Chen, J.; Duan, J. Multi-label feature selection based on neighborhood mutual information. Appl. Soft Comput. 2016, 38, 244–256. [Google Scholar] [CrossRef]
  6. Deng, W.; Xu, H.; Guan, Z.; Sun, Y.; Ran, X.; Ma, H.; Zhou, X.; Zhao, H. PSO-K-Means Clustering-Based NSGA-III for Delay Recovery. IEEE Trans. Consum. Electron. 2025, 71, 10084–10095. [Google Scholar] [CrossRef]
  7. Huang, R.; Jiang, W.; Sun, G. Manifold-based constraint Laplacian score for multi-label feature selection. Pattern Recognit. Lett. 2018, 112, 346–352. [Google Scholar] [CrossRef]
  8. Dai, J.; Chen, J.; Liu, Y.; Hu, H. Novel multi-label feature selection via label symmetric uncertainty correlation learning and feature redundancy evaluation. Knowl.-Based Syst. 2020, 207, 106342. [Google Scholar] [CrossRef]
  9. Lee, J.; Kim, D.W. Memetic feature selection algorithm for multi-label classification. Inf. Sci. 2015, 293, 80–96. [Google Scholar] [CrossRef]
  10. Kashef, S.; Nezamabadi-pour, H. A label-specific multi-label feature selection algorithm based on the Pareto dominance concept. Pattern Recognit. 2019, 88, 654–667. [Google Scholar]
  11. Pereira, R.B.; Plastino, A.; Zadrozny, B.; Merschmann, L.H. Categorizing feature selection methods for multi-label classification. Artif. Intell. Rev. 2018, 49, 57–78. [Google Scholar] [CrossRef]
  12. Lee, J.; Kim, D.W. Efficient multi-label feature selection using entropy-based label selection. Entropy 2016, 18, 405. [Google Scholar] [CrossRef]
  13. Hall, M.A. Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning. In Proceedings of the Seventeenth International Conference on Machine Learning, San Francisco, CA, USA, 29 June–2 July 2000; pp. 359–366. [Google Scholar]
  14. Guyon, I.; Weston, J.; Barnhill, S. Gene Selection for Cancer Classification using Support Vector Machines. Mach. Learn. 2002, 46, 389–422. [Google Scholar] [CrossRef]
  15. Mejia-Lavalle, M.; Sucar, E.; Arroyo, G. Feature selection with a perceptron neural net. In Proceedings of the International Workshop on Feature Selection for Data Mining, Bethesda, MD, USA, 22 April 2006; pp. 131–135. [Google Scholar]
  16. Yu, L.; Liu, H. Efficient Feature Selection via Analysis of Relevance and Redundancy. J. Mach. Learn. Res. 2004, 5, 1205–1224. [Google Scholar]
  17. Liu, H.; Setiono, R. Chi2: Feature selection and discretization of numeric attributes. In Proceedings of the 7th IEEE International Conference on Tools with Artificial Intelligence, Herndon, VA, USA, 5–8 November 1995; pp. 388–391. [Google Scholar]
  18. Liu, H.; Yu, L. Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng. 2005, 17, 491–502. [Google Scholar] [CrossRef]
  19. Li, Y.H.; Hu, L.; Gao, W.F. Multi-label feature selection based on sparse coefficient matrix reconstruction. Chin. J. Comput. 2022, 45, 1827–1841, (In Chinese with English abstract). [Google Scholar]
  20. Wu, J.S.; Li, Y.L.; Huang, C. Recent Advances in Unsupervised Multi-view Feature Selection. J. Softw. 2025, 36, 886–914. [Google Scholar]
  21. Li, Y.H.; Hu, L.; Zhang, P. Multi-label feature selection based on dynamic graph Laplacian. J. Commun. 2020, 41, 47–59. [Google Scholar]
  22. Sechidis, K.; Spyromitros-Xioufis, E.; Vlahavas, I. Information theoretic multi-target feature selection via output space quantization. Entropy 2019, 21, 855. [Google Scholar] [CrossRef]
  23. Liu, J.; Lin, Y.; Ding, W.; Zhang, H.; Du, J. Fuzzy mutual information-based multilabel feature selection with label dependency and streaming labels. IEEE Trans. Fuzzy Syst. 2023, 31, 77–91. [Google Scholar] [CrossRef]
  24. Zhang, L.; Wang, C. Multi-label feature selection algorithm based on joint mutual information of max-relevance and min-redundancy. J. Commun. 2018, 39, 111–122. [Google Scholar]
  25. Wang, G.Y.; Yu, H.; Yang, D.C. Decision table reduction based on conditional information entropy. Chin. J. Comput. 2002, 25, 759–766, (In Chinese with English abstract). [Google Scholar]
  26. Liu, J.; Li, Y.; Weng, W. Feature selection for multi-label learning with streaming label. Neurocomputing 2020, 387, 268–278. [Google Scholar] [CrossRef]
  27. Sun, L.; Wang, L.; Ding, W.; Qian, Y.; Xu, J. Feature Selection Using Fuzzy Neighborhood Entropy-Based Uncertainty Measures for Fuzzy Neighborhood Multigranulation Rough Sets. IEEE Trans. Fuzzy Syst. 2021, 29, 19–33. [Google Scholar] [CrossRef]
  28. Boutell, M.R.; Luo, J.; Shen, X. Learning multi-label scene classification. Pattern Recognit. 2004, 37, 1757–1771. [Google Scholar] [CrossRef]
  29. Trohidis, K.; Tsoumakas, G.; Kalliris, G. Multi-label classification of music by emotion. EURASIP J. Audio Speech Music Process. 2011, 2011, 4. [Google Scholar] [CrossRef]
  30. Read, J. A pruned problem transformation method for multi-label classification. In Proceedings of the 2008 New Zealand Computer Science Research Student Conference, Christchurch, New Zealand, 14–18 April 2008; pp. 143–150. [Google Scholar]
  31. Yin, T.; Chen, H.; Wan, J.; Zhang, P.; Horng, S.J.; Li, T. Exploiting feature multi-correlations for multilabel feature selection in robust multi-neighborhood fuzzy β covering space. Inf. Fusion 2024, 104, 102150. [Google Scholar] [CrossRef]
  32. Zhang, Y.; Huo, W.; Tang, J. Multi-label feature selection via latent representation learning and dynamic graph constraints. Pattern Recognit. 2024, 151, 110411. [Google Scholar] [CrossRef]
  33. Jian, L.; Li, J.; Shu, K.; Liu, H. Multi-label informed feature selection. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16), New York, NY, USA, 9–15 July 2016; pp. 1627–1633. [Google Scholar]
  34. Fan, Y.; Liu, J.; Tang, J. Learning correlation information for multi-label feature selection. Pattern Recognit. 2024, 145, 109899. [Google Scholar] [CrossRef]
  35. Dai, J.; Liu, Q.; Chen, W. Multi-label feature selection based on fuzzy mutual information and orthogonal regression. IEEE Trans. Fuzzy Syst. 2024, 32, 5136–5148. [Google Scholar] [CrossRef]
  36. Yin, T.; Chen, H.; Yuan, Z. LEFMIFS: Label enhancement and fuzzy mutual information for robust multilabel feature selection. Eng. Appl. Artif. Intell. 2024, 133, 108108. [Google Scholar] [CrossRef]
  37. Sun, Z.; Zhang, J.; Dai, L.; Li, C.; Zhou, C.; Xin, J.; Li, S. Mutual information based multi-label feature selection via constrained convex optimization. Neurocomputing 2019, 329, 447–456. [Google Scholar] [CrossRef]
  38. Gonzalez-Lopez, J.; Ventura, S.; Cano, A. Distributed multi-label feature selection using individual mutual information measures. Knowl.-Based Syst. 2020, 188, 105052. [Google Scholar] [CrossRef]
  39. Lee, J.; Kim, D.W. Mutual information-based multi-label feature selection using interaction information. Expert Syst. Appl. 2015, 42, 2013–2025. [Google Scholar] [CrossRef]
  40. Lee, J.; Kim, D.W. Feature selection for multi-label classification using multivariate mutual information. Pattern Recognit. Lett. 2013, 34, 349–357. [Google Scholar] [CrossRef]
  41. Lee, J.; Kim, D.W. SCLS: Multi-label feature selection based on scalable criterion for large label set. Pattern Recognit. 2017, 66, 342–352. [Google Scholar] [CrossRef]
  42. Zhang, P.; Liu, G.; Gao, W. Distinguishing two types of labels for multi-label feature selection. Pattern Recognit. 2019, 95, 72–82. [Google Scholar] [CrossRef]
  43. Pan, M.; Sun, Z.; Wang, C.; Cao, G. A multi-label feature selection method based on an approximation of interaction information. Intell. Data Anal. 2022, 26, 823–840. [Google Scholar] [CrossRef]
  44. Lee, J.; Kim, D.W. Fast multi-label feature selection based on information-theoretic feature ranking. Pattern Recognit. 2015, 48, 2761–2771. [Google Scholar] [CrossRef]
  45. Zhang, P.; Liu, G.; Song, J. MFSJMI: Multi-label feature selection considering join mutual information and interaction weight. Pattern Recognit. 2023, 138, 109378. [Google Scholar] [CrossRef]
  46. Guo, D.; Zhang, J.; Yang, B.; Lin, Y. Multi-modal intelligent situation awareness in real-time air traffic control: Control intent understanding and flight trajectory prediction. Chin. J. Aeronaut. 2025, 38, 103376. [Google Scholar] [CrossRef]
  47. Zhao, J.; Yang, C.; Gao, W.; Park, J.H. ADP-based optimal control of linear singularly perturbed systems with uncertain dynamics: A two-stage value iteration method. IEEE Trans. Circuits Syst. II Express Briefs 2023, 70, 4399–4403. [Google Scholar] [CrossRef]
  48. Tsoumakas, G.; Spyromitros-Xioufis, E.; Vilcek, J. Mulan: A Java library for multi-label learning. J. Mach. Learn. Res. 2011, 12, 2411–2414. [Google Scholar]
  49. Cai, Z.; Zhu, W. Multi-label feature selection via feature manifold learning and sparsity regularization. Int. J. Mach. Learn. Cybern. 2018, 9, 1321–1334. [Google Scholar] [CrossRef]
  50. Rodrigues, D.; Pereira, L.; Nakamura, R. A wrapper approach for feature selection based on bat algorithm and optimum-path forest. Expert Syst. Appl. 2014, 41, 2250–2258. [Google Scholar] [CrossRef]
  51. Zhang, J.; Luo, Z.; Li, C. Manifold regularized discriminative feature selection for multi-label learning. Pattern Recognit. 2019, 95, 136–150. [Google Scholar] [CrossRef]
  52. Zhang, L.; Wang, Z. Multi-label Feature Selection Algorithm Based on Maximum Correlation and Minimum Redundancy Joint Mutual Information. J. Commun. 2018, 39, 111–122. [Google Scholar]
  53. Friedman, M. A comparison of alternative tests of significance for the problemof m rankings. Ann. Math. Stat. 1940, 11, 86–92. [Google Scholar] [CrossRef]
  54. Dunn, O.J. Multiple comparisons among means. J. Am. Assoc. 1961, 56, 52–64. [Google Scholar] [CrossRef]
Figure 1. The dynamic changes in the label information during the feature selection process: (a) original label space; (b) label space based on S i i ; (c) label space based on S j j . The white part is the selected feature subset that are relevant to the shown label.
Figure 1. The dynamic changes in the label information during the feature selection process: (a) original label space; (b) label space based on S i i ; (c) label space based on S j j . The white part is the selected feature subset that are relevant to the shown label.
Electronics 15 00525 g001
Figure 2. Two-stage interactive iterative strategy of the proposed method.
Figure 2. Two-stage interactive iterative strategy of the proposed method.
Electronics 15 00525 g002
Figure 3. The statistical results of Avgrank: (a) hamming loss, (b) coverage error, (c) ranking loss, (d) average precision.
Figure 3. The statistical results of Avgrank: (a) hamming loss, (b) coverage error, (c) ranking loss, (d) average precision.
Electronics 15 00525 g003
Figure 4. Classification performance comparisons in terms of the HL metric.
Figure 4. Classification performance comparisons in terms of the HL metric.
Electronics 15 00525 g004aElectronics 15 00525 g004b
Figure 5. Classification performance comparisons in terms of the CE metric.
Figure 5. Classification performance comparisons in terms of the CE metric.
Electronics 15 00525 g005
Figure 6. Classification performance comparisons in terms of the RL metric.
Figure 6. Classification performance comparisons in terms of the RL metric.
Electronics 15 00525 g006aElectronics 15 00525 g006b
Figure 7. Classification performance comparisons in terms of the AP metric.
Figure 7. Classification performance comparisons in terms of the AP metric.
Electronics 15 00525 g007
Figure 8. Classification performance comparisons in terms of the AP metric on computers: (a) K = 5, (b) K = 10, (c) K = 15.
Figure 8. Classification performance comparisons in terms of the AP metric on computers: (a) K = 5, (b) K = 10, (c) K = 15.
Electronics 15 00525 g008
Figure 9. Classification performance comparisons in terms of the AP metric on reference: (a) K = 5, (b) K = 10, (c) K = 15.
Figure 9. Classification performance comparisons in terms of the AP metric on reference: (a) K = 5, (b) K = 10, (c) K = 15.
Electronics 15 00525 g009
Figure 10. Classification performance comparisons in terms of the AP metric on Social: (a) K = 5, (b) K = 10, (c) K = 15.
Figure 10. Classification performance comparisons in terms of the AP metric on Social: (a) K = 5, (b) K = 10, (c) K = 15.
Electronics 15 00525 g010
Figure 11. The CD diagrams using the Bonferroni–Dunn test: (a) hamming loss, (b) coverage error, (c) ranking loss, (d) average precision.
Figure 11. The CD diagrams using the Bonferroni–Dunn test: (a) hamming loss, (b) coverage error, (c) ranking loss, (d) average precision.
Electronics 15 00525 g011
Table 1. Summary of evaluation criteria for representative feature selection methods.
Table 1. Summary of evaluation criteria for representative feature selection methods.
Methods R e l e v a n c e ( f k ; L ) R e d u n d a n c e ( f k ; S )
D2F l i L I ( f k ; l i ) f j S l i L I ( f k ; f j ; l i )
PMU l i L I ( f k ; l i ) f j S l i L I ( f k ; f j ; l i )
SCLS l i L I ( f k ; l i ) f j S I ( f k ; f j ) H ( f k ) l i L I ( f k ; l i )
FIMF l i L I ( f k ; l i ) /
LRFS l i L I ( f k ; l i ) 1 | S | f j S I ( f k ; f j )
IDA 1 | L | { l i L I ( f k ; l i ) 1 2 l q L l j L , q j I ( f k ; l q ; l j ) } 1 | S | { f i S I ( f k ; f i ) 1 2 f q S f j L ,   f q f j I ( f k ; f q ; f j ) }
MFSJMI l i L l j L { l i } I ( f k ; l i | l j ) I ( f k ; l i ) I ( l i ; l j ; f k ) f j S I ( f k ; f j )
Table 2. The time complexity of the nine methods.
Table 2. The time complexity of the nine methods.
MethodsTime Complexities
MLCFS O ( n d q + w n d )
D2F O ( n d q + w n d q )
PMU O ( n d q + w n d q + n d q 2 )
SCLS O ( n d q + w n d )
LRFS O ( n d q 2 + w n d )
FIMF O ( n d q + w n d q )
IDA O ( n d q + w n d q )
MFSJMI O ( n d q + w n d q )
MIFS O ( n ( d 2 + d q ) + w n d 2 )
Table 3. Description of multi-label dataset.
Table 3. Description of multi-label dataset.
Datasets Instances Train Test Features Label Label Cardinality Label Density Domain
scene24071211119629461.0740.179Image
yeast24171500917103144.2370.303Biology
computers500020003000681331.5080.046Yahoo
health500020003000612321.6620.052Text
reference500020003000636271.1690.035Yahoo
social5000200030001047391.2820.033Text
medical9783336451449451.2450.028Text
entertain500020003000640211.4200.068Text
society500020003000636271.6920.063Text
Table 4. Performance comparison results of nine methods on the HL metric.
Table 4. Performance comparison results of nine methods on the HL metric.
DatasetsMLCFSMIFSD2FPMUSCLSLRFSFIMFIDAMFSJMI
scene0.1413 ± 0.02060.1704 ± 0.00970.1492 ± 0.00640.1473 ± 0.00660.1734 ± 0.0030.1419 ± 0.00990.1663 ± 0.00630.1458 ± 0.01020.1411 ± 0.019
yeast0.2257 ± 0.01260.2302 ± 0.00410.2278 ± 0.00290.2279 ± 0.00370.2332 ± 0.00440.2263 ± 0.00350.2319 ± 0.00420.2303 ± 0.00260.2305 ± 0.0028
Computers0.0407 ± 0.00080.0449 ± 0.00020.044 ± 0.00050.0441 ± 0.00050.0434 ± 0.00050.0429 ± 0.00070.0433 ± 0.00060.0426 ± 0.00120.0432 ± 0.0007
Health0.0441 ± 0.00260.0502 ± 0.0010.0483 ± 0.00050.0493 ± 0.00060.0485 ± 0.00110.0452 ± 0.00110.0442 ± 0.00130.0471 ± 0.00090.0473 ± 0.0015
Reference0.0305 ± 0.00140.0313 ± 0.00120.0322 ± 0.00120.0336 ± 0.0010.0329 ± 0.00020.0312 ± 0.00070.0321 ± 0.00090.0315 ± 0.00060.0314 ± 0.0009
Social0.0266 ± 0.00210.0317 ± 0.00130.0303 ± 0.00050.0309 ± 0.00030.0287 ± 0.00070.0274 ± 0.00070.0282 ± 0.00060.0266 ± 0.00120.0281 ± 0.0009
medical0.0171 ± 0.00080.0165 ± 0.00210.0196 ± 0.0010.0197 ± 0.00110.0233 ± 0.00020.0175 ± 0.0010.0174 ± 0.0010.0218 ± 0.00010.0177 ± 0.0015
Entertain0.0637 ± 0.00170.0658 ± 0.00080.0657 ± 0.00130.0671 ± 0.00110.0659 ± 0.00140.0631 ± 0.00140.0654 ± 0.00110.0615 ± 0.00120.0641 ± 0.0011
Society0.0587 ± 0.00070.0596 ± 0.00090.0587 ± 0.00040.0597 ± 0.00090.0594 ± 0.00030.058 ± 0.00060.0586 ± 0.00070.0582 ± 0.00050.0589 ± 0.001
average0.07220.07780.07510.07550.07880.07260.07640.0740.0739
The bolded part indicates the best classification performance.
Table 5. Performance comparison results of nine methods on the CE metric.
Table 5. Performance comparison results of nine methods on the CE metric.
DatasetsMLCFSMIFSD2FPMUSCLSLRFSFIMFIDAMFSJMI
scene1.9849 ± 0.29842.9801 ± 0.4342.3015 ± 0.23572.3129 ± 0.24432.7828 ± 0.10862.2297 ± 0.29132.6974 ± 0.41792.3396 ± 0.37832.3442 ± 0.4845
yeast7.9796 ± 0.25899.0812 ± 0.5068.7833 ± 0.27268.9352 ± 0.36739.0711 ± 0.34468.9035 ± 0.35168.9928 ± 0.32349.2325 ± 0.39278.9751± 0.332
Computers6.3673 ± 0.11497.5371 ± 0.53737.2455 ± 0.26417.1926 ± 0.21687.1822 ± 0.22167.1585 ± 0.23696.9352 ± 0.1967.0722 ± 0.25117.1399 ± 0.2279
Health4.8104 ± 0.31856.228 ± 0.3765.7394 ± 0.15555.7229 ± 0.14265.8251 ± 0.18025.7664 ± 0.15444.7298 ± 0.13645.8252 ± 0.17035.8838 ± 0.1783
Reference5.0328 ± 0.10625.949 ± 0.31565.6561 ± 0.19735.6117 ± 0.11475.6353 ± 0.16235.7063 ± 0.34185.6452 ± 0.31125.9714 ± 0.3325.6892 ± 0.2235
Social5.4449 ± 0.16856.955 ± 0.40516.1474 ± 0.1916.2101 ± 0.19766.0108 ± 0.31135.8175 ± 0.33545.9043 ± 0.27046.0501 ± 0.33546.028 ± 0.302
medical5.1658 ± 0.41376.1604 ± 0.41416.3598 ± 0.40126.4201 ± 0.40258.3118 ± 0.10985.8078 ± 0.29275.7868 ± 0.26997.1824 ± 0.06315.8668 ± 0.6055
Entertain5.1016 ± 0.12435.9338 ± 0.54075.7088 ± 0.22775.6683 ± 0.21675.7602 ± 0.17515.5795 ± 0.26645.6386 ± 0.21375.7098 ± 0.23975.6899 ± 0.2259
Society7.8775 ± 0.23458.6349 ± 0.47918.4876 ± 0.26658.4146 ± 0.26698.5074 ± 0.23498.3791 ± 0.31118.3525 ± 0.37828.3738 ± 0.30938.4163 ± 0.3102
average5.52946.60666.26996.27656.56526.14986.07586.41746.2259
The bolded part indicates the best classification performance.
Table 6. Performance comparison results of nine methods on the RL metric.
Table 6. Performance comparison results of nine methods on the RL metric.
DatasetsMLCFSMIFSD2FPMUSCLSLRFSFIMFIDAMFSJMI
scene0.1763 ± 0.05960.3751 ± 0.08650.2395 ± 0.04780.2415 ± 0.04930.3366 ± 0.02160.2249 ± 0.0590.318 ± 0.08420.2467 ± 0.07590.248 ± 0.0968
yeast0.2053 ± 0.01680.2703 ± 0.03020.2454 ± 0.00960.2548 ± 0.01830.2653 ± 0.01460.2564 ± 0.01490.2586 ± 0.01580.2678 ± 0.02040.2596 ± 0.0222
Computers0.1186 ± 0.00230.1515 ± 0.01460.1389 ± 0.0060.1367 ± 0.00450.1398 ± 0.00610.1364 ± 0.00570.1307 ± 0.0050.1364 ± 0.00670.1367 ± 0.0058
Health0.0729 ± 0.00830.1148 ± 0.01060.0979 ± 0.00430.0983 ± 0.00420.1013 ± 0.00510.0979 ± 0.00410.2005 ± 0.00620.1001 ± 0.00440.1028 ± 0.0049
Reference0.1069 ± 0.00320.0313 ± 0.00120.1254 ± 0.00650.124 ± 0.00390.1251 ± 0.0050.1278 ± 0.01070.1256 ± 0.00970.1355 ± 0.01050.127 ± 0.0069
Social0.0886 ± 0.00390.0317 ± 0.00130.1043 ± 0.00410.1058 ± 0.00440.1025 ± 0.00750.0976 ± 0.00640.0983 ± 0.00610.1022 ± 0.00790.1011 ± 0.0068
medical0.0726 ± 0.0080.0897 ± 0.00930.0951 ± 0.00920.0963 ± 0.00910.1398 ± 0.00240.0833 ± 0.00640.0829 ± 0.0060.1139 ± 0.00120.0848 ± 0.0131
Entertain0.1598 ± 0.0060.2004 ± 0.02650.1885 ± 0.0110.186 ± 0.01010.1896 ± 0.00820.1826 ± 0.01280.1855 ± 0.01020.1888 ± 0.01130.188 ± 0.0108
Society0.1845 ± 0.0060.0596 ± 0.00090.2068 ± 0.00810.203 ± 0.00850.2075 ± 0.00810.2034 ± 0.01050.2032 ± 0.01340.203 ± 0.01020.2054 ± 0.0112
average0.13170.14720.16020.16070.17860.15670.17820.14850.1647
The bolded part indicates the best classification performance.
Table 7. Performance comparison results of nine methods on the AP metric.
Table 7. Performance comparison results of nine methods on the AP metric.
DatasetsMLCFSMIFSD2FPMUSCLSLRFSFIMFIDAMFSJMI
scene0.7331 ± 0.070.4978 ± 0.06540.6169 ± 0.05030.6197 ± 0.05220.5129 ± 0.01710.6362 ± 0.06580.5443 ± 0.08030.6165 ± 0.08250.6258 ± 0.0873
yeast0.7199 ± 0.01980.6441 ± 0.04080.6791 ± 0.01340.6728 ± 0.01990.653 ± 0.01970.6648 ± 0.0190.6683 ± 0.01930.6524 ± 0.02530.6615 ± 0.0267
Computers0.6015 ± 0.00680.514 ± 0.03640.5407 ± 0.0130.5402 ± 0.01580.5256 ± 0.01780.5416 ± 0.01640.5494 ± 0.01630.5481 ± 0.01550.5399 ± 0.017
Health0.6734 ± 0.02720.5407 ± 0.03080.5617 ± 0.02010.5583 ± 0.01380.5566 ± 0.01630.5594 ± 0.02030.6692 ± 0.01220.5578 ± 0.0230.5539 ± 0.0226
Reference0.5824 ± 0.00990.5089 ± 0.02730.5204 ± 0.01860.505 ± 0.0280.5111 ± 0.01760.5182 ± 0.01880.5233 ± 0.02050.5062 ± 0.01950.5209 ± 0.017
Social0.6428 ± 0.0240.5183 ± 0.02540.5671 ± 0.01210.5628 ± 0.01340.5423 ± 0.02370.5731 ± 0.01730.5674 ± 0.02450.571 ± 0.02210.5671 ± 0.0246
medical0.7295 ± 0.03680.6599 ± 0.05760.6056 ± 0.0320.591 ± 0.0260.4482 ± 0.00670.6532 ± 0.02680.6515 ± 0.02480.5055 ± 0.00210.6493 ± 0.0465
Entertain0.5279 ± 0.02460.4182 ± 0.03040.4319 ± 0.01790.4473 ± 0.01280.4361 ± 0.00940.4382 ± 0.0180.4388 ± 0.01580.4229 ± 0.01520.4298 ± 0.0176
Society0.5256 ± 0.00710.4441 ± 0.03270.4865 ± 0.00920.4911 ± 0.01390.4684 ± 0.01490.4792 ± 0.01580.474 ± 0.01990.571 ± 0.02210.481 ± 0.0213
average0.63730.52730.55670.55420.51710.56270.56510.55020.5588
The bolded part indicates the best classification performance.
Table 8. The ranking of the evaluation indicators of nine feature selection methods.
Table 8. The ranking of the evaluation indicators of nine feature selection methods.
MethodsHamming LossCoverage ErrorRanking LossAverage Precision
MLCFS1.781.111.331.11
MIFS6.568.335.788
D2F5.785.335.224.44
PMU7.444.894.675.11
SCLS7.676.567.337.33
LRFS2.563.673.673.89
FIMF4.893.334.783.89
IDA3.676.3365.67
MFSJMI4.445.445.785.44
Table 9. Friedman statistics F F and critical value.
Table 9. Friedman statistics F F and critical value.
Evaluation Metrics χ F 2 F F Critical Value
Hamming Loss38.92939.41722.102
Coverage Error42.235111.3516
Ranking Loss22.42703.6192
Average Precision38.15219.0173
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cao, Y.; Zhang, P.; Wang, L. Multi-Label Feature Selection Method Based on Maximum Label Complexity Ratio. Electronics 2026, 15, 525. https://doi.org/10.3390/electronics15030525

AMA Style

Cao Y, Zhang P, Wang L. Multi-Label Feature Selection Method Based on Maximum Label Complexity Ratio. Electronics. 2026; 15(3):525. https://doi.org/10.3390/electronics15030525

Chicago/Turabian Style

Cao, Yu, Ping Zhang, and Long Wang. 2026. "Multi-Label Feature Selection Method Based on Maximum Label Complexity Ratio" Electronics 15, no. 3: 525. https://doi.org/10.3390/electronics15030525

APA Style

Cao, Y., Zhang, P., & Wang, L. (2026). Multi-Label Feature Selection Method Based on Maximum Label Complexity Ratio. Electronics, 15(3), 525. https://doi.org/10.3390/electronics15030525

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop