Multi-Label Feature Selection Method Based on Maximum Label Complexity Ratio

Cao, Yu; Zhang, Ping; Wang, Long

doi:10.3390/electronics15030525

Open AccessArticle

Multi-Label Feature Selection Method Based on Maximum Label Complexity Ratio

by

Yu Cao

¹,

Ping Zhang

^1,2,*

and

Long Wang

³

¹

School of Artificial Intelligence, Hebei University of Technology, Tianjin 300131, China

²

Hebei Province Key Laboratory of Big Data Calculation, Tianjin 300131, China

³

Aviation University of Airforce, Changchun 130022, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(3), 525; https://doi.org/10.3390/electronics15030525

Submission received: 24 December 2025 / Revised: 18 January 2026 / Accepted: 23 January 2026 / Published: 26 January 2026

(This article belongs to the Special Issue New Trends for Feature Selection Applied in Data Mining)

Download

Browse Figures

Versions Notes

Abstract

Multi-label feature selection, which aims to select reliable and information-rich feature subsets from high-dimensional multi-label data, plays a critical role in data mining and pattern recognition. Conventional information-theoretic methods approximate the high-order correlation between candidate features and the multi-dimensional label set by aggregating low-order mutual information between features and individual labels. However, this strategy inherently assumes all labels are equally significant, thereby overlooking their intricate distributions. To address this limitation, we first define a novel label complexity ratio based on information entropy and mutual information. We then quantify and dynamically update this ratio for each label, accounting for varying label correlations and the differential influence of selected features. Finally, we propose a new feature selection method that jointly considers the correlation with the currently most complex label, the redundancy between candidate and already-selected features, and the interaction information among these three elements to identify a high-quality feature subset. Comprehensive experiments on nine benchmark multi-label datasets demonstrate that the proposed method achieves superior performance compared to eight state-of-the-art multi-label feature selection methods.

Keywords:

multi-label learning; multi-label feature selection; information theory; label complexity ratio

1. Introduction

With the rapid advancement of big data and artificial intelligence technologies, vast amounts of data are being generated and stored in numerous applications. This data exhibits growing trends toward complexity and diversity. Increasingly, data objects are characterized by high-dimensional features and are associated with multiple semantic labels simultaneously. For instance, a news document may be represented by tens of thousands of word features and annotated with topics such as “economy,” “culture,” and “sports”. By analyzing the distribution patterns of multi-label training data, multi-label learning algorithms can perform multi-label classification for unseen instances [1]. This capability has led to broad applications in areas such as sentiment analysis, functional genomics classification, and image annotation [2].

However, the “curse of dimensionality” inherent in high-dimensional data significantly increases the complexity and computational burden of learning algorithms. Such high-dimensional multi-label datasets not only contain features relevant to the label set, but also include a substantial number of irrelevant and redundant features. The presence of these irrelevant and redundant features can lead to overfitting in learning models [3,4,5], substantially compromising algorithmic effectiveness. Consequently, selecting a compact yet informative feature subset that is closely related to the label set from high-dimensional multi-label data has become a critical and challenging task [6]. To address this issue, researchers have developed various multi-label feature selection methods [7,8]. These methods aim to identify a relevant feature subset from the original high-dimensional feature space while eliminating those that are irrelevant or redundant [9,10].

Based on the employed selection strategy, multi-label feature selection methods are generally categorized into three types: filter, wrapper, and embedded methods [11]. Filter methods operate independently of any specific learning algorithm and do not interfere with subsequent model training [12]. Wrapper methods evaluate feature subsets by directly assessing the classification performance of a designated predictor [13]. Embedded methods integrate the feature selection process into the training phase of the learning algorithm itself. Unlike wrapper and embedded approaches [14,15], filter methods are classifier-agnostic, offering advantages such as high computational efficiency and strong scalability [16,17,18]. In this work, we introduce a novel feature evaluation criterion following the filter-based paradigm.

In filter-based approaches, information theory provides a widely adopted evaluation criterion capable of capturing both linear and nonlinear feature relationships, thereby offering a quantitative measure of feature importance. Numerous multi-label feature selection methods founded on information theory have been developed. Broadly speaking, these methods assess features mainly from two perspectives: the relevance between features and the label space, and the redundancy among features [19,20,21,22]. Unlike single-label feature selection, multi-label scenarios must consider correlations between features and multiple labels. To address the challenge of evaluating feature relationships with the high-dimensional label set, many existing methods approximate this by accumulating low-order mutual information between candidate features and individual labels. Other methods employ conditional mutual information between features given labels to estimate the feature-label space correlation [23,24,25]. This accumulation strategy is fundamentally premised on the assumption that all labels are equally significant, which introduces several key limitations when assessing candidate feature relevance: (1) it fails to differentiate the information distributions associated with different labels; (2) it neglects the dynamic interrelationships among labels; and (3) it overlooks the varying influence that selected features may exert on different labels. Specifically, the information distribution of each label exhibits varying degrees of complexity. Labels with more intricate distributions contain richer information and are therefore relatively more significant, implying that more relevant features are needed for their adequate representation. Beyond the inherent complexity of individual labels, the relationships among different labels must also be considered. While some prior improvements have acknowledged label relationships, they have not effectively distinguished or quantified label complexity. Moreover, during the feature selection process, as selected features progressively capture label information, the complexity of the corresponding labels changes dynamically. It is; therefore, essential to holistically consider label complexity under multiple influencing factors, ensuring the final selected feature subset sufficiently represents the intricate distribution of label information.

To quantify the complexity of various label distributions, we introduce a novel criterion termed the label complexity ratio (LCratio), derived from entropy and mutual information. Guided by this measure and rooted in information-theoretic principles, we propose a multi-label feature selection method called Multi-label Complexity Feature Selection (MLCFS). MLCFS selects features by dynamically focusing on the currently most complex label, while jointly evaluating feature-label correlation, feature–feature redundancy, and the interaction among features and labels. The goal is to select a compact yet highly informative feature subset. The detailed methodology is presented in Section 4. The main contributions of this work are summarized as follows:

(1): We systematically investigate how dynamic changes in label complexity influence feature relevance assessment. To quantify this effect, we introduce a dynamic label complexity ratio derived from label information entropy and mutual information.
(2): A novel multi-label feature selection method named MLCFS is proposed. This method comprehensively addresses the correlation and redundancy among features, as well as the interaction information between features and labels. Additionally, it takes into account the variations in label complexity.
(3): To verify the effectiveness of MLCFS, experiments are conducted on nine publicly available multi-label datasets. This study compares the proposed method with eight established multi-label feature selection methods. The experimental results demonstrate that MLCFS outperforms the other comparative methods across multiple evaluation metrics, effectively reducing data dimensionality and enhancing the classification performance.

The remaining structure of the paper is as follows. Section 2 introduces some basic concepts of information theory. Section 3 briefly reviews the related work. Section 4 describes the proposed multi-label feature selection method MLCFS in detail. Section 5 presents and analyzes the experimental results to verify the effectiveness of the proposed method. Section 6 summarizes this paper.

2. Preliminaries

The Basic Concepts of Information Theory

In this section, we introduce two fundamental information-theoretic concepts central to our feature selection framework: mutual information and conditional mutual information [26,27]. Mutual information measures the amount of information shared between two variables, reflecting their degree of correlation. Formally, mutual information is defined as follows:

I (X; Y) = H (X) - H (X | Y) = \sum_{i = 1}^{n} \sum_{j = 1}^{m} p (x_{i}, y_{i}) \log \frac{p (x_{i}, y_{i})}{p (x_{i}) p (y_{i})}

(1)

where

X = {x_{1}, x_{2}, \dots, x_{n}}

and

Y = {y_{1}, y_{2}, \dots, y_{m}}

are two discrete random variables,

p (x_{i})

and

p (y_{j})

are marginal probability density functions, i.e.,

p (x_{i}) = \frac{c o u n t (X = x_{i})}{n}

,

p (y_{j}) = \frac{c o u n t (Y = y_{j})}{m}

,

p (x_{i}, y_{j})

is the joint probability density function, computed by

p (x_{i}, y_{j}) = \frac{c o u n t (X = x_{i} \land Y = y_{j})}{n * m}

,

c o u n t (\cdot)

is the number of values in the variables.

H (X)

is the entropy used to measure the uncertainty of

X

, computed by

H (x) = - \sum_{x_{i} \in X} p (x_{i}) \log p (x_{i})

.

H (X | Y)

is the conditional entropy to measure the remaining uncertainty of

X

given

Y

, computed by

H (X, Y) = - \sum_{x_{i} \in X} \sum_{y_{j} \in Y} p (x_{i}, y_{j}) \log p (x_{i}, y_{j})

. The larger the mutual information, the more information the two random variables share, and the greater the correlation between them.

Conditional mutual information quantifies the interdependence between two random variables when a third variable is known. Let

Z = {z_{1}, z_{2}, \dots, z_{k}}

be another discrete random variable. The definition of the conditional mutual information between the random variables X and Y given the random variable Z is as follows:

I (X; Y | Z) = H (X | Z) - H (X | Y Z) = \sum_{i = 1}^{n} \sum_{j = 1}^{m} \sum_{l = 1}^{k} p (x_{i}, y_{j}, z_{l}) \log \frac{p (x_{i}, y_{j} | z_{l})}{p (x_{i} | z_{l}) p (y_{j} | z_{l})}

(2)

3. Related Work

The primary goal of multi-label feature selection is to identify a subset of features that are highly relevant to the label space from a high-dimensional candidate set. In recent years, researchers have proposed numerous methods for this task, which generally fall into two main categories based on how they handle multi-label data: problem transformation-based methods and algorithm adaptation-based methods. Problem transformation-based approaches first convert multi-label data into one or more single-label problems, using techniques such as Binary Relevance (BR) [28], Label Powerset (LP) [29], or Pruned Problem Transformation (PPT) [30]. A conventional single-label feature selection method is then applied to the transformed data. A key limitation of this paradigm is that it often fails to capture correlations among labels. In contrast, algorithm adaptation-based methods operate directly on the original multi-label dataset to select an optimal feature subset. A growing number of such methods have been developed [31,32], which explicitly leverage inter–label relationships to guide the selection of more informative features.

In recent years, algorithm adaptation-based multi-label feature selection methods have gained considerable attention. Jian et al. [33] proposed MIFS, an embedding-based method that uncovers label correlations through latent semantic analysis while jointly performing label decomposition and feature selection via a regression model. Fan et al. [34] introduced LCIFS, a method that incorporates label relationships by jointly modeling label correlations and feature redundancy. Specifically, LCIFS employs adaptive spectral graph learning to capture label structural correlations and fits the feature–label relationship using manifold-based regression, while also utilizing feature correlations to reduce redundancy in the selected subset. Dai et al. [35] presented an approach that evaluates features from a global correlation perspective and integrates this prior knowledge into an orthogonal regression optimization framework. Yin et al. [36] proposed LEFMIFS, which embeds label enhancement into feature selection. LEFMIFS first converts logical labels into real-valued label distributions and then incorporates them into a fuzzy mutual information-based feature evaluation function for multi-label feature assessment.

Information theory has been widely adopted in multi-label feature selection for quantifying nonlinear feature relationships. Sun et al. [37] developed a method that combines mutual information with constrained convex optimization to fully capture feature-label correlations. Gonzalez–Lopez et al. [38] introduced a Gaussian Mixture Model (GMM) approach that selects the optimal feature subset by maximizing the geometric mean of the mutual information between features and each label. Lee et al. devised a series of feature-evaluation measures, including D2F [39], PMU [40], and SCLS [41], which assess feature relevance and redundancy from different angles. Zhang et al. [42] incorporated conditional mutual information and proposed a label-redundancy-aware method termed LRFS. Pan et al. [43] presented an approximation of three-way interaction information, referred to as IDA in this paper, for evaluating feature correlation and redundancy. The FIMF method [44] is a fast information-theoretic technique that accelerates correlation measurement by omitting redundant entropy computations while emphasizing high-entropy labels. Zhang et al. [45] proposed MFSJMI, which embeds joint mutual information and interaction weights into the evaluation function by decomposing joint mutual information and considering multi-label correlations. By examining the feature-evaluation criteria used in the information-theoretic methods above, we can summarize them into the following unified framework:

J (f_{k}) = R e l e v a n c e (f_{k}; L) - R e d u n d a n c e (f_{k}; S)

(3)

where

f_{k}

denotes the candidate feature,

L

denotes the label set, and

S

is the selected feature subset.

J (f_{k})

represents the evaluation criterion for the candidate feature

f_{k}

, with larger values indicating greater importance.

R e l e v a n c e (f_{k}; L)

represents the correlation between

f_{k}

and the label set

L

, while

R e d u n d a n c e (f_{k}, S)

represents the redundancy between

f_{k}

and the selected feature set S. Symbol explanations can be found in Appendix A Table A1. Table 1 summarizes the feature evaluation criteria proposed by the above representative methods based on the evaluation framework in Formula (3). Specifically, in the series of feature selection evaluation functions proposed by Lee et al., such as D2F, PMU, and SCLS,

R e l e v a n c e (f_{k}; L)

is calculated by the sum of mutual information between candidate features and each label, that is

\sum_{l_{i} \in L} I (f_{k}; l_{i})

. In addition, Zhang et al. measured

R e l e v a n c e (f_{k}; L)

using the accumulated conditional mutual information or joint mutual information of candidate features and paired labels, that is

\sum_{l_{i} \neq l_{j}, l_{j} \in L} I (f_{k}; l_{j} | l_{i})

. Analysis shows that existing methods commonly adopt an accumulation strategy to quantify the correlation of candidate features. However, this strategy uniformly computes the correlation between all labels and features, without performing fine-grained differentiation or measurement of label information. To address this limitation, this paper proposes a novel label importance measure, based on which a precise assessment of candidate feature correlation is achieved.

4. Proposed Feature Selection Method

In Section 4.1, we analyze the dynamic changes in the information carried by labels during the feature selection process. In Section 4.2, we define a new term, Label Complexity Ratio (LCratio), and analyze the cases in which the selected feature set is empty and non-empty. In Section 4.3, we propose a new feature selection method called Multi-label Complexity Feature Selection (MLCFS) and provide its pseudo-code.

4.1. The Dynamic Changes in the Label Information

In this section, we present a detailed description of the feature selection process for analyzing dynamic changes in the distribution of label information.

Information-theoretic multi-label feature selection methods typically employ a sequential forward strategy, iteratively adding one feature at a time to construct the optimal subset. To clarify this process, we present a schematic illustration in Figure 1.

Let

{l_{i}, l_{j}, l_{k}, l_{p}}

denote the label space.

S_{i i}

and

S_{j j}

represent feature subsets selected at two different stages of the feature selection process, with

S_{i i} \subseteq S_{j j}

, indicating that

S_{i i}

corresponds to an earlier stage than

S_{j j}

. Specifically, when the selected feature subset is empty, the label’s intrinsic information and the structural relationships among labels are illustrated in Figure 1a, where the sizes of different regions represent the amount of information carried by the labels, and the intersection between the information regions of two labels reflects their correlation. In the initial stage, labels

l_{i}

and

l_{j}

each has substantial information, and

l_{i}

shows stronger correlations with

l_{k}

and

l_{p}

. Therefore, during the initial stage of feature selection, prioritizing features highly relevant to the label

l_{i}

helps capture more of the semantic information within the label space.

Subsequently, the selected feature subset

S_{i i}

gradually incorporates relevant features. As the selected features increasingly reflect label semantics, the remaining uncertainty in the label space decreases, leading to a corresponding change in label information complexity. As shown in Figure 1b,

S_{i i}

captures more information about

l_{i}

and

l_{p}

, while label

l_{i}

still contains a significant portion of unrepresented information. Thus, in subsequent iterations, selecting features strongly associated with

l_{j}

enhances the overall representational capacity of the final feature subset. Further, Figure 1c illustrates that after an additional round of feature selection,

S_{i i}

is updated to a new subset

S_{j j}

. At this point,

l_{k}

becomes the label with the highest current information content. Through the iterative sequential forward selection process, the distribution of label information changes dynamically. At each step, selecting features that are strongly correlated with the currently most informative label effectively maximizes the coverage of label information, thereby constructing a semantically richer and more representative feature subset.

4.2. Quantification of the Complexity of Labels

This section introduces a novel term to capture the dynamic variations in the complexity of label information, offering more effective guidance for evaluating feature relevance. This term incorporates the influence of both label relationships and the currently selected features, enabling its calculation and dynamic update [46] throughout the selection process.

Definition 1.

Let

L = {l_{1}, l_{2}, \dots, l_{q}}

be the set of labels, and

q

be the number of labels,

l_{i}

represents the i-th label,

1 \leq i \leq q

. When the selected feature set

S

is empty, the definition of the complexity rate of the label

l_{i}

is as follows:

L C r a t i o (l_{i}) = \frac{1}{2} \times {H (l_{i}) + \frac{1}{| L | - 1} \sum_{l_{j} \in L - l_{i}} \frac{2 I (l_{i}; l_{j})}{H (l_{i}) + H (l_{j})}}

(4)

where

H (l_{i})

represents the quantification of the uncertainty information distribution of the current label

l_{i}

itself,

I (l_{i}; l_{j})

represents the measurement of the relationship between the label

l_{i}

and other labels in the label set L.

L C r a t i o (l_{i})

captures the assessment of label complexity by integrating label entropy and mutual information among labels. A higher value of

L C r a t i o (l_{i})

indicates a more complex distribution of label

l_{i}

and stronger inter-label correlations, implying that the information carried by label

l_{i}

is more important and thus requires more features during the feature selection process for adequate representation.

As the forward sequential search progresses, more relevant features are iteratively added to the selected subset, which in turn captures the information of different labels to varying degrees. Therefore, the complexity of each label must be dynamically updated to reflect the influence of the features already selected. The specific update formula is given below:

L C r a t i o (l_{i}, S) = \frac{1}{2} \times {\min_{f_{\max} \in S} H (l_{i} | f_{\max}) + \frac{1}{| L | - 1} \sum_{l_{j} \in L - l_{i}} \frac{2 I (l_{i}; l_{j} | f_{\max})}{H (l_{i} | f_{\max}) + H (l_{j} | f_{\max})}}

(5)

where the first term represents the distribution of the remaining uncertainty information of the label

l_{i}

under the condition of the known feature

f_{\max}

. Here, based on the principle that a smaller conditional entropy indicates the known variable provides more information, the strategy selects the already existing feature

f_{\max}

from

S

that has the greatest influence on

l_{i}

, while also reducing computational complexity. Since

l_{i}

exhibits the least remaining uncertainty information under the influence of

f_{\max}

, it implies that

f_{\max}

captures the most informative content from

l_{i}

, thereby reducing the label complexity rate. Then, the second term accounts for the dynamic changes in the relationship between the label

l_{i}

and other labels in the label set

L

conditioned on

f_{\max}

.

Furthermore, based on the definition of the label complexity ratio, both

L C r a t i o (l_{i})

and

L C r a t i o (l_{i}, S)

are bounded within the interval (0, 1), with a scaling factor of 1/2 to ensure values remain within the range. This ensures that the value of

L C r a t i o (l_{i})

intuitively reflects the relative importance of the label

l_{i}

within the entire label space, rather than merely representing an absolute amount of information. It is also regarded as the probability that this label is selected as the most descriptive label at the current stage. The specific proof process is as follows:

Proof.

According to the definition and properties of entropy and mutual information in information theory, it is known that the information entropy of the label

l_{i}

satisfies

0 \leq H (l_{i}) \leq 1

. Since

0 \leq I (l_{i}; l_{j}) \leq H (l_{i})

and

0 \leq I (l_{i}; l_{j}) \leq H (l_{j})

; therefore,

0 \leq 2 I (l_{i}; l_{j}) \leq H (l_{i}) + H (l_{j})

, further obtaining

0 \leq \frac{2 I (l_{i}; l_{j})}{H (l_{i}) + H (l_{j})} \leq 1

. By taking the average in the label set,

0 \leq \frac{1}{| L | - 1} \sum_{l_{j} \in L - l_{i}} \frac{2 I (l_{i}; l_{j})}{H (l_{i}) + H (l_{j})} \leq 1

is satisfied. In conclusion,

0 \leq \frac{1}{2} {H (l_{i}) + \frac{1}{| L | - 1} \sum_{l_{j} \in L - l_{i}} \frac{2 I (l_{i}; l_{j})}{H (l_{i}) + H (l_{j})}} \leq 1

is satisfied, that is

0 \leq L C r a t i o (l_{i}) \leq 1

. □

The proof process of

L C r a t i o (l_{i}, S)

can be obtained by the same reasoning.

4.3. Proposed Method

This section proposes a new multi-label feature selection method, MLCFS, based on the label complexity ratio presented in Section 4.2, and adopts an information-theoretic measure to evaluate features through a two-stage interactive iterative strategy. In the first stage, the specific label

l_{\max}

characterized by a complex information distribution and strong inter–label relationships within the label space is identified using the label complexity ratio, which is calculated as follows:

l_{\max} = \{\begin{matrix} \underset{l_{i} \in L}{\arg \max} {L C r a t i o (l_{i})} & S = Ø, \\ \underset{l_{i} \in L}{\arg \max} {L C r a t i o (l_{i}, S)} & Otherwise \end{matrix}

(6)

when the selected feature set S is empty,

l_{\max}

is the label corresponding to the maximum

L C r a t i o (l_{i})

value; when S is non-empty,

l_{\max}

is the label corresponding to the maximum

L C r a t i o (l_{i}, S)

value.

In the second stage, a novel feature evaluation criterion is proposed. This criterion measures the correlation between candidate features and the label

l_{m a x}

, while also accounting for redundancy relative to the already selected features. Additionally, it incorporates the interaction information among these three components. The specific formulation is as follows:

J (f_{k}) = I (f_{k}; l_{\max}) - \frac{1}{| S |} \sum_{f_{s} \in S} I (f_{k}; f_{s}) - \frac{1}{| S |} \sum_{f_{s} \in S} {I (f_{k}; l_{\max}) - I (f_{s}; l_{\max} | f_{k})}

(7)

where

I (f_{k}; l_{\max})

represents the correlation between the candidate feature

f_{k}

and the label

l_{\max}

,

I (f_{k}; f_{s})

quantifies the redundancy between the candidate features

f_{k}

and the feature

f_{s}

in the selected feature set S, while

I (f_{k}; l_{\max}) - I (f_{s}; l_{\max} | f_{k})

captures the interaction information among the candidate features

f_{k}

, the selected feature

f_{s}

, and the label

l_{m a x}

. The interaction information term may be positive or negative. A negative value indicates that, given a feature

f_{k}

as a condition,

f_{s}

provides more classification information for the label

l_{\max}

than

f_{k}

itself. Therefore, a negative interaction term actually translates into a positive contribution to the candidate feature

f_{k}

. This encourages the algorithm to favor features that complement the already selected feature set rather than merely being redundant. A higher value of

J (f_{k})

indicates that

f_{k}

provides greater representational and descriptive information about the label

l_{\max}

, implying a stronger correlation. Meanwhile, lower redundancy between

f_{k}

and the selected features suggest that

f_{k}

contributes more complementary information, thereby enhancing joint interactions within the feature set.

After the current optimal candidate feature

f_{\max}

is identified by maximizing

J (f_{k})

, it is added to the selected feature set

S

. Following the forward sequential search strategy, the selection of the next optimal candidate feature begins with obtaining the corresponding specific label through

L C r a t i o (l_{i}, S)

. At this stage, the inclusion of

f_{\max}

influences the update computation of

L C r a t i o (l_{i}, S)

, thereby reducing the complexity rate of the label

l_{\max}

. Since the first term in

J (f_{k})

is maximized,

H (l_{\max})

can be treated as approximately constant when evaluating all candidate features under

I (f_{\max}; l_{\max}) = H (l_{\max}) - H (l_{\max} | f_{\max})

, leading to a smaller

H (l_{\max} | f_{\max})

value. Consequently, according to Formula (5), the overall value of

L C r a t i o (l_{\max}, S)

decreases. This mechanism ensures that a distinct complexity rate label is selected at each iteration, guiding the final feature set S toward capturing more comprehensive and richer label descriptive information. The algorithm workflow is illustrated in Figure 2. It contains two stages: select the specific label

l_{\max}

carrying the most complex information distribution, and select the feature with the largest

J (f_{k})

.

Based on the preceding analysis, feature selection necessitates not only a holistic consideration of feature correlation, redundancy, and feature–label interaction, but also an explicit awareness that distinct label distributions impose differing demands on the descriptive capacity of features. The pseudo-code of MLCFS is presented in Algorithm 1. According to the pseudo-code, the algorithm consists of three steps. Step 1 (lines 1–5) initializes the parameters and computes the complexity ratio of all labels. Step 2 (lines 7–13) selects the label with the highest label complexity ratio and adds the feature with the maximum mutual information for that label to the feature subset. Step 3 (lines 14–25) iteratively updates the label complexity ratio, selects the label with the highest updated complexity ratio, and adds the feature that maximizes Formula (7), repeating until the stopping condition is met. The third step consists of two stages: Stage A updates the label complexity ratio, and Stage B evaluates feature performance and selects candidate features.

Algorithm 1 MLCFS

Input:

A training sample D with a full feature set F = {f_{1}, f_{2}, \dots, f_{n}}

and a label set L = {l_{1}, l_{2}, \dots, l_{q}}

; User-specified threshold K.
Output:
The already-selected feature subset S.
//Step 1: Compute initial label complexity ratios for all labels
1:

Initialize S \leftarrow Ø

;
2:

Initialize k \leftarrow

0;
3: For i = 1 to q do
4:

Calculate the quantification of the complexity of the label l_{i}

based on L C r a t i o (l_{i})

;
5: End for
6: While k < K do
//Step 2: First iteration (when S is empty)
7: If k = 0 then
8:

Select label l_{m a x}

with the largest L C r a t i o (l_{i})

;
9:

Select feature f_{\max}

with the largest I (f_{m}; l_{m a x})

;
10:

F = F - {f_{m a x}}

;
11:

S = S \cup {f_{m a x}}

;
12:

k = k + 1

;
13: End if
//Step 3: Subsequent iterations
//Stage A: Update label complexity ratios considering selected features
14: For i = 1 to q do
15:

According to the Formula (5), update the L C r a t i o (l_{i}; S)

;
16: End for
17:

Select label l_{m a x}

with the largest L C r a t i o (l_{i}; S)

;
//Stage B: Comprehensive feature evaluation
18:

For each candidate feature f_{m} \in F

do
19:

According to the Formula (7), calculate the J (f_{m})

;
20: end for
21:

Select the feature f_{m a x}

with the largest J (f_{m})

;
22:

F = F - {f_{m a x}}

;
23:

S = S \cup {f_{m a x}}

;
24:

k = k + 1

;
25: End while.

4.4. Theoretical and Time Complexity Analysis

Most information-theoretic methods (e.g., D2F, PMU, MFSJMI) treat all labels as equally important. They aggregate mutual information across labels or consider pairwise label interactions, which is based on an implicit assumption of uniform label importance. Furthermore, although FIMF introduces label weights, these weights are not updated during the feature selection process. Overall, these methods are unable to differentiate and quantify the complexity of different label distributions. In contrast, the MLCFS algorithm dynamically focuses on the currently most complex label, thereby ensuring balanced coverage of the entire label space and overcoming the limitations of existing feature selection strategies in handling label information distribution.

We present a time-complexity analysis for the proposed MLCFS method and eight representative information-theoretic feature selection methods (D2F, PMU, SCLS, LRFS, FIMF, IDA, MFSJMI, and MIFS). Let

n

denote the number of instances,

d

the number of features,

q

the number of labels, and

w

the size of the selected feature subset. Since probability estimation requires scanning all instances, computing mutual information, conditional mutual information, and interaction information typically incurs a time complexity of

O (n)

. The time complexities of all compared methods are summarized in Table 2.

The analysis shows that MLCFS achieves the same time complexity as the SCLS method. Furthermore, the time complexity of MLCFS is lower than that of D2F, PMU, LRFS, FIMF, and IDA. Specifically, compared to D2F, FIMF, IDA, and MFSJMI, MLCFS saves a factor of

q

in the second term, by focusing on a single label

l_{\max}

per iteration instead of summing over all

q

labels when computing feature redundancy and interaction. Compared to PMU and LRFS, which have second-order label correlations at

O (n d q^{2})

cost, MLCFS uses the precomputed and dynamically updated LCratio to guide label selection without requiring repeated calculation of pairwise label interactions during feature evaluation.

Therefore, while introducing a novel dynamic label complexity awareness, the proposed MLCFS method maintains competitive computational efficiency compared to simpler methods and is more efficient than several state-of-the-art methods that account for label correlations.

5. Experimental Results and Analysis

In this section, experiments are conducted on nine publicly available multi-label datasets to evaluate the effectiveness of the proposed multi-label feature selection method MLCFS. The proposed method is compared with eight representative and widely used multi-label feature selection methods. The evaluation metrics are described in Section 5.1. The dataset descriptions and experimental settings are provided in Section 5.2. The experimental results and detailed analysis are presented in Section 5.3. The significance tests of the experimental results are provided in Section 5.4.

5.1. Evaluation of Metrics for Multi-Label Feature Selection

To enhance the evaluation of multi-label algorithms’ performance, we employ four widely recognized metrics commonly used in multi-label learning [47]. Suppose that

D = {(x_{i}, l_{i}) | x_{i} \in X, l_{i} \in L}

is a multi-label training dataset,

U = {x_{1}, x_{2}, \dots, x_{n}}

is a set containing n samples,

F = {f_{1}, f_{2}, \dots, f_{d}}

denotes the set of features,

L = {l_{1}, l_{2}, \dots, l_{q}}

is a set of labels.

\forall x_{i} \in U

,

L (x_{i})

and

L {(x_{i})}^{'}

denote the true label set and the predicted label set, respectively. The specific definitions of the evaluation indicators are as follows.

(1): Hamming Loss (HL): HL evaluates the occurrence frequency of the given sample labels being misclassified.

H L = \frac{1}{n} \sum_{i = 1}^{n} \frac{| L (x_{i}) △ L {(x_{i})}^{'} |}{q}

(8)

△

calculates the symmetric difference between

L (x_{i})

and

L {(x_{i})}^{'}

.

(2): Average Precision (AP): AP evaluates the average score of the labels ranked higher than the given label.

$A P = \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{| L (x_{i}) |} \sum_{l_{i} \in L (x_{i})} \frac{| {l_{i}^{'} | r a n k (g (x_{i}, l_{i}^{'})) \leq r a n k (g (x_{i}, l_{i})), l_{i}^{'} \in L (x_{i})} |}{r a n k (x_{i}, l_{i})}$

(9)

where $r a n k (g (x_{i}, l))$ record the results of all labels ranked in descending order of their scores according to $g (.)$ .

(3): Ranking Loss (RL): RL evaluates the average score of the marked pairs in the reverse sorting of the given samples.

R L = \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{| L (x_{i}) | | \bar{L} (x_{i}) |} | {(l^{'}, l^{″}) | f (x_{i}, l^{'}) \leq f (x_{i}, l^{"}), (l^{'}, l^{"}) \in L (x_{i}) \times \bar{L} (x_{i})} |

(10)

\bar{L} (x_{i})

represents the complementary set of the true label set

L (x_{i})

,

L (x_{i}) \times \bar{L} (x_{i})

represents the Cartesian product between

L (x_{i})

and

\bar{L} (x_{i})

.

(4): Coverage Error (CE): CE evaluates how many steps it takes on average to move down the list of ranked labels, covering all relevant labels for the sample.

C E = \frac{1}{n} \sum_{i = 1}^{n} \max_{l \in L (x_{i})} r a n k (g (x_{i}, l)) - 1

(11)

The evaluation metrics adopted in the experiment follow the principle that smaller HL, RL, and CE values indicate better performance. Conversely, the higher the value of AP, the better the classification performance.

5.2. Description of Multi-Label Benchmark Datasets and Experimental Settings

We evaluate the performance of our method on nine publicly available multi-label benchmark datasets from the Mulan repository [48], datasets that are well-established in prior work [49,50,51]. Table 3 summarizes their key characteristics, including the number of instances, features, and labels. To comprehensively assess the effectiveness of the proposed approach, we compare it with eight representative multi-label feature selection methods: MIFS, D2F, PMU, SCLS, LRFS, FIMF, IDA, and MFSJMI. For each dataset, the top 20% of features selected by each method are used to compute average performance scores and standard deviations. The classification performance is evaluated using the MLKNN classifier [52] with four standard metrics: Hamming Loss (HL), Average Precision (AP), Ranking Loss (RL), and Coverage Error (CE). Following common practice, we set the number of neighbors K = 10 and the smoothing factor to 1.

5.3. Classification Results and Analysis

Table 4, Table 5, Table 6 and Table 7 present the experimental results of the nine multi-label feature selection methods on the nine benchmark datasets. The best value in each row is highlighted in bold; the last row reports the average performance of each method over all datasets. By comparing four evaluation metrics among MIFS, D2F, PMU, SCLS, LRFS, FIMF, IDA, MFSJMI, and the proposed MLCFS, the effectiveness of MLCFS is comprehensively confirmed. As shown in Table 4, MLCFS achieves the lowest Hamming Loss (HL) on five datasets and obtains competitive results on the scene and medical datasets, reflecting its overall strength in minimizing label-wise misclassification. Table 5 shows that MLCFS outperforms all other methods on the Coverage Error (CE) metric, attaining the best score on eight of the nine datasets.

In Table 6, MLCFS delivers better Ranking Loss (RL) values than all compared methods on six datasets; its average RL is also the lowest among all methods, indicating superior ranking consistency. Regarding Average Precision (AP) in Table 7, MLCFS exhibits outstanding performance on eight datasets and achieves the highest average AP, confirming its advantage in retrieving relevant labels early. Overall, the results demonstrate that MLCFS performs consistently well across all evaluation metrics. These findings highlight the importance of accounting for label complexity distribution and the dynamic interactions between labels and selected features during the feature-selection process.

Figure 3 graphically summarizes the average rank results derived from Table 3, Table 4, Table 5 and Table 6. In each subfigure, the horizontal axis corresponds to the nine compared methods, while the vertical axis shows the average rank of each method across all experimental datasets. MLCFS consistently achieves the best average rank on all four metrics, demonstrating its superior ranking consistency and overall effectiveness relative to the eight benchmark methods.

To evaluate the sensitivity of MLCFS to the number of selected features, Figure 4, Figure 5, Figure 6 and Figure 7 present the performance evolution curves of all methods as the feature subset size increases from 1% to 20%, with a 1% increment. A comprehensive analysis reveals the following: MLCFS generally achieves performance saturation with fewer features and even at very low feature proportions (<5%), MLCFS maintains a leading performance, indicating that the initial critical features it selects are of high quality. And the curves of MLCFS exhibit smooth and steady trends across all datasets, without abnormal fluctuations, demonstrating the robustness of its selection process.

To evaluate the sensitivity of the features selected by different feature selection methods to classifier parameters, we conduct additional experiments. Specifically, we fixed the feature subset selected by each method (top 20%) and varied the neighborhood parameter K of the ML-KNN classifier (taking values from {5, 10, 15}), then observed the corresponding changes in classification performance.

Figure 8, Figure 9 and Figure 10 illustrates the variation in the AP metric with respect to the parameter K on three representative datasets. As shown, the curve corresponding to MLCFS consistently outperforms those of other methods across all values of K. Furthermore, the MLCFS curve exhibits smaller fluctuations, indicating that its performance is less sensitive to changes in K. Similar trends are observed on other datasets and across different evaluation metrics. These results demonstrate that the feature subset selected by MLCFS provides a more stable and robust representation for the classifier, enabling consistently strong performance under varying parameter configurations.

5.4. Statistical Tests

To further investigate whether there are significant differences in the classification performance of the proposed algorithm, MLCFS, and eight comparative feature selection algorithms across four evaluation metrics, the Friedman test and the Bonferroni–Dunn test [53,54] were employed for verification. Table 8 presents the average ranking results of the MLSMFS algorithm and all comparative algorithms on the 8 experimental datasets under the four evaluation metrics. The results show that the MLCFS algorithm achieved the optimal ranking outcomes across all metrics. For

K

algorithms and

N

datasets,

r_{j}^{i}

represents the algorithm on the

j

dataset, and

R_{i} = 1 / N \sum_{j = 1}^{N} r_{j}^{i}

represents the average rank of the

i

algorithm. The Friedman statistic

F_{F}

follows with an

F

-distribution. It follows an

F

-distribution with degrees of freedom

(K - 1)

in the numerator and

(K - 1) (N - 1)

in the denominator.

F_{F} = \frac{(N - 1) χ_{F}^{2}}{N (K - 1) - χ_{F}^{2}}, w h e r e χ_{F}^{2} = \frac{12 N}{K (K + 1)} (\sum_{i = 1}^{K} R_{i}^{2} - \frac{K {(K + 1)}^{2}}{4}) .

(12)

Table 9 summarizes the value of

F_{F}

and the corresponding critical value. If the

F_{F}

value is greater than the critical value, then the null hypothesis is rejected. The null hypothesis states that the classification performance of all compared algorithms is equal. As shown in Table 9, the null hypothesis was clearly rejected on each evaluation metric at the significance level

α = 0.05

.

Therefore, the subsequent Bonferroni–Dunn test was employed to further analyze the relative performance between the proposed algorithm and the other comparative algorithms. If the average ranks of the proposed algorithm MLSMFS and a compared algorithm across all datasets fall within a critical difference (CD), they are considered statistically similar. Conversely, if the difference in average ranks exceeds the CD, it indicates a significant difference in classification performance between the proposed algorithm and the compared algorithm.

With

K = 9

and

N = 9

,

C D = q_{α} \sqrt{\frac{k (k + 1)}{6 N}}

, where

q = 2.724

at

α = 0.05

; thus, we can compute CD = 3.516. Figure 11 presents the critical difference diagrams for each classification evaluation metric, with the average ranks of the nine feature selection algorithms plotted along the axis. The ranking increases from right to left in sequence.

In Figure 11, any compared method whose average rank falls within one critical difference (CD) of the best-performing algorithm is connected to it by a thick red line. Otherwise, compared methods not connected to the best method by such a line are considered to exhibit significantly different performance. Overall, the proposed method MLCFS shows significant outperformance over most compared methods across evaluation metrics.

6. Conclusions and Future Work

In this paper, we propose a novel feature selection method for multi-label learning, designed to identify a compact and informative feature subset. We first define a label complexity ratio based on information entropy and mutual information, which quantifies the varying complexity across different label distributions. This ratio is then dynamically updated via conditional mutual information to reflect the influence of already selected features. Building on this foundation, we introduce a new feature evaluation criterion that maximizes the label complexity ratio while holistically accounting for feature correlation, redundancy, and interaction. Finally, we validate the proposed method, termed MLCFS, on multiple multi-label benchmark datasets using four standard evaluation metrics. Experimental results confirm that MLCFS outperforms several representative feature selection methods.

The proposed MLCFS method dynamically selects features based on the currently most complex label. While this method has demonstrated effectiveness in our experiments, it remains subject to certain limitations. For instance, with respect to feature redundancy, the influence of label complexity on pairwise feature redundancy has not been incorporated into the current framework. In future work, we will conduct an in-depth investigation into the different roles of feature redundancy in information-theoretic feature selection to consider the dynamic changes in redundancy within the selected feature subset and the label space.

Author Contributions

Conceptualization, P.Z. and Y.C.; methodology, P.Z.; software, Y.C.; validation, P.Z. and Y.C.; formal analysis, P.Z.; investigation, P.Z. and L.W.; resources, Y.C.; data curation, Y.C.; writing—original draft preparation, Y.C.; writing—review and editing, P.Z. and L.W.; visualization, Y.C.; supervision, P.Z. and L.W.; project administration, P.Z.; funding acquisition, P.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant No. 62206085, and Grant No. 62376088; The National Natural Science Foundation of Hebei Province under Grants No. F2025202050 and No. F202420204; The Hebei Province Yanzhao Golden Platform Talent Gathering Program Key Talent Project (Postdoctoral Platform) (No. B2024005001).

Data Availability Statement

The original data presented in the study are openly available in [Mulan] at [https://mulan.sourceforge.net/] (accessed on 1 May 2025) or reference [Tsoumakas, G.; Spyromitros-Xioufis, E.; Vilcek, J. Mulan: a Java library for multi-label learning. Journal of Machine Learning Research. 2011, 12, 2411–2414 [48]].

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Symbols and Notations.

Symbols	Notations
$U = {x_{1}, x_{2}, \dots, x_{n}}$	set of samples
$F = {f_{1}, f_{2}, \dots, f_{d}}$	set of features
$L = {l_{1}, l_{2}, \dots, l_{q}}$	set of labels
$f_{k}$	the k-th candidate feature(general nonation)
$l_{i} \in L$	the i-th label
$f_{s} \in S$	features in the selected feature subset
$S \in F$	selected feature subset

References

Huang, R.; Wu, Z. Multi-label feature selection via manifold regularization and dependence maximization. Pattern Recognit. 2021, 120, 108149. [Google Scholar] [CrossRef]
Wu, J.S.; Huang, S.J.; Zhou, Z.H. Genome-wide protein function prediction through multi-instance multi-label learning. IEEE/ACM Trans. Comput. Biol. Bioinform. 2014, 11, 891–902. [Google Scholar] [CrossRef]
Spolaôr, N.; Monard, M.C.; Tsoumakas, G.; Lee, H.D. A systematic review of multi-label feature selection and a new method based on label construction. Neurocomputing 2016, 180, 3–15. [Google Scholar] [CrossRef]
Gao, W.; Hu, L.; Zhang, P. Class-specific mutual information variation for feature selection. Pattern Recognit. 2018, 79, 328–339. [Google Scholar] [CrossRef]
Lin, Y.; Hu, Q.; Liu, J.; Chen, J.; Duan, J. Multi-label feature selection based on neighborhood mutual information. Appl. Soft Comput. 2016, 38, 244–256. [Google Scholar] [CrossRef]
Deng, W.; Xu, H.; Guan, Z.; Sun, Y.; Ran, X.; Ma, H.; Zhou, X.; Zhao, H. PSO-K-Means Clustering-Based NSGA-III for Delay Recovery. IEEE Trans. Consum. Electron. 2025, 71, 10084–10095. [Google Scholar] [CrossRef]
Huang, R.; Jiang, W.; Sun, G. Manifold-based constraint Laplacian score for multi-label feature selection. Pattern Recognit. Lett. 2018, 112, 346–352. [Google Scholar] [CrossRef]
Dai, J.; Chen, J.; Liu, Y.; Hu, H. Novel multi-label feature selection via label symmetric uncertainty correlation learning and feature redundancy evaluation. Knowl.-Based Syst. 2020, 207, 106342. [Google Scholar] [CrossRef]
Lee, J.; Kim, D.W. Memetic feature selection algorithm for multi-label classification. Inf. Sci. 2015, 293, 80–96. [Google Scholar] [CrossRef]
Kashef, S.; Nezamabadi-pour, H. A label-specific multi-label feature selection algorithm based on the Pareto dominance concept. Pattern Recognit. 2019, 88, 654–667. [Google Scholar]
Pereira, R.B.; Plastino, A.; Zadrozny, B.; Merschmann, L.H. Categorizing feature selection methods for multi-label classification. Artif. Intell. Rev. 2018, 49, 57–78. [Google Scholar] [CrossRef]
Lee, J.; Kim, D.W. Efficient multi-label feature selection using entropy-based label selection. Entropy 2016, 18, 405. [Google Scholar] [CrossRef]
Hall, M.A. Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning. In Proceedings of the Seventeenth International Conference on Machine Learning, San Francisco, CA, USA, 29 June–2 July 2000; pp. 359–366. [Google Scholar]
Guyon, I.; Weston, J.; Barnhill, S. Gene Selection for Cancer Classification using Support Vector Machines. Mach. Learn. 2002, 46, 389–422. [Google Scholar] [CrossRef]
Mejia-Lavalle, M.; Sucar, E.; Arroyo, G. Feature selection with a perceptron neural net. In Proceedings of the International Workshop on Feature Selection for Data Mining, Bethesda, MD, USA, 22 April 2006; pp. 131–135. [Google Scholar]
Yu, L.; Liu, H. Efficient Feature Selection via Analysis of Relevance and Redundancy. J. Mach. Learn. Res. 2004, 5, 1205–1224. [Google Scholar]
Liu, H.; Setiono, R. Chi2: Feature selection and discretization of numeric attributes. In Proceedings of the 7th IEEE International Conference on Tools with Artificial Intelligence, Herndon, VA, USA, 5–8 November 1995; pp. 388–391. [Google Scholar]
Liu, H.; Yu, L. Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng. 2005, 17, 491–502. [Google Scholar] [CrossRef]
Li, Y.H.; Hu, L.; Gao, W.F. Multi-label feature selection based on sparse coefficient matrix reconstruction. Chin. J. Comput. 2022, 45, 1827–1841, (In Chinese with English abstract). [Google Scholar]
Wu, J.S.; Li, Y.L.; Huang, C. Recent Advances in Unsupervised Multi-view Feature Selection. J. Softw. 2025, 36, 886–914. [Google Scholar]
Li, Y.H.; Hu, L.; Zhang, P. Multi-label feature selection based on dynamic graph Laplacian. J. Commun. 2020, 41, 47–59. [Google Scholar]
Sechidis, K.; Spyromitros-Xioufis, E.; Vlahavas, I. Information theoretic multi-target feature selection via output space quantization. Entropy 2019, 21, 855. [Google Scholar] [CrossRef]
Liu, J.; Lin, Y.; Ding, W.; Zhang, H.; Du, J. Fuzzy mutual information-based multilabel feature selection with label dependency and streaming labels. IEEE Trans. Fuzzy Syst. 2023, 31, 77–91. [Google Scholar] [CrossRef]
Zhang, L.; Wang, C. Multi-label feature selection algorithm based on joint mutual information of max-relevance and min-redundancy. J. Commun. 2018, 39, 111–122. [Google Scholar]
Wang, G.Y.; Yu, H.; Yang, D.C. Decision table reduction based on conditional information entropy. Chin. J. Comput. 2002, 25, 759–766, (In Chinese with English abstract). [Google Scholar]
Liu, J.; Li, Y.; Weng, W. Feature selection for multi-label learning with streaming label. Neurocomputing 2020, 387, 268–278. [Google Scholar] [CrossRef]
Sun, L.; Wang, L.; Ding, W.; Qian, Y.; Xu, J. Feature Selection Using Fuzzy Neighborhood Entropy-Based Uncertainty Measures for Fuzzy Neighborhood Multigranulation Rough Sets. IEEE Trans. Fuzzy Syst. 2021, 29, 19–33. [Google Scholar] [CrossRef]
Boutell, M.R.; Luo, J.; Shen, X. Learning multi-label scene classification. Pattern Recognit. 2004, 37, 1757–1771. [Google Scholar] [CrossRef]
Trohidis, K.; Tsoumakas, G.; Kalliris, G. Multi-label classification of music by emotion. EURASIP J. Audio Speech Music Process. 2011, 2011, 4. [Google Scholar] [CrossRef]
Read, J. A pruned problem transformation method for multi-label classification. In Proceedings of the 2008 New Zealand Computer Science Research Student Conference, Christchurch, New Zealand, 14–18 April 2008; pp. 143–150. [Google Scholar]
Yin, T.; Chen, H.; Wan, J.; Zhang, P.; Horng, S.J.; Li, T. Exploiting feature multi-correlations for multilabel feature selection in robust multi-neighborhood fuzzy β covering space. Inf. Fusion 2024, 104, 102150. [Google Scholar] [CrossRef]
Zhang, Y.; Huo, W.; Tang, J. Multi-label feature selection via latent representation learning and dynamic graph constraints. Pattern Recognit. 2024, 151, 110411. [Google Scholar] [CrossRef]
Jian, L.; Li, J.; Shu, K.; Liu, H. Multi-label informed feature selection. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16), New York, NY, USA, 9–15 July 2016; pp. 1627–1633. [Google Scholar]
Fan, Y.; Liu, J.; Tang, J. Learning correlation information for multi-label feature selection. Pattern Recognit. 2024, 145, 109899. [Google Scholar] [CrossRef]
Dai, J.; Liu, Q.; Chen, W. Multi-label feature selection based on fuzzy mutual information and orthogonal regression. IEEE Trans. Fuzzy Syst. 2024, 32, 5136–5148. [Google Scholar] [CrossRef]
Yin, T.; Chen, H.; Yuan, Z. LEFMIFS: Label enhancement and fuzzy mutual information for robust multilabel feature selection. Eng. Appl. Artif. Intell. 2024, 133, 108108. [Google Scholar] [CrossRef]
Sun, Z.; Zhang, J.; Dai, L.; Li, C.; Zhou, C.; Xin, J.; Li, S. Mutual information based multi-label feature selection via constrained convex optimization. Neurocomputing 2019, 329, 447–456. [Google Scholar] [CrossRef]
Gonzalez-Lopez, J.; Ventura, S.; Cano, A. Distributed multi-label feature selection using individual mutual information measures. Knowl.-Based Syst. 2020, 188, 105052. [Google Scholar] [CrossRef]
Lee, J.; Kim, D.W. Mutual information-based multi-label feature selection using interaction information. Expert Syst. Appl. 2015, 42, 2013–2025. [Google Scholar] [CrossRef]
Lee, J.; Kim, D.W. Feature selection for multi-label classification using multivariate mutual information. Pattern Recognit. Lett. 2013, 34, 349–357. [Google Scholar] [CrossRef]
Lee, J.; Kim, D.W. SCLS: Multi-label feature selection based on scalable criterion for large label set. Pattern Recognit. 2017, 66, 342–352. [Google Scholar] [CrossRef]
Zhang, P.; Liu, G.; Gao, W. Distinguishing two types of labels for multi-label feature selection. Pattern Recognit. 2019, 95, 72–82. [Google Scholar] [CrossRef]
Pan, M.; Sun, Z.; Wang, C.; Cao, G. A multi-label feature selection method based on an approximation of interaction information. Intell. Data Anal. 2022, 26, 823–840. [Google Scholar] [CrossRef]
Lee, J.; Kim, D.W. Fast multi-label feature selection based on information-theoretic feature ranking. Pattern Recognit. 2015, 48, 2761–2771. [Google Scholar] [CrossRef]
Zhang, P.; Liu, G.; Song, J. MFSJMI: Multi-label feature selection considering join mutual information and interaction weight. Pattern Recognit. 2023, 138, 109378. [Google Scholar] [CrossRef]
Guo, D.; Zhang, J.; Yang, B.; Lin, Y. Multi-modal intelligent situation awareness in real-time air traffic control: Control intent understanding and flight trajectory prediction. Chin. J. Aeronaut. 2025, 38, 103376. [Google Scholar] [CrossRef]
Zhao, J.; Yang, C.; Gao, W.; Park, J.H. ADP-based optimal control of linear singularly perturbed systems with uncertain dynamics: A two-stage value iteration method. IEEE Trans. Circuits Syst. II Express Briefs 2023, 70, 4399–4403. [Google Scholar] [CrossRef]
Tsoumakas, G.; Spyromitros-Xioufis, E.; Vilcek, J. Mulan: A Java library for multi-label learning. J. Mach. Learn. Res. 2011, 12, 2411–2414. [Google Scholar]
Cai, Z.; Zhu, W. Multi-label feature selection via feature manifold learning and sparsity regularization. Int. J. Mach. Learn. Cybern. 2018, 9, 1321–1334. [Google Scholar] [CrossRef]
Rodrigues, D.; Pereira, L.; Nakamura, R. A wrapper approach for feature selection based on bat algorithm and optimum-path forest. Expert Syst. Appl. 2014, 41, 2250–2258. [Google Scholar] [CrossRef]
Zhang, J.; Luo, Z.; Li, C. Manifold regularized discriminative feature selection for multi-label learning. Pattern Recognit. 2019, 95, 136–150. [Google Scholar] [CrossRef]
Zhang, L.; Wang, Z. Multi-label Feature Selection Algorithm Based on Maximum Correlation and Minimum Redundancy Joint Mutual Information. J. Commun. 2018, 39, 111–122. [Google Scholar]
Friedman, M. A comparison of alternative tests of significance for the problemof m rankings. Ann. Math. Stat. 1940, 11, 86–92. [Google Scholar] [CrossRef]
Dunn, O.J. Multiple comparisons among means. J. Am. Assoc. 1961, 56, 52–64. [Google Scholar] [CrossRef]

Figure 1. The dynamic changes in the label information during the feature selection process: (a) original label space; (b) label space based on

S_{i i}

; (c) label space based on

S_{j j}

. The white part is the selected feature subset that are relevant to the shown label.

Figure 1. The dynamic changes in the label information during the feature selection process: (a) original label space; (b) label space based on

S_{i i}

; (c) label space based on

S_{j j}

. The white part is the selected feature subset that are relevant to the shown label.

Figure 2. Two-stage interactive iterative strategy of the proposed method.

Figure 3. The statistical results of Avgrank: (a) hamming loss, (b) coverage error, (c) ranking loss, (d) average precision.

Figure 4. Classification performance comparisons in terms of the HL metric.

Figure 5. Classification performance comparisons in terms of the CE metric.

Figure 6. Classification performance comparisons in terms of the RL metric.

Figure 7. Classification performance comparisons in terms of the AP metric.

Figure 8. Classification performance comparisons in terms of the AP metric on computers: (a) K = 5, (b) K = 10, (c) K = 15.

Figure 9. Classification performance comparisons in terms of the AP metric on reference: (a) K = 5, (b) K = 10, (c) K = 15.

Figure 10. Classification performance comparisons in terms of the AP metric on Social: (a) K = 5, (b) K = 10, (c) K = 15.

Figure 11. The CD diagrams using the Bonferroni–Dunn test: (a) hamming loss, (b) coverage error, (c) ranking loss, (d) average precision.

Table 1. Summary of evaluation criteria for representative feature selection methods.

Methods	$R e l e v a n c e (f_{k}; L)$	$R e d u n d a n c e (f_{k}; S)$
D2F	$\sum_{l_{i} \in L} I (f_{k}; l_{i})$	$\sum_{f_{j} \in S} \sum_{l_{i} \in L} I (f_{k}; f_{j}; l_{i})$
PMU	$\sum_{l_{i} \in L} I (f_{k}; l_{i})$	$\sum_{f_{j} \in S} \sum_{l_{i} \in L} I (f_{k}; f_{j}; l_{i})$
SCLS	$\sum_{l_{i} \in L} I (f_{k}; l_{i})$	$\sum_{f_{j} \in S} \frac{I (f_{k}; f_{j})}{H (f_{k})} \sum_{l_{i} \in L} I (f_{k}; l_{i})$
FIMF	$\sum_{l_{i} \in L} I (f_{k}; l_{i})$	/
LRFS	$\sum_{l_{i} \in L} I (f_{k}; l_{i})$	$\frac{1}{\| S \|} \sum_{f_{j} \in S} I (f_{k}; f_{j})$
IDA	$\frac{1}{\| L \|} {\sum_{l_{i} \in L} I (f_{k}; l_{i}) - \frac{1}{2} \sum_{l_{q} \in L} \sum_{l_{j} \in L, q \neq j} I (f_{k}; l_{q}; l_{j})}$	$\frac{1}{\| S \|} {\sum_{f_{i} \in S} I (f_{k}; f_{i}) - \frac{1}{2} \sum_{f_{q} \in S} \sum_{f_{j} \in L, f_{q} \neq f_{j}} I (f_{k}; f_{q}; f_{j})}$
MFSJMI	$\sum_{l_{i} \in L} \sum_{l_{j} \in L - {l_{i}}} \frac{I (f_{k}; l_{i} \| l_{j})}{I (f_{k}; l_{i})} * I (l_{i}; l_{j}; f_{k})$	$\sum_{f_{j} \in S} I (f_{k}; f_{j})$

Table 2. The time complexity of the nine methods.

Methods	Time Complexities
MLCFS	$O (n d q + w n d)$
D2F	$O (n d q + w n d q)$
PMU	$O (n d q + w n d q + n d q^{2})$
SCLS	$O (n d q + w n d)$
LRFS	$O (n d q^{2} + w n d)$
FIMF	$O (n d q + w n d q)$
IDA	$O (n d q + w n d q)$
MFSJMI	$O (n d q + w n d q)$
MIFS	$O (n (d^{2} + d q) + w n d^{2})$

Table 3. Description of multi-label dataset.

Datasets	Instances	Train	Test	Features	Label	Label Cardinality	Label Density	Domain
scene	2407	1211	1196	294	6	1.074	0.179	Image
yeast	2417	1500	917	103	14	4.237	0.303	Biology
computers	5000	2000	3000	681	33	1.508	0.046	Yahoo
health	5000	2000	3000	612	32	1.662	0.052	Text
reference	5000	2000	3000	636	27	1.169	0.035	Yahoo
social	5000	2000	3000	1047	39	1.282	0.033	Text
medical	978	333	645	1449	45	1.245	0.028	Text
entertain	5000	2000	3000	640	21	1.420	0.068	Text
society	5000	2000	3000	636	27	1.692	0.063	Text

Table 4. Performance comparison results of nine methods on the HL metric.

Datasets	MLCFS	MIFS	D2F	PMU	SCLS	LRFS	FIMF	IDA	MFSJMI
scene	0.1413 ± 0.0206	0.1704 ± 0.0097	0.1492 ± 0.0064	0.1473 ± 0.0066	0.1734 ± 0.003	0.1419 ± 0.0099	0.1663 ± 0.0063	0.1458 ± 0.0102	0.1411 ± 0.019
yeast	0.2257 ± 0.0126	0.2302 ± 0.0041	0.2278 ± 0.0029	0.2279 ± 0.0037	0.2332 ± 0.0044	0.2263 ± 0.0035	0.2319 ± 0.0042	0.2303 ± 0.0026	0.2305 ± 0.0028
Computers	0.0407 ± 0.0008	0.0449 ± 0.0002	0.044 ± 0.0005	0.0441 ± 0.0005	0.0434 ± 0.0005	0.0429 ± 0.0007	0.0433 ± 0.0006	0.0426 ± 0.0012	0.0432 ± 0.0007
Health	0.0441 ± 0.0026	0.0502 ± 0.001	0.0483 ± 0.0005	0.0493 ± 0.0006	0.0485 ± 0.0011	0.0452 ± 0.0011	0.0442 ± 0.0013	0.0471 ± 0.0009	0.0473 ± 0.0015
Reference	0.0305 ± 0.0014	0.0313 ± 0.0012	0.0322 ± 0.0012	0.0336 ± 0.001	0.0329 ± 0.0002	0.0312 ± 0.0007	0.0321 ± 0.0009	0.0315 ± 0.0006	0.0314 ± 0.0009
Social	0.0266 ± 0.0021	0.0317 ± 0.0013	0.0303 ± 0.0005	0.0309 ± 0.0003	0.0287 ± 0.0007	0.0274 ± 0.0007	0.0282 ± 0.0006	0.0266 ± 0.0012	0.0281 ± 0.0009
medical	0.0171 ± 0.0008	0.0165 ± 0.0021	0.0196 ± 0.001	0.0197 ± 0.0011	0.0233 ± 0.0002	0.0175 ± 0.001	0.0174 ± 0.001	0.0218 ± 0.0001	0.0177 ± 0.0015
Entertain	0.0637 ± 0.0017	0.0658 ± 0.0008	0.0657 ± 0.0013	0.0671 ± 0.0011	0.0659 ± 0.0014	0.0631 ± 0.0014	0.0654 ± 0.0011	0.0615 ± 0.0012	0.0641 ± 0.0011
Society	0.0587 ± 0.0007	0.0596 ± 0.0009	0.0587 ± 0.0004	0.0597 ± 0.0009	0.0594 ± 0.0003	0.058 ± 0.0006	0.0586 ± 0.0007	0.0582 ± 0.0005	0.0589 ± 0.001
average	0.0722	0.0778	0.0751	0.0755	0.0788	0.0726	0.0764	0.074	0.0739

The bolded part indicates the best classification performance.

Table 5. Performance comparison results of nine methods on the CE metric.

Datasets	MLCFS	MIFS	D2F	PMU	SCLS	LRFS	FIMF	IDA	MFSJMI
scene	1.9849 ± 0.2984	2.9801 ± 0.434	2.3015 ± 0.2357	2.3129 ± 0.2443	2.7828 ± 0.1086	2.2297 ± 0.2913	2.6974 ± 0.4179	2.3396 ± 0.3783	2.3442 ± 0.4845
yeast	7.9796 ± 0.2589	9.0812 ± 0.506	8.7833 ± 0.2726	8.9352 ± 0.3673	9.0711 ± 0.3446	8.9035 ± 0.3516	8.9928 ± 0.3234	9.2325 ± 0.3927	8.9751± 0.332
Computers	6.3673 ± 0.1149	7.5371 ± 0.5373	7.2455 ± 0.2641	7.1926 ± 0.2168	7.1822 ± 0.2216	7.1585 ± 0.2369	6.9352 ± 0.196	7.0722 ± 0.2511	7.1399 ± 0.2279
Health	4.8104 ± 0.3185	6.228 ± 0.376	5.7394 ± 0.1555	5.7229 ± 0.1426	5.8251 ± 0.1802	5.7664 ± 0.1544	4.7298 ± 0.1364	5.8252 ± 0.1703	5.8838 ± 0.1783
Reference	5.0328 ± 0.1062	5.949 ± 0.3156	5.6561 ± 0.1973	5.6117 ± 0.1147	5.6353 ± 0.1623	5.7063 ± 0.3418	5.6452 ± 0.3112	5.9714 ± 0.332	5.6892 ± 0.2235
Social	5.4449 ± 0.1685	6.955 ± 0.4051	6.1474 ± 0.191	6.2101 ± 0.1976	6.0108 ± 0.3113	5.8175 ± 0.3354	5.9043 ± 0.2704	6.0501 ± 0.3354	6.028 ± 0.302
medical	5.1658 ± 0.4137	6.1604 ± 0.4141	6.3598 ± 0.4012	6.4201 ± 0.4025	8.3118 ± 0.1098	5.8078 ± 0.2927	5.7868 ± 0.2699	7.1824 ± 0.0631	5.8668 ± 0.6055
Entertain	5.1016 ± 0.1243	5.9338 ± 0.5407	5.7088 ± 0.2277	5.6683 ± 0.2167	5.7602 ± 0.1751	5.5795 ± 0.2664	5.6386 ± 0.2137	5.7098 ± 0.2397	5.6899 ± 0.2259
Society	7.8775 ± 0.2345	8.6349 ± 0.4791	8.4876 ± 0.2665	8.4146 ± 0.2669	8.5074 ± 0.2349	8.3791 ± 0.3111	8.3525 ± 0.3782	8.3738 ± 0.3093	8.4163 ± 0.3102
average	5.5294	6.6066	6.2699	6.2765	6.5652	6.1498	6.0758	6.4174	6.2259

The bolded part indicates the best classification performance.

Table 6. Performance comparison results of nine methods on the RL metric.

Datasets	MLCFS	MIFS	D2F	PMU	SCLS	LRFS	FIMF	IDA	MFSJMI
scene	0.1763 ± 0.0596	0.3751 ± 0.0865	0.2395 ± 0.0478	0.2415 ± 0.0493	0.3366 ± 0.0216	0.2249 ± 0.059	0.318 ± 0.0842	0.2467 ± 0.0759	0.248 ± 0.0968
yeast	0.2053 ± 0.0168	0.2703 ± 0.0302	0.2454 ± 0.0096	0.2548 ± 0.0183	0.2653 ± 0.0146	0.2564 ± 0.0149	0.2586 ± 0.0158	0.2678 ± 0.0204	0.2596 ± 0.0222
Computers	0.1186 ± 0.0023	0.1515 ± 0.0146	0.1389 ± 0.006	0.1367 ± 0.0045	0.1398 ± 0.0061	0.1364 ± 0.0057	0.1307 ± 0.005	0.1364 ± 0.0067	0.1367 ± 0.0058
Health	0.0729 ± 0.0083	0.1148 ± 0.0106	0.0979 ± 0.0043	0.0983 ± 0.0042	0.1013 ± 0.0051	0.0979 ± 0.0041	0.2005 ± 0.0062	0.1001 ± 0.0044	0.1028 ± 0.0049
Reference	0.1069 ± 0.0032	0.0313 ± 0.0012	0.1254 ± 0.0065	0.124 ± 0.0039	0.1251 ± 0.005	0.1278 ± 0.0107	0.1256 ± 0.0097	0.1355 ± 0.0105	0.127 ± 0.0069
Social	0.0886 ± 0.0039	0.0317 ± 0.0013	0.1043 ± 0.0041	0.1058 ± 0.0044	0.1025 ± 0.0075	0.0976 ± 0.0064	0.0983 ± 0.0061	0.1022 ± 0.0079	0.1011 ± 0.0068
medical	0.0726 ± 0.008	0.0897 ± 0.0093	0.0951 ± 0.0092	0.0963 ± 0.0091	0.1398 ± 0.0024	0.0833 ± 0.0064	0.0829 ± 0.006	0.1139 ± 0.0012	0.0848 ± 0.0131
Entertain	0.1598 ± 0.006	0.2004 ± 0.0265	0.1885 ± 0.011	0.186 ± 0.0101	0.1896 ± 0.0082	0.1826 ± 0.0128	0.1855 ± 0.0102	0.1888 ± 0.0113	0.188 ± 0.0108
Society	0.1845 ± 0.006	0.0596 ± 0.0009	0.2068 ± 0.0081	0.203 ± 0.0085	0.2075 ± 0.0081	0.2034 ± 0.0105	0.2032 ± 0.0134	0.203 ± 0.0102	0.2054 ± 0.0112
average	0.1317	0.1472	0.1602	0.1607	0.1786	0.1567	0.1782	0.1485	0.1647

The bolded part indicates the best classification performance.

Table 7. Performance comparison results of nine methods on the AP metric.

Datasets	MLCFS	MIFS	D2F	PMU	SCLS	LRFS	FIMF	IDA	MFSJMI
scene	0.7331 ± 0.07	0.4978 ± 0.0654	0.6169 ± 0.0503	0.6197 ± 0.0522	0.5129 ± 0.0171	0.6362 ± 0.0658	0.5443 ± 0.0803	0.6165 ± 0.0825	0.6258 ± 0.0873
yeast	0.7199 ± 0.0198	0.6441 ± 0.0408	0.6791 ± 0.0134	0.6728 ± 0.0199	0.653 ± 0.0197	0.6648 ± 0.019	0.6683 ± 0.0193	0.6524 ± 0.0253	0.6615 ± 0.0267
Computers	0.6015 ± 0.0068	0.514 ± 0.0364	0.5407 ± 0.013	0.5402 ± 0.0158	0.5256 ± 0.0178	0.5416 ± 0.0164	0.5494 ± 0.0163	0.5481 ± 0.0155	0.5399 ± 0.017
Health	0.6734 ± 0.0272	0.5407 ± 0.0308	0.5617 ± 0.0201	0.5583 ± 0.0138	0.5566 ± 0.0163	0.5594 ± 0.0203	0.6692 ± 0.0122	0.5578 ± 0.023	0.5539 ± 0.0226
Reference	0.5824 ± 0.0099	0.5089 ± 0.0273	0.5204 ± 0.0186	0.505 ± 0.028	0.5111 ± 0.0176	0.5182 ± 0.0188	0.5233 ± 0.0205	0.5062 ± 0.0195	0.5209 ± 0.017
Social	0.6428 ± 0.024	0.5183 ± 0.0254	0.5671 ± 0.0121	0.5628 ± 0.0134	0.5423 ± 0.0237	0.5731 ± 0.0173	0.5674 ± 0.0245	0.571 ± 0.0221	0.5671 ± 0.0246
medical	0.7295 ± 0.0368	0.6599 ± 0.0576	0.6056 ± 0.032	0.591 ± 0.026	0.4482 ± 0.0067	0.6532 ± 0.0268	0.6515 ± 0.0248	0.5055 ± 0.0021	0.6493 ± 0.0465
Entertain	0.5279 ± 0.0246	0.4182 ± 0.0304	0.4319 ± 0.0179	0.4473 ± 0.0128	0.4361 ± 0.0094	0.4382 ± 0.018	0.4388 ± 0.0158	0.4229 ± 0.0152	0.4298 ± 0.0176
Society	0.5256 ± 0.0071	0.4441 ± 0.0327	0.4865 ± 0.0092	0.4911 ± 0.0139	0.4684 ± 0.0149	0.4792 ± 0.0158	0.474 ± 0.0199	0.571 ± 0.0221	0.481 ± 0.0213
average	0.6373	0.5273	0.5567	0.5542	0.5171	0.5627	0.5651	0.5502	0.5588

The bolded part indicates the best classification performance.

Table 8. The ranking of the evaluation indicators of nine feature selection methods.

Methods	Hamming Loss	Coverage Error	Ranking Loss	Average Precision
MLCFS	1.78	1.11	1.33	1.11
MIFS	6.56	8.33	5.78	8
D2F	5.78	5.33	5.22	4.44
PMU	7.44	4.89	4.67	5.11
SCLS	7.67	6.56	7.33	7.33
LRFS	2.56	3.67	3.67	3.89
FIMF	4.89	3.33	4.78	3.89
IDA	3.67	6.33	6	5.67
MFSJMI	4.44	5.44	5.78	5.44

Table 9. Friedman statistics

F_{F}

and critical value.

Table 9. Friedman statistics

F_{F}

and critical value.

Evaluation Metrics	$χ_{F}^{2}$	$F_{F}$	Critical Value
Hamming Loss	38.9293	9.4172	2.102
Coverage Error	42.2351	11.3516
Ranking Loss	22.4270	3.6192
Average Precision	38.1521	9.0173

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cao, Y.; Zhang, P.; Wang, L. Multi-Label Feature Selection Method Based on Maximum Label Complexity Ratio. Electronics 2026, 15, 525. https://doi.org/10.3390/electronics15030525

AMA Style

Cao Y, Zhang P, Wang L. Multi-Label Feature Selection Method Based on Maximum Label Complexity Ratio. Electronics. 2026; 15(3):525. https://doi.org/10.3390/electronics15030525

Chicago/Turabian Style

Cao, Yu, Ping Zhang, and Long Wang. 2026. "Multi-Label Feature Selection Method Based on Maximum Label Complexity Ratio" Electronics 15, no. 3: 525. https://doi.org/10.3390/electronics15030525

APA Style

Cao, Y., Zhang, P., & Wang, L. (2026). Multi-Label Feature Selection Method Based on Maximum Label Complexity Ratio. Electronics, 15(3), 525. https://doi.org/10.3390/electronics15030525

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Label Feature Selection Method Based on Maximum Label Complexity Ratio

Abstract

1. Introduction

2. Preliminaries

The Basic Concepts of Information Theory

3. Related Work

4. Proposed Feature Selection Method

4.1. The Dynamic Changes in the Label Information

4.2. Quantification of the Complexity of Labels

4.3. Proposed Method

4.4. Theoretical and Time Complexity Analysis

5. Experimental Results and Analysis

5.1. Evaluation of Metrics for Multi-Label Feature Selection

5.2. Description of Multi-Label Benchmark Datasets and Experimental Settings

5.3. Classification Results and Analysis

5.4. Statistical Tests

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI