Next Article in Journal
Diversity of Bacterial Communities in Horse Bean Plantations Soils with Various Cultivation Technologies
Previous Article in Journal
Microscopic Simulation of Heterogeneous Traffic Flow on Multi-Lane Ring Roads and Highways
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Dynamic Online Label Distribution Feature Selection Based on Label Importance and Label Correlation

by
Weiliang Chen
1,
Xiao Sun
2 and
Fuji Ren
3,*
1
Multimodal Affective Computing Lab, Hefei University of Technology, Hefei 230001, China
2
School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230001, China
3
The College of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(3), 1466; https://doi.org/10.3390/app15031466
Submission received: 26 September 2024 / Revised: 11 December 2024 / Accepted: 19 December 2024 / Published: 31 January 2025
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

:
Existing feature selection methods mainly target single-label learning and multi-label learning, and only a few algorithms are optimized for label distribution learning. In label distribution learning, the associated labels of each sample have different levels of importance. Therefore, multi-label feature selection algorithms cannot be directly applied to label distribution learning. Discretizing label distribution data into multi-label data will cause part of the supervision information to be lost. In most practical applications of label distribution learning, the feature space is undefined, and the features are in the form of flow features. To solve this problem, this paper applies fuzzy rough set theory and applies the flow feature framework to propose a dynamic label distribution feature selection algorithm that handles flow features. Experimental results show that the proposed method is more effective than six state-of-the-art feature selection algorithms on 12 datasets with respect to six representative evaluation metrics.

1. Introduction

In the field of machine learning and data mining [1], the problem of label ambiguity is currently a popular research topic. There are currently two relatively mature machine learning paradigms, namely, single-label learning (SLL) and multi-label learning (MLL) [2]. SLL achieves good results in the field of machine learning when the target instance has a clear single-class label. MLL is an extension of SLL. In real life, an object may be associated with several class labels. To intuitively reflect the characteristics of polysemous objects, the most obvious approach is to assign multiple category labels to each example of a polysemous object, that is, to use MLL [3,4].
Because it is constrained by label semantics, traditional multi-label classification assumes that relevant and irrelevant labels have the same degree of relevance to the description of the example, which obviously does not meet the complex cognitive needs of humans [5]. In layman’s terms, MLL only solves the label ambiguity problem of labels associated with the instance. However, in real life, we need to solve the more common and ambiguous scenario of “how to describe the instance”. For example, in facial-expression emotion recognition, it is more meaningful to understand the degree to which each emotion describes the instance. To solve these problems, a new learning paradigm has been proposed in recent years—label distribution learning (LDL) [6], which is an extension of MLL. The degree of LDL to which each label describes an instance is represented by the corresponding value of the label distribution, called the descriptive degree. This value explicitly indicates the relative importance of the label. Both SLL and MLL can be regarded as special cases of LDL. Figure 1 shows the three learning paradigms within the LDL framework.
Figure 2 shows an image described using a label distribution. The label descriptions of “HAP”, “SAD”, “SUR”, “ANG”, “DIS”, and “FER” in the image are shown in Figure 2. LDL has richer label information and clearly describes the importance of each label [7]. At present, it has been applied in multiple neighborhoods. For example, the adaptive facial recognition proposed by X. Geng et al., achieved good results by estimating the age of sample characters through LDL [8]. Z. Zhang et al., by collecting surveillance video data, used LDL to implement crowd counting applications over a period of time [9], and P. H. et al., used LDL to predict box office before movie release by collecting people’s opinions on movies [10]. LDL has also been widely used in fields such as emotion recognition and multilingual learning.
Similar to traditional SLL and MLL, LDL also faces many challenges. In terms of the data structure of label instances, the problems faced by LDL tasks include a large number of feature dimensions [11], large number of labels [12], label imbalance [13], and streaming features [14]. In LDL tasks, the dimensionality of the data is very large, with usually thousands or tens of thousands of dimensions [15,16]. This “curse of dimensionality” leads to a series of problems such as reduced classification accuracy and generalization ability in high-dimensional space, and it increases the computational cost of many learning algorithms. In addition, too many features may contain redundant or irrelevant features, which may lead to more problems such as the excessive consumption of computing resources and degradation of model performance. To this end, many dimensionality reduction methods have been proposed [17,18]. However, they are mainly designed for SLL and MLL algorithms, and are rarely suitable for LDL algorithms. Therefore, there is an urgent need to develop new and effective LDL dimensionality reduction methods.
There are two main methods for reducing the dimensionality of label distribution: feature extraction and feature selection. Feature extraction methods use space mapping techniques or space transformations to reduce the dimensionality of the feature space. However, these methods destroy the structural information of the original feature space, blur the physical meaning of the features, and lack semantic interpretation. In contrast to feature extraction methods, feature selection methods do not perform any feature space transformation or mapping; instead, they retain the original spatial structure. Feature selection methods sort the features in the original feature space by importance, select the subspace that best represents the semantic features of that space, and use this subspace to maximally represent the original feature space [19]. Therefore, feature selection methods well preserve the physical meaning of the feature space, which is an advantage of feature extraction methods [20]. From the perspective of interaction with learning algorithms, existing MLL feature selection algorithms are mainly divided into three types: filtering, wrapping, and embedding [21]. All types have different relative advantages. Filtering methods are typically more efficient, less computationally expensive, and more general than embedding and wrapping methods; therefore, this work focuses on filtering methods.
In the past few decades, many feature selection algorithms have been proposed for label distribution learning. In related studies [22,23], these methods were used to consider the correlation between features and labels based on fuzzy rough sets. In addition, because of the correlation between labels, the performance of label distribution models depends largely on label correlation. For example, the method in [24] considers label correlation to reduce data dimensionality, that is, it leverages the assumption that all samples share the same label correlation. However, for different sample groups in practical applications, the corresponding label correlation between random variables is an effective indicator for evaluating the discriminative ability of candidate features in the feature selection process. Alternatively, when dealing with problems related to uncertainty, rough set theory [25,26] naturally has many advantages and has been widely used in SLL or MLL feature selection algorithms [27,28]. Chen et al. [29] designed a parallel feature selection method based on neighborhood rough sets (NRSs), which considers the partial order between features and values. Yuan et al. [30] proposed a generalized unsupervised feature selection model based on fuzzy rough sets. However, to the best of our knowledge, there are relatively few feature selection methods based on rough set theory for label distribution. The main problem faced by rough sets in dealing with label distribution feature selection is how to deal with distributed labels.
In most practical applications of LDL, the feature space is usually uncertain, and the number of samples for each feature increases gradually over time, just like a stream of feature vectors. Such features are called streaming features. For example, on the social networking platform Twitter, hot topics change dynamically over time. When a hot topic appears, it is always accompanied by a set of new keywords. These fresh keywords can be used as key features to distinguish hot topics. Streaming feature selection assumes that features arrive dynamically over time [31] and performs feature selection when each feature arrives to always maintain the optimal subset of features [32,33]. So far, there have been many methods for processing streaming features. For example, Chen et al. [34] proposed using global features to process streaming feature data, theoretically analyzed the pairwise correlation between features in the currently selected feature subset, and solved the streaming feature problem using online pairwise comparison techniques. NRS can handle mixed types of data without destroying the neighborhood and sequential structure of the data [35]. Moreover, feature selection based on NRS does not require any prior knowledge of the feature space structure, and hence it is an ideal tool for online stream feature selection.
This paper proposes a dynamic online label distribution feature selection model based on label correlation and NRS, and proposes a dynamic label distribution feature selection algorithm for processing stream features. Mutual information is widely used to measure the degree of dependence between random variables and is an effective indicator for evaluating the discriminative ability of candidate features in the feature selection process. The proposed method uses the mutual information method to process the label space and obtain the correlation between labels through mutual information and graph characteristics. In addition, a new label space neighborhood relationship is proposed, in which the neighborhood class of the instance is constructed in the label space, replacing the calculation of the traditional logical label equivalence class. Simultaneously, the nearest neighbor distribution of the surrounding instances is used to calculate the neighborhood of the instance in the feature space, avoiding the problem of neighborhood granularity selection faced by the traditional NRS model. On this basis, the NRS model is extended to fit the label distribution data, and the corresponding feature dependency is redefined. Combining label correlation and NRS, a dynamic feature selection algorithm framework is proposed. The main contributions of this paper are as follows:
(1)
The average nearest neighbor method is used to calculate a new form of neighborhood granularity, and the mutual information method is used to calculate the label weight to obtain the label correlation. By combining the neighborhood granularity and label correlation, a new NRS relationship and feature importance model is constructed.
(2)
The traditional NRS model is generalized to adapt it to LDL.
(3)
Using the above model, a new label distribution stream feature selection algorithm is proposed that combines the new NRS model with the stream feature online importance update framework. As a result, it can better handle the label distribution stream feature problem.
The rest of this paper is organized as follows. In Section 2, we introduce the related concepts, including LDL, feature correlation, and NRS theory. Section 3 introduces label correlation and the NRS models. In Section 4, we propose a dynamic feature selection algorithm based on label correlation and NRS. We report our experimental results in Section 5. Finally, Section 6 summarizes this paper and discusses future work.

2. Related Work

2.1. LDL

At present, the mainstream LDL algorithms are mainly divided into three categories: The first is the problem transformation method, which converts the label distribution data into weighted single-label data or multi-label data and processes the data with traditional single-label or multi-label algorithms. For example, Geng proposed two representative problem transformation algorithms based on problem transformation, PT-SVM [36], which respectively transform the LDL problem into a SLL problem based on a support vector machine and Bayes’ theorem. The second category is the algorithm adaptation method, which naturally extends the existing learning model to directly process the label distribution. AA-kNN [37], AA-BP [37], Duo-LDL [38], and LALOTP [39] are four representative methods that use k-nearest neighbors and the BP neural network to solve the LDL problem. By contrast, the third category consists of algorithms specifically proposed to solve the LDL problem. Representative algorithms include the IS-LLD and CPNN algorithms [39], which were applied to the problem of facial age estimation using a label distribution.

2.2. Label Correlation

In the past few years, researchers have proposed various methods to improve the recognition rate of annotation labels by classifiers. By incorporating label correlation, a large number of studies have convincingly demonstrated that exploiting label correlation can significantly improve classifier performance. For example, in the Movie dataset, we often observe that “science fiction” movies tend to be associated with the “adventure” label, while some movies also have the “humor” label. This pattern of co-occurrence between “science fiction” and “adventure” signals the global label correlation between them. In addition, the frequent co-occurrence of “adventure” and “humor” in some movies also indicates the local correlation between them. A deeper understanding of the relationships among labels can provide guidance for building more accurate multi-label classifiers. The method proposed by Zhu et al. [40] simultaneously handles complete and missing labels, exploiting global and local label correlations by learning latent label representations and optimizing label manifolds. However, these methods may still have a limited ability to handle missing labels.

2.3. NRS

Rough set-based methods are another popular class of feature selection algorithm that can maintain the consistency of the original features using the smallest feature subset size. Few methods extend rough sets to NRS to deal with label distribution. The primary difficulty is the selection of the neighborhood granularity. Wang et al. [23] combined information theory with fuzzy rough sets to construct a label distribution feature selection method with maximum relevance and minimum redundancy. Qian et al. [41] constructed a label enhancement algorithm using probability and rough set models, and designed the MFS algorithm based on the degree of feature dependency.

3. Preliminaries

In Section 3, we will discuss some basic definitions of LDL and NRS. Based on the theory of NRS, we will propose a neighborhood rough set decision system by dividing the sample set’s neighborhood and constructing a lower approximation to achieve this system.

3.1. LDL

3.1.1. LDL Framework

We define X = R N to represent the N-dimensional sample space, that is, U = X 1 , X 2 , X i , X n , L = l 1 , l 2 , , l m represents the M-dimensional label space. For an instance x i U in the sample space, x i = F i 1 , F i 2 , , F i d represents a d-dimensional feature vector, and we denote label vector Y = y 1 , y 2 , y j , y m , where y j = d y j x 1 , d y j x 2 , d y j x 3 , , d y j x n for each instance x i corresponding to y i = d y i ( x ) = 1 . According to the definition of LDL, both SLL and MLL can be regarded as more general cases of LDL. The task of LDL is to find a mapping f : X Y . When x i contains the label l i , the corresponding value of y i is 1; otherwise, it is 1  [42].

3.1.2. NRS

Assume a decision system N D T = U , C , D , where U = x 1 , X 2 , , X n represents a non-empty set of instances (that is, the set composed of all samples), C = a 1 , , a N represents the attribute set corresponding to the sample, and D represents the set of decision attributes.
For a given parameter δ and feature set C, the  δ -domain relationship on X can be determined. We call the decision system a neighborhood decision system, denoted as NDS = < U , C D , δ > .
Definition 1. 
Given an N-dimensional real space Ω , Δ : R N × R N R , we say that Δ is a metric on R N if Δ satisfies the following constraints [43]:
(1) 
Δ x 1 , x 2 0 , if and only when x 1 = x 2 , x 1 , x 2 R N ;
(2) 
Δ x 1 , x 2 = Δ x 2 , x 1 , x 1 , x 2 R N ;
(3) 
Δ x 1 , x 3 Δ x 1 , x 2 + Δ x 2 , x 3 , x 1 , x 2 , x 3 R N .
Definition 2. 
For x i U and a feature subset B C , we define the δ-neighborhood of x i based on parameter C as follows [43]:
δ B x i = x j x j U , Δ B x i , x j δ
x i and x j are two different elements in set U, where δ > 0 , and the set of instances of values is granulated by Δ B x i , x j . We call Ω , Δ B the metric space, and  δ B x i the δ-neighborhood information particle generated by x i . In terms of the two-dimensional real space, the neighborhoods based on the 1-norm, 2-norm, and infinite norm are shown in Figure 3, which are diamond, circle, and square regions, respectively.
In this manner, we granulate the neighborhood of all objects in the universal space.
From the neighborhood information particle clusters, δ x i i = 1 , 2 , , n yields a neighborhood relation N on the universal space U. This relation can be represented by a matrix system M ( N ) = r i j n × m : if x j δ x i , then r i j = 1 ; otherwise, r i j = 0 . For neighborhood relations, we have
(1)
x i U : δ 1 x i δ 2 x i
(2)
N 1 N 2
The neighborhood information particle clusters defined in this manner constitute the basic concept system in the universal space.
Definition 3. 
Given a non-empty finite set U = x 1 , x 2 , , x n on the actual space and a neighborhood relation N on U, we call the two-tuple N A S = U , N a neighborhood approximation space [43].
For example, U = x 1 , x 2 , , x n , a is an attribute of U, f = ( x , a ) represents the value of sample x on attribute a. f x 1 , a = 1.1 , f x 2 , a = 1.2 , f x 3 , a = 1.6 , f x 4 , a = 1.8 , f x 5 , a = 1.9 . When the specified neighborhood size is 0.2, since f x 1 , a f x 2 , a 0.2 , then x 2 δ x 1 , x 1 δ x 2 .
Definition 4.  
For a given decision system N D T = U , C , D and X N , the lower and upper approximations of X in the neighborhood approximation space N A S = < U , N > are respectively defined as follows [44]:
(1) 
N ̲ X = x i δ x i X , x i U
(2) 
N ¯ X = x i δ x i X , x i U
where N ̲ X is also referred to as the positive domain of X in the approximation space N A S = U , N and is the largest union of neighborhood information particles that can be completely contained in X.
Definition 5. 
For a neighborhood decision system N D T = U , A , D , δ , D partitions U into N equivalence classes: X 1 , X 2 , , X N . B A , we define the upper and lower approximations of the decision attribute D with respect to B as [45]
N B D ̲ = U i = 1 N N B ̲ X i
N B ¯ D = i = 1 N N B ¯ X i
respectively [45], where δ B x i is the informative neighborhood particle generated by attribute B and metric Δ.
Figure 4 provides a geometric explanation of the classical rough set model, where the equivalence classes marked in red belong entirely to X and are lower approximations of X. Compared with the neighborhood rough set model, these two models are consistent.
The lower approximation of decision attribute D, also called the decision-positive region, is denoted by POS ( D ) .
The size of the positive region reflects the degree to which the classification problem is separable in a given attribute space, with larger positive regions indicating areas of overlap (i.e., fewer boundaries) for each category. We can describe such classification problems in more detail using the following set of attributes:
POS ( D ) = x i δ B x i D , x i U
Definition 6. 
Suppose that A , B are two sets. We define the degree to which A is contained in B , I ( A , B ) , as follows [46]:
I ( A , B ) = Card ( A B ) Card ( A )
When A = or B = , we define I ( A , B ) = 0 . Here, I ( A , B ) reflects the importance of B to A.
The dependency of decision attribute D on condition attribute B is defined as follows [47]:
γ B ( D ) = Card N B D ̲ / Card ( U )
where γ B ( D ) denotes the proportion of samples in the sample set that can be included by a decision according to the description of condition attribute B.
The positive region of the decision is larger if decision attribute D is more dependent on condition attribute B.

4. Proposed Method

4.1. Improvement of NRS

Given a decision system N D T = U , C , D , U = x 1 , x 2 , , X n represents a non-empty set of instances, C represents the feature set corresponding to the instance set, and D represents the decision attribute set D L = D L 1 , D L 2 , , D L m . Traditional single-label and multi-label neighborhood information particle partitioning methods are not suitable for labeled distribution data. For general data, a group of instances with the same attribute value or label value is called an equivalence class. Similarly, for mixed data, a group of instances with similar attribute values or label values is called a neighborhood class. In the method proposed in this paper, the margin of particles in the sample is used for granulating the neighborhood size.
Definition 7. 
Given a sample x, the margin of x relative to a set of samples U is defined as follows [12]:
m ( x ) = Δ ( x , N S ( x ) ) Δ ( x , N T ( x ) )
where N S ( x ) denotes the instance from U that has the shortest distance from x and whose label class is different from that of x. In addition, N T ( x ) denotes the instance from U that has the shortest distance from x and has the same label class as x. We call these instances the nearest miss and the nearest hit, respectively. Moreover, Δ ( x , N S ( x ) ) denotes the distance between x and N S ( x ) , and  Δ ( x , N T ( x ) ) denotes the distance between x and N T ( x ) . We call δ ( x ) = { y Δ ( x , y ) m ( x ) } the neighborhood particle about x. To facilitate the setting of neighborhood information particles, we set m ( x ) = 0 when m ( x ) < 0 .
A sample may have a positive or negative effect on different labels. Thus, for a given sample, the degree of granularity depends on the label used.
Definition 8. 
For a sample x and label l k L , the margin of x with respect to l k is m l k ( x ) = Δ l k x , N S l k ( x ) Δ l k x , N T l k ( x ) , l k L .
As noted above, each sample has a different label and, correspondingly, a different granularity. Depending on the different decision views, we need to combine all the single-label granularities of a given sample to form a label distribution granularity [44]. Therefore, in this paper, we choose the average granularity (i.e., the average nearest neighborhood, also known as the neutral view) to represent the label distribution granularity of a sample [12]:
m neu ( x ) = 1 L i = 1 L m l k ( x )
To solve the problem of the granularity selection of δ, combining Equations (1) and (9), the new neighborhood of the sample is defined as
δ B x i = x j x j U , Δ B x i , x j m neu x i
We define a new neighborhood information particle to solve the problem of selecting the neighborhood granularity, which is caused by label distribution data. In addition, the average nearest neighbor reflects the relationship between features in an instance. This new neighborhood model considers the relationships between features and is based on improved neighborhood information.
Definition 9. 
Given x i U , a label space D L = D L 1 , D L 2 , , D L m . , D L j D L , the label space granularity on label D L j is defined as follows:
Θ D L j x i = x j Δ D L j x i , x j θ D L j Δ D L j x i , x j 0 , x j U , if d x i L j 0 x j Δ D L j x i , x j = 0 , if d x i L j = 0
where θ D L j = j = 1 j = m i = 1 i = n d x i L j m × n . Here, Θ D L j x i indicates whether the neighborhood is centered on x i , and θ D L j represents the radius. A larger θ D L j value indicates the sample is closer to x i . When d x i L j = 0 , the neighborhood granularity is equivalent to the equivalence class.
Example 1. 
Continue to Table 1. Assume 2-norm is applied to Δ D L j x i , x j , we take D L 1 for example, and then we can compute Θ D L 1 x i as: Θ D L 1 x 1 = x 1 , x 2 , x 9 , Θ D L 1 x 2 = x 1 , x 2 , x 3 , x 5 , x 9 , Θ D L 1 x 3 = x 2 , x 3 , x 4 , x 5 , x 7 , Θ D L 1 x 4 = x 3 , x 4 , x 5 , x 7 , , Θ D L 1 x 5 = x 2 , x 3 , x 4 , x 5 , x 7 , Θ D L 1 x 6 = x 6 , x 8 , x 10 , , Θ D L 1 x 7 = x 3 , x 4 , x 5 , x 7 , Θ D L 1 x 8 = x 6 , x 8 , x 10 , Θ D L 1 x 9 = x 1 , x 2 , x 9 , Θ D L 1 x 10 = x 6 , x 8 , x 10 .
Based on the above definition, we can find that the neighborhood granularity of instance in the label space forms the granularity system that can cover the instance set. Then, we can summarize the following properties:
(1) Θ D L j x i , f o r x i Θ D L j x i ;
(2) i = 1 n Θ D L j x i = U .
Definition 10. 
In the neighborhood decision system N D S = < U , C D > , U = X 1 , X 2 , , X n represents the sample space, and D L = D L 1 , D L 2 , , D L m . represents the label distribution’s feature space. By defining the multi-label decision space, we can expand the decision positive domain of the single-label decision using Equation (4). For a certain feature subset B C , the lower approximation of the decision attribute D j with respect to B is as follows:
POS D j = N B ̲ D L j = i = 1 n x i δ B x i Θ D L j x i , x i U
In this equation, the neighborhood particle δ B x i is obtained from Definition 9, and the label particle Θ D L j x i is obtained from Definition 10. Through this method, we extend the NRS and solve the problem of selecting the neighborhood granularity of rough sets in labeled distribution learning.
Example 2. 
Continue to Table 1. We take L 1 for example, for feature set B = f 1 , f 2 , and then we can compute δ B x i according to Definition 9: δ B x 1 = x 1 , x 4 , x 8 , δ B x 1 = x 2 , x 3 , x 5 ; δ B x 3 = x 3 , x 5 , δ B x 4 = x 4 , x 5 , x 8 , δ B x 5 = x 1 , x 5 , x 3 , δ B x 6 = x 6 , x 9 ; δ B x 7 = x 7 , x 3 , δ B x 8 = x 8 , x 1 , δ B x 9 = x 9 , x 6 , δ B x 10 = x 10 , x 9 .
According to Definition 11, using Examples 1 and 2, we can calculate POS D j = x 2 , x 3 , x 7 .
In LDL, because the labels of each instance are always related in some way, it is necessary to consider the importance of labels and the correlation between them.
Definition 11. 
For a sample x i and the corresponding feature vector Y i , that is, D = x i , Y i 1 i N , x i U , Y i L , N is the number of instances in the training set, and l i , l j and L ( 1 i , j k ) are any two labels in label space L. The correlation between l i and l j is calculated using the mutual information as follows:
M I l i , l j = k = 1 M q = 1 M P l i k , l j q log P l i k l l j q P l j q
An undirected label graph (WUG) = ( V , E , W ) can be constructed by applying Equation (12). Here, V = L = l 1 , l 2 , , l m represents the set of nodes of the undirected graph, E = l i , l j l i , l j L represents its set of edges, and  w l i , l j = M I l i , l j represents the weight of each edge [48]. The importance of each node in this undirected graph is defined as follows:
L W l i = ( 1 d ) + d l j S N l i L W l j w l i , l j S W l j
SW l j = l j w l i , l j
In the above equations, L W l i and L W l j represent the weight divisions of nodes l i and l j , respectively. S N l i is the set of nodes with edges to label l i , and  w l i , l j = M I l i , l j represents the correlation between nodes. Equation (10) is used to calculate S N l i , which denotes the sum for the correlation for all edges starting from l j . In addition, d is the damping coefficient, where d = 0.85 is the recommended setting according to the method in Ref. [49]. For ease of calculation, an initial weight value can be set for all nodes; this is usually 1 / L , where L is the total number of nodes, that is, the total number of labels. Using this algorithm, we can calculate the correlation between node l i (i.e., label l i ) and other nodes l j related to it, as well as the structure of the graph (WUG). Through label correlation, we obtain the weight of each label in the label space, and this completes the exploration of label correlation.

4.2. Feature Selection Based on NRSs

The processing method of the labeled distribution neighborhood decision system is similar to that of the multi-label decision system. By extending the rough set importance theory of labeled distribution data (Equation (6)), combining the positive domain theory of labeled distribution decision (Equation (12)) and label correlation (Equation (14)), we obtain the importance of the decision attribute D = L = l 1 , l 2 , , l m on the set feature subset B ( B C ) :
γ B D L = L j L m Card PoS B D L j L W l j Card ( U )
The above equation reflects the importance of the decision attributes corresponding to the decision positive domain and feature subset. It solves the problem of granularity selection and feature association of label distribution NRS.
In the neighborhood decision system N D S = < U , C D , δ > , for feature subset B C , a C B, the importance of a to B is defined as follows:
SIG ( a , B , D ) = γ B a ( D ) γ B ( D )
In the new importance model, we add label importance and label relevance to the NRS model. The new NRS model reflects the fusion of feature information and label relevance. For the above NRS model, we construct a method to compute the most reducible feature sets based on a greedy forward search.
In this method, steps 1 and 2 perform the preprocessing when the labeled data arrive. Our reduced set starts from the empty set and calculates the label weights L W ( L ) for the entire label space. This step requires traversing the entire label space and constructing an undirected graph. Assuming that the number of labels in the label space is L, the time complexity of calculating the correlation between each pair of labels is O | L | 2 , and the time complexity of calculating the weight of each label is O ( 1 ) . Therefore, the time complexity of steps 1 and 2 is O | L | 2 + 1 = O | L | 2 . Steps 3 6 are divided into two parts: calculating the neighborhood of the instance, and analyzing whether the instance and the neighborhood are important. First, the nearest hit or miss is determined for each instance by selecting the average approximate neighborhood as the domain granularity criterion. Assuming the instance space is U, the time complexity of this step is O | U | 2 . Next, the neighborhood corresponding to the instance and the labeling granularity of the labeling space are determined, and then the decision positive domain and attribute importance are calculated. The time complexity of determining the neighborhood of each instance is O ( n log n ) ; the time complexity of determining the marking granularity of the marking space is O | U | 2 ; and the time complexity of determining the decision positive domain and importance calculation corresponding to each sample is O ( 1 ) . Therefore, the overall time complexity of instance domain calculation is O | U | 2 + | U | 2 + | U | log | U | + 1 + 1 = O | U | 2 . Therefore, the time complexity of Algorithm 1 is O | L | 2 + | U | 2 .
Algorithm 1 Calculate S I G ( a , r e d , D ) .
Input: 
neighborhood decision system U , C D .
Output: 
SIG ( a , r e d , D )
 1:
use Equation (5) to calculate the weight matrix L W ( L ) ;
 2:
Initialize reduct ;
 3:
if reduct =  then
 4:
     POS ( D ) = 0 ;
 5:
end if
 6:
if reduct  then
 7:
     x i U , the average approximate neighbor d r e d x i under a is calculated by Equation (2);
 8:
    For x i U , calculate the neighborhood δ red x i using Equation (3);
 9:
    For x i U , the label space granularity Θ D L j x i on label D L j is calculated by Equation (7);
 10:
   For the label l j of x i , POS B D L j is calculated by Equation (7);
 11:
   Calculate γ B D L by Equation (8);
 12:
end if
 13:
For a C r e d u c t , repeat steps 3.1–3.5 to calculate γ red a ( D ) ;
 14:
Calculate SIG ( a , red, D ) = γ redua ( D ) γ a ( D ) ;
 15:
Output SIG ( a , red, D )

4.3. Dynamic Algorithm for Online Label Distribution Feature Selection

Most feature selection algorithms assume that all candidate features are available to the algorithm before feature selection. In contrast, for streaming features, all the candidates cannot be collected before learning begins because they arrive dynamically and incrementally over time. Therefore, based on Method 1, we incorporate the online streaming feature selection framework [50] and propose an online labeled distributed streaming feature selection algorithm to solve the labeled distributed stream feature selection problem.
In the labeled distributed stream feature decision system L F D S = < U , C L , t > , U = x 1 , x 2 , , X n represents the set of all non-empty instances, C represents the candidate feature subset, L represents the label space, t represents the time when the feature arrives, F t represents the new feature of t, and  S t 1 represents the current candidate feature subset at time t.

4.3.1. Importance Analysis

Letting F i be the new feature arriving at the i-th moment, S t i 1 S t i 1 is the reduced set reduct at the i-th moment. For the new features that arrive, the first step is to perform redundancy analysis on F i . The purpose of redundancy analysis is to determine whether the newly arrived feature F i is beneficial to the label set L, that is, to determine the importance of F i to the entire label set L.
Using Equation (17), SIG F i , S t i 1 , D L = γ F i S t i 1 D L γ S t i 1 D L is set to measure the importance of F i . If  SIG F i , S t i 1 , D L < 0 , we consider F i to be redundant and unimportant for the current label set L, and hence F i can be omitted.

4.3.2. Significance Analysis

The purpose of correlation analysis is to evaluate the correlation between the new currently arriving features relative to the currently arriving features.
Define the significance of label L as follows:
ς L j = i = 1 n d x i L j j = i m i = 1 n d x i L j
For a reduced red j of label L j , we design a mapping Φ j : red j F that maps red j to the following d-dimensional vector:
Φ j = Φ j f 1 , Φ j f 2 , , Φ j f d
where Φ j f i = 1 , if  f i red j ; otherwise, Φ j f i = 0 .
We then define the significance of feature f i relative to D L on A as follows:
SIG * f i , A , D L = j = 1 m ς L j · Φ j f i
For a new feature F i , we calculate SIG * F i , S t i 1 , D L ; when SIG * F i , S t i 1 , D L > 0, we believe that the significance S t i 1 of the new feature F i for label set D L is greater than or equal to the average significance of the features that have been processed in label set D L . Therefore, we believe that F i is a significant feature and should be retained.

4.3.3. Redundancy Analysis

The purpose of redundancy analysis is to determine whether there is a feature F k in the current reduction set S t i 1 and whether the contribution of F k to label set L is the same as that of F i F i S t i 1 . When their contributions are the same, it is necessary to choose between the two features F k and F i .
For the two features F i and F k , if  SIG F i , F k , D = 0 , then F i and F k have the same contribution as L. We thus compare SIG * F i , S t i 1 , D L and SIG * F k , S t i 1 , D L . If SIG * F i , S t i 1 , D L SIG * F k , S t i 1 , D L , we keep F k and discard F i ; if SIG * F i , S t i 1 , D L < SIG * F k , S t i 1 , D L , we keep F i and discard F k .
The streaming feature selection framework is based on online importance analysis, saliency analysis, and redundancy analysis. In this framework, a training set of known feature size is used to simulate the streaming features, and each streaming feature is generated from the candidate feature set. In the framework shown in Figure 2, we propose the dynamic tag distribution feature selection algorithm (Algorithm 1), which considers tag importance and tag correlation, and combines the above three analyses.
The flow feature selection framework, illustrated in Figure 5, is based on online importance analysis, significance analysis, and redundancy analysis. In this framework, a training set with known feature sizes is used to simulate flow features, and each flow feature is generated from the candidate feature set. In the framework shown in Figure 2, we propose a dynamic online label distribution feature selection algorithm that considers label importance and label correlation (Algorithm 2), which incorporates the above three types of analysis.
The main part of Algorithm 2 is the calculation of the dependency relationship between features. Here, S t 1 is the number of features in the currently selected feature set at time t. Algorithm 1 evaluates whether new features at their time of arrival need to be retained and decides how to retain them. The whole process is an online selection problem, and includes three main parts: importance analysis, significance analysis, and redundancy analysis. We have made our code publicly available.
Algorithm 2 Dynamic online label distribution feature selection (DLILC-LDL).
Input: 
New feature F i that arrives at the i-th moment, the simplified set reduct S t i 1 that arrives the i - 1-th moment, label set L, and redundancy weight δ .
Output: 
Reduced tset reduct.
 1:
Use Equation (5) to calculate weight matrix W ( L ) ;
 2:
Initialize reduct ;
 3:
If new features F i arrive, calculate γ F i ( D ) according to Equation (16);
 4:
/significance analysis/
 5:
if red =  then
 6:
    Calculate SIG F i , S t i 1 , D = γ F i S t i 1 ( D ) γ S t i 1 ( D ) ;
 7:
    if  SIG F i , S t i 1 , D < 0  then
 8:
          discard F i , to  5
 9:
    else
 10:
         reduct = reduct F i , to  5 ;
 11:
    end if
 12:
    /importance Analysis/
 13:
else
 14:
    According to Equation (11), calculate SIG * F i , S t i 1 , D L ;
 15:
end if
 16:
if SIG * F i , S t i 1 , D L > 0 then
 17:
    reduct = reduct F i , to  5 ;
 18:
else
 19:
    /Redundancy Analysis/
 20:
    while  F k reduct do
 21:
        if  SIG F i , F k , D = 0 and SIG * F i , S t i 1 , D L SIG * F k , S t i 1 , D L  then
 22:
              discard F i , to 5;
 23:
        else if  SIG * F i , S t i 1 , D L < SIG * F k , S t i 1 , D L  then
 24:
              reduct = reduct F k
 25:
              reduct = reduct F i , to  5
 26:
        end if
 27:
    end while
 28:
end if
 29:
No new features F i arrive;
 30:
Output reduced set reduct.
The feature calculation performed by Algorithm 2 is taken from Method 1. Hence, the time complexity of a single feature selection is O | L | 2 + | U | log | U | . However, in most cases, Algorithm 2 is neither simple nor optimistic, and it requires online updates. Because the time complexity of the update depends on the calculation of feature dependencies, in the worst case, it is necessary to traverse all selected features for processing, and hence the time complexity in the worst case is O | L | 2 + S t 1 | U | log | U | .

5. Experimental Data

5.1. Datasets

To verify the efficiency and accuracy of our algorithm, we conducted experiments on 10 labeled distribution datasets, including 1 facial expression dataset SJAFFE [51], 7 biological experiment datasets (Yeast [24]), 1 natural scene dataset (Natural_Scene [52]), and 1 large-scale biomedical research dataset (Human_Gene [53]) in the Table 2.

5.2. Evaluation Indices

In the experiment, one indicator of the performance of the algorithm is the average distance or similarity between the predicted and true label distributions. To verify the efficiency of our algorithm, we selected six evaluation indicators widely used in the labeled distribution community: the Chebyshev distance, Clark distance, Canberra distance, Kullback–Leibler (KL) divergence, cosine similarity, and intersection similarity. The specific calculation method for six evaluation indicators is shown in Table 3. Table 2 summarizes the mathematical equations of these six indicators, where D = d 1 , d 2 , , d L represents the predicted label distribution and D ¯ = d 1 ¯ , d 2 ¯ , , d L ¯ represents the actual label distribution. In the table, for each evaluation indicator, the downward arrow ( ) indicates that smaller values signify better performance, whereas the upward arrow (↑) indicates that larger values signify better performance.

5.3. Experimental Setup

To evaluate the effectiveness and efficiency of our proposed algorithm, we compared it with MDFS [54], GRRO [55,56], MDDM [57], FSFL [23], LDFS [58], six advanced feature selection algorithms.
When reducing the dimensions of the feature space, the experimental dataset needs to be adapted for different feature selection methods. For the label distribution feature selection method, a multi-label dataset must be converted into a label distribution dataset through label enhancement. Specifically, the experiments in this study used the label enhancement method to convert the data. Multi-label feature selection methods cannot directly use the real-world label distribution dataset. Therefore, we used the equal frequency strategy to discretize the label distribution dataset before dimensionality reduction.
The experiment was conducted using a 10 -fold cross-validation strategy. To obtain the output, the learner was trained on the above 15 datasets using SA-BFGS after feature selection, and the prediction performance was obtained in the test. Following [59], the number of training iterations of SA-BFGS was set to 10, and the regularization factor was 1.

5.4. Experimental Results

To demonstrate the effectiveness of our algorithm, we compared the algorithm proposed in this paper, the DLILC-LDL algorithms, with MDDMproj, MDDMspc, MDFS, LDFS, GRRO, and FSF in terms of their predictive classification performance. To ensure the results were comparable, all features obtained by all comparison algorithms were sorted, the number of features selected by the dynamic feature selection method proposed in this paper was used as the final feature subset, and the features with the same number of features in the comparison algorithm results were used as their feature subsets.
Because all the comparison algorithms use the result of feature selection as the result of feature sorting, we give the detailed experimental results of all algorithms on the various classification datasets in Table 4, Table 5, Table 6, Table 7, Table 8 and Table 9. For each evaluation criterion, “↓” indicates that smaller values are better and “↑” indicates that larger values are better. In addition, the best predictive classification performance of the evaluation indicator is shown in bold and the average predictive classification performance of each algorithm is shown in italics.
In addition, in our experiments, to study the performance of the streaming feature selection method when the entire feature set is unknown in advance, we used the 12 benchmark datasets listed in Table 2 as our testbed and simulated the streaming feature situation by observing the features on the training data that arrived each time.
From the experimental results in Table 4, Table 5, Table 6, Table 7, Table 8 and Table 9, we can infer the following:
(1)
For all evaluation indicators, DLILC-LDL is generally better than the comparison algorithms. The advantages of Algorithm 1 are basically suitable for all the labeled distribution datasets in the experiment.
(2)
In terms of the Chebyshev distance and cosine similarity, DLILC-LDL achieved the best performance on nine of the labeled distribution datasets. Note that DLILC-LDL achieved suboptimal or near-suboptimal performance on the remaining dataset. The prediction and classification performance of DLILC-LDL is also very close to the best performance of another comparison algorithm on the labeled distribution dataset. In particular, in terms of the Chebyshev distance, the DLILC-LDL algorithm generally achieved the best performance.
(3)
In terms of the Clark and Canberra distances, the prediction and classification performance results of DLILC-LDL are significantly better than those of the compared algorithms on seven label distribution datasets. Note that even on the datasets where the best performance was not achieved, the results of DLILC-LDL are still quite good, and it ranked second on the remaining datasets.
(4)
In terms of different evaluation indicators, the average classification performance of DLILC-LDL is significantly better than that of all the compared algorithms. These experimental results reveal that the DLILC-LDL has better performance than the other compared algorithms.
Prediction performance is expected to vary. To clearly evaluate the differences among the algorithms, the prediction performance was normalized to [0.1, 0.5] following [23]. Figure 6 shows the normalized stability indices for the Chebyshev distance, Clark distance, Canberra distance, KL divergence, cosine similarity, and intersection similarity. Each corner of the spider-web chart in Figure 6 represents a different dataset, and each colored line represents a different algorithm.
If the area of the graph enclosed by a specific line is large and its shape is similar to that of a regular dodecagon, the performance and stability of the corresponding algorithm are good. A stability value of approximately 0.5 is considered to be a good value. Figure 6 reveals the following:
(1)
For all the evaluation indices, DLILC-LDL has the best stability because its shape is very close to that of a regular dodecagon, and it has the largest enclosed area.
(2)
For the Chebyshev distance, intersection similarity, cosine similarity, and intersection similarity, DLILC-LILC maintains stability on at least ten datasets.
(3)
For the Clark distance, KL divergence, and Canberra distance, the values of SLILC-LDL are similar to those of some comparison algorithms. Therefore, its performance advantage over the existing algorithms is not as obvious as in other evaluation criteria.
(4)
For all evaluation criteria, the area of DLILC-LILC is larger than or similar to those of the existing algorithms and is closer in shape to a regular dodecagon. In fact, comprehensive Bonferroni–Dunn tests on the DLILC-LILC algorithm comparing it with existing algorithms reveal that the performance and the stability of the DIILC-LILC algorithm is the best.

5.5. Statistical Tests

Because some of the experimental results were similar, statistical tests were used to evaluate whether there are significant differences in the experimental results. To systematically explore the statistical significance of the comparison algorithms, we used the Friedman test to further analyze the differences between the performance results of the comparison algorithms. This is a highly accepted method of statistically comparing multiple algorithms for significant differences across many datasets [60]. The specific method is as follows: given k comparison algorithms and N multi-label datasets, R j = 1 N i = 1 N r i j v represents the average rank of the j-th algorithm on all datasets, where r i j is the rank of algorithm j on the i-th dataset. Under the null hypothesis (it is assumed that the classification performance of each algorithm under each evaluation index is equal, that is, the ranks of all algorithms are equal), the Friedman test can be defined as follows:
F F = ( N 1 ) χ F 2 N ( k 1 ) χ F 2 , where χ F 2 = 12 N k ( k + 1 ) i = 1 k R i 2 k ( k + 1 ) 2 4
where F F follows an F-distribution with ( k 1 ) and ( k 1 ) ( N 1 ) degrees of freedom. Table 9 summarizes the F F values and the corresponding critical values of each evaluation index after the Friedman test statistics were calculated [61].
As Table 10 reveals, for all evaluation indicators with a significance level of a = 0.10 , the null hypothesis is clearly disproved. Next, we used post hoc tests to further determine the differences in performance of these comparison algorithms. Because our focus was on comparing the proposed method with other algorithms, the Bonferroni–Dunn test was used [62]. The Bonferroni–Dunn test is a method used in multiple comparison problems to control the probability of a Type I error (that is, incorrectly rejecting a hypothesis that is truly null). When conducting multiple statistical tests, the probability of rejecting one or more null hypotheses (i.e., the probability of making a Type I error) increases with the number of tests, even if all of the null hypotheses are true. The Bonferroni–Dunn test reduces the overall risk of this error by reducing the significance level of a single test. The disadvantage of the Bonferroni–Dunn test is that it is relatively conservative and may increase the risk of Type II errors (i.e., incorrectly accepting a false null hypothesis) when the amount of data is large. Therefore, we set the reported corrected P value (the probability of observing the current result or a more extreme result under the premise that the null hypothesis is true) greater than 1 as 1. This was performed to maintain the accuracy of statistical analysis and the legitimacy of probability, and reduce the probability of Type II errors. When the distance between the average ranks exceeds the following critical difference (cd), we consider DLILC-LDL to have the same performance as that of a comparison algorithm.
From Figure 7, we can observe the following:
(1)
DLILC-LDL shows significant improvement in several evaluation indicators.
DLILC-LDL significantly outperforms other comparison algorithms. The relative advantage is obvious.
(2)
Although the classification performance results of FSFL and GRRO are comparable, the average classification performance of DLILC-LDL in Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8 is significantly better than that of the comparison algorithms. In summary, DLILC-LDL is better than other advanced multi-label feature selection algorithms, and this method has stronger statistical performance.

6. Summary

This paper proposed a label distribution feature selection model based on label correlation and NRS, and it further proposed a label distribution feature selection algorithm based on this model, namely, a dynamic algorithm for processing streaming features. In the proposed approach, we first defined a new neighborhood particle by averaging the nearest neighbor method to better connect the information between the features of the label distribution data. Then, by calculating the mutual information between the features, we obtained the feature correlation weight and combined the lower approximation of the new neighborhood with the feature weight to obtain a new feature subset importance model. On this basis, we generalized the traditional NRS model, defined a new label space granularity, and granulated the label space of the label distribution data to adapt it to LDL. Incorporating rough set theory, we defined the label distribution neighborhood decision system and proposed a new importance model. Moreover, based on this algorithm, we proposed a dynamic feature selection algorithm that can solve the label distribution streaming feature problem by testing the arriving features in the time series using importance, significance, and redundancy analyses. The experimental results show that our algorithm is highly competitive compared with other commonly used algorithms. However, in a large number of dynamic data environments, the practical application of this method is somewhat challenging due to the high time complexity of the algorithm. The time complexity of the algorithm mainly comes from the calculation of the sample neighborhood and the feature selection of the sample based on the neighborhood. In the labeled distribution environment, the label granularity of the sample also needs to be calculated. Therefore, in order to improve the time complexity, the samples can be pre-sorted according to the rules of the label space, and the correctness of the feature selection can be further improved. This will be the direction for research and optimization in the next step. Therefore, in future work, we hope to further optimize the algorithm and reduce the running time of the algorithm. At the same time, we hope to further improve the label space particles and propose a better label distribution neighborhood decision system. In addition, the next step is to combine this method with practical applications (such as sentiment computing and multimodal sentiment recognition) to practice this method.

Author Contributions

Conceptualization, W.C.; Methodology, W.C.; Software, W.C.; Validation, W.C.; Writing—original draft, W.C.; Writing—review & editing, X.S. and F.R.; Supervision, F.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in [LDL data sets] [https://palm.seu.edu.cn/xgeng/LDL/index.htm#data] accessed on 18 December 2024.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Xu, S.; Yang, X.; Yu, H.; Yu, D.J.; Yang, J.; Tsang, E.C. Multi-label learning with label-specific feature reduction. Knowl.-Based Syst. 2016, 104, 52–61. [Google Scholar] [CrossRef]
  2. Tsoumakas, G.; Katakis, I. Multi-label classification: An overview. Int. J. Data Warehous. Min. (IJDWM) 2007, 3, 1–13. [Google Scholar] [CrossRef]
  3. Fakhari, A.; Moghadam, A.M.E. Combination of classification and regression in decision tree for multi-labeling image annotation and retrieval. Appl. Soft Comput. 2013, 13, 1292–1302. [Google Scholar] [CrossRef]
  4. Liu, W.; Wang, H.; Shen, X.; Tsang, I.W. The emerging trends of multi-label learning. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7955–7974. [Google Scholar] [CrossRef]
  5. Yuanjian, Z.; Tianna, Z.; Duoqian, M. Granule-based label enhancement in label distribution learning. CAAI Trans. Intell. Syst. 2022, 18, 390–398. [Google Scholar]
  6. Geng, X. Label distribution learning. IEEE Trans. Knowl. Data Eng. 2016, 28, 1734–1748. [Google Scholar] [CrossRef]
  7. Gao, B.B.; Xing, C.; Xie, C.W.; Wu, J.; Geng, X. Deep label distribution learning with label ambiguity. IEEE Trans. Image Process. 2017, 26, 2825–2838. [Google Scholar] [CrossRef]
  8. Geng, X.; Wang, Q.; Xia, Y. Facial age estimation by adaptive label distribution learning. In Proceedings of the IEEE 2014 22nd International Conference on Pattern Recognition, Stockholm, Sweden, 24–28 August 2014; pp. 4465–4470. [Google Scholar]
  9. Zhang, Z.; Wang, M.; Geng, X. Crowd counting in public video surveillance by label distribution learning. Neurocomputing 2015, 166, 151–163. [Google Scholar] [CrossRef]
  10. Geng, X.; Hou, P. Pre-release prediction of crowd opinion on movies by label distribution learning. In Proceedings of the 24th International Conference on Artificial Intelligen (IJCAI’15), Buenos Aires, Argentina, 25–31 July 2015; pp. 3511–3517. [Google Scholar]
  11. Lee, J.; Kim, D.W. Fast multi-label feature selection based on information-theoretic feature ranking. Pattern Recognit. 2015, 48, 2761–2771. [Google Scholar] [CrossRef]
  12. Lin, Y.; Hu, Q.; Liu, J.; Chen, J.; Duan, J. Multi-label feature selection based on neighborhood mutual information. Appl. Soft Comput. 2016, 38, 244–256. [Google Scholar] [CrossRef]
  13. Yu, Y.; Pedrycz, W.; Miao, D. Multi-label classification by exploiting label correlations. Expert Syst. Appl. 2014, 41, 2989–3004. [Google Scholar] [CrossRef]
  14. Wu, X.; Yu, K.; Wang, H.; Ding, W. Online streaming feature selection. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 1159–1166. [Google Scholar]
  15. Chen, H.; Li, T.; Luo, C.; Horng, S.J.; Wang, G. A decision-theoretic rough set approach for dynamic data mining. IEEE Trans. Fuzzy Syst. 2015, 23, 1958–1970. [Google Scholar] [CrossRef]
  16. Chen, D.; Yang, Y. Attribute reduction for heterogeneous data based on the combination of classical and fuzzy rough set models. IEEE Trans. Fuzzy Syst. 2013, 22, 1325–1334. [Google Scholar] [CrossRef]
  17. Li, S.; Zhang, K.; Li, Y.; Wang, S.; Zhang, S. Online streaming feature selection based on neighborhood rough set. Appl. Soft Comput. 2021, 113, 108025. [Google Scholar] [CrossRef]
  18. Liu, K.; Li, T.; Yang, X.; Yang, X.; Liu, D.; Zhang, P.; Wang, J. Granular cabin: An efficient solution to neighborhood learning in big data. Inf. Sci. 2022, 583, 189–201. [Google Scholar] [CrossRef]
  19. Arslan, S.; Ozturk, C. Multi hive artificial bee colony programming for high dimensional symbolic regression with feature selection. Appl. Soft Comput. 2019, 78, 515–527. [Google Scholar] [CrossRef]
  20. Jiang, Z.; Liu, K.; Yang, X.; Yu, H.; Fujita, H.; Qian, Y. Accelerator for supervised neighborhood based attribute reduction. Int. J. Approx. Reason. 2020, 119, 122–150. [Google Scholar] [CrossRef]
  21. Kashef, S.; Nezamabadi-pour, H.; Nikpour, B. Multilabel feature selection: A comprehensive review and guiding experiments. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2018, 8, e1240. [Google Scholar] [CrossRef]
  22. Zhai, Y.; Dai, J. Label Distribution Data Feature Reduction Based on Fuzzy Rough Set Model. Aust. J. Intell. Inf. Process. Syst. 2019, 16, 27–35. [Google Scholar]
  23. Wang, Y.; Dai, J. Label distribution feature selection based on mutual information in fuzzy rough set theory. In Proceedings of the IEEE 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–2. [Google Scholar]
  24. Qian, W.; Xiong, Y.; Yang, J.; Shu, W. Feature selection for label distribution learning via feature similarity and label correlation. Inf. Sci. 2022, 582, 38–59. [Google Scholar] [CrossRef]
  25. Pawlak, Z.; Skowron, A. Rudiments of rough sets. Inf. Sci. 2007, 177, 3–27. [Google Scholar] [CrossRef]
  26. Zhang, P.; Li, T.; Wang, G.; Luo, C.; Chen, H.; Zhang, J.; Wang, D.; Yu, Z. Multi-source information fusion based on rough set theory: A review. Inf. Fusion 2021, 68, 85–117. [Google Scholar] [CrossRef]
  27. Wang, P.; Yao, Y. CE3: A three-way clustering method based on mathematical morphology. Knowl.-Based Syst. 2018, 155, 54–65. [Google Scholar] [CrossRef]
  28. Fan, J.; Wang, P.; Jiang, C.; Yang, X.; Song, J. Ensemble learning using three-way density-sensitive spectral clustering. Int. J. Approx. Reason. 2022, 149, 70–84. [Google Scholar] [CrossRef]
  29. Chen, H.; Li, T.; Cai, Y.; Luo, C.; Fujita, H. Parallel attribute reduction in dominance-based neighborhood rough set. Inf. Sci. 2016, 373, 351–368. [Google Scholar] [CrossRef]
  30. Yuan, Z.; Chen, H.; Li, T.; Yu, Z.; Sang, B.; Luo, C. Unsupervised attribute reduction for mixed data based on fuzzy rough sets. Inf. Sci. 2021, 572, 67–87. [Google Scholar] [CrossRef]
  31. Yu, K.; Wu, X.; Ding, W.; Pei, J. Scalable and accurate online feature selection for big data. ACM Trans. Knowl. Discov. Data (TKDD) 2016, 11, 1–39. [Google Scholar] [CrossRef]
  32. Paul, D.; Jain, A.; Saha, S.; Mathew, J. Multi-objective PSO based online feature selection for multi-label classification. Knowl.-Based Syst. 2021, 222, 106966. [Google Scholar] [CrossRef]
  33. Lin, Y.; Hu, Q.; Liu, J.; Li, J.; Wu, X. Streaming feature selection for multilabel learning based on fuzzy mutual information. IEEE Trans. Fuzzy Syst. 2017, 25, 1491–1507. [Google Scholar] [CrossRef]
  34. Chen, W.; Sun, X. Dynamic multi-label feature selection algorithm based on label importance and label correlation. Int. J. Mach. Learn. Cybern. 2024, 15, 3379–3396. [Google Scholar] [CrossRef]
  35. Yu, K.; Wu, X.; Ding, W.; Pei, J. Towards scalable and accurate online feature selection for big data. In Proceedings of the 2014 IEEE International Conference on Data Mining, Shenzhen, China, 14–17 December 2014; pp. 660–669. [Google Scholar]
  36. Xu, N.; Liu, Y.P.; Geng, X. Label enhancement for label distribution learning. IEEE Trans. Knowl. Data Eng. 2019, 33, 1632–1643. [Google Scholar] [CrossRef]
  37. Żychowski, A.; Mańdziuk, J. Duo-LDL method for Label Distribution Learning based on pairwise class dependencies. Appl. Soft Comput. 2021, 110, 107585. [Google Scholar] [CrossRef]
  38. Zhao, P.; Zhou, Z.H. Label distribution learning by optimal transport. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
  39. Geng, X.; Yin, C.; Zhou, Z.H. Facial age estimation by learning from label distributions. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2401–2412. [Google Scholar] [CrossRef] [PubMed]
  40. Zhu, Y.; Kwok, J.T.; Zhou, Z.H. Multi-label learning with global and local label correlation. IEEE Trans. Knowl. Data Eng. 2017, 30, 1081–1094. [Google Scholar] [CrossRef]
  41. Qian, W.; Huang, J.; Wang, Y.; Xie, Y. Label distribution feature selection for multi-label classification with rough set. Int. J. Approx. Reason. 2021, 128, 32–55. [Google Scholar] [CrossRef]
  42. Liu, J.; Lin, Y.; Ding, W.; Zhang, H.; Wang, C.; Du, J. Multi-label feature selection based on label distribution and neighborhood rough set. Neurocomputing 2023, 524, 142–157. [Google Scholar] [CrossRef]
  43. Hu, Q.; Yu, D.; Xu, Z. Numerical attribute reduction based on neighborhood granulation and rough approximation. J. Softw. 2008, 19, 640–649. [Google Scholar] [CrossRef]
  44. Fan, Y.; Chen, B.; Huang, W.; Liu, J.; Weng, W.; Lan, W. Multi-label feature selection based on label correlations and feature redundancy. Knowl.-Based Syst. 2022, 241, 108256. [Google Scholar] [CrossRef]
  45. Chen, P.; Lin, M.; Liu, J. Multi-label attribute reduction based on variable precision fuzzy neighborhood rough set. IEEE Access 2020, 8, 133565–133576. [Google Scholar] [CrossRef]
  46. Liu, J.; Lin, Y.; Du, J.; Zhang, H.; Chen, Z.; Zhang, J. ASFS: A novel streaming feature selection for multi-label data based on neighborhood rough set. Appl. Intell. 2023, 53, 1707–1724. [Google Scholar] [CrossRef]
  47. Wu, Y.; Liu, J.; Yu, X.; Lin, Y.; Li, S. Neighborhood rough set based multi-label feature selection with label correlation. Concurr. Comput. Pract. Exp. 2022, 34, e7162. [Google Scholar] [CrossRef]
  48. Qian, Y.; Liang, J.; Pedrycz, W.; Dang, C. Positive approximation: An accelerator for attribute reduction in rough set theory. Artif. Intell. 2010, 174, 597–618. [Google Scholar] [CrossRef]
  49. Liu, J.; Lin, Y.; Lin, M.; Wu, S.; Zhang, J. Feature selection based on quality of information. Neurocomputing 2017, 225, 11–22. [Google Scholar] [CrossRef]
  50. Hashemi, A.; Dowlatshahi, M.B.; Nezamabadi-Pour, H. MGFS: A multi-label graph-based feature selection algorithm via PageRank centrality. Expert Syst. Appl. 2020, 142, 113024. [Google Scholar] [CrossRef]
  51. Sen, T.; Chaudhary, D.K. Contrastive study of simple pagerank, hits and weighted pagerank algorithms. In Proceedings of the IEEE 2017 7th International Conference on Cloud Computing, Data Science & Engineering-Confluence, Noida, India, 12–13 January 2017; pp. 721–727. [Google Scholar]
  52. Lyons, M.; Akamatsu, S.; Kamachi, M.; Gyoba, J. Coding facial expressions with gabor wavelets. In Proceedings of the Third IEEE International Conference on Automatic Face and Gesture Recognition, Nara, Japan, 14–16 April 1998; pp. 200–205. [Google Scholar]
  53. Eisen, M.B.; Spellman, P.T.; Brown, P.O.; Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 1998, 95, 14863–14868. [Google Scholar] [CrossRef]
  54. Geng, X.; Luo, L. Multilabel ranking with inconsistent rankers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3742–3747. [Google Scholar]
  55. Jia, X.; Shen, X.; Li, W.; Lu, Y.; Zhu, J. Label distribution learning by maintaining label ranking relation. IEEE Trans. Knowl. Data Eng. 2021, 35, 1695–1707. [Google Scholar] [CrossRef]
  56. Zhang, J.; Luo, Z.; Li, C.; Zhou, C.; Li, S. Manifold regularized discriminative feature selection for multi-label learning. Pattern Recognit. 2019, 95, 136–150. [Google Scholar] [CrossRef]
  57. Zhang, J.; Lin, Y.; Jiang, M.; Li, S.; Tang, Y.; Long, J.; Weng, J.; Tan, K.C. Fast multilabel feature selection via global relevance and redundancy optimization. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 5721–5734. [Google Scholar] [CrossRef]
  58. Zhang, Y.; Zhou, Z.H. Multilabel dimensionality reduction via dependence maximization. ACM Trans. Knowl. Discov. Data (TKDD) 2010, 4, 1–21. [Google Scholar] [CrossRef]
  59. Hu, L.; Gao, L.; Li, Y.; Zhang, P.; Gao, W. Feature-specific mutual information variation for multi-label feature selection. Inf. Sci. 2022, 593, 449–471. [Google Scholar] [CrossRef]
  60. Li, Y.; Lin, Y.; Liu, J.; Weng, W.; Shi, Z.; Wu, S. Feature selection for multi-label learning based on kernelized fuzzy rough sets. Neurocomputing 2018, 318, 271–286. [Google Scholar] [CrossRef]
  61. Friedman, M. A comparison of alternative tests of significance for the problem of m rankings. Ann. Math. Stat. 1940, 11, 86–92. [Google Scholar] [CrossRef]
  62. Dong, J.; Fu, J.; Zhou, P.; Li, H.; Wang, X. Improving Spoken Language Understanding with Cross-Modal Contrastive Learning. In Proceedings of the Interspeech, Incheon, Republic of Korea, 18–22 September 2022; pp. 2693–2697. [Google Scholar]
Figure 1. Three learning paradigms in the LDL framework.
Figure 1. Three learning paradigms in the LDL framework.
Applsci 15 01466 g001
Figure 2. Distribution of facial expressions and emotions.
Figure 2. Distribution of facial expressions and emotions.
Applsci 15 01466 g002
Figure 3. Neighborhood granules in 2-D spaces.
Figure 3. Neighborhood granules in 2-D spaces.
Applsci 15 01466 g003
Figure 4. Rough sets in discrete spaces.
Figure 4. Rough sets in discrete spaces.
Applsci 15 01466 g004
Figure 5. Flow feature selection framework.
Figure 5. Flow feature selection framework.
Applsci 15 01466 g005
Figure 6. Spider-web charts for stability analysis.
Figure 6. Spider-web charts for stability analysis.
Applsci 15 01466 g006
Figure 7. Bonferroni–Dunn comparison of SLILC-LDL and DLILC-LDL with the comparison algorithms.
Figure 7. Bonferroni–Dunn comparison of SLILC-LDL and DLILC-LDL with the comparison algorithms.
Applsci 15 01466 g007
Table 1. Example of multi-label data.
Table 1. Example of multi-label data.
No.U f 1 f 1 D L 1 D L 2 D L 3
1 x 1 0.890.080.330.670
2 x 2 0.210.480.6400.36
3 x 3 0.110.060.7900.21
4 x 4 0.430.20100
5 x 5 0.150.140.820.150.03
6 x 6 0.820.3100.480.52
7 x 7 0.030.15100
8 x 8 0.730.0900.60.4
9 x 9 0.860.500.4100.59
10 x 10 0.660.78010
Table 2. Main features of the 12 datasets.
Table 2. Main features of the 12 datasets.
No.DatasetExamplesFeaturesLabels
1S-JAFFE2132436
2Natural_Scene20002949
3Human_Gene17,8923668
4Yeast-alpha (alpha)24652418
5Yeast-diau (diau)2465247
6Yeast-dtt (dtt)2465244
7Yeast-elu (elu)24652414
8Yeast-heat (heat)2465246
9Yeast-spo (spo)2465246
10Yeast-spo5 (spo5)2465243
Table 3. Algorithm evaluation indicators.
Table 3. Algorithm evaluation indicators.
NameFormula
Chebyshev distance (Chebyshev) ↓ Dis ( D ¯ , D ) = max i d ¯ i d i
Clark distance (Clark) ↓ Dis ( D ¯ , D ) = j = 1 L d ¯ i d i 2 d ¯ i + d i 2
Canberra distance (Canberra) ↓ Dis ( D ¯ , D ) = j = 1 L d ¯ i d i d ¯ i + d i
KL divergence (KL) ↑ Dis ( D ¯ , D ) = j = 1 L d ¯ i ln d ¯ i d i
Cosine similarity (Cosine) ↑ Sim ( D ¯ , D ) = j = 1 L d ¯ i d i Γ
Intersection similarity (Intersection) ↑ Sim ( D ¯ , D ) = j = 1 L min d ¯ i , d i
Table 4. Comparison of the Chebyshev distance results of different algorithms ↓.
Table 4. Comparison of the Chebyshev distance results of different algorithms ↓.
MDDMprojMDDMspcMDFSLDFSGRROFSFLDLILC-LDL
S-JAFFE0.10290.10130.10870.10990.10540.08600.0832
Natural_Scene0.34310.33950.33360.32660.33230.34970.3199
Human_Gene0.063030.063100.063040.063120.063200.063130.06291
Yeast-alpha (alpha)0.0098050.0098630.0099100.0098830.0098350.0098780.009775
Yeast-diau (diau)0.036000.036040.035580.035640.035580.035760.03456
Yeast-dtt (dtt)0.040200.039980.039740.038150.039900.039730.03939
Yeast-elu (elu)0.016900.017000.016880.016800.016880.016900.01668
Yeast-heat (heat)0.043630.043310.043430.043380.043350.042980.04282
Yeast-spo (spo)0.059360.059480.059400.059460.059420.059180.05852
Yeast-spo5 (spo5)0.093600.092430.093380.092650.092450.093430.09168
Avg0.080850.0800200.080360.079550.079830.079670.07595
Table 5. Comparison of the Clark distance results of different algorithms ↓.
Table 5. Comparison of the Clark distance results of different algorithms ↓.
MDDMprojMDDMspcMDFSLDFSGRROFSFLDLILC-LDL
S-JAFFE0.37760.35460.38270.39790.37950.33980.3369
Natural_Scene2.04782.06072.03472.01142.04912.04632.0216
Human_Gene1.43211.46021.45481.43971.44251.44231.4090
Yeast-alpha (alpha)0.22340.22130.20860.20870.20830.20940.2073
Yeast-diau(diau)0.19000.19080.18860.18920.18900.18990.1850
Yeast-dtt (dtt)0.11260.11250.11230.14160.11220.11180.1106
Yeast-elu (elu)0.20640.20600.20590.20660.20690.20820.2060
Yeast-heat (heat)0.14530.14540.14540.14580.14620.14400.1456
Yeast-spo (spo)0.19860.21010.21020.21040.21060.21040.2075
Yeast-spo5 (spo5)0.18090.17960.17980.18030.18040.18040.1775
Avg0.51150.51410.51230.51310.51250.50820.5007
Table 6. Comparison of the Canberra distance results of different algorithms ↓.
Table 6. Comparison of the Canberra distance results of different algorithms ↓.
MDDMprojMDDMspcMDFSLDFSGRROFSFLDLILC-LDL
S-JAFFE0.81080.77180.84130.83590.81630.73680.7564
Natural_Scene5.74485.81245.71205.63575.75245.79975.6653
Human_Gene16.823916.336616.318816.346416.349616.349516.2999
Yeast-alpha (alpha)0.67810.68370.68360.68600.68470.68720.6841
Yeast-diau (diau)0.41350.41380.40870.41030.41030.41090.4014
Yeast-dtt (dtt)0.18870.18850.18850.18900.18880.18800.1866
Yeast-elu (elu)0.61040.60720.61060.60910.61150.61630.6055
Yeast-heat (heat)0.37420.37510.37580.37530.37700.37400.3720
Yeast-spo (spo)0.52510.52580.52500.52460.52590.52330.5176
Yeast-spo5 (spo5)0.27630.27570.27730.27660.27650.27380.2751
avg2.64462.59912.59422.58892.59932.59592.5764
Table 7. Comparison of the KL divergence results of different algorithms ↓.
Table 7. Comparison of the KL divergence results of different algorithms ↓.
MDDMprojMDDMspcMDFSLDFSGRROFSFLDLILC-LDL
S-JAFFE0.064530.061300.067670.069570.066570.048570.04340
Natural_Scene0.70700.73870.68500.64890.70280.70580.6468
Human_Gene0.63940.60630.60710.61380.60140.65100.5862
Yeast-alpha (alpha)0.0053500.0054170.0054130.0054470.0054130.0055100.005403
Yeast-diau (diau)0.011680.011660.011660.011740.011760.011540.01168
Yeast-dtt (dtt)0.0068780.0069000.0068820.0069600.0069000.0068380.006716
Yeast-elu (elu)0.0068670.0068330.0067000.0067500.0067750.0068750.006625
Yeast-heat (heat)0.013230.013250.013200.013300.013350.013450.01305
Yeast-spo (spo)0.024700.024780.024560.024580.024720.024560.02396
Yeast-spo5 (spo5)0.029650.029600.029680.029880.029980.029930.02923
avg0.15090.15050.14580.14310.14700.15040.1373
Table 8. Comparison of the cosine similarity results of different algorithms ↑.
Table 8. Comparison of the cosine similarity results of different algorithms ↑.
MDDMprojMDDMspcMDFSLDFSGRROFSFLDLILC-LDL
S-JAFFE0.94210.94420.93720.93610.93890.95400.9614
Natural_Scene0.75860.75190.76670.77450.76270.76040.7753
Human_Gene0.74370.74120.73990.73470.74210.74400.7458
Yeast-alpha (alpha)0.99480.99480.99460.99470.99470.99460.9947
Yeast-diau (diau)0.98870.98870.98890.98880.98880.98870.9892
Yeast-dtt (dtt)0.99260.99260.99270.99260.99250.99280.9928
Yeast-elu (elu)0.99360.99360.99360.99350.99350.99370.9936
Yeast-heat (heat)0.97990.98730.98730.98720.98720.98740.9875
Yeast-spo (spo)0.97630.97660.97620.97620.97620.97640.9769
Yeast-spo5 (spo5)0.97420.97430.97440.97480.97390.97410.9746
avg0.93440.93460.93510.93520.93500.93660.9392
Table 9. Comparison of the intersection similarity results of different algorithms ↑.
Table 9. Comparison of the intersection similarity results of different algorithms ↑.
MDDMprojMDDMspcMDFSLDFSGRROFSFLDLILC-LDL
S-JAFFE0.86610.86800.86770.86260.86700.88040.8825
Natural_Scene0.52970.51950.54720.56790.54030.53910.5633
Human_Gene0.73230.73290.73290.73230.73210.73220.7330
Yeast-alpha (alpha)0.96210.96210.96210.96200.96210.96240.9622
Yeast-diau (diau)0.94230.94220.94270.94260.94260.94250.9438
Yeast-dtt (dtt)0.95420.95420.95440.95470.95440.95460.9550
Yeast-elu (elu)0.95680.95700.95680.95690.95670.95640.9572
Yeast-heat (heat)0.93830.93820.93710.93830.93790.93780.9388
Yeast-spo (spo)0.91350.91340.91350.91340.91320.91710.9146
Yeast-spo5 (spo5)0.91070.91090.91080.91060.91060.91040.9119
avg0.87060.86990.87250.87410.871790.87320.8762
Table 10. Summary of the Friedman test ( k = 8 , N = 10 ), giving the F F values and critical values of each evaluation index on α = 0.10 .
Table 10. Summary of the Friedman test ( k = 8 , N = 10 ), giving the F F values and critical values of each evaluation index on α = 0.10 .
Evaluation Metric F F Critical Value ( α = 0.10 )
Chebyshev6.3571.770
Clark6.294
Canberra5.286
KL8.450
Cosine5.269
Intersection6.683
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, W.; Sun, X.; Ren, F. Dynamic Online Label Distribution Feature Selection Based on Label Importance and Label Correlation. Appl. Sci. 2025, 15, 1466. https://doi.org/10.3390/app15031466

AMA Style

Chen W, Sun X, Ren F. Dynamic Online Label Distribution Feature Selection Based on Label Importance and Label Correlation. Applied Sciences. 2025; 15(3):1466. https://doi.org/10.3390/app15031466

Chicago/Turabian Style

Chen, Weiliang, Xiao Sun, and Fuji Ren. 2025. "Dynamic Online Label Distribution Feature Selection Based on Label Importance and Label Correlation" Applied Sciences 15, no. 3: 1466. https://doi.org/10.3390/app15031466

APA Style

Chen, W., Sun, X., & Ren, F. (2025). Dynamic Online Label Distribution Feature Selection Based on Label Importance and Label Correlation. Applied Sciences, 15(3), 1466. https://doi.org/10.3390/app15031466

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop