1. Introduction
In the field of machine learning and data mining [
1], the problem of label ambiguity is currently a popular research topic. There are currently two relatively mature machine learning paradigms, namely, single-label learning (SLL) and multi-label learning (MLL) [
2]. SLL achieves good results in the field of machine learning when the target instance has a clear single-class label. MLL is an extension of SLL. In real life, an object may be associated with several class labels. To intuitively reflect the characteristics of polysemous objects, the most obvious approach is to assign multiple category labels to each example of a polysemous object, that is, to use MLL [
3,
4].
Because it is constrained by label semantics, traditional multi-label classification assumes that relevant and irrelevant labels have the same degree of relevance to the description of the example, which obviously does not meet the complex cognitive needs of humans [
5]. In layman’s terms, MLL only solves the label ambiguity problem of labels associated with the instance. However, in real life, we need to solve the more common and ambiguous scenario of “how to describe the instance”. For example, in facial-expression emotion recognition, it is more meaningful to understand the degree to which each emotion describes the instance. To solve these problems, a new learning paradigm has been proposed in recent years—label distribution learning (LDL) [
6], which is an extension of MLL. The degree of LDL to which each label describes an instance is represented by the corresponding value of the label distribution, called the descriptive degree. This value explicitly indicates the relative importance of the label. Both SLL and MLL can be regarded as special cases of LDL.
Figure 1 shows the three learning paradigms within the LDL framework.
Figure 2 shows an image described using a label distribution. The label descriptions of “HAP”, “SAD”, “SUR”, “ANG”, “DIS”, and “FER” in the image are shown in
Figure 2. LDL has richer label information and clearly describes the importance of each label [
7]. At present, it has been applied in multiple neighborhoods. For example, the adaptive facial recognition proposed by X. Geng et al., achieved good results by estimating the age of sample characters through LDL [
8]. Z. Zhang et al., by collecting surveillance video data, used LDL to implement crowd counting applications over a period of time [
9], and P. H. et al., used LDL to predict box office before movie release by collecting people’s opinions on movies [
10]. LDL has also been widely used in fields such as emotion recognition and multilingual learning.
Similar to traditional SLL and MLL, LDL also faces many challenges. In terms of the data structure of label instances, the problems faced by LDL tasks include a large number of feature dimensions [
11], large number of labels [
12], label imbalance [
13], and streaming features [
14]. In LDL tasks, the dimensionality of the data is very large, with usually thousands or tens of thousands of dimensions [
15,
16]. This “curse of dimensionality” leads to a series of problems such as reduced classification accuracy and generalization ability in high-dimensional space, and it increases the computational cost of many learning algorithms. In addition, too many features may contain redundant or irrelevant features, which may lead to more problems such as the excessive consumption of computing resources and degradation of model performance. To this end, many dimensionality reduction methods have been proposed [
17,
18]. However, they are mainly designed for SLL and MLL algorithms, and are rarely suitable for LDL algorithms. Therefore, there is an urgent need to develop new and effective LDL dimensionality reduction methods.
There are two main methods for reducing the dimensionality of label distribution: feature extraction and feature selection. Feature extraction methods use space mapping techniques or space transformations to reduce the dimensionality of the feature space. However, these methods destroy the structural information of the original feature space, blur the physical meaning of the features, and lack semantic interpretation. In contrast to feature extraction methods, feature selection methods do not perform any feature space transformation or mapping; instead, they retain the original spatial structure. Feature selection methods sort the features in the original feature space by importance, select the subspace that best represents the semantic features of that space, and use this subspace to maximally represent the original feature space [
19]. Therefore, feature selection methods well preserve the physical meaning of the feature space, which is an advantage of feature extraction methods [
20]. From the perspective of interaction with learning algorithms, existing MLL feature selection algorithms are mainly divided into three types: filtering, wrapping, and embedding [
21]. All types have different relative advantages. Filtering methods are typically more efficient, less computationally expensive, and more general than embedding and wrapping methods; therefore, this work focuses on filtering methods.
In the past few decades, many feature selection algorithms have been proposed for label distribution learning. In related studies [
22,
23], these methods were used to consider the correlation between features and labels based on fuzzy rough sets. In addition, because of the correlation between labels, the performance of label distribution models depends largely on label correlation. For example, the method in [
24] considers label correlation to reduce data dimensionality, that is, it leverages the assumption that all samples share the same label correlation. However, for different sample groups in practical applications, the corresponding label correlation between random variables is an effective indicator for evaluating the discriminative ability of candidate features in the feature selection process. Alternatively, when dealing with problems related to uncertainty, rough set theory [
25,
26] naturally has many advantages and has been widely used in SLL or MLL feature selection algorithms [
27,
28]. Chen et al. [
29] designed a parallel feature selection method based on neighborhood rough sets (NRSs), which considers the partial order between features and values. Yuan et al. [
30] proposed a generalized unsupervised feature selection model based on fuzzy rough sets. However, to the best of our knowledge, there are relatively few feature selection methods based on rough set theory for label distribution. The main problem faced by rough sets in dealing with label distribution feature selection is how to deal with distributed labels.
In most practical applications of LDL, the feature space is usually uncertain, and the number of samples for each feature increases gradually over time, just like a stream of feature vectors. Such features are called streaming features. For example, on the social networking platform Twitter, hot topics change dynamically over time. When a hot topic appears, it is always accompanied by a set of new keywords. These fresh keywords can be used as key features to distinguish hot topics. Streaming feature selection assumes that features arrive dynamically over time [
31] and performs feature selection when each feature arrives to always maintain the optimal subset of features [
32,
33]. So far, there have been many methods for processing streaming features. For example, Chen et al. [
34] proposed using global features to process streaming feature data, theoretically analyzed the pairwise correlation between features in the currently selected feature subset, and solved the streaming feature problem using online pairwise comparison techniques. NRS can handle mixed types of data without destroying the neighborhood and sequential structure of the data [
35]. Moreover, feature selection based on NRS does not require any prior knowledge of the feature space structure, and hence it is an ideal tool for online stream feature selection.
This paper proposes a dynamic online label distribution feature selection model based on label correlation and NRS, and proposes a dynamic label distribution feature selection algorithm for processing stream features. Mutual information is widely used to measure the degree of dependence between random variables and is an effective indicator for evaluating the discriminative ability of candidate features in the feature selection process. The proposed method uses the mutual information method to process the label space and obtain the correlation between labels through mutual information and graph characteristics. In addition, a new label space neighborhood relationship is proposed, in which the neighborhood class of the instance is constructed in the label space, replacing the calculation of the traditional logical label equivalence class. Simultaneously, the nearest neighbor distribution of the surrounding instances is used to calculate the neighborhood of the instance in the feature space, avoiding the problem of neighborhood granularity selection faced by the traditional NRS model. On this basis, the NRS model is extended to fit the label distribution data, and the corresponding feature dependency is redefined. Combining label correlation and NRS, a dynamic feature selection algorithm framework is proposed. The main contributions of this paper are as follows:
- (1)
The average nearest neighbor method is used to calculate a new form of neighborhood granularity, and the mutual information method is used to calculate the label weight to obtain the label correlation. By combining the neighborhood granularity and label correlation, a new NRS relationship and feature importance model is constructed.
- (2)
The traditional NRS model is generalized to adapt it to LDL.
- (3)
Using the above model, a new label distribution stream feature selection algorithm is proposed that combines the new NRS model with the stream feature online importance update framework. As a result, it can better handle the label distribution stream feature problem.
The rest of this paper is organized as follows. In
Section 2, we introduce the related concepts, including LDL, feature correlation, and NRS theory.
Section 3 introduces label correlation and the NRS models. In
Section 4, we propose a dynamic feature selection algorithm based on label correlation and NRS. We report our experimental results in
Section 5. Finally,
Section 6 summarizes this paper and discusses future work.
4. Proposed Method
4.1. Improvement of NRS
Given a decision system , represents a non-empty set of instances, C represents the feature set corresponding to the instance set, and D represents the decision attribute set Traditional single-label and multi-label neighborhood information particle partitioning methods are not suitable for labeled distribution data. For general data, a group of instances with the same attribute value or label value is called an equivalence class. Similarly, for mixed data, a group of instances with similar attribute values or label values is called a neighborhood class. In the method proposed in this paper, the margin of particles in the sample is used for granulating the neighborhood size.
Definition 7. Given a sample x, the margin of x relative to a set of samples U is defined as follows [
12]
:where denotes the instance from U that has the shortest distance from x and whose label class is different from that of x. In addition, denotes the instance from U that has the shortest distance from x and has the same label class as x. We call these instances the nearest miss and the nearest hit, respectively. Moreover, denotes the distance between x and , and denotes the distance between x and . We call the neighborhood particle about x. To facilitate the setting of neighborhood information particles, we set when . A sample may have a positive or negative effect on different labels. Thus, for a given sample, the degree of granularity depends on the label used.
Definition 8. For a sample x and label , the margin of x with respect to is .
As noted above, each sample has a different label and, correspondingly, a different granularity. Depending on the different decision views, we need to combine all the single-label granularities of a given sample to form a label distribution granularity [44]. Therefore, in this paper, we choose the average granularity (i.e., the average nearest neighborhood, also known as the neutral view) to represent the label distribution granularity of a sample [12]: To solve the problem of the granularity selection of δ, combining Equations (1) and (9), the new neighborhood of the sample is defined as We define a new neighborhood information particle to solve the problem of selecting the neighborhood granularity, which is caused by label distribution data. In addition, the average nearest neighbor reflects the relationship between features in an instance. This new neighborhood model considers the relationships between features and is based on improved neighborhood information.
Definition 9. Given , a label space , the label space granularity on label is defined as follows:where . Here, indicates whether the neighborhood is centered on , and represents the radius. A larger value indicates the sample is closer to . When , the neighborhood granularity is equivalent to the equivalence class. Example 1. Continue to Table 1. Assume 2-norm is applied to , we take for example, and then we can compute as: , , , , , , ,, , . Based on the above definition, we can find that the neighborhood granularity of instance in the label space forms the granularity system that can cover the instance set. Then, we can summarize the following properties:
(1) ;
(2) .
Definition 10. In the neighborhood decision system , represents the sample space, and represents the label distribution’s feature space. By defining the multi-label decision space, we can expand the decision positive domain of the single-label decision using Equation (4). For a certain feature subset , the lower approximation of the decision attribute with respect to B is as follows: In this equation, the neighborhood particle is obtained from Definition 9, and the label particle is obtained from Definition 10. Through this method, we extend the NRS and solve the problem of selecting the neighborhood granularity of rough sets in labeled distribution learning.
Example 2. Continue to Table 1. We take for example, for feature set , and then we can compute according to Definition 9: , ; , ,,; , , ,. According to Definition 11, using Examples 1 and 2, we can calculate .
In LDL, because the labels of each instance are always related in some way, it is necessary to consider the importance of labels and the correlation between them.
Definition 11. For a sample and the corresponding feature vector , that is, , N is the number of instances in the training set, and and are any two labels in label space L. The correlation between and is calculated using the mutual information as follows: An undirected label graph (WUG)
can be constructed by applying Equation (12). Here,
represents the set of nodes of the undirected graph,
represents its set of edges, and
represents the weight of each edge [
48]. The importance of each node in this undirected graph is defined as follows:
In the above equations,
and
represent the weight divisions of nodes
and
, respectively.
is the set of nodes with edges to label
, and
represents the correlation between nodes. Equation (
10) is used to calculate
, which denotes the sum for the correlation for all edges starting from
. In addition,
d is the damping coefficient, where
is the recommended setting according to the method in Ref. [
49]. For ease of calculation, an initial weight value can be set for all nodes; this is usually
, where
L is the total number of nodes, that is, the total number of labels. Using this algorithm, we can calculate the correlation between node
(i.e., label
) and other nodes
related to it, as well as the structure of the graph (WUG). Through label correlation, we obtain the weight of each label in the label space, and this completes the exploration of label correlation.
4.2. Feature Selection Based on NRSs
The processing method of the labeled distribution neighborhood decision system is similar to that of the multi-label decision system. By extending the rough set importance theory of labeled distribution data (Equation (
6)), combining the positive domain theory of labeled distribution decision (Equation (
12)) and label correlation (Equation (
14)), we obtain the importance of the decision attribute
on the set feature subset
:
The above equation reflects the importance of the decision attributes corresponding to the decision positive domain and feature subset. It solves the problem of granularity selection and feature association of label distribution NRS.
In the neighborhood decision system
, for feature subset
B, the importance of
a to
B is defined as follows:
In the new importance model, we add label importance and label relevance to the NRS model. The new NRS model reflects the fusion of feature information and label relevance. For the above NRS model, we construct a method to compute the most reducible feature sets based on a greedy forward search.
In this method, steps 1 and 2 perform the preprocessing when the labeled data arrive. Our reduced set starts from the empty set and calculates the label weights
for the entire label space. This step requires traversing the entire label space and constructing an undirected graph. Assuming that the number of labels in the label space is
L, the time complexity of calculating the correlation between each pair of labels is
, and the time complexity of calculating the weight of each label is
. Therefore, the time complexity of steps 1 and 2 is
. Steps
are divided into two parts: calculating the neighborhood of the instance, and analyzing whether the instance and the neighborhood are important. First, the nearest hit or miss is determined for each instance by selecting the average approximate neighborhood as the domain granularity criterion. Assuming the instance space is
U, the time complexity of this step is
. Next, the neighborhood corresponding to the instance and the labeling granularity of the labeling space are determined, and then the decision positive domain and attribute importance are calculated. The time complexity of determining the neighborhood of each instance is
; the time complexity of determining the marking granularity of the marking space is
; and the time complexity of determining the decision positive domain and importance calculation corresponding to each sample is
. Therefore, the overall time complexity of instance domain calculation is
. Therefore, the time complexity of Algorithm 1 is
.
Algorithm 1 Calculate |
- Input:
neighborhood decision system . - Output:
- 1:
use Equation ( 5) to calculate the weight matrix ; - 2:
Initialize reduct ; - 3:
if reduct then - 4:
; - 5:
end if - 6:
if reduct then - 7:
, the average approximate neighbor under a is calculated by Equation ( 2); - 8:
For , calculate the neighborhood using Equation ( 3); - 9:
For , the label space granularity on label is calculated by Equation ( 7); - 10:
For the label of is calculated by Equation ( 7); - 11:
Calculate by Equation ( 8); - 12:
end if - 13:
For , repeat steps 3.1–3.5 to calculate ; - 14:
Calculate , red,; - 15:
Output , red,
|
4.3. Dynamic Algorithm for Online Label Distribution Feature Selection
Most feature selection algorithms assume that all candidate features are available to the algorithm before feature selection. In contrast, for streaming features, all the candidates cannot be collected before learning begins because they arrive dynamically and incrementally over time. Therefore, based on Method 1, we incorporate the online streaming feature selection framework [
50] and propose an online labeled distributed streaming feature selection algorithm to solve the labeled distributed stream feature selection problem.
In the labeled distributed stream feature decision system , represents the set of all non-empty instances, C represents the candidate feature subset, L represents the label space, t represents the time when the feature arrives, represents the new feature of t, and represents the current candidate feature subset at time t.
4.3.1. Importance Analysis
Letting be the new feature arriving at the i-th moment, is the reduced set reduct at the i-th moment. For the new features that arrive, the first step is to perform redundancy analysis on . The purpose of redundancy analysis is to determine whether the newly arrived feature is beneficial to the label set L, that is, to determine the importance of to the entire label set L.
Using Equation (
17),
is set to measure the importance of
. If
, we consider
to be redundant and unimportant for the current label set
L, and hence
can be omitted.
4.3.2. Significance Analysis
The purpose of correlation analysis is to evaluate the correlation between the new currently arriving features relative to the currently arriving features.
Define the significance of label
L as follows:
For a reduced
of label
, we design a mapping
that maps
to the following d-dimensional vector:
where
, if
; otherwise,
.
We then define the significance of feature
relative to
on A as follows:
For a new feature , we calculate ; when 0, we believe that the significance of the new feature for label set is greater than or equal to the average significance of the features that have been processed in label set . Therefore, we believe that is a significant feature and should be retained.
4.3.3. Redundancy Analysis
The purpose of redundancy analysis is to determine whether there is a feature in the current reduction set and whether the contribution of to label set L is the same as that of . When their contributions are the same, it is necessary to choose between the two features and .
For the two features and , if , then and have the same contribution as L. We thus compare and . If , we keep and discard ; if , we keep and discard .
The streaming feature selection framework is based on online importance analysis, saliency analysis, and redundancy analysis. In this framework, a training set of known feature size is used to simulate the streaming features, and each streaming feature is generated from the candidate feature set. In the framework shown in
Figure 2, we propose the dynamic tag distribution feature selection algorithm (Algorithm 1), which considers tag importance and tag correlation, and combines the above three analyses.
The flow feature selection framework, illustrated in
Figure 5, is based on online importance analysis, significance analysis, and redundancy analysis. In this framework, a training set with known feature sizes is used to simulate flow features, and each flow feature is generated from the candidate feature set. In the framework shown in
Figure 2, we propose a dynamic online label distribution feature selection algorithm that considers label importance and label correlation (Algorithm 2), which incorporates the above three types of analysis.
The main part of Algorithm 2 is the calculation of the dependency relationship between features. Here,
is the number of features in the currently selected feature set at time
t. Algorithm 1 evaluates whether new features at their time of arrival need to be retained and decides how to retain them. The whole process is an online selection problem, and includes three main parts: importance analysis, significance analysis, and redundancy analysis. We have made our code publicly available.
Algorithm 2 Dynamic online label distribution feature selection (DLILC-LDL). |
- Input:
New feature that arrives at the i-th moment, the simplified set that arrives the i - 1-th moment, label set L, and redundancy weight . - Output:
Reduced tset reduct. - 1:
Use Equation ( 5) to calculate weight matrix ; - 2:
Initialize reduct ; - 3:
If new features arrive, calculate according to Equation ( 16); - 4:
/significance analysis/ - 5:
if red then - 6:
Calculate ; - 7:
if then - 8:
discard , to - 9:
else - 10:
reduct = reduct , to ; - 11:
end if - 12:
/importance Analysis/ - 13:
else - 14:
According to Equation ( 11), calculate ; - 15:
end if - 16:
ifthen - 17:
reduct = reduct , to ; - 18:
else - 19:
/Redundancy Analysis/ - 20:
while reduct do - 21:
if and then - 22:
discard , to 5; - 23:
else if then - 24:
reduct = reduct - 25:
reduct = reduct , to - 26:
end if - 27:
end while - 28:
end if - 29:
No new features arrive; - 30:
Output reduced set reduct.
|
The feature calculation performed by Algorithm 2 is taken from Method 1. Hence, the time complexity of a single feature selection is . However, in most cases, Algorithm 2 is neither simple nor optimistic, and it requires online updates. Because the time complexity of the update depends on the calculation of feature dependencies, in the worst case, it is necessary to traverse all selected features for processing, and hence the time complexity in the worst case is .
6. Summary
This paper proposed a label distribution feature selection model based on label correlation and NRS, and it further proposed a label distribution feature selection algorithm based on this model, namely, a dynamic algorithm for processing streaming features. In the proposed approach, we first defined a new neighborhood particle by averaging the nearest neighbor method to better connect the information between the features of the label distribution data. Then, by calculating the mutual information between the features, we obtained the feature correlation weight and combined the lower approximation of the new neighborhood with the feature weight to obtain a new feature subset importance model. On this basis, we generalized the traditional NRS model, defined a new label space granularity, and granulated the label space of the label distribution data to adapt it to LDL. Incorporating rough set theory, we defined the label distribution neighborhood decision system and proposed a new importance model. Moreover, based on this algorithm, we proposed a dynamic feature selection algorithm that can solve the label distribution streaming feature problem by testing the arriving features in the time series using importance, significance, and redundancy analyses. The experimental results show that our algorithm is highly competitive compared with other commonly used algorithms. However, in a large number of dynamic data environments, the practical application of this method is somewhat challenging due to the high time complexity of the algorithm. The time complexity of the algorithm mainly comes from the calculation of the sample neighborhood and the feature selection of the sample based on the neighborhood. In the labeled distribution environment, the label granularity of the sample also needs to be calculated. Therefore, in order to improve the time complexity, the samples can be pre-sorted according to the rules of the label space, and the correctness of the feature selection can be further improved. This will be the direction for research and optimization in the next step. Therefore, in future work, we hope to further optimize the algorithm and reduce the running time of the algorithm. At the same time, we hope to further improve the label space particles and propose a better label distribution neighborhood decision system. In addition, the next step is to combine this method with practical applications (such as sentiment computing and multimodal sentiment recognition) to practice this method.