Online Multi-Label Streaming Feature Selection Based on Label Group Correlation and Feature Interaction

Multi-label streaming feature selection has received widespread attention in recent years because the dynamic acquisition of features is more in line with the needs of practical application scenarios. Most previous methods either assume that the labels are independent of each other, or, although label correlation is explored, the relationship between related labels and features is difficult to understand or specify. In real applications, both situations may occur where the labels are correlated and the features may belong specifically to some labels. Moreover, these methods treat features individually without considering the interaction between features. Based on this, we present a novel online streaming feature selection method based on label group correlation and feature interaction (OSLGC). In our design, we first divide labels into multiple groups with the help of graph theory. Then, we integrate label weight and mutual information to accurately quantify the relationships between features under different label groups. Subsequently, a novel feature selection framework using sliding windows is designed, including online feature relevance analysis and online feature interaction analysis. Experiments on ten datasets show that the proposed method outperforms some mature MFS algorithms in terms of predictive performance, statistical analysis, stability analysis, and ablation experiments.


Introduction
Multi-label feature selection (MFS) plays a crucial role in addressing the preprocessing of high-dimensional multi-label data. Numerous methods have been proposed and proven to be effective in improving prediction performance and model interpretability. However, traditional MFS methods assume that all features are collected and presented to the learning model beforehand [1][2][3][4]. This assumption does not align with many practical application scenarios where not all features are available in advance. In video recognition, for example, each frame may possess important features that become available over time. Hence, achieving real-time feature processing has emerged as a significant concern [5][6][7][8].
Online multi-label feature selection with streaming features is an essential branch of MFS that facilitates the efficient real-time management of streaming features. It provides significant advantages, such as low time and space consumption, particularly when dealing with extremely high-dimensional datasets. Some notable works in this have attracted attention, including online multi-label streaming feature selection based on a neighborhood rough set (OM-NRS) [9], multi-label streaming feature selection (MSFS) [10], and novel streaming feature selection(ASFS) [11]. However, these methods primarily focus on eliminating irrelevant and/or redundant features. In addition to identifying irrelevant and/or redundant features, feature interaction is crucial but often overlooked. Feature interaction refers to features that have weak or independent correlations with the label, but when combined with other features, they may exhibit a strong association with the predicted label. Streaming feature selection with feature interaction (SFS-FI) [12] is a representative approach that considers feature interaction dynamically. SFS-FI has successfully identified the impact of feature interaction; however, it lacks the capability to tackle the learning challenge in multi-label scenarios.
Another difficulty with online MFS is that labels are universally correlated, which is a distinctive property of multi-label data [13][14][15][16]. Intuitively, a known label can aid in learning an unknown one, and the co-occurrence of two labels may provide additional information to the model. For example, an image with 'grassland' and 'lion' labels is likely also to be marked as 'African'; similarly, 'sea' and 'ship' labels would tend to appear together in a short video, while 'train' and 'sea' labels tend not to appear together. Some research work has been carried out around label correlation. Representative work includes multi-label streaming feature selection (MSFS) and online multi-label streaming feature selection with label correlation (OMSFS LC ). MSFS captures label correlation by constructing a new data representation pattern for label space and utilizes the constructed label relationship matrix to examine the merits of features. OMSFS LC constructs label weights by calculating label correlation, and, on this basis, integrates label weights into significance analysis and relevance analysis of streaming features. The methods mentioned above select features by evaluating the relationship between features and the global label space. This strategy may not be optimal as it is challenging to comprehend and specify the relationship between relevant labels and features. Based on research by Li et al. [17], it has been found that strongly correlated labels tend to share similar specific features, while weakly related labels typically have distinct features. In line with this observation, this paper will categorize related labels into two groups: strongly related labels will be grouped together, while weakly related labels will be separated into different groups.
Accordingly, a novel online multi-label streaming feature selection based on label group correlation and feature interaction, namely OSLGC, is proposed to select pertinent and interactive features from streaming features. Firstly, our method involves calculating the correlation matrix of labels and using graph theory to group related labels. Labels within the same group have a strong correlation, while labels from different groups have a weak correlation. Then, we define the feature relevance item and integrate the label weight and feature interaction weight into the feature relevance item. Subsequently, a framework based on sliding windows is established, which iteratively processes streaming features through two steps: online feature correlation analysis and online feature interaction analysis. Finally, extensive experiments demonstrate that OSLGC can yield significant performance improvements compared to other mature MFS methods. The uniqueness of OSLGC is established as follows: • By utilizing graph theory, label groups are constructed to ensure that closely associated labels are grouped together. This method provides an effective means of visualizing the relationships among labels. • We provide a formal definition of feature interaction and quantify the impact of feature interaction under different label groups. Based on this, OSLGC is capable of selecting features with interactivity. • A novel streaming feature selection framework using sliding windows is proposed, which resolves the online MFS problem by simultaneously taking feature interaction, label importance, as well as label group correlation, into account. • Experiments on ten datasets demonstrate that the proposed method is competitive with existing mature MFS algorithms in terms of predictive performance, statistical analysis, stability analysis, and ablation experiments.
The rest of this article is arranged as follows: In Section 2, we review previous research. Section 3 provides the relevant preparatory information. In Section 4, we present the detailed procedure for OSLGC. In Section 5, we report the empirical study. Finally, Section 6 sums up the work of this paper and considers the prospects, priorities and direction of future research.

Related Work
Multi-label feature selection (MFS), as a widely known data preprocessing method, has achieved promising results in different application fields, such as emotion classification [18], text classification [19], and gene detection [20]. Depending on whether the features are sufficiently captured in advance, existing MFS methods can be divided into batch and online methods.
The batch method assumes that the features presented to learning are pre-available. Generally speaking, it can be further subdivided into several types according to the characteristics provided by the complex label space, including missing labels [21,22], label distribution [23,24], label selection [25], label imbalance [26,27], streaming labels [28,29], partial labels [30,31], label-specific features [32][33][34], and label correlation [35][36][37]. Among them, investigating label correlation is considered to be a favorable strategy to promote the performance of learning. To date, many works have focused on this. For instance, label supplementation for multi-label feature selection (LSMFS) [38] evaluates the relationship between labels using mutual information provided by the features. Quadratically constrained linear programming (QCLP) [39] introduces a matrix variable normal prior distribution to model label correlation. By minimizing the label ranking loss of label correlation regularization, QCLP is able to identify a feature subset. On the other hand, label-specific features emphasize that different labels may possess their own specific features. One of the most representative studies, Label specIfic FeaTures (LIFT) [32], has shown that using label-specific features to guide the MFS process can elevate the performance and interpretability of learning tasks. Recently, group-preserving label-specific feature selection (GLFS) [33] has been used to exploit label-specific features and common features with l 2,1 -norm regularization to support the interpretability of the selected features.
The online method differs from the batch method in that features are generated onthe-fly and feature selection takes place in real-time as the features arrive. Based on the characteristics of the label space, it can be categorized into two groups: label independence and label correlation. For label independence, several methods have been proposed, such as the streaming feature selection algorithm with dynamic sliding Windows and feature repulsion loss (SF-DSW-FRL) [40], multi-objective online streaming multi-label feature selection using mutual information (MOML) [41], streaming feature selection via class-imbalance aware rough set (SFSCI) [42], online multi-label group feature selection (OMGFS) [43], and multi-objective multi-label-based online feature selection (MMOFS) [44]. Similar to the static MFS methods, the online MFS approach also focuses on exploring label correlation. For instance, MSFS [10] uses the relationship between samples and labels to construct a new data representation model to measure label correlation, and implements feature selection by designing feature correlation and redundancy analysis. Multi-label online streaming feature selection with mutual information (ML-OSMI) [45] uses high-order methods to determine label correlation, and combines spectral granulation and mutual information to evaluate streaming features. Unfortunately, existing methods cannot exactly capture the impact of label relationships on the evaluation of streaming features and are hindered by a time-consuming calculation procedure. Thus, online multi-source streaming features selection (OMSFS) [7] investigates label correlation by calculating mutual information, and, on this basis, constructs the weight for each label and designs a significance analysis to accelerate the computational efficiency.
Based on our review of previous studies, we find that with the arrival of each new feature, existing methods can be effective in dealing with streaming features. However, these methods pay more attention to the contribution of features to all labels, and do not explore the specific relationship between features and labels. To put it simply, they fail to consider that highly correlated labels may have common features, while weakly correlated labels may have distinct features. Additionally, it is important to mention that most previous works have focused on selecting the most relevant features to labels, but have ignored the potential contribution of feature interactions to labels. In contrast, our framework pays close attention to feature interactions and label group correlation, and seeks to explore the specific features and label group weights corresponding to the label group.

Multi-Label Learning
Given a multi-label information table MLS =< U, F, L >, where U = {x 1 , x 2 , · · · , x n } is a non-empty instance set, F = { f 1 , f 2 , · · · , f d } and L = {l 1 , l 2 , · · · , l m } are a feature set and label set used to describe instances, respectively. l i (x k ) represents the value of label l i on instance x k ∈ U, where l i (x k ) = 1, only if x k possesses l i , and 0 otherwise. The task of multi-label learning is to learn a function h : U → 2 L .

Basic Information-Theoretic Notions
This section introduces some basic information theory concepts which are commonly used in the evaluation of feature quality. Definition 1. Let X = {x 1 , x 2 , · · · , x n } be a discrete random variable and P(x i ) be the probability of x i , then the entropy of X is H(X) is a measure of randomness or uncertainty in the distribution of X. It is at a maximum when all the possible values of X are equal, and at a minimum when X takes only one value with probability 1.

Definition 2.
Let Y = {y 1 , y 2 , · · · , y m } be another random variable. Then, the joint entropy H(X, Y) of X and Y is: where P(x i , y j ) denotes the joint probability of x i and y j .
Definition 3. Given X and Y, when the variable Y is known, the residual uncertainty of X can be determined by the conditional entropy H(X|Y): where P(x i |y j ) is the conditional probability of x i given y j .
Definition 4. Given X and Y, then the amount of information shared by two variables can be determined by the mutual information I(X; Y): The larger the I(X; Y) value, the stronger the correlation between the two variables. Inversely, the two variables are independent if I(X; Y) = 0.

Definition 5.
Given variables X, Y and Z, when Z is given, the uncertainty of X due to the known Y can be measured by the conditional mutual information I(X; Y | Z):

Exploiting Label Group Correlation
To investigate label group correlation, in this subsection, we introduce a graph-based method to further distinguish the relevant labels, which can effectively differentiate relevant labels by grouping strongly related labels together and separating weakly related ones into different groups. The process involves two fundamental steps: (1) constructing an undirected graph of the labels, and (2) partitioning the graph to create distinct label groups.
In the first step, OSLGC aims to construct an undirected graph that effectively captures the label correlation among all labels, thus providing additional information for streaming feature evaluation. For this purpose, it is necessary to investigate the correlation between labels. Definition 6. Given < U, F, L >, x k ∈ U, l i , l j ∈ L, l i (x k ) represents the value of label l i with respect to instance x k , the correlation r ij between the labels is defined as: Obviously, if l i and l j are independent, then r ij = 0; otherwise, r ij > 0.
Using Equation (6), the label correlation matrix M(R LC ) is obtained, and the form of Based on the matrix, the weighted undirected graph of the label correlation can be structured m], i = j} mean the vertex and edge of Graph, respectively. As M(R LC ) is symmetric, Graph is an undirected graph that reflects the correlation among all labels. But, regrettably, Graph has m vertices and m(m − 1)/2 edges. For ultra-high-dimensional data, the density of the graph will be considerable, which often leads to strong edge interweaving of different weights. Moreover, the resolution of complete graphs is an NP-hard problem. Therefore, for Graph, it is necessary to reduce the edges of Graph.
In the second step, OSLGC aims to divide the graph and create label groups. With this intent, we first generate a minimum spanning tree (MST) through the Prim algorithm. MST has the same vertices as Graph and partial edges of Graph. The weight of the link edge in the MST is expressed as W (l i ,l j ) , which is essentially different for different edges. To divide strongly correlated labels into groups, we set the threshold to break the edges below the threshold in MST.

Definition 7.
Given W (l i ,l j ) represents the weight of the edges, and the threshold for weak label correlation is defined as: δ is the average of the edge weights, which is used to divide the label groups, thereby putting the strongly related labels in the same group.
Concretely, if W (l i ,l j ) ≥ δ, which means that the relationship between labels l i and l j is a strength label correlation, we will reserve the edge that connects l i with l j . If W (l i ,l j ) < δ, which explains that the relationship between labels l i and l j is a weakness label correlation, we can remove the edge that connects l i with l j from MST. Hence, the MST can be segmented into forests by threshold segmentation. In the forest, the label nodes within each subtree are strongly correlated, while the label nodes between subtrees are weakly correlated. Based on this, we can treat each subtree as a label group, denoted as L = {LG 1 ∪ LG 2 ∪ · · · ∪ LG p }. Table 1. First, the label correlation matrix is calculated using Equation (6), as follows:  Table 1. Example of multi-label data.

Example 1. A multi-label dataset is presented in
Instance Then, we can create the label undirected graph by using the label correlation matrix, as shown in Figure 1a. Immediately afterwards, the minimum spanning tree is generated by the Prim algorithm, as shown in Figure 1b. Finally, the threshold δ of MST is calculated using Equation (7), and the edges that meet condition W (l i ,l j ) < δ are removed, as shown in Figure 1c.

Analysis Feature Interaction under Label Group
As a rule, the related labels generally share some label-specific features [17,33], i.e., labels within the same label group may share the same label-specific features. Thus, to generate label-specific features for different label groups, in this subsection, we will further analyze feature relationships under different label groups, including feature independence, feature redundancy, and feature interaction. We also give the interaction weight factor to quantify the influence degree of the feature relationship under different label groups.
According to Definition 8, LG h ) suggests that the information provided by feature f i and f t for the label group LG h are non-interfering, i.e., the features are independent of each other under label group LG h .
LG h ), then f i and f t are independent under label group LG h .
LG h ). Thus, f i and f t are independent under label group LG h .

Theorem 2.
If f i and f t are independent, under the condition that label group LG h is known, then Proof. If f i and f t are independent, i.e., I( f i ; f t ) = 0, according to Definition 5, it can be proven that I( f i ; f t |LG h ) = 0. Definition 9 (Feature redundancy). Given a set of label groups L = {LG 1 ∪ LG 2 ∪ · · · ∪ LG p }, LG h ⊆ L, S t = { f 1 , f 2 , · · · , f d * } denotes the selected features, and f t is a new incoming feature. For ∀ f i ∈ S t , f i and f t are referred to as feature redundancy under LG h if, and only if: Equation (9) suggests that there is partial duplication of information provided by two features; that is, the amount of information brought by two features f i and f t together for label group LG h is less than the sum of the information brought by the two features for LG h separately.
LG h ), then the relationship between f i and f t is a pair of feature redundancy under label group LG h .
LG h ). Thus, the relationship between f i and f t is a pair of feature redundancy under label group LG h .

Definition 10 (Feature interaction). Given a set of label groups
Equation (10) suggests that there is a synergy between features f i and f t together for label group LG h ; that is, they yield more information together for label group LG h than what could be expected from the sum of I( f i ; LG h ) and I( f t ; LG h ).
LG h ), then f i and f t is a pair of feature interaction under label group LG h .
LG h ). Thus, f i and f t are a pair of feature positive interaction under label group LG h .

Property 1.
If two features f i and f t are not independent, the correlations between f i and f t under a different label group LG h are distinct. It is easy to show with Example 2. Table 1. As shown in Table 2, we can see that I( f 1 , f 2 ; LG 1 ) = 0.997 is less than I( f 1 ; LG 1 ) + I( f 2 ; LG 1 ) = 1.227, and, according to Definition 9, f 1 and f 2 is a feature redundancy under label group LG 1 . However, for label group LG 3 , it satisfies that I( f 1 ,

Example 2. Continue
LG 3 ); that is, f 1 and f 2 is a feature interaction under the label group LG 3 . This finding suggests that the relationship between f 1 and f 2 changes dynamically under different label groups.
Consequently, to evaluate features accurately, it is imperative to quantify the influence of the feature relationships on feature relevance. That is, the inflow of a new feature f t has a positive effect in predicting labels, and we should enlarge the weight of feature f t ; otherwise, the weight of feature f t should be reduced. The feature interaction weight factor is defined to quantize the impact of the feature relationships as follows: · · · , f d * } denotes the selected features, and f t is a new incoming feature. For ∀ f i ∈ S t , the feature interaction weight between f i and f t is defined as: LG h ) offers additional information for evaluating feature f t . If feature f t and the selected feature f i ∈ S t is independent or redundant, it holds that FW( f i , f t , LG h ) ≤ 1. However, if the feature relationship is interactive, it holds that FW( f , f t , LG h ) > 1.

Streaming Feature Selection with Label Group Correlation and Feature Interaction
Streaming features refer to features acquired over time; however, in fact, not all features obtained dynamically are helpful for prediction. Therefore, it is necessary to extract valuable features from the streaming features' environment. To achieve this purpose, in this paper, we implement the analysis of streaming features in two stages: online feature relevance analysis and online feature interaction analysis.

Online Feature Relevance Analysis
The purpose of feature relevance analysis is to select features that are important to the label groups. Correspondingly, the feature relevance is defined as follows: Definition 12 (Feature Relevance). Given label groups L = {LG 1 ∪ LG 2 ∪ · · · ∪ LG p }, f t is a new incoming feature, the feature relevance item is defined as: In which, W(LG h ) denotes the weight assigned to each label group, and where H(LG h ) is the information entropy of label group LG h .
The higher the weight of the label group, the more important the label group is to other label groups. In other words, the corresponding label-specific features of the label group should have higher feature importance.

Definition 13.
Given label groups L = {LG 1 ∪ LG 2 ∪ · · · ∪ LG p }, f t is a new incoming feature, and γ( f t ) is the feature relevance. With a pair of thresholds α and β (0 < α < β), we define: ( In general, for a new incoming feature f t , if f t is powerfully relevant, we will select it; if f t is irrelevant, we will directly abandon it and no longer consider it later; if f t is weakly relevant, there is a risk of greater misjudgment in making a decision immediately, including selecting or abandoning, and the best approach is to obtain more information to make a decision.

Online Feature Interaction Analysis
Definition 13 can be used to make intuitive judgments about features that are weakly correlated. However, Definition 13 does not provide a basis for selecting or abandoning weakly relevant features. Therefore, it is necessary to determine whether to remove or retain the weakly relevant features.

Definition 14.
Given label groups L = {LG 1 ∪ LG 2 ∪ · · · ∪ LG p }, S t = { f 1 , f 2 , · · · , f d * } denotes the selected features, and f t is a new incoming feature. The feature relevance when considering feature interaction, called the enhanced feature relevance, is defined as: In which, FW( f i , f t ; LG h ) is the feature interaction weight between f t and f i ∈ S t . Furthermore, to determine whether to retain the weakly relevant feature, we set the mean value of feature relevance about the selected features as the relevance threshold, as follows: Definition 15. Given S t = { f 1 , f 2 , · · · , f d * } denotes the selected features, f i ∈ S t , at time t, the mean value of the feature relevance about the selected features is: Obviously, when ( f t ) > Mean t , it shows that the weak relevant feature f t interacts with the selected features. In this case, f t can enhance the prediction ability and be selected as an effective feature. Otherwise, when ( f t ) ≤ Mean t , it denotes that adding the weakly relevant feature f t does not promote the prediction ability for labels, and, in this case, we can discard the feature f t .

Streaming Feature Selection Strategy Using Sliding Windows
According to Definition 13, two main issues need to be addressed: (1) how to design a streaming feature selection mechanism to discriminate the newly arrived features; (2) how to set proper thresholds for α and β.
(1) Streaming feature selection with sliding windows: To solve the first challenge, a sliding window mechanism is proposed to receive the arrived features in a timed sequence, which is consistent with the dynamic nature of the streaming features. The specific process can be illustrated using the example in Figure 2.
• First, the sliding window (SW) continuously receives and saves the arrived features. When the number of features in the sliding window reaches the preset size, the features in the window are discriminated, which includes decision-making with regard to selection, abandonment, or delay. • According to the feature relevance γ( f t ) (Definition 12), we select the strongly relevant features, as shown in Figure 2. We can straightforwardly select strongly relevant features, e.g., f 15  This process is performed repeatedly. That is, when the features in the sliding window reach saturation or no new features appear, the next round of feature analysis is performed. (2) Thresholds setting of α and β: To solve the second challenge, we assume that the experimental data follow a normal distribution and the streaming features arrive randomly. Inspired by the 3σ principle of normal distribution, we set α and β as the mean and standard deviation of features in the sliding window.
Definition 16. Given a sliding window SW, f t ∈ SW, and γ( f t ) is the feature relevance, then, at time t, the mean value µ t of the sliding window is: Definition 17. Given a sliding window SW , f t ∈ SW, and γ( f t ) is the feature relevance, then, at time t, the standard deviation σ t of the sliding window is: Therefore, we combine the 3σ principle of normally distributed data to redefine the three feature relationships.

Definition 18.
Given γ( f t ) is the feature relevance, at time t, µ t and σ t are the mean and standard deviation of the features in the sliding window. Then, we define three feature relationships as: ( Through the above analysis, we propose a novel algorithm, named OSLGC, as shown in Algorithm 1.

Algorithm 1
The OSLGC algorithm Input: SW:sliding window, f i :predictive features, L: label set. Output: S t : the feature subset at time t.
1: Generate label groups L = {LG 1 ∪ LG 2 ∪ · · · ∪ LG p } by Section 4.1; 2: repeat 3: Get a new feature f t at time t; 4: Add feature f t to the sliding window SW; 5: while SW is full or no features are available do 6: Compute µ t , γ t , and Mean t ; 7: for each f t ∈ SW do 8: if γ( f t ) ≥ µ t + σ t then 9: if ( f t ) > Mean t then 10: S t = S t ∪ f t ; 11: end if 12: else 13: Discard f t ; 14: end if 15: end for 16: end while 17: until No features are available; 18: Return S t ; The major computation in OSLGC is feature analysis in sliding windows (Steps 5-16). Assuming |F t | is the number of currently arrived features, and |L| is the number of labels, in the best-case scenario, OSLGC obtains a feature subset after running online feature relevance analysis, and the time complexity is O(|F t | · |L|). However, in many cases, the features are not simply strongly relevant or irrelevant, but include weakly relevant instances. Therefore, online feature interaction analysis needs to be further performed. The final time complexity is O(|F t | · |F t | · |L|).

Data Sets
We conducted experiments on ten multi-label datasets, which were mainly from three different domains, that is, text, audio, and images, respectively. Among them, the first eight datasets (i.e., Business, Computer, Education, Entertainment, Health, Entertainment, Reference, and Society) were taken from Yahoo, and were derived from the actual web text classification. For audio, Birds is an audio dataset that identifies 19 species of birds. For images, Scene includes 2407 images with up to six labels per image. These datasets are freely available for public download and have been widely used in research on multilabel learning.
Detailed information is provided in Table 3. For each dataset S, we use |S|, F(S), and L(S) to represent the number of instances, number of features, and number of labels, respectively. LCard(S) denotes the average number of labels per example, and LDen(S) standardizes LCard(S) according to the number of possible labels. In addition, it is worth noting that the number of instances and the number of labels in different datasets vary from 645 to 5000 and from 6 to 33, respectively. These datasets with varied properties provide a solid foundation for algorithm testing. Table 3. Detailed description of datasets.

Experimental Setting
To visualize the performance of OSLGC, we compared OSLGC with several recent MFS algorithms. For a reasonable comparison, two different types of algorithms were selected as comparison algorithms, including (1) two online multi-label streaming feature selection algorithms, and (2) five MFS methods based on information theory. Specifically, the two online multi-label streaming feature selection methods included multi-label streaming feature selection (MSFS) [10] and online multi-label feature selection based on neighborhood rough set (OMNRS) [9]. On the other hand, the five MFS methods based on information theory were multi-label feature selection with label dependency and streaming labels (MSDS) [16], multi-label feature selection with streaming labels (MLFSL) [28], label supplementation for multi-label feature selection (LSMFS) [38], maximum label supplementation for multi-label feature selection (MLSMFS) [38], and constraint regression and adaptive spectral graph (CSASG [46]), respectively. Details of these algorithms are provided below. MLFSL: It is an MFS algorithm based on streaming labels, which fuses the feature rankings by minimizing the overall weighted deviation. • CSASG: It proposes a multi-label feature selection framework, which incorporates a spectral graph term based on information entropy into the manifold framework.
For the proposed method, the size of the sliding window |SW| is set to 15 in this paper. For the algorithms that obtain the feature subset, e.g., MSDS, MSFS, and OMNRS, we use the feature subset obtained by these algorithms to construct new data for prediction. For the algorithms that obtain feature ranking, e.g., MLFSL, LSMFS, MLSMFS, and CSASG, the first p features are selected, which depends on the dimension of the feature subset obtained by the OSLGC algorithm. Furthermore, we select the average precision (AP), Hamming loss (HL), one error (OE), and macro-F1 (F1), as the evaluation metrics. Due to space limitations, information on these metrics will not be provided in detail. The formulas and descriptions of all the evaluation metrics are provided in [47,48]. Finally, MLkNN (k = 10) is selected as the basic classifier.

Experimental Results
Tables 4-7 display the results for the different evaluation metrics, where the symbol "↓ (↑)" indicates "the smaller (larger), the better". Boldface highlights the best prediction performance for a specific dataset, and the penultimate row in each table shows the average value of the algorithm on all datasets. Furthermore, the Win/Draw/Loss record provides the number of datasets where OSLGC outperforms, performs equally to, and underperforms compared to the other algorithms, respectively. The experimental results indicate that OSLGC exhibits strong competitiveness compared with other algorithms; the experimental results also provide some interesting insights.

•
For web text data, OSLGC is capable of achieving the best predictive performance on at least 7 out of the 8 datasets on all the evaluation metrics. This suggests that the proposed method is suitable for selecting features for web text data. • For the Birds and Scene data, OSLGC achieves the best result on 3 out of 4 evaluation metrics. For the remaining evaluation metric, OSLGC ranks second with a disadvantage of 0.96 % and 1.51 %, respectively. This result indicates that OSLGC can also be applied to the classification problem of other data types, such as images and audio. • The average prediction results of all datasets were comprehensively investigated, with the results showing that the performance of OSLGC has obvious advantages. Furthermore, the Win/Draw/Loss records clearly demonstrate that OSLGC can outperform the other algorithms. • Although MSFS, OMNRS, and OSLGC are proposed to manage streaming features, the performance advantage of OSLGC confirms that label group correlation and feature interaction can provide additional information for processing streaming features.
OSLGC is able to make use of label group correlation to guide feature selection, and adds online feature interaction analysis to provide hidden information for predictive labels. By combining the potential contributions of the feature space and the label space, OSLGC performs very competitively compared to other mature MFS methods.

Statistical Tests
To assess the statistical significance of the observed differences between the eight algorithms, we used the Friedman test [49]. The Friedman test ranks the prediction performance obtained by each dataset. The best algorithm ranks first, the sub-optimal algorithm ranks second, and so on. For K algorithms and N datasets, r i j represents the rank of the i algorithm on the j dataset, and R i = 1/N ∑ N j=1 r i j represents the average rank of the i algorithm. Under the null hypothesis (i.e., all algorithms are equivalent), the Friedman statistic F F obeys the Fisher distribution of degrees of freedom (K − 1) and (K − 1)(N − 1): ). Table 8 summarizes the value of F F and the corresponding critical value. Based on the Friedman test, the null hypothesis is rejected at a significance level of 0.10. Consequently, it is necessary to use a post hoc test to further analyze the relative performance between the algorithms. As the experiments focus on the performance difference between OSLGC and other algorithms, we chose the Bonferroni-Dunn test [50] to serve this purpose. In this test, the performance difference between OSLGC and one comparison algorithm is compared using the critical difference (CD), and CD α = q α · K(K+1) 6N , where q α = 2.450 at α = 0.10; thus, we can compute CD 0.1 = 2.6838.  Figure 3 gives the CD diagrams, where the average rank of each algorithm is plotted on the coordinate axis. The best performance rank is on the rightmost side of the coordinate axis, and conversely, the worst rank is on the leftmost side of the coordinate axis. In each subfigure, if the average rank between OSLGC and one comparison algorithm is connected by a CD line, it indicates that the performance of the two algorithms is comparable and statistically indistinguishable. Otherwise, if the average rank of a comparison algorithm is outside a CD, it is considered to have a significantly different performance from OSLGC.
From Figure 3, we can observe that: (1) OSLGC has obvious advantages over LSMFS, MLSMFS, MLFSL, and MSFS with respect to all the evaluation metrics; (2) OSLGC achieves comparable performance with CSASG for each evaluation metric, but, different from the setting of the known static feature space of CSASG, OSLGC selects features by assuming the dynamic arrival of features, which entails a process of selecting the best feature with local feature information; (3) It is noteworthy that, although OSLGC cannot be significantly distinguished from all the algorithms, OSLGC exhibits significant advantages over the other feature selection algorithms. In summary, OSLGC exhibits a stronger statistical performance than LSMFS, MLSMFS, MLFSL, MSFS, MSDS, OMNRS, and CSASG.

Stability Analysis
In this subsection, we employ spiderweb plots to verify the stability of the algorithms. Because the results generated by the algorithm on different evaluation metrics are quite different, to reasonably compare, we standardize the prediction results to a standard range [0.1, 0.5]. The spiderweb diagram has the following characteristics: (1) The larger the area surrounded by the same color line, the better the performance and the stability of the algorithm. (2) The closer the normalized value is to 0.5, the better the performance.
(3) The closer the shape of the encircling line corresponding to the algorithm is to a regular polygon, the better the stability of the algorithm. Figure 4 shows spider diagrams for all the evaluation metrics, where each corner denotes a dataset and different colored lines represent different MFS algorithms, respectively.
By analyzing Figure 4, it is found that: (1) Among all the algorithms, the area surrounded by OSLGC is the largest, which indicates that OSLGC has the best performance; (2) The polygon enclosed by OSLGC is approximately a regular polygon with respect to the average precision and macro-F1. This indicates that the performance obtained by OSLGC is relatively stable on different datasets; (3) Furthermore, although the polygon enclosed by OSLGC is not a regular polygon with respect to the Hamming loss and one error metrics, the fluctuation range of OSLGC at each vertex is relatively small. In summary, compared with the other algorithms, the OSLGC algorithm has obvious advantages in terms of performance and stability.

Ablation Experiment
To evaluate the contribution of the label group correlation, we conducted an ablation empirical study by removing the label group correlation in Algorithm 1 and derived a variant of the OSLGC algorithm, called the OSLGC-RLC algorithm. Table 9 displays the results for OSLGC and OSLGC-RLC. Due to space limitations, we select three datasets for experimental verification: Recreation, Entertainment, and Social. Considering the results in Table 9, it is observed that OSLGC significantly outperforms OSLGC-RLC on all the evaluation metrics. In conclusion, the above results suggest that considering the label group correlation is an effective strategy in feature selection.

Conclusions
In this paper, we have presented a new online multi-label streaming feature selection method, called OSLGC, to select relevant or interactive features from streaming features. In OSLGC, a set of trees is constructed using graph theory that is able to divide strongly related labels into the same tree, and which applies a streaming feature selection strategy using sliding windows, which identifies the relevant, interactive, and irrelevant features in an online manner. OSLGC can be divided into two parts: online feature relevance analysis and online feature interaction analysis. For online feature relevance analysis, we designed the feature relevance terms to provide a basis for decision-making, such as for selection, delay, and abandonment. For online feature interaction analysis, we defined an enhanced feature relevance item that prefers to select a group of interactive features from the delay decisions corresponding to the online relevance analysis. Based on experiments undertaken, our research showed that OSLGC achieved a high level of competitive performance against other advanced competitors.
In future work, we intend to combine label-specific features and common features to design streaming feature selection strategies. Furthermore, we are committed to building streaming feature selection strategies that are suitable for large-scale data.