Next Article in Journal
Frobenius Norm-Based Global Stability Analysis of Delayed Bidirectional Associative Memory Neural Networks
Previous Article in Journal
Identifying Patients with Temporomandibular Joint Disorders Based on Patterns of Electromyographic Activity of the Masseter and Temporalis Muscles
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Label Learning with Distribution Matching Ensemble: An Adaptive and Just-In-Time Weighted Ensemble Learning Algorithm for Classifying a Nonstationary Online Multi-Label Data Stream

by
Chao Shen
1,†,
Bingyu Liu
1,†,
Changbin Shao
1,
Xibei Yang
1,
Sen Xu
2,
Changming Zhu
3 and
Hualong Yu
1,*
1
School of Computer, Jiangsu University of Science and Technology, Zhenjiang 212100, China
2
School of Information Technology, Yancheng Institute of Technology, Yancheng 224051, China
3
Key Laboratory of Ethnic language Intelligent Analysis and Security Governance of MOE, Minzu University of China, Beijing 100081, China
*
Author to whom correspondence should be addressed.
The authors contributed equally to this work.
Symmetry 2025, 17(2), 182; https://doi.org/10.3390/sym17020182
Submission received: 5 December 2024 / Revised: 21 January 2025 / Accepted: 22 January 2025 / Published: 24 January 2025
(This article belongs to the Section Computer)

Abstract

:
Learning from a nonstationary data stream is challenging, as a data stream is generally considered to be endless, and the learning model is required to be constantly amended for adapting the shifting data distributions. When it meets multi-label data, the challenge would be further intensified. In this study, an adaptive online weighted multi-label ensemble learning algorithm called MLDME (multi-label learning with distribution matching ensemble) is proposed. It simultaneously calculates both the feature matching level and label matching level between any one reserved data block and the new received data block, further providing an adaptive decision weight assignment for ensemble classifiers based on their distribution similarities. Specifically, MLDME abandons the most commonly used but not totally correct underlying hypothesis that in a data stream, each data block always has the most approximate distribution with that emerging after it; thus, MLDME could provide a just-in-time decision for the new received data block. In addition, to avoid an infinite extension of ensemble classifiers, we use a fixed-size buffer to store them and design three different dynamic classifier updating rules. Experimental results for nine synthetic and three real-world multi-label nonstationary data streams indicate that the proposed MLDME algorithm is superior to some popular and state-of-the-art online learning paradigms and algorithms, including two specifically designed ones for classifying a nonstationary multi-label data stream.

1. Introduction

In traditional machine learning, learning models are generally considered trained on a previously prepared dataset, and then provide prediction for future unseen instances. However, in some real-world applications, such a paradigm is not applicable any more. Actually, in these specifical applications, data can be received in the form of successive stream in which the data distribution may be altered with time. In such a scenario, the traditional static learning models would be invalid, as they always require both training and testing data satisfying an independent identically distributed hypothesis. In other words, for a nonstationary data stream, the learning models are required to update themselves constantly to adapt to the new data distribution. We call this learning paradigm online learning [1,2] and the phenomenon of the shifting data distribution concept drift [3,4,5]. Specifically, online learning has been frequently used in many real-world applications, including activity recognition [6], face recognition [7], recommendation systems [8], intrusion detection [9], fault diagnosis [10], and marketing analysis [11]. As an example of concept drift, an intrusion detection system requires persistent upgrading of itself to cope with the potentially emerging new strategies of invasion developed by hackers. In this example, once the ways of intrusion vary, that means that the feature distribution or feature–class associations are simultaneously altered, causing the originally developed intrusion detection model be out of work (see Figure 1).
As we know, learning from a data stream is significantly more difficult than learning from static data, as it needs to satisfy two basic requirements. The first one is one pass, i.e., when new data are received, we need to immediately decide to use them or abandon them, but cannot reserve and revisit them. The one pass requirement is proposed based on the hypothesis of that a real-world data stream is generally endless, and reserving all data will cause memory overflow. One pass means that once useful data have been removed, they cannot be revisited to update the current model. However, there is always a lag for estimating the usefulness of data, and thus the one pass requirement remains a critical challenge for an online learning model. The second one is that the model should adapt concept drift emerging in a data stream, which requires detecting whether a concept drift exists in time, and if it is yes, then the learning model is modified to make it coincide with the new data distribution. For the first requirement, it generally can be satisfied by one of two following solutions: the first one is to modify the conventional static learning algorithm to make it can rapidly tune the model parameters to adapt newly arrived data without considering whether they are useful [12,13], and the second one is to use ensemble learning to constantly train new learning models on newly received data and to adopt weighted voting to alleviate the effect of those models trained on useless data [14,15,16]. As for the second requirement, it can be satisfied by either simultaneously adding concept drift detectors and a forgetting mechanism in single learning model adaption [17], or adopting concept drift detectors to decide which base learners in the ensemble are useful and should be reserved, and how much weights should be assigned for them to vote, furthermore making the ensemble learning model track and adapt concept drift [18,19,20]. Obviously, it would be more flexible and robust to adopt ensemble learning than a single learner in such a scenario.
There have been lots of algorithms that can be deployed in a streaming environment. However, most of them only focus on a single-label classification problem. In context of a multi-label problem, it may require considering more factors for online learning since concept drift would become more complex [21]. In multi-label learning, the model is required to simultaneously predict multiple class labels for an instance. For example, an image may contain several class labels, such as cloud, tree, mountain, grassland, and animal (see Figure 2). While the concept drift that happens in multi-label learning tasks may be associated with the alteration of the label distribution or label associations, e.g., during the World Cup, most news simultaneously covers several topics (class labels) about sports, the economy, and tourism, but with the outbreak of a war, the news would be attracted to focus on the combination of several other topics of war, politics, humanitarianism, economy, and munitions. In such a scenario, no matter the label correlation (which class labels tend to simultaneously emerge) shift or label distribution (which class labels tend to more emerge) shift, it can be regarded as the emergence of concept drift. Therefore, when online learning meets multi-label data, the challenge would be significantly intensified.
Motivated by the issue mentioned above, we designed and developed an adaptive and just-in-time online multi-label learning algorithm named MLDME (multi-label learning with distribution matching ensemble). Specifically, it partially inherits our previously proposed DME (distribution matching ensemble) algorithm [22]. DME first uses a Gaussian mixture model (GMM) [23] to accurately measure the data distribution of each received data chunk. Then, it adopts the Kullback–Leibler divergence [24] to detect the feature matching level (feature distribution similarity between two data chunks, see Figure 3a) between each old data chunk and the newest received data chunk. Furthermore, it takes advantage of the feedback feature matching level to adaptively assign decision weights for each corresponding learner in the ensemble. Specifically, DME can provide a just-in-time decision, as it adequately takes advantage of the feature distribution of the newly arrived data chunk and its similarity with that of each reserved old data chunk. In this study, we extended DME to a multi-label data stream. In particular, in addition to the feature matching level, we also consider the degree of the label distribution alteration, i.e., label matching level (label distribution similarity between two data chunks, see Figure 3b), using a label distribution drift detector (LD3) [21]. That is to say, in our algorithm, the weight is adaptively allocated for each base learner in the ensemble by simultaneously considering the feature matching level and label matching level between each reserved data chunk and newly received data chunk. This amendment satisfies the requirement of a just-in-time classification of a nonstationary multi-label data stream. Additionally, in comparison with DME, the proposed MLDME adds a weight allocation threshold to avoid that of those base learners, which are trained on data chunks having a very low distribution similarity with the new data chunk, to participate in decision making. Furthermore, to avoid the infinite extension of ensemble classifiers, we still used a fixed-size buffer to store them and designed three different dynamic classifier updating rules. We conducted experiments on nine artificial and three real-world multi-label nonstationary data streams, and compared them with several popular and state-of-the-art online learning paradigms and algorithms, including two specifically designed ones for classifying a nonstationary multi-label data stream. The results indicate the effectiveness and superiority of the proposed algorithm.
The three main contributions of this study are concluded as follows:
(1)
To adapt a multi-label online learning requirement, this study designs a concept drift detector (integrating both DME [22] and LD3 [21]) while simultaneously considering that drifts happen at the feature level and label level;
(2)
To adapt the just-in-time decision requirement, this study considers collecting both data distribution information and pseudo-label information on a newly received unlabeled data chunk, and then adaptively assigns a decision weight for each base learner according the similarity between each old data chunk and the newly received one;
(3)
Three different ensemble update rules are designed for the proposed algorithm to satisfy various drifting types in practical applications.
The rest of the paper is organized as follows: Section 2 summarizes some related work. In Section 3, the proposed method is described in detail. Section 4 provides the experimental details and results, and further gives the analysis. Finally, Section 5 concludes with the findings and contributions of this paper.

2. Related Work

In this section, we review several benchmark and state-of-the-art online learning algorithms related to our proposed work in this study.
Hoeffding Tree (HT) [25] has been used as an incremental classifier for massive data streams. Different from those traditional decision tree algorithms that adopt a given split evaluation function to select best attribute, the HT uses the Hoeffding bound to calculate the number of samples necessary to select the right split node with a user-specified probability. It implements increment learning without the need for storing instances after they have been used. The drawback of the HT lies in that it cannot adapt concept drift, as old and new knowledge are both totally reserved.
In the context of online weighted ensemble learning, dynamic weighted majority (DWM) [26] is a benchmark algorithm. It maintains a buffer of base classifiers to make a decision, and for each base learner, its weight can be continuously decayed when it provides a wrong prediction for a newly received instance. If the weight of a base classifier is lower than a given threshold, then it would be removed from the ensemble buffer. DWM adapts concept drift by both dynamically removing bad classifiers and adding new classifiers trained on new data.
Another popular online weighted ensemble learning algorithm is the accuracy update ensemble (AUE2) [27]. Unlike DWM, which is constructed in the scenario of one-by-one instance receiving, it assumes that we received new data in form of a data chunk. After a new data chunk has been labeled, AUE2 lets each reserved classifier provide a prediction on that data chunk and acquires the corresponding error rates. Then, it updates the weight of each classifiers by the feedback of error rates. Specifically, in each round, the classifier with the least accuracy would be replaced by the one trained on the new data chunk, while the new added classifier would be given the largest weight, as it is generally regarded as ‘perfect’ classifier based on the underlying hypothesis that each data block always has the most approximate distribution with that close to it. In addition, AUE2 differs from DWM, as it makes each base classifier incrementally learn from the newly received data chunk, which can be well-suited to concept drift.
We note that most emerging online weighted ensemble learning algorithms are based on the basic underlying hypothesis that each data block always has the most approximate distribution with that emerging after it. It is obviously wrong when concept drift happens between two adjacent data chunks. That means when concept drift happens, most ensemble learning algorithms could only provide a delayed decision. Assuming that a ‘perfect’ classifier exists in the ensemble buffer, it could not be assigned a higher weight to provide a prediction for the newly received unlabeled data block. To solve this issue, in our recent work, an algorithm called distribution matching ensemble (DME) [22] was proposed to provide an adaptive and just-in-time prediction for the data stream. Specifically, DME assumes that if two data blocks have an approximate distribution in feature space, then they would hold similar experience on concept prediction. Therefore, we make DME use a Gaussian mixture model (GMM) [23] to accurately measure the data distribution of each received data chunk and the Kullback–Leibler (KL) divergence [24] to detect the feature matching level between each old data chunk and the newest received data chunk. Furthermore, we take advantage of the feedback feature matching level to adaptively assign a decision weight for each corresponding classifier in the ensemble. It not only abides by the one pass rule, as only the GMM parameters of each data chunk in the ensemble are reserved, but also satisfies the requirement of a just-in-time adaptive prediction, since the experience of the newly received data block is used.
As for learning from a multi-label data stream, it will be more complex than learning from a single-label data stream, as concept drift has been re-defined. As we know, multi-label data frequently exist in real-world applications, and in such scenarios, the label correlation should be focused [28,29,30,31]. That means that even if two data blocks have very approximate feature distributions, concept drift would still happen if they have vastly different label correlation distributions [21]. Roseberry and Cano [32] proposed a punitive k nearest neighbors algorithm with a self-adjusting memory (MLSAMP) to classify drifting multi-label data streams. It maintains a sliding window to reserve recent instances and uses the traditional majority-voting KNN classifier in the window to provide a prediction for the newly received instance. Specifically, it also develops a variable window size technique to adapt to various concept drifts, i.e., maintaining a large window when incremental or gradual drifts exist, and a small window if a sudden drift has been detected. In addition, a punitive removal mechanism is designed to immediately remove those instances that have contributed to errors in the window. The MLSAMP algorithm has low time complexity and good adaption for various potential concept drifts. In [33], the authors focused on the negative impact of using a fixed k value for MLSAMP, and hence proposed an improved algorithm called multi-label self-adjusting k nearest neighbors algorithm (MLSA). It adds a label-specific adaptive k module to MLSAMP to adaptively assign an optimal k value for each label in decision making.
It is clear that although several online multi-label learning algorithms for classifying drifting data streams have been developed, they always focused on the scenario of one-by-one instance receiving but ignored the scenario with chunk-by-chunk. In addition, they customarily used a single model with an adaptive forgetting mechanism, but neglected the more flexible and accurate ensemble learning model. Therefore, in this study, we present a more effective and efficient algorithm for adapting chunk-by-chunk multi-label drifting streams.

3. Methods

In this section, we first briefly give the preliminary knowledge related to this study, including the types of concept drift and the basic paradigm of online weighted ensemble learning. Then, we describe how to measure the similarity matching level between two multi-label data blocks in feature space and label space. Finally, we describe the flow path of MLDME algorithm and discuss its time complexity.

3.1. Preliminary Knowledge

3.1.1. Types of Concept Drift

According to the opinions of Webb et al. [3], concept drift is very complex because it not only associates with the variations of feature space, label distribution, and their dependencies but also relates to the variation frequency with time. Suppose P t X , Y denotes the joint distribution of the received data block B t , where X = { x i | x i R d ,   i = 1,2 , , N } , in which d represents the dimension of the feature space and N represents the number of instances in B t , and Y = { y i | y i 1,2 , ,   C , i = 1,2 , ,   N } , in which C represents the number of classes. If after a time increment , we observe P t X , Y P t + X , Y , then it means that concept drift has happened. The drift may be incurred by P t X P t + X , P t Y P t + Y , or P t Y | X P t + Y | X . Based on the time increment and the difference level between two distributions, concept drift can be divided into four types (see Figure 4) as follows:
Sudden drift generally corresponds to a large difference between two continuously received data blocks, i.e., P t X , Y P t + 1 X , Y , and then the distribution tends to be stable for the next received data block;
Incremental drift generally denotes that the data frequently vibrate between the old distribution and new distribution in a long period, and then it tends to be stable in the form of the new data distribution;
Gradual drift conducts a slight distribution variation of the old data distribution and is gradually changed to be a new data distribution, and then it becomes stable in the form of the new data distribution;
Reoccurring drift suddenly becomes the new data distribution from the old distribution, and then after several or some data blocks, it will become the old distribution again, which means that the drift happens repeatedly.

3.1.2. The Basic Framework of Online Weighted Ensemble Learning

As indicated in Section 1, for nonstationary data streams, online weighted ensemble learning is a more flexible and robust paradigm than an incremental single learning model with a forgetting mechanism. Therefore, in this study, we consider the use of online weighted ensemble learning for dealing with drifting multi-label data streams.
Suppose { B 1 , B 2 , , B t , } is an endless multi-label data stream, where B t denotes the data block received at time t . Each data block generally contains an equal number of instances, i.e., B t = { ( x i , Y i ) | i = 1,2 , ,   N } , where x i denotes the feature vector of the ith instance in B t , Y i = { y i 1 , , y i j , ,   y i | L | } , y i j { 0,1 } , | L | denotes the number of labels, and N represents the number of instances in a data block. For an online data stream, it is generally supposed that when a data block B t is just received, its label set is unknown; when and only when the next data block B t + 1 is received, the real label set for B t can be obtained.
In the online weighted ensemble learning framework, a fixed-size ensemble buffer is generally maintained to restore some base classifiers = { C 1 , C 2 , , C M } , where C i represents the classifier trained on a data block and M denotes the size of the ensemble buffer. When a new data block B t is received, it can be predicted by Y ^ t = w 1 C 1 ( X t ) + + w M C M ( X t ) ) , where X t and Y ^ t , respectively, denote the feature vector and predicted labels of B t , and { w 1 , , w M } represents the decision weight corresponding to each classifier in . In general, the weights are adaptively assigned based on the illustration of each classifier for the data chunk B t 1 , such as DWM [26] and AUE2 [27], or their illustrations on the data chunk B t , e.g., DME [22]. Then, when the new data chunk B t + 1 arrives, the new classifier trained on B t will be added to , and an old classifier in will be removed based on a preset rule. Next, a new iteration begins.

3.2. Feature Matching Level

As indicated above, data drift may happen in a variety of forms, i.e., P t X , Y P t + X , Y may be incurred by P t X P t + X , P t Y P t + Y , or P t Y | X P t + Y | X . To provide just-in-time adaptive weighting, it is necessary to first detect the feature matching level between any old data chunk and the newly received data chunk. Here, we inherit from the idea of DME [22] to use GMM [23] to extract the distribution information from each data chunk and KL divergence [24] to detect the feature matching level between two data chunks.
GMM [23] has been widely used as to estimate the probability density function (pdf) for any kind of data distribution. In order to approximate any distribution, GMM generally assumes that a unknown probability density function constitutes of s known Gaussian pdfs, and it can be calculated as the weighted sum of these pdfs, i.e.,
f x = i = 1 s λ i f i x = i = 1 s λ i N x ; μ i , Σ i
where λ represents the weight of each Gaussian pdf, and meanwhile, i = 1 s λ i = 1 . N x ; μ i , Σ i corresponds to a Gaussian pdf in x with the mean vector μ i and covariance matrix Σ i . Specifically, the EM algorithm is generally used to iteratively estimate those parameters existing in GMM. At the E stage, we generally assign N examples to each cluster i , and then calculate the probability of a sample x emerging in that cluster by Equation (2). Next, we assign the example to the cluster which has provided the highest probability.
ξ i = λ i N x ; μ i , Σ i i = 1 s λ i N x ; μ i , Σ i
At the M stage, the following equations are used to model a mixture pdf:
μ i = j = 1 N ξ i j x j j = 1 N ξ i j
Σ i = j = 1 N ξ i j x j μ i x j μ i T j = 1 N ξ i j
λ i = j = 1 N ξ i j N
Specifically, when we designate a suitable s , GMM can accurately approximate any distribution in theory.
As for KL divergence, it is always used to estimate the similarity and/or dissimilarity between two pdfs. Considering there are two different pdfs  f and g defined on R z , where z denotes the dimension of the observed vectors, their KL divergence is defined as follows:
D K L f | | g = f x log f x g x d x
If both f and g are Gaussian pdfs, then the KL divergence has a closed-form expression described as follows:
D K L f | | g = 1 2 log g f + T r g 1 f + μ f μ g T g 1 μ f μ g d 2
However, for GMM pdfs, there is no a closed-form expression. In such a case, the KL divergence can be approximated to be other functions that may be calculated efficiently. In this study, we adopted the variational approximation strategy proposed by Hershey and Olsen [34] to address this problem.
Let L f g = E X log g X , where X ~ f . The KL divergence can be replaced by a decomposition as follows:
D K L f | | g = L f f L f g
Then the lower bounds for L f f and L f g can be obtained using Jensen’s inequality:
L f g a λ a f log b λ b g e D K L f a | | g b a λ a f H f a
L f f a λ a f log a λ a f e D K L f a | | f a a λ a f H f a
These lower bounds can be used as approximations for the corresponding quantities, further acquiring the approximation of KL divergence:
D v a r f | | g = a λ a f log a λ a f e D f a | | f a b λ b g e D f a | | f b
However, the KL divergence is asymmetric, which means that D v a r f | | g D v a r g | | f . To address this issue, it can be transformed to be a Jensen–Shannon (JS) divergence [35], which can be calculated as follows:
J S D f | g = 1 2 D v a r f | | Q + 1 2 D v a r g | | Q
where Q = f + g 2 denotes the medium distribution between f and g. It is clear that J S D f | g [ 0 ,   1 ] , and a larger J S D corresponds to a smaller similarity between two data distributions. In other words, JS divergence is a symmetric similarity metric that can provide a stable measure for similarity between two different distributions.

3.3. Label Matching Level

As we know, the feature matching level only detects the similarity between P t X and P t + X , which cannot perfectly estimate whether concept drift has occurred. Specifically, in multi-label data streams, the variation in label dependency can be also seen as concept drift. In this study, we use the LD3 (label dependency drift detector) [21] to calculate the label matching level between two data chunks existing in a multi-label data stream. The result of LD3 can partially reflect the similarity between P t Y and P t + Y .
In LD3, each data block generates a co-occurrence matrix that is obtained by counting the number of times each class label occurs as “1” alongside other labels. The generated matrix is then ranked within each row, which is called local ranking, by creating a ranking for each label based on their co-occurrence frequencies. After acquiring the local ranking, the ranks can be further aggregated as global ranking as follows:
r i = 1 j = 1 . j i | L | 1 r i j
where r i denotes the ranking score of the label l i , and r i j represents the ranking of the label l i in the jth row of the local ranking matrix. Next, we obtain the global ranking position sequence R by ranking all r i in ascending rankings. Suppose R p and R q , respectively, denote the global ranking sequences of two different data blocks B p and B q ; then, the label matching correlation between these two data blocks can be calculated as follows:
C o r r = 1 i = 1 | L | ( 2 R p ,   i · | R p ,   i R q ,   i | m a x { 1 R p ,   i , | L R p ,   i | } )
where 2 R p ,   i denotes the weight of the label l i , which ensures that higher ranking labels have more influence. The numerator | R p ,   i R q ,   i | is the ranking distance of l i within the two rankings and the denominator scales the distance. It is clear that the higher the Corr is, the more similar the label dependency order between two multi-label data blocks is.

3.4. The Proposed MLDME Algorithm

By integrating the feature matching level and label matching level, it is easy to estimate the similarity between two multi-label data blocks. However, for a newly received unlabeled data chunk, to achieve a just-in-time prediction by taking advantage of prior information, it can directly acquire feature information, while real labels are unknown. In this study, we use the pseudo-labels predicted by each classifier C i in to calculate its label dependency order. Although the predicted results may be not accurate, they can still reflect the difference between two data blocks enough.
Suppose f i denotes the pdf of the data block corresponding to the classifier C i in the current ensemble buffer , and g denotes the pdf of the newly received data block; then, the feature matching level between these two data blocks can be calculated as follows:
s i m F i = 1 J S D f i | g
In addition, suppose R i denotes the label global ranking sequence of the data block corresponding to the classifier C i , and R ^ i denotes that based on the predicted labels by C i for the newly received data block; then, the label matching level between these two data blocks can be calculated as follows:
s i m L i = 1 j = 1 | L | 2 R i ,   j R i , j R ^ i , j max 1 R i , j , n R ^ i , j
Next, the similarity level between two data blocks can be further integrated as follows:
s i m F L i = γ s i m F i + ( 1 γ ) s i m L i
where γ denotes a regulatory factor to tune the relative significance between the feature level and label level. In this study, γ has been empirically designated as 0.5, i.e., the feature level has an equal significance to label level.
Moreover, the voting weight of each classifier C i can be calculated by its similarity level. Specifically, to avoid the classifiers constructed on the data blocks having significant differences with the newly received data block to participate, a threshold τ is predesignated. If s i m F L i < τ , then the classifier C i is removed from the prediction. That means that for each newly received data chunk, only M classifiers are assigned weights for conducting the ensemble prediction, where M M . In this study, τ is empirically set to be 0.2. The weight of each classifier participating in the ensemble decision can be calculated as follows:
w i = s i m F L i j = 1 M s i m F L j
It is clear that a reserved data chunk has more similar feature and label distributions with the newly received data chunk, and it contributes more to the prediction. In addition, to avoid a decision deficiency when the similarity level of each reserved data chunk with the newly received one is lower than τ , e.g., a sudden drift happens in the new data chunk, we designate the classifier corresponding to the highest similarity to provide a prediction in such a scenario.
The procedure of the proposed MLDME is described as follows (Algorithm 1):
Algorithm 1 MLDME
Input: a multi-label chunk-based data stream { B 1 , B 2 , } in which each data chunk has n instances and |L| labels, the number of GMM is s , the regulatory factor is γ , the decision threshold is τ , and the size of the ensemble buffer is M
Output: = C 1 , C 2 , , C M with the label distribution information P i and label global ranking sequence R i of each corresponding data block
Procedure:
1.
= ;
2.
While new data chunk B i is received
3.
  If =
4.
   Wait real labels of the data chunk B i ;
5.
   Extract GMM distribution information P i ;
6.
   Calculate label global ranking sequence R i ;
7.
   Train a classifier C i on B i ;
8.
   Put C i , P i , and R i into ;
9.
  else
10.
   For j = 1 to | |
11.
    Extract GMM distribution information P i ;
12.
    Calculate J S D i ,   j by Equation (12);
13.
    Calculate s i m F i , j by Equation (15);
14.
    Predict pseudo-labels for B i by C j ;
15.
    Calculate label global ranking sequence R i by pseudo-labels;
16.
    Calculate s i m L i , j by Equation (16);
17.
    Calculate s i m F L i , j by Equation (17);
18.
    If s i m F L i , j < τ
19.
      s i m F L i , j = 0 ;
20.
    End If
21.
   End For
22.
   Calculate decision weight for each classifier in by Equation (18);
23.
   Provide prediction for B i by weighted ensemble voting;
24.
   If | |=M
25.
    Remove a classifier in by a pre-setting ensemble update rule;
26.
   End IF
27.
   Wait real labels of the data chunk B i ;
28.
   Train a classifier C i on B i ;
29.
   Put C i , P i , and R i into ;
30.
  End If
31.
End While
In the algorithm description, we note that when the first data chunk is received, it lacks prior knowledge to a provide prediction for it, and hence the first data chunk cannot be predicted, but needs to wait for its real label set. In addition, when and only when the ensemble buffer is full, the classifier update rule is used to replace the most useless classifier by the newly trained one. Here, we designed three different ensemble update rules. First follows the FIFO principle to replace the earliest constructed classifier in with the new one. AVE maintains a score for each classifier, which uses the decision threshold τ to calculate the ratio between the participating ensemble decision and undergoing data chunks of that classifier. Certainly, the score is also compulsively decayed with time. This rule reflects the significance of each classifier in the ensemble, and always selects to remove the most useless one. Rec modifies the AVE rule, and it only finds the lowest AVE classifier from the most half oldest ones in the ensemble to guarantee that those newly added classifiers can be reserved longer. It can be seen as a trade-off between two first rules. In this study, we used Rec as the default updating rule. Regarding the performance difference among these three rules, we will discuss it in Section 4. To further clarify the algorithm procedure, a flow chart of the MLDME algorithm is presented in Figure 5.
Since MLDME only reserves M GMM distribution information, label global ranking sequences, and classifiers, it satisfies the one pass requirement of online learning. However, due to the participation of the complex calculation of the feature matching level and label matching level, MLDME has a relatively high time complexity. However, it is acceptable as in real-world applications, as there always exists a longer time interval between two continuous data chunks than that between two continuous instances. The specific running time of various compared algorithms will be given in next section.

4. Experiments

4.1. The Datasets Used in the Experiments

In our experiments, we used three real-world multi-label data sets from a multi-label classification dataset repository that is available at http://www.uco.es/kdis/mllresources/ (accessed on 15 October 2024), and nine synthetic multi-label data streams that cover all drift types.
Among the three real-world datasets, 20 ng collects lots of documents covering 20 different newsgroups, Ohsumed collects medical data associated with 23 different cardiovascular diseases, and Yelp collects a lot of user comments and rates for good or bad evaluations of many services, commodities, transactions, and prices. For these three datasets, each data block except the last one contains 400 instances.
As for synthetic datasets, they are generated by the approach used in [21]. Each synthetic dataset contains 20,000 instances that are averagely assigned into 50 data blocks. That means that each data block holds 400 instances. Reoccurk denotes the reoccurring drift that happens after every k continuous data blocks. Gradual denotes that the data stream slowly drifts from one distribution to another distribution. Both Increment1 and Increment2 simulate the incremental drift, and their distinction lies in that Increment1 drifts faster than Increment2. Similarly, both Suddens and Suddenf simulate sudden drift, and they are different, as sudden drift only happens once in Suddens, while Suddenf activates sudden drift every 10 data blocks.
The detailed information about these datasets is described in Table 1.

4.2. Experimental Settings

In the experiments, we compared the proposed MLDME algorithm with several popular and state-of-the-art algorithms, including HT [25], DWM [26], AUE2 [27], DME [22], MLSAMP [32], and MLSA [33]. The experiments were performed using Scikit-Multiflow [36]. Specifically, DWM, AUE2, DME, and MLDME share the common parameter M = 10 , and DME and MLDME share the common number of GMM s = 5 . For MLSAM and MLSA, the size of maximum window is designated as 1200. As for the classification algorithm, we designated extreme learning machine (ELM) [37,38] as the base learning algorithm, where the number of hidden-layer nodes has been empirically set to be 100 and the penalty factor has been set to be 10.
We adopted four popular multi-label learning performance evaluation metrics to compare various algorithms. They are Subset Accuracy, Hamming Score, Macro-averaged F1, and Micro-averaged F1, respectively. Suppose there are N testing instances; Y i ( i { 1,2 , , N } ) represents the real labels of the ith instance, and Y ^ i ( i { 1,2 , , N } ) denotes the predicted labels of that instance. Then, Subset Accuracy can be calculated as follows:
Subset   Accuracy = 1 N i = 0 N 1 | Y i = Y ^ i
The Hamming Score can be calculated as follows:
Hamming   Score = 1 N | L | i = 0 N j = 0 | L | 1 | y i j = y ^ i j
where y i j and y ^ i j denote the real and predicted jth label of the ith instance. It is clear that the Subset Accuracy corresponds to the ratio between the number of testing instances that have been accurately predicted throughout all labels and the total number of testing instances, while the Hamming   Score denotes the ratio between the number of accurately predicted example–label pairs and the total number of example–label pairs. As for the two other metrics, their calculation requires making use of the results of several other metrics, including the number of true positive instances (TP), the number of true negative instances (TN), the number of false positive instances (FP), and the number of false negative instances (FN). They can be calculated as follows:
T P j = i = 0 N 1 y i j = y ^ i j = 1
T N j = i = 0 N 1 y i j = 0 ,   y ^ i j = 0
F P j = i = 0 N 1 y i j = 0 ,   y ^ i j = 1
F N j = i = 0 N 1 y i j = 1 ,   y ^ i j = 0
Specifically, T P j and T N j , respectively, record the number of instances whose real and predicted labels are both positive and negative on the jth label, while F P j and F N j record the number of instances that belong to the negative (positive) class but have been predicted as belonging to the positive (negative) class for that label, respectively. Furthermore, they can be used to calculate several evaluation metrics as follows:
Macro Precision = 1 | L | j = 1 | L | T P j T P j + F P j
Macro Recall = 1 | L | j = 1 | L | T P j T P j + F N j
M i cro Precision = j = 1 | L | T P j j = 1 | L | T P j + j = 1 | L | F P j
M i cro Recall = j = 1 | L | T P j j = 1 | L | T P j + j = 1 | L | F N j
Specifically, Precision and Recall calculate the accuracy of instances that have been predicted as the positive class and accurately predicted as the positive class, respectively. The difference between Macro and Micro lies in that the former first calculates the performance on each label and then calculates the average performance, while the latter one directly calculates the performance throughout all labels. Further, Macro-averaged F1 and Micro-averaged F1 can be calculated as follows:
M a c r o a v e r a g e d   F 1 = 2 × Macro Precision × Macro Recall Macro Precision + Macro Reca l l
M i c r o a v e r a g e d   F 1 = 2 × M i cro Precision × M i cro Recall M i cro Precision + M i cro Reca l l
where M a c r o a v e r a g e d   F 1 estimates the tradeoff between Macro Precision and Macro Recall , while M i c r o a v e r a g e d   F 1 evaluates the tradeoff between Micro Precision and Micro Recall .
Finally, the average predicted performance throughout all data blocks is used to compare the quality of various algorithms.

4.3. Experimental Results

The experimental results are presented in Table 2, Table 3, Table 4 and Table 5. Specifically, the best result for each dataset has been highlighted in bold. From these results, we can observe several conclusions as follows:
(1)
For each performance evaluation metric except Macro-averaged F1, the proposed MLDME performed better than each of other compared algorithms, indicating its effectiveness and superiority. Specifically, it performed best on seven, six, five, and seven datasets in terms of the Subset Accuracy, Hamming Score, Macro-averaged F1, and Micro-averaged F1, respectively. We believe that it mainly profits from the just-in-time mechanism that takes full advantage of the information acquired from the newly received unlabeled data chunk. The mechanism can make MLDME adapt to concept drift in a timelier manner.
(2)
We also note that in addition to MLDME, both MLSAMP and MLSA have shown more superiority to several other competitors, and their superiority is specifically significant in data streams with gradual and sudden drifts. In our opinion, their superiority comes from the one-by-one processing manner used by them, which make them can adapt faster to suddenly occurring concept drifts.
(3)
Although DME performs worse than MLDME, it significantly outperforms several traditional online weighted ensemble learning algorithms, including HT, DWM, and AUE2, on most data streams. The results illustrate the necessity of making use of the distribution information of the newly received unlabeled data chunk to make a just-in-time decision again. In addition, DME performs worse than MLDME, showing the necessity of considering the label matching level, though some pseudo-labels assigned to the newly received data chunk might be wrong.
(4)
About the different types of concept drift, the proposed MLDME seems to be performing better on reoccurring drift and incremental drift. It is first because of the adoption of both the feature matching level and label matching level. Additionally, it is related to its weight assignment mechanism and classifier updating rule. However, when reoccurring drift happens between a large data block interval, e.g., Reoccur15 and Reoccur20, MLDME tends to be out of work as the ensemble buffer has removed those closely associated classifiers from the data block.
We also presented performance variation curves of various compared algorithms for several representative datasets, including Reoccur5, Reoccur15, Suddens, and 20 ng, in Figure 6, Figure 7, Figure 8 and Figure 9.
The results in Figure 6 and Figure 7 show that when reoccurring drift happens in a small interval, the proposed MLDME algorithm can adaptively provide an excellent prediction for drifting data chunks, but when reoccurring drift happens at a low frequency, the performance of MLDME would be significantly decreased. It is consistent with our analysis above. The solution is enlarging the ensemble buffer to make it comprise prior knowledge in a longer period. The results in Figure 8 illustrate that similar to other online learning algorithms, MLDME cannot immediately adapt to unseen sudden drift, but it can fit the new data distribution faster than HT and DWM through its adaptive weight assignment mechanism. In Figure 9, we observe that MLDME performs well on real-world nonstationary multi-label data streams, and it can rapidly track and adapt data distribution variations. We believe that it profits from making full use of the information of the newly received data chunk, further provides a just-in-time decision, and adaptively responds to data variations.

4.4. Significance Analysis Using Statistics

Furthermore, the Nemenyi test is adopted as a post hoc test for the Friedman test [39,40] to investigate the performance significance of the proposed MLDME algorithm. Here, we focus on the relative performance between our proposed MLDME algorithm and the other compared algorithms. If the average rank of the MLDME and that of a compared algorithm differ by at least one critical difference (CD) unit, then we consider that their performance is significantly different. Specifically, the CD is calculated as follows:
C D = q α U ( U + 1 ) 6 H
where q α denotes the significance level, U indicates the number of compared algorithms in the experiments, and H represents the number of datasets used in the experiments. In a CD diagram, if the average ranks of two algorithms are less than a CD unit, then they would be connected by a thick line to indicate that their difference is not significant at a specific significance level. Figure 10 presents the CD diagrams for four different metrics.
In Figure 10, we observe that MLDME performs best on Subset Accuracy, Hamming Score, and Micro-averaged F1 metrics, and is runner-up on the Macro-averaged F1 metric, indicating its superiority. Specifically, on the Subset Accuracy metric, MLDME significantly outperforms the DWM, HT, and AUE2 algorithms, while in terms of three other performance metrics, MLDME only significantly outperforms the DWM algorithm. Although compared with the majority of algorithms, the superiority of MLDME is not significant enough, it is still best as it has acquired lower average rankings.

4.5. Ablation Experiments

Further ablation experiments were conducted to investigate whether both the feature matching level and label matching level have contributed to the success of MLDME. Specifically, we compared the performance of MLDME when adopting s i m F L , and only using s i m F and s i m L as the similarity metric. The results are presented in Table 6.
From the results in Table 6, we observe that integrating s i m F and s i m L , i.e., s i m F L , produced best results for seven, eight, seven, and seven datasets in terms of Subset Accuracy, Hamming Score, Macro-averaged F1, and Micro-averaged F1 metrics, respectively. The results indicate the necessity of integrating these two matching levels to estimate the similarity between two multi-label data chunks. It can provide more comprehensive information than independently considering either the feature level or label level. In addition, an interesting phenomenon is observed, that is, matching the feature level similarity seems to help more in improving the accuracy of an individual instance, while matching the label level similarity helps to promote learning performance in terms of equilibrium.

4.6. Comparison of Three Ensemble Updating Rules

Next, we compared the performance of the three different ensemble updating rules presented in Section 3, i.e., First, Ave, and Rec. The results are presented in Table 7, where we observe that MLDME with Rec rule have produced best results for ten, nine, five, and six datasets in terms of Subset Accuracy, Hamming Score, Macro-averaged F1, and Micro-averaged F1 metrics, respectively. The results illustrate that adopting the Rec rule is significantly better than using two other rules. At least we can say that using the Rec rule is safer than using the other two. It is believed that the Rec rule simultaneously considers the utilization rate and survival time of a classifier in the ensemble buffer, which can save those most useful classifiers to a large extent. While Ave tends to remove those new but potentially useful ones from the ensemble buffer, further resulting in the loss of decision information and performance degeneration. As for the First rule, although it is simple, we note that it can still acquire excellent performance in several specific drifting scenarios, e.g., frequent reoccurring drifts and slow gradual drifts. In practical applications, we recommend to use the Rec ensemble updating rule as a safe setting.

4.7. Comparison of the Running Time

Finally, we compared the running times of various compared algorithms when treating one data chunk, and provided the results in Table 8. We observed that both DME and MLDME are obviously more time-consuming than several other compared algorithms throughout almost all data streams. For DME, we believe that most of its running time costs occur when calculating the GMM for the new data chunk and the KL divergence between M block pairs. As for MLDME, it spends more time estimating the label distribution and calculating the label matching level between any one reserved data chunk and the newly received data chunk. As indicated in Section 3, although MLDME is much more time-consuming than those simple algorithms, it is still acceptable as in real-world applications, as there always exists a long time interval between two continuous data chunks.

5. Conclusions

In this study, an adaptive and just-in-time online weighted ensemble multi-label learning algorithm called MLDME is proposed. It synchronously focuses on the feature distribution and pseudo-label distribution of a newly received unlabeled data chunk according to the similarities with that of labeled historic data chunks to adaptively assign a decision weight for each corresponding classifier. Specifically, MLDME adopts JS divergence based on a Gaussian mixture model to estimate feature distribution similarity and the LD3 approach to calculate the label distribution similarity between two data blocks. In addition, three different ensemble updating rules are proposed. The experimental results for nine synthetic multi-label data streams and three real-world data streams indicate the necessity of taking full advantage of the newly received unlabeled data chunk, which can effectively avoid a decision delay. In addition, the results show that MLDME outperforms several traditional online weighted ensemble learning paradigms and state-of-the-art multi-label online learning algorithms.
Future work intends to modify MLDME to make it adapt to the streaming scenario with receiving instances one-by-one. Additionally, how to dynamically and adaptively call the ensemble updating rule according to the feedback of distribution matching information will be investigated too.

Author Contributions

Conceptualization, C.S. (Chao Shen) and H.Y.; methodology, C.S. (Chao Shen), B.L. and H.Y.; validation, C.S. (Chao Shen) and B.L.; formal analysis, B.L. and C.S. (Changbin Shao); investigation, X.Y. and S.X.; resources, H.Y.; data curation, C.S. (Chao Shen) and B.L.; writing—original draft preparation, C.S. (Chao Shen), B.L. and C.S. (Changbin Shao); writing—review and editing, X.Y., C.Z. and H.Y.; visualization, B.L. and C.Z.; supervision, H.Y.; project administration, X.Y., S.X. and H.Y.; funding acquisition, X.Y., S.X. and H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

The work was supported by the National Natural Science Foundation of China under grant No. 62176107, No. 62076111, and No. 62076215.

Data Availability Statement

The three real multi-label datasets are from the multi-label classification dataset repository that is available at http://www.uco.es/kdis/mllresources/ (accessed on 15 October 2024). The nine synthetic datasets can be generated according to the approach used in [21].

Conflicts of Interest

The authors declare no conflicts of interest, and the funders had no role in the design of the study; in the collection, analyses, or interpretation of the data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviation

SymbolMeaning of the symbol
{ B 1 , B 2 , , B t , } an endless data stream
B t the data chunk received at time t
P t X , Y the joint GMM distribution of the received data block B t
x i the feature vector of the ith instance
Y i the real label sets of the ith instance
y i j the real label of the ith instance for the jth label
Y ^ i the predicted label sets of the ith instance
y ^ i j the predicted label of the ith instance for the jth label
Nthe number of instances in a data block
|L| the number of labels
the ensemble buffer
C i the ith classifier in the ensemble buffer
M the size of the ensemble buffer
M the number of base classifiers participating in the weighted decision
μ i the mean vector of the ith Gaussian pdf
Σ i the covariance matrix of the ith Gaussian pdf
λ i the weight of the ith Gaussian pdf
sthe number of Gaussian models in GMM
J S D f | g the Jensen–Shannon divergence between two data distributions f and g
r i the ranking score of the ith label
R i the global ranking sequence of the ith data block
s i m F i the feature matching level
C o r r ,   s i m L i the label matching level
s i m F L i the integrated matching level
γ the regulatory factor to tune the relative significance between two matching levels
τ the decision threshold to decide whether a classifier participates in the decision
w i the normalized decision weight of the ith base classifier
Uthe number of compared algorithms
Hthe number of datasets used in the experiments

References

  1. Zhang, Q.; Zhang, P.; Long, G.; Ding, W.; Zhang, C.; Wu, X. Online learning from trapezoidal data streams. IEEE Trans. Knowl. Data Eng. 2016, 28, 2709–2723. [Google Scholar] [CrossRef]
  2. Wang, Y.; Ding, Y.; He, X.; Fan, X.; Lin, C.; Li, F.; Wang, T.; Luo, Z.; Luo, J. Novelty detection and online learning for chunk data streams. IEEE Trans. Knowl. Data Eng. 2020, 43, 2400–2412. [Google Scholar] [CrossRef] [PubMed]
  3. Webb, G.I.; Hyde, R.; Cao, H.; Nguyen, H.L.; Petitjean, F. Characterizing concept drift. Data Min. Knowl. Discov. 2016, 30, 964–994. [Google Scholar] [CrossRef]
  4. Lu, J.; Liu, A.; Dong, F.; Gu, F.; Gama, J.; Zhang, G. Learning under concept drift: A review. IEEE Trans. Knowl. Data Eng. 2018, 31, 2346–2363. [Google Scholar] [CrossRef]
  5. Webb, G.I.; Lee, L.K.; Goethals, B.; Petitjean, F. Analyzing concept drift and shift from sample data. Data Min. Knowl. Discov. 2018, 32, 1179–1199. [Google Scholar] [CrossRef]
  6. Abdallah, Z.S.; Gaber, M.M.; Srinivasan, B.; Krishnaswamy, S. Activity recognition with evolving data streams: A review. ACM Comput. Surv. 2018, 51, 1–36. [Google Scholar] [CrossRef]
  7. Lopez-Lopez, E.; Pardo, X.M.; Regueiro, C.V. Incremental learning from low-labelled stream data in open-set video face recognition. Pattern Recognit. 2022, 131, 108885. [Google Scholar] [CrossRef]
  8. Al-Ghossein, M.; Abdessalem, T.; Barré, A. A survey on stream-based recommender systems. ACM Comput. Surv. 2021, 54, 1–36. [Google Scholar] [CrossRef]
  9. Faisal, M.A.; Aung, Z.; Williams, J.R.; Sanchez, A. Data-stream-based intrusion detection system for advanced metering infrastructure in smart grid: A feasibility study. IEEE Syst. J. 2014, 9, 31–44. [Google Scholar] [CrossRef]
  10. Yang, Z.; Yan, F.; Shen, Y.; Yang, L.; Su, L.; Hu, W.; Le, J. On-Line Fault Diagnosis Model of Distribution Transformer Based on Parallel Big Data Stream and Transfer Learning. IEEJ Trans. Electr. Electron. Eng. 2023, 18, 332–340. [Google Scholar] [CrossRef]
  11. Lin, C.C.; Chen, C.S.; Chen, A.P. Using intelligent computing and data stream mining for behavioral finance associated with market profile and financial physics. Appl. Soft Comput. 2018, 68, 756–764. [Google Scholar] [CrossRef]
  12. Cheng, W.Y.; Juang, C.F. A fuzzy model with online incremental SVM and margin-selective gradient descent learning for classification problems. IEEE Trans. Fuzzy Syst. 2013, 22, 324–337. [Google Scholar] [CrossRef]
  13. Wen, Y.; Liu, X.; Yu, H. Adaptive tree-like neural network: Overcoming catastrophic forgetting to classify streaming data with concept drifts. Knowl. Based Syst. 2024, 293, 111636. [Google Scholar] [CrossRef]
  14. Du, H.; Zhang, Y.; Gang, K.; Zhang, L.; Chen, Y.C. Online ensemble learning algorithm for imbalanced data stream. Appl. Soft Comput. 2021, 107, 107378. [Google Scholar] [CrossRef]
  15. Gomes, H.M.; Barddal, J.P.; Enembreck, F.; Bifet, A. A survey on ensemble learning for data stream classification. ACM Comput. Surv. 2017, 50, 1–36. [Google Scholar] [CrossRef]
  16. Idrees, M.M.; Minku, L.L.; Stahl, F.; Badii, A. A heterogeneous online learning ensemble for non-stationary environments. Knowl. Based Syst. 2020, 188, 104983. [Google Scholar] [CrossRef]
  17. Yu, H.; Webb, G.I. Adaptive online extreme learning machine by regulating forgetting factor by concept drift map. Neurocomputing 2019, 343, 141–153. [Google Scholar] [CrossRef]
  18. Abbasi, A.; Javed, A.R.; Chakraborty, C.; Nebhen, J.; Zehra, W.; Jalil, Z. ElStream: An ensemble learning approach for concept drift detection in dynamic social big data stream learning. IEEE Access 2021, 9, 66408–66419. [Google Scholar] [CrossRef]
  19. de Barros, R.S.M.; de Carvalho Santos, S.G.T. An overview and comprehensive comparison of ensembles for concept drift. Inf. Fusion 2019, 52, 213–244. [Google Scholar] [CrossRef]
  20. Jiao, B.; Guo, Y.; Gong, D.; Chen, Q. Dynamic ensemble selection for imbalanced data streams with concept drift. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 1278–1291. [Google Scholar] [CrossRef]
  21. Gulcan, E.B.; Can, F. Unsupervised concept drift detection for multi-label data streams. Artif. Intell. Rev. 2023, 56, 2401–2434. [Google Scholar] [CrossRef]
  22. Feng, B.; Gu, Y.; Yu, H.; Yang, X.; Gao, S. DME: An adaptive and just-in-time weighted ensemble learning method for classifying block-based concept drift steam. IEEE Access 2022, 10, 120578–120591. [Google Scholar] [CrossRef]
  23. Reynolds, D.A. Gaussian mixture models. Encycl. Biom. 2009, 741, 659–663. [Google Scholar]
  24. Adler, A.; Tang, J.; Polyanskiy, Y. Quantization of Random Distributions Under KL Divergence. In Proceedings of the 2021 IEEE International Symposium on Information Theory (ISIT), Melbourne, Australia, 12–20 July 2021; pp. 2762–2767. [Google Scholar]
  25. Domingos, P.; Hulten, G. Mining high-speed data streams. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, Seattle, WA, USA, 1 August 2000; pp. 71–80. [Google Scholar]
  26. Kolter, J.Z.; Maloof, M.A. Dynamic weighted majority: An ensemble method for drifting concepts. J. Mach. Learn. Res. 2007, 8, 2755–2790. [Google Scholar]
  27. Brzezinski, D.; Stefanowski, J. Reacting to different types of concept drift: The accuracy updated ensemble algorithm. IEEE Trans. Neural Netw. Learn. Syst. 2013, 25, 81–94. [Google Scholar] [CrossRef]
  28. Zhu, Y.; Kwok, J.T.; Zhou, Z.H. Multi-label learning with global and local label correlation. IEEE Trans. Knowl. Data Eng. 2017, 30, 1081–1094. [Google Scholar] [CrossRef]
  29. Nguyen, T.T.; Nguyen, T.T.T.; Luong, A.V.; Nguyen, Q.V.H.; Liew, A.W.C.; Stantic, B. Multi-label classification via label correlation and first order feature dependance in a data stream. Pattern Recognit. 2019, 90, 35–51. [Google Scholar] [CrossRef]
  30. Duan, J.; Gu, Y.; Yu, H.; Yang, X.; Gao, E. CC++: An algorithm family based on ensemble of classifier chains for classifying imbalanced multi-label data. Expert Syst. Appl. 2024, 236, 121366. [Google Scholar] [CrossRef]
  31. Duan, J.; Yang, X.; Gao, S.; Yu, H. A partition-based problem transformation algorithm for classifying imbalanced multi-label data. Eng. Appl. Artif. Intell. 2024, 128, 107506. [Google Scholar] [CrossRef]
  32. Roseberry, M.; Krawczyk, B.; Cano, A. Multi-label punitive kNN with self-adjusting memory for drifting data streams. ACM Trans. Knowl. Discov. Data (TKDD) 2019, 13, 1–31. [Google Scholar] [CrossRef]
  33. Roseberry, M.; Krawczyk, B.; Djenouri, Y.; Cano, A. Self-adjusting k nearest neighbors for continual learning from multi-label drifting data streams. Neurocomputing 2021, 442, 10–25. [Google Scholar] [CrossRef]
  34. Hershey, J.R.; Olsen, P.A. Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models. In Proceedings of the 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, Honolulu, HI, USA, 15–20 April 2007; Volume 4, pp. 317–320. [Google Scholar]
  35. Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  36. Montiel, J.; Read, J.; Bifet, A.; Abdessalem, T. Scikit-multiflow: A multi-output streaming framework. J. Mach. Learn. Res. 2018, 19, 1–5. [Google Scholar]
  37. Huang, G.B.; Zhou, H.; Ding, X.; Zhang, R. Extreme learning machine for regression and multiclass classification. IEEE Trans. Syst. Man Cybern. 2011, 42, 513–529. [Google Scholar] [CrossRef]
  38. Yu, H.; Sun, C.; Yang, X.; Zheng, S.; Wang, Q.; Xi, X. LW-ELM: A fast and flexible cost-sensitive learning framework for classifying imbalanced data. IEEE Access 2018, 6, 28488–28500. [Google Scholar] [CrossRef]
  39. Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
  40. García, S.; Fernández, A.; Luengo, J.; Herrera, F. Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inf. Sci. 2010, 180, 2044–2064. [Google Scholar] [CrossRef]
Figure 1. Concept drift caused by the feature distribution variance in an online intrusion detection system, where and denote safe and intrusive instances, respectively. Here, the variance of invasion strategies alters the feature distribution, further making the classification model trained previously be out of work.
Figure 1. Concept drift caused by the feature distribution variance in an online intrusion detection system, where and denote safe and intrusive instances, respectively. Here, the variance of invasion strategies alters the feature distribution, further making the classification model trained previously be out of work.
Symmetry 17 00182 g001
Figure 2. An example of a multi-label learning instance.
Figure 2. An example of a multi-label learning instance.
Symmetry 17 00182 g002
Figure 3. An intuitionistic presentation of the feature matching level and label matching level, where each left subgraph has a higher matching level with the middle subgraph than with the right one. Here, and , respectively, denote the instances belonging to two different classes, while ■ denotes the specific instance containing the corresponding class label.
Figure 3. An intuitionistic presentation of the feature matching level and label matching level, where each left subgraph has a higher matching level with the middle subgraph than with the right one. Here, and , respectively, denote the instances belonging to two different classes, while ■ denotes the specific instance containing the corresponding class label.
Symmetry 17 00182 g003
Figure 4. Illustration of different concept drifts.
Figure 4. Illustration of different concept drifts.
Symmetry 17 00182 g004
Figure 5. The flow chart of the proposed MLDME algorithm.
Figure 5. The flow chart of the proposed MLDME algorithm.
Symmetry 17 00182 g005
Figure 6. Performance variation curves of various compared algorithms for the Reoccur5 dataset, in where (a) Subset Accuracy, (b) Hamming Score, (c) Macro-Averaged F1, and (d) Micro-Averaged F1.
Figure 6. Performance variation curves of various compared algorithms for the Reoccur5 dataset, in where (a) Subset Accuracy, (b) Hamming Score, (c) Macro-Averaged F1, and (d) Micro-Averaged F1.
Symmetry 17 00182 g006
Figure 7. Performance variation curves of various compared algorithms for the Reoccur15 dataset, in where (a) Subset Accuracy, (b) Hamming Score, (c) Macro-Averaged F1, and (d) Micro-Averaged F1.
Figure 7. Performance variation curves of various compared algorithms for the Reoccur15 dataset, in where (a) Subset Accuracy, (b) Hamming Score, (c) Macro-Averaged F1, and (d) Micro-Averaged F1.
Symmetry 17 00182 g007
Figure 8. Performance variation curves of various compared algorithms for the SuddenS dataset, in where (a) Subset Accuracy, (b) Hamming Score, (c) Macro-Averaged F1, and (d) Micro-Averaged F1.
Figure 8. Performance variation curves of various compared algorithms for the SuddenS dataset, in where (a) Subset Accuracy, (b) Hamming Score, (c) Macro-Averaged F1, and (d) Micro-Averaged F1.
Symmetry 17 00182 g008
Figure 9. Performance variation curves of various compared algorithms for the 20 ng dataset, in where (a) Subset Accuracy, (b) Hamming Score, (c) Macro-Averaged F1, and (d) Micro-Averaged F1.
Figure 9. Performance variation curves of various compared algorithms for the 20 ng dataset, in where (a) Subset Accuracy, (b) Hamming Score, (c) Macro-Averaged F1, and (d) Micro-Averaged F1.
Symmetry 17 00182 g009
Figure 10. CD diagrams for four different metrics.
Figure 10. CD diagrams for four different metrics.
Symmetry 17 00182 g010
Table 1. Datasets used in this study.
Table 1. Datasets used in this study.
Dataset#Instances#Features#LabelsLabel Cardinality
20 ng19,3001006201.03
Ohsumed13,9291002231.66
Yelp10,80667151.64
Reoccur520,00050103.00
Reoccur1020,00050102.80
Reoccur1520,00050102.80
Reoccur2020,00050102.80
Gradual20,00050104.26
Suddenf20,00050103.65
Suddens20,00050103.00
Incremental120,00050103.60
Incremental220,00050103.62
Table 2. Results of various compared algorithms for the Subset Accuracy metric, in where the best result on each data stream has been highlighted in bold.
Table 2. Results of various compared algorithms for the Subset Accuracy metric, in where the best result on each data stream has been highlighted in bold.
DatasetDMEMLSAMPMLSADWMHTAUE2MLDME
20 ng0.42180.34170.36150.36550.37550.39320.4234
Ohsumed0.48030.47790.47250.47560.47430.46780.4804
Yelp0.33090.30790.34370.28940.30220.34180.3558
Reoccur50.51630.30720.48790.06970.31480.14020.5702
Reoccur100.36800.26160.35910.10420.19770.13020.3970
Reoccur150.41340.27690.53900.21660.27410.23370.4592
Reoccur200.39700.32520.59070.17610.25360.26820.4602
Gradual0.68500.76260.72080.48350.59610.70440.7030
Suddenf0.07100.04710.05770.05590.07490.06590.0829
Suddens0.37890.31020.54240.15510.31020.20100.4601
Incremental10.88420.89000.87520.24680.76440.88820.8795
Incremental20.79330.82160.79520.34950.66940.81020.8229
Table 3. Results of various compared algorithms for the Hamming Score metric, in where the best result on each data stream has been highlighted in bold.
Table 3. Results of various compared algorithms for the Hamming Score metric, in where the best result on each data stream has been highlighted in bold.
DatasetDMEMLSAMPMLSADWMHTAUE2MLDME
20 ng0.94710.94000.94320.94070.94230.94360.9474
Ohsumed0.96580.96560.96500.96550.96500.96500.9660
Yelp0.83020.78080.81080.82260.82250.83060.8337
Reoccur50.81150.80530.89150.73620.87500.74820.8929
Reoccur100.82590.85120.88770.76210.84880.77540.8532
Reoccur150.90420.81810.90510.79910.86700.81760.9060
Reoccur200.82940.83230.91870.79390.86830.85710.9021
Gradual0.95140.95640.95550.85880.94020.95460.9509
Suddenf0.65230.68740.70610.65380.71160.69700.6688
Suddens0.89580.82810.90590.77050.88470.82920.9052
Incremental10.98020.97670.97770.85220.96480.98040.9801
Incremental20.96570.96050.96190.83340.94830.96630.9713
Table 4. Results of various compared algorithms for the Macro-averaged F1 metric, in where the best result on each data stream has been highlighted in bold.
Table 4. Results of various compared algorithms for the Macro-averaged F1 metric, in where the best result on each data stream has been highlighted in bold.
DatasetDMEMLSAMPMLSADWMHTAUE2MLDME
20 ng0.05270.08950.20320.09720.10330.07850.0538
Ohsumed0.02260.02050.04560.03350.02560.02330.0279
Yelp0.21640.26120.26470.12230.18130.26190.2917
Reoccur50.78790.61180.77660.31830.71490.30570.8579
Reoccur100.65720.59630.69760.36020.58420.35720.7268
Reoccur150.71700.55360.74430.37150.61160.44280.7581
Reoccur200.66750.59070.78240.36260.59130.55890.7534
Gradual0.94110.94560.94460.84070.92440.94450.9402
Suddenf0.27080.37290.35320.32580.35850.36350.3327
Suddens0.66940.61630.76950.32660.70220.46960.7299
Incremental10.97210.96620.96890.73230.94750.97260.9718
Incremental20.95460.94940.95030.76310.91290.95590.9614
Table 5. Results of various compared algorithms for the Micro-averaged F1 metric, in where the best result on each data stream has been highlighted in bold.
Table 5. Results of various compared algorithms for the Micro-averaged F1 metric, in where the best result on each data stream has been highlighted in bold.
DatasetDMEMLSAMPMLSADWMHTAUE2MLDME
20 ng0.45230.27420.37100.42770.44570.44780.4535
Ohsumed0.06280.05710.08520.10030.05140.07390.0834
Yelp0.27870.38160.35960.16580.21320.32520.3479
Reoccur50.79110.66520.79880.44250.75690.40070.8701
Reoccur100.71860.65900.74180.47980.66470.49210.7682
Reoccur150.82830.62770.79720.57560.72170.63720.8362
Reoccur200.76190.66810.82750.58080.69130.72150.8327
Gradual0.94390.94890.94790.84470.93010.94720.9436
Suddenf0.35740.47230.46060.40260.50450.46320.4222
Suddens0.81230.68090.81520.55600.76950.67630.8286
Incremental10.97430.96900.97130.78070.95070.97460.9741
Incremental20.95730.95240.95340.76320.92470.95850.9637
Table 6. Results of the ablation experiments, in where the best result on each data stream in terms of each performance metric.
Table 6. Results of the ablation experiments, in where the best result on each data stream in terms of each performance metric.
Subset AccuracyHamming ScoreMacro-Averaged F1Micro-Averaged F1
DatasetsimFLsimFsimLsimFLsimFsimLsimFLsimFsimLsimFLsimFsimL
20 ng0.42340.3853 0.4050 0.94740.9425 0.9456 0.0538 0.08210.0752 0.4535 0.4492 0.4583
Ohsumed0.48040.4603 0.4738 0.96600.9641 0.9655 0.0279 0.05150.0455 0.0834 0.16000.1361
Yelp0.35580.3671 0.40670.8337 0.8191 0.84010.2917 0.39920.3938 0.3479 0.45210.4491
Reoccur50.57020.3901 0.4538 0.8929 0.8592 0.89540.85790.7444 0.7936 0.87010.7710 0.8183
Reoccur100.39700.3591 0.42110.85320.7789 0.7729 0.72680.5452 0.5333 0.76820.5989 0.5872
Reoccur150.45920.51810.5155 0.9060 0.91190.91190.75810.6112 0.6102 0.83620.6984 0.6912
Reoccur200.46020.2813 0.4505 0.90210.8375 0.8676 0.75340.5453 0.7021 0.83270.6960 0.7730
Gradual0.70300.6967 0.6962 0.95090.9498 0.9492 0.94020.9393 0.9385 0.94360.9425 0.9418
Suddenf0.08290.0291 0.0270 0.66880.6290 0.6201 0.3327 0.38590.3810 0.4222 0.44730.4393
Suddens0.46010.2910 0.51140.9052 0.8682 0.91000.7299 0.5946 0.78130.8286 0.7537 0.8445
Incremental10.87950.7377 0.7297 0.98010.9010 0.8960 0.97180.8582 0.8519 0.97410.8692 0.8625
Incremental20.82290.8152 0.8160 0.97130.9687 0.9689 0.96140.9586 0.9588 0.96370.9610 0.9612
Table 7. Experimental comparison of three ensemble updating rules, in where the best result on each data stream in terms of each performance metric.
Table 7. Experimental comparison of three ensemble updating rules, in where the best result on each data stream in terms of each performance metric.
Subset AccuracyHamming ScoreMacro-Averaged F1Micro-Averaged F1
DatasetRecAveFirstRecAveFirstRecAveFirstRecAveFirst
20 ng0.42340.3776 0.3873 0.94740.9456 0.94650.0538 0.08490.0791 0.45350.4465 0.4450
Ohsumed0.48040.4401 0.4565 0.96600.9653 0.9647 0.0279 0.05500.0514 0.0834 0.16340.1564
Yelp0.35580.2880 0.3222 0.83370.8319 0.8279 0.2917 0.3697 0.38960.3479 0.4024 0.4207
Reoccur50.57020.55860.64720.8929 0.8929 0.94490.85790.8234 0.8336 0.87010.8461 0.8496
Reoccur100.39700.3756 0.38890.85320.84660.8491 0.72680.6882 0.7103 0.76820.7329 0.7472
Reoccur150.45920.4505 0.4472 0.90600.9022 0.8967 0.7581 0.7957 0.80270.8362 0.84150.8394
Reoccur200.46020.4020 0.4342 0.90210.86930.87290.75340.7380 0.6835 0.83270.8161 0.7728
Gradual0.70300.6929 0.6959 0.9509 0.9534 0.95440.9402 0.9431 0.94410.9436 0.9460 0.9471
Suddenf0.08290.0202 0.0224 0.66880.5894 0.6151 0.3327 0.3300 0.38040.4222 0.3788 0.4378
Suddens0.4601 0.6022 0.51770.9052 0.9248 0.93820.7299 0.87020.8122 0.8286 0.90060.8677
Incremental10.87950.8276 0.8271 0.98010.9475 0.94780.97180.9202 0.9258 0.97410.9288 0.9310
Incremental20.82290.6793 0.6457 0.97130.8899 0.87360.96140.8143 0.7822 0.96370.8186 0.7878
Table 8. Experimental results for the running time (seconds).
Table 8. Experimental results for the running time (seconds).
DatasetDMEMLSAMPMLSADWMHTAUE2MLDME
20 ng2.72 × 1015.27 × 10−15.16 × 10−12.373.62 × 10−12.522.92 × 10−1
Ohsumed1.92 × 1015.74 × 10−15.64 × 10−12.783.81 × 10−13.032.18 × 10−1
Yelp1.16 × 1019.62 × 10−19.11 × 10−11.511.53 × 10−11.631.58 × 10−1
Reoccur52.233.43 × 10−13.52 × 10−18.89 × 10−11.29 × 10−19.02 × 10−12.85
Reoccur102.453.51 × 10−13.44 × 10−17.89 × 10−11.32 × 10−18.22 × 10−12.86
Reoccur153.363.64 × 10−13.55 × 10−17.35 × 10−11.40 × 10−18.02 × 10−12.57
Reoccur203.121.19 × 10−13.64 × 10−18.29 × 10−11.39 × 10−17.81 × 10−14.20
Gradual1.593.63 × 10−13.51 × 10−18.98 × 10−11.34 × 10−18.85 × 10−12.07
Suddenf4.183.94 × 10−13.86 × 10−17.36 × 10−11.34 × 10−17.51 × 10−15.80
Suddens1.903.65 × 10−13.59 × 10−16.97 × 10−11.41 × 10−17.17 × 10−11.99
Incremental10.91 3.62 × 10−13.85 × 10−18.99 × 10−12.38 × 10−17.81 × 10−11.04
Incremental20.80 3.07 × 10−13.19 × 10−18.97 × 10−12.43 × 10−18.77 × 10−11.03
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shen, C.; Liu, B.; Shao, C.; Yang, X.; Xu, S.; Zhu, C.; Yu, H. Multi-Label Learning with Distribution Matching Ensemble: An Adaptive and Just-In-Time Weighted Ensemble Learning Algorithm for Classifying a Nonstationary Online Multi-Label Data Stream. Symmetry 2025, 17, 182. https://doi.org/10.3390/sym17020182

AMA Style

Shen C, Liu B, Shao C, Yang X, Xu S, Zhu C, Yu H. Multi-Label Learning with Distribution Matching Ensemble: An Adaptive and Just-In-Time Weighted Ensemble Learning Algorithm for Classifying a Nonstationary Online Multi-Label Data Stream. Symmetry. 2025; 17(2):182. https://doi.org/10.3390/sym17020182

Chicago/Turabian Style

Shen, Chao, Bingyu Liu, Changbin Shao, Xibei Yang, Sen Xu, Changming Zhu, and Hualong Yu. 2025. "Multi-Label Learning with Distribution Matching Ensemble: An Adaptive and Just-In-Time Weighted Ensemble Learning Algorithm for Classifying a Nonstationary Online Multi-Label Data Stream" Symmetry 17, no. 2: 182. https://doi.org/10.3390/sym17020182

APA Style

Shen, C., Liu, B., Shao, C., Yang, X., Xu, S., Zhu, C., & Yu, H. (2025). Multi-Label Learning with Distribution Matching Ensemble: An Adaptive and Just-In-Time Weighted Ensemble Learning Algorithm for Classifying a Nonstationary Online Multi-Label Data Stream. Symmetry, 17(2), 182. https://doi.org/10.3390/sym17020182

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop