NLOCL: Noise-Labeled Online Continual Learning

: Continual learning (CL) from infinite data streams has become a challenge for neural network models in real-world scenarios. Catastrophic forgetting of previous knowledge occurs in this learning setting, and existing supervised CL methods rely excessively on accurately labeled samples. However, the real-world data labels are usually misled by noise, which influences the CL agents and aggravates forgetting. To address this problem, we propose a method named noise-labeled online continual learning (NLOCL), which implements the online CL model with noise-labeled data streams. NLOCL uses an empirical replay strategy to retain crucial examples, separates data streams by small-loss criteria, and includes semi-supervised fine-tuning for labeled and unlabeled samples. Besides, NLOCL combines small loss with class diversity measures and eliminates online memory partitioning. Furthermore, we optimized the experience replay stage to enhance the model performance by retaining significant clean-labeled examples and carefully selecting suitable samples. In the experiment, we designed noise-labeled data streams by injecting noisy labels into multiple datasets and partitioning tasks to simulate infinite data streams realistically. The experimental results demonstrate the superior performance and robust learning capabilities of our proposed method.


Introduction
Neural network models have demonstrated significant advantages across various research domains like computer vision, natural language processing, and target detection.They excel in single-task learning and often outperform humans in tasks such as Atari games and object recognition [1,2].Nevertheless, these models are difficult to adapt to dynamic data streams and multitask learning scenarios.As new tasks come continuously, the neural networks will encounter a notorious problem called "catastrophic forgetting", also known as the "plasticity-stability" trade-off [3], leading to considerable performance decline on past tasks.Overcoming catastrophic forgetting of previous tasks is crucial for achieving continuous task learning [4].
Continual learning (CL) aims to address the forgetting issue in never-ending data streams.It focuses on adapting to changing data distributions in real-world scenarios and retaining performance on old tasks while learning new tasks [5].Researchers have extensively investigated CL to alleviate catastrophic forgetting.Compared with other types of methods, replay-based methods, where the neural network model learns previous task data from a limited memory buffer storing a small amount of observed data, have proven highly effective in combating forgetting.Most CL methods learn in a supervised mode and are extremely dependent on the precision of the labels.Recently, SPR [6] argues that, if a pure replay buffer is maintained, significant performance gains can be achieved.Nevertheless, SPR relies on clean samples and discards the potential noisy samples [7].
However, in real-world scenarios, there are various sources of data, and data quality cannot be guaranteed, leading to the presence of erroneous labels or noisy labels (NLs) in data streams.As NLs increase, supervised learning becomes misleading, which impairs the model performance.This issue is exacerbated in the replay-based CL methods since NLs will be replayed from the memory buffer, which leads to more severe catastrophic forgetting.
To address the challenges associated with noise-labeled data in CL, we propose a novel method called noise-labeled online continual learning (NLOCL), which extends the online CL model to effectively handle data streams with NLs.NLOCL employs an empirical replay strategy to retain and prioritize important examples, ensuring the model focuses on highquality data during training.NLOCL incorporates semi-supervised fine-tuning, leveraging both labeled and unlabeled samples to improve model robustness.A key innovation of NLOCL is the integration of small loss criteria with class diversity measures, which allows the method to maintain diverse and representative samples.Unlike traditional methods, NLOCL eliminates the need for online memory partitioning, simplifying the implementation and reducing computational overhead.Moreover, we have optimized the experience replay stage to improve model performance significantly.By preserving significant clean-labeled examples and carefully selecting appropriate samples for replay, NLOCL ensures that the model continuously learns from the most relevant data, reducing the negative impact of noisy labels.
In the experiments, we simulated noise-labeled data streams by introducing noisy labels into multitask data streams.This setup allows us to rigorously evaluate the effectiveness of NLOCL under conditions that closely resemble real-world scenarios.The experimental results demonstrate the superior performance and robust learning capabilities of our proposed method, especially showcasing its potential for practical applications with noisy and dynamic data streams.
The rest of this paper is organized as follows: Section 2 presents a review of related work.Section 3 is devoted to our proposed NLOCL method.Extensive experiments are conducted in Section 4 to support our claims.We finish the paper with a conclusion in Section 5.

Online Continual Learning
Continual learning settings are able to be categorized into the offline setting and the online setting [6,8].In the offline setting, unrestricted use of all current task data is infeasible in real-world scenarios, which imposes limits on continual learning.Models are constrained by limited memory space for training assistance.Offline memory space should ideally match the size of the current task dataset.In the online setting, various concepts are proposed by different researchers.The online setting refers to the fact that each stream sample can only be used once in model training [9,10], whereas the research in [11] considers that only one or a few samples pass through the data stream at any given moment.The former has similar memory buffer requirements as the offline setting, resulting in unnecessary memory overhead.

Catastrophic Forgetting
Catastrophic forgetting is a common challenge in CL models.The primary objective is to prevent the neural network from losing information learned from previous tasks, ensuring stable classification performance across continual tasks.Understanding the causes of catastrophic forgetting is crucial in the CL algorithm, with early studies delving into this phenomenon in detail.EWC [12] examines catastrophic forgetting in continual tasks, identifying causes and introducing the EWC algorithm.EWC belongs to the type of regularization-based approaches in CL and prevents parameter drift by penalizing updates that affect old task parameters, ensuring alignment with optimal parameter spaces [12].When a model learns two tasks sequentially, the abrupt shift in the data distribution upon the arrival of subsequent tasks can lead to catastrophic forgetting.In real-world scenarios with complex and large datasets, as task turnover increases, the network tends to adjust parameters crucial for previous tasks to accommodate new ones, prioritizing "plasticity" over stability.This trade-off increasingly leans towards sacrificing stability for plasticity for the need to adapt to new tasks.

Replay-Based Methods
The primary methods for studying CL models are replay-based, regularization-based, and parameter-isolation-based methods.Among these, replay-based methods are versatile and adaptable for CL scenarios.They involve storing or generating samples of previous tasks for replay.This section focuses on the former approach.Although it requires more storage space and memory overhead compared to generating pseudo-samples, the latter method introduces challenges in model generation.In particular, the complex tasks increase continuously, affecting the model robustness.Replay-based methods represent a general learning paradigm, exemplified by the basic algorithm in [13].
In addition to storing and replaying real samples, some methods opt to replay relevant information from previous tasks instead of actual samples.For example, GEM proposes using previous task samples in situational memory to compute the feasible domain of the previous tasks.It then projects the gradient of the current task into this domain to prevent the gradient descent direction from exceeding the bounds of the previous task's feasible domain [14].A-GEM [15] simplified GEM by converting the optimization problem into estimating a projection in one direction through random selection from the memory buffer.Recent methods like GPM, TRGP, and GCR advocate avoiding direct access to old data to address potential user data privacy concerns associated with traditional empirical replay-based methods [16][17][18].These methods store each task's input space during forward propagation, representing feature substrates at each layer of the input network as a set of orthogonal substrates from previous task inputs.Leveraging the unique relationship between the task gradient descent space and the input space, they constrain the direction of subsequent task gradient descent to safeguard important parameter information from significant changes and preserve old knowledge.
In our work, we propose a replay-based method on the data streams with labeled and unlabeled samples, which satisfies the scenario of the data in reality.We overcome catastrophic forgetting in the online continual learning setting and enhance the model performance by the online memory update strategy, which can retain significant cleanlabeled examples in the memory buffer.

Problem Description
The main task studied in this paper is the problem of online task-free CL.Beginning with the typical scenario of CL classification tasks, many replay-based CL paradigms have shown good performance, which are still followed in this paper.Assuming a set of n different tasks t ∈ {T 1 , T 2 , . . ., T n }, each task's data distribution is denoted by D i , representing the task index.The model observes a data stream that can undergo abrupt changes from D k to D k + 1, but the model is unaware of the time of the abrupt change in the data distribution.The data seen by the model at a certain time t from the data stream is represented as X t , and in a noisy environment, Y t is the noisy label.Notably, the representation of samples differs greatly between settings with and without tasks.In both task-free and task-aware settings, the representation of samples differs significantly.In the task-aware setting, the general form of a sample at any given time is represented as (X i t , Y i t ) or (X t , Y t , i), where the task identifier i is visible to the neural network.This information helps predict the model output.The minimization of the optimization objective can be represented as: where L is the corresponding loss function.Due to the adoption of a replay-based approach in this paper, a replay buffer B is maintained for the model.{X D , Y D } and {X B , Y B } represent mini-batches from the online data stream, and the replay buffer during fine-tuning is denoted by BB.The loss function integrates values from both the memory buffer and online data streams, forming the basis for learning with noisy labels.This highlights the model's capacity to assimilate new knowledge while preserving old task information.Replaybased methods naturally excel in retaining previous model knowledge, as demonstrated by this formulation.CL algorithms in training mode settings can be categorized into class incremental and task incremental [19], also known as task-agnostic and task-aware learning.The distinction between these lies in the use of task identifiers.In task-aware learning, task identifiers are visible to the model during training and testing, aiding in clarifying task boundaries.During training, only the classifier head corresponding to the updated task identifier is selected, while during testing, only the relevant classifier head is activated for prediction.In contrast, task-agnostic learning requires all classifier heads to participate in training and prediction output.The difference between them is reflected in the structure of the model output layer.In task-based CL [20][21][22], each new task requires an additional output head, extending the model's output layer to accommodate specific task predictions.In contrast, task-agnostic learning keeps all classifier heads fixed during both training and testing stages [23].There is a more illustrative description of the differences between these two settings in Figure 1.
In [24], it was explicitly pointed out that noisy labels can lead to retrogressive forgetting consequences [25].In the context of CL, the negative impact of noisy labels lies in disrupting the high-quality memory buffer, leading to more severe catastrophic forgetting phenomena [26,27], which hinders subsequent task learning.The study [24] finds that a clean buffer can help improve performance.Conducting CL in a noisy environment is entirely reasonable based on replay-based methods.

Overview of Our Method
Figure 1 succinctly outlines our approach, depicting the model's learning process on the current task data stream online.Initially, the upper section illustrates the scenario of online noisy label learning tasks.Data continually arrive as time progresses, potentially containing noisy labels for each task.At task boundaries, the data distribution may abruptly change.The online data stream initially fills a finite-capacity online buffer.Subsequently, samples from the buffer are categorized into clean-labeled and noise-labeled samples, and eligible samples for inclusion in the global replay buffer are selected.These replay samples along with online samples from future tasks are then utilized for fine-tuning.To maximize the utilization of the online data stream and enhance learning by capturing crucial information, our approach forgoes the constraint of exclusively using clean samples for training.

Online Buffer Separation and Important Example Retaining
The process of dividing the online data stream determines the quality of the labeled data and unlabeled data involved in fine-tuning.Following the paradigm of online CL, a fixed number of samples are taken from the data stream each time, which constitutes an online iteration.The model is represented as f (; θ).During each iteration, the network engages in supervised learning with a minimal learning rate.Initially, the DNN tends to grasp simple patterns, gradually memorizing noisy labels.Samples with clean labels with minimal losses are prioritized for memorization by the neural network.To differentiate between high-confidence samples and noisy ones, threshold hyperparameters can be configured.
Due to the task-agnostic setting, assuming a total of c classes need to be learned, it means the model always maintains c classifier heads, and all classifier heads participate in the prediction output of each sample.The prediction vector of sample x i can be represented Correspondingly, the label can be represented as y = [y 1 i , y 2 i , . . ., y c i ].This paper computes the distance between them to describe their difference: . ( Due to the characteristics of class incremental learning, p i can contain classes that the model has not yet learned, which is incorrect.It is necessary to filter out the classes that will only appear in the future from p i to assist the model in making as accurate judgments as possible in classifying noise.The prediction accuracy might be enhanced by employing a class-specific binary mask.In this approach, the mask distinguishes between previously learned and newly introduced classes.Specifically, it assigns a value of 1 to positions corresponding to both old and current classes, signifying the recognition of these classes, while positions corresponding to yet-to-be-learned classes are set to 0. Solely setting bits related to current task labels to 1 is not exhaustive in capturing the model's evolving comprehension.Consider a common scenario: the classes contained in task t may correspond to the labels of previous tasks, and the mask bits corresponding to previous task labels will be set to 0. In our approach, we allow predictions of samples to point to labels contained in previous tasks, avoiding the erroneous auxiliary effect of the mask.The mask bit for class j of x i is represented as m j i .The formula for calculating the distance of samples with the addition of the corrected mask, which is the mean-squared error loss, is described as follows: Compare the distances of all samples in each iteration; mark the samples with distances smaller than the threshold as clean-labeled samples; fill the online buffer with clean samples.Therefore, other samples with larger distances are filled into the buffer for noisy samples as follows: For the threshold in the above formula, this paper uses the average distance of all samples in the batch to identify as follows: From the online buffer of clean samples, select the smallest N 1 samples based on the distance, and add them to the replay buffer of clean samples.Correspondingly, select the largest N 2 samples based on the distance, and add them to the replay buffer of noisy samples.N 1 and N 2 are treated as hyperparameters.
The above content describes the process of online sample partitioning and sample selection.There are limitations to existing methods related to noisy label learning, such as SPR [6], which advocates learning feature representations on reliable samples.However, in noisy environments, the limited number of clean samples restricts the model from fully learning the feature representations of each class.Subsequent experiments will also demonstrate the necessity of replaying noisy samples.

Clean Label Learning
Clean-labeled samples are limited.Despite filtering out the most reliable samples from the online data stream, the network may memorize some noise samples due to the memory effect of deep neural networks, leading to inevitable interference.To enhance the generalization ability and robustness of deep neural networks, the learning of labeled samples introduces a modified data augmentation method called MixUp [27], which generates new examples and labels through linear interpolation.Assume {x ai , y ai } and {x bi , y bi } are two data augmentations of an example {x i , y i }, along with their one-hot encoded labels.This results in the MixUp augmentation { x i , y i }.This method of data augmentation effectively introduces prior knowledge [11] to the network, enhancing its generalization ability, as shown in Figure 2, and the process is shown in the following formula: Figure 2. Supervised learning process for clean-labeled samples.
To learn from clean label images, no additional auxiliary learning measures are needed.The augmented samples mentioned above are fed into Net 1, and the corresponding crossentropy loss referred to as the loss on-labeled samples is computed using the predicted p.As shown in Equation ( 9), N represents the number of input clean-labeled samples, y i represents the label of the ith sample, and p i represents the prediction corresponding to the ith sample by Net 1.

Noisy Label Learning
This section shows the fine-tuning framework for learning from noisy labels and entails extracting more information from samples with relatively higher proportions of correctly partitioned noisy labels while minimizing the influence of mislabeled data.Drawing inspiration from the supervised contrastive learning framework SCL2 [28], we optimized its components and introduced a finite memory buffer for caching historical feature representations.Subsequently, we devised a composite loss for unlabeled samples, which integrates unsupervised contrastive learning loss, supervised contrastive learning loss based on soft labels, supervised prototype contrastive learning loss based on soft labels, and corrected soft label loss.
In handling examples with noisy labels, two networks are utilized, denoted in this paper as Net 1 and Net 2. Both networks maintain identical architectures, and Net 1's parameters are periodically used to perform momentum updates on Net 2's weights.These momentum updates maintain approximate parameter equality between the two networks while ensuring that the weights of Net 2 change more slowly than those of Net 1, thus preventing Net 1 from being heavily influenced by noise and maintaining model robustness.
For each data example {x i , ŷi } in the unlabeled sample set U, two different data augmentations are obtained, labeled as aug(x j ) and aug ′ (x j ).These augmentations, aug(x j ) and aug ′ (x j ), are, respectively, fed into two models with identical network structures, referred to as Net 1 and Net 2. Net 1 will output the low-dimensional feature embedding of aug(x j ), denoted as f (aug(x j ); θ).In addition to the aforementioned feature embedding, another important output of Net 1 is the prediction regarding the first data augmentation x j , represented as p j , which is generated through the fully connected layer of the Net 1.The second data augmentation aug ′ (x j ) is input into Net 2, which only needs to output the low-dimensional feature embedding for this augmented sample, represented as f (aug ′ (x j ); θ ′ ).

Loss Function Analysis
The first component of the loss is called soft-label-corrected loss (L slcor ).The research on noisy label correction [16] proposed the idea of correcting labels using linear interpolation between the original noisy labels and soft labels.The corrected label is represented as β ŷj + (1 − β) p j , where p j represents the network's output prediction.If the corrected label is used as the supervision signal, the cross-entropy loss can be expressed as follows: Equation (10) represents the recent network predictions introduced as supervision signals.The short-term memory effect of neural networks prevents using original noisy labels as supervision signals, as the network quickly memorizes some noisy labels, leading to performance degradation.As the number of iterations increases, the influence of the original labels on the network should decrease.The supervision signals are updated with iterations to generate more accurate soft targets.Based on the above discussion, naturally, the updating process of the supervision signals is represented as follows: The epoch is represented by k.At any k > 0-th moment, historical supervision is incorporated into the current supervision update.The update of supervision is summarized as the exponential moving average of historical supervision.The loss of all epochs is divided into the first epoch (first term) and other epochs (second term) as expressed in the following: where p j[s] represents the model's prediction for x j in the s-th epoch.Since there are no historical soft labels in the first epoch, the original labels are temporarily used as the supervision.In the first term of the loss, it is weighted by α k .α is a real number in the interval (0, 1), ensuring that, as the epochs progress, the weight of the original labels in the supervision decreases gradually.The weights of the updated y k increase, especially for the recent y k .As the iteration progresses, the model's predictions become more accurate, which are then used to correct the soft labels more accurately.
The second part of the loss is the supervised contrastive learning loss, denoted as L supcl .Given N sample label pairs x j , y j , j = 1, 2, . . ., N in a mini-batch, we obtain 2N augmented sample pairs { x s , y s }, s = 1 . . .2N through feature extraction on two networks.Here, x 2s−1 and x 2s represent two different versions of augmentation for x s , and correspondingly, y s = y 2s−1 = y 2s .Within these 2N augmented samples, for any index s representing a sample, s ∈ {1 . . .2N}, we denote the index of a sample that shares the same class label as the s-th sample as t.These two augmented samples are considered positive.The loss for this part is represented as follows: where v t represents the features of other samples in the same batch as v s , but with the same category as v s .Soft labels are used as supervised information for sample category discrimination.N y s denotes the number of enhanced samples in the batch that have the same category as v s .Using I as the indicator function, its value is 1 when the subscript condition is satisfied; otherwise, it is 0. Clearly, v t serves as the positive sample for v s , and v k serves as the negative sample.τ is the temperature coefficient.When computing L supcl , it ensures that positive and negative samples are better distinguished, promoting the network to learn information from features and sample labels from the same category.The third part of the loss concerns global prototype learning.Maintaining a global prototype matrix provides prototype vectors for class centers, assisting the network in discrimination.The objective is to make the low-dimensional feature representations output by the training network closer to the prototypes of their respective classes.In each subsequent epoch, if the update condition is met, the current prototypes are updated with the exponential moving average of historical prototype vectors and low-dimensional features.This enables the prototypes to evolve as more samples are seen by the model, better characterizing the class centers of each class.The loss formula is shown as follows: where k represents the class label.m denotes the coefficient for the exponential moving average.Here, it is necessary to use supervised information to distinguish the class of prototypes, ensuring that the most correct prototypes are selected for updating.To maintain the stability of prototypes in describing class features, prototype updates can only be performed when two conditions are met.The prototype learning loss function is as follows: In the above loss, y j represents the soft label and τ is the temperature coefficient.There are two prerequisites for prototype updates: first, setting a threshold β requires that the aforementioned L proto is less than β; second, when the sample soft label is the same as the network prediction, meeting this condition restricts the speed of prototype updates and maintaining the stability of the network.
Finally, the completely unsupervised loss function is denoted as L unsup .As shown in Figure 3, a small buffer with a queue structure is utilized to store the outputs of Net 2 over a period of time, serving as negative samples.The feature buffer is represented as M, with a capacity limit of |M|.The low-dimensional features of recent online samples in Net 2 are stored in the queue as v j ′ .As the queue has limited capacity, this part of the loss is expressed as: where v j ′ represents the feature representation obtained from Net 2 and τ unsup represents the temperature coefficient in this context.v m ′ denotes several historical features in the queue M.   The loss function computed on the complete set of noise-labeled samples is as follows (λ represents the weight coefficient for each component):

Related Algorithmic Process
The entirety of the processing flows is summarized in Algorithms 1 and 2, unifying the loss functions across the two data partitions.The model aims to minimize the following semi-supervised loss function through stochastic gradient descent: Algorithm 1 Fine-tuning stage processing flow.Let θ ′ = θ 7:

Input
compute data augmentation for {x j , y j }, obtain { x j , y j } and { x ′ j , y update feature embedding queue B f eat 10: for iteration in taskt do 4: Sampling an iteration from a data stream Classification on X Test by the network of ϕ 14: end for 15: End Function Test

Datasets
We conduct experiments on three popular datasets and organize them as per the SPR [6] setting.The detailed datasets are shown Table 1.The MNIST [29] dataset consists of grayscale images of handwritten digits 0-9.It includes 60,000 training set instances and 10,000 test set instances.The dimensions of the grayscale images are 28 × 28.Each image contains one of the digits 0-9 as its label.

CIFAR-10
This is a color image dataset containing 10 classes such as "airplane" and "frog".Each class comprises 5000 training images and 1000 test images.The dimensions of the color images are 32 × 32.The image below shows some sample images from CIFAR-10 [30].

CIFAR-100
The dataset consists of color images belonging to 100 classes, with each image having dimensions of 32 × 32 pixels.The 100 classes in this dataset can be grouped into 20 superclasses.Each class comprises 500 training images and 100 test images [30].

Training Details and Baselines
In this work, we used a primary updating network and an auxiliary network with the same structure and employed a two-hidden-layer MLP for the 5-class classification task on the MNIST dataset and ResNet18 on the CIFAR dataset.We used 0.999 as the momentum coefficient for copying the feature extractor parameters from the primary network to the auxiliary network and for the global prototype momentum update coefficient, and 0.9 as the momentum coefficient for updating soft labels during the fine-tuning stage.Set N 1 , N 2 , and CNLL to be consistent.Set the coefficient λ u of the loss function for the unlabeled example part to be 0.5, and the coefficients λ slcor , λ proto , λ sup , and λ unsup for each part of the loss function to be 0.25.Each iteration is configured to encompass 500 examples extracted from the online data stream.Within each iteration, the model embarks on a warm-up learning phase spanning 15 epochs, utilizing a learning rate of 0.002.Subsequently, it transitions to the fine-tuning stage, comprising 20 epochs of training at a learning rate of 0.025.During fine-tuning, the SGD optimizer is employed with a weight decay of 0.0005.The learning rate is dynamically adjusted using CosineAnnealingLR [31], with 0.001 serving as the lower bound.Mini-batch sizes are uniformly set to 64 across all operations.

Accuracy Performance
Model performance on the MNIST dataset with the injected symmetric noisy labels is compared in Table 2 and results with injected asymmetric noisy labels in Table 3.The proposed method outperforms the previous state-of-the-art method CNLL, especially under asymmetric noise settings of 20% and 40%.Multitask [39], trained offline using all training data with 0% noise rate, serves as an upper bound for comparison among all algorithms and is not considered as a baseline.Specifically, our method achieves average accuracy higher than all baselines under five different noise rate settings.Tables 4 and 5 present the performance comparison of our method and various baselines on the CIFAR-10 dataset with symmetric and asymmetric noise rates injected.Although our method does not achieve optimal performance in all experiments, specifically, the proposed method in this section outperforms the previous best baseline by 0.9% to 7.8% in classification accuracy under different noise rate settings.It exhibits even stronger performance in asymmetric scenarios, further highlighting the robust learning prowess of our method in noisy label environments.This resilience is attributed to the model's adeptness at discerning the reliability of sample labels during online sample separation.Subsequently, during semi-supervised fine-tuning, clean and noisy samples are autonomously learned separately to circumvent mutual interference in the label-correction operations.Additionally, our approach maximizes the utilization of reliable supervised information for learning clean-labeled samples, integrates globally periodically updated prototype matrices, and leverages entirely unsupervised techniques to prevent the infusion of erroneous information from noisy labels into the network's soft-label-correction process.Through these strategies, our algorithm emerges as the most competitive among all methods.Tables 6 and 7 compare our method with the CL method GDumb and various versions of GDumb combined with noisy label learning methods on the CIFAR-100 dataset with injected random noise and superclass noise.In the experiment with a symmetric noise rate of 60%, our method performs 2.2% lower than CNLL.However, in the other five experiments, our method achieves better results than CNLL, especially in the case of 40% random noise, where NCOCL surpasses CNLL by 2.4% in average accuracy, becoming the most competitive method among all compared methods.

Ablation Experiment
In this section, to further demonstrate the effectiveness of the proposed method for CL models in noisy label learning tasks, we conducted ablation experiments to verify and analyze the contributions of different components of the noisy label partial objective function and the necessity of learning information from noisy samples.

Effectiveness of Individual Components of the Noisy Label Objective Function
To prove the effectiveness of the proposed measures for learning from noise-labeled samples, we conduct ablation experiments on CIFAR-10 with symmetric and asymmetric noise rates set at 20%.We combine different components of L u in various ways and summarize their effectiveness and rationality based on the different performance exhibited by the model.The corresponding experimental results are summarized in Table 8.It is evident that, at a symmetric noise rate of 20%, each component proposed in L u contributes to enhancing the model's adaptability to noise-labeled data.Our novel approach, encompassing global prototype learning and periodic updates to prototype matrices based on sample features, is pivotal in preserving the model's alignment with the core of previously learned classes during continuous task refinement.Furthermore, our fully unsupervised contrast method, utilizing historical features, proves instrumental in mitigating the impact of noisy label interference when refining soft labels, thereby bolstering robust learning within noise-labeled data streams.In ablation studies, where each loss function is individually eliminated, the model's classification accuracy plummets by over 10%.This outcome remains consistent even when subjected to equivalent asymmetric noise rates.

Ratio Coefficient between Labeled Samples and Unlabeled Samples
In this section, we focus on the allocation ratio of losses between labeled and unlabeled data for two types of noise in data streams with equivalent noise rates.We explore setting different values for λ u in the loss function.It can be observed that the best classification performance is achieved when the value of λ u is set to 0.25, as shown in Table 9.In the preceding sections, this paper delves into the practical importance of grappling with noise-labeled data.Given their abundance and lower cost, noise-labeled samples hold greater relevance in real-world applications.While numerous studies opt to exclude clean-labeled samples during supervised fine-tuning, theoretically, incorporating a broader range of samples in fine-tuning yields better representation learning.Building upon this premise, it is demonstrated that integrating a portion of noise-labeled samples into CL replay, alongside clean-labeled samples, outperforms using an equivalent number of solely clean-labeled samples.
Two settings were attempted in this part of the ablation study: Firstly, we kept N 1 unchanged while setting N 2 to 0. Thereby, we can cancel the original noise replay samples while keeping the online noise buffer unchanged.Secondly, we further canceled the online noise buffer and used only high-confidence samples for replay in the online setting.The sample settings still involve the CIFAR-10 dataset with symmetric and asymmetric noise rates set at 20%.The results of this ablation experiment are reported in Table 10.

Conclusions
This paper proposes a method named noise-labeled online continual learning (NLOCL), which enhances the practical significance of the CL model by constructing real-world data flow scenarios.It tackles the problem in the replay-based CL method for online data streams, task-free boundaries, and noisy label learning.In our method, clean-and noise-labeled data are partitioned from online data.NLOCL utilizes reliable supervision for accurate learning and corrects unreliable supervision, which combines supervised contrastive learning with unsupervised contrastive learning.Our experiments contain detailed comparative experiments and ablation experiments, demonstrating the superiority of NLOCL in noisy-label CL scenarios.In the future work, we will consider overcoming catastrophic forgetting on the spiking neural network [40][41][42], which uses discrete spikes to compute and transmit information.

Figure 1 .
Figure 1.The Structure of an overall method for noise-labeled online continual learning (NLOCL) based on sample separation replay.

Figure 3 .
Figure 3. Description of fully unsupervised feature-based comparison learning using feature buffer.

Figure 4
Figure 4 summarizes the learning process of unlabeled samples.It depicts the dualmodel structure and indicates momentum updates.The main components of the loss function and the information contained in the model output within the loss function are also annotated in the figure.

Table 8 .
Ablation experiments on loss function of the learning processing on unlabeled samples.

Table 9 .
Ablation experiments on learning ratio coefficients for labeled and unlabeled samples.

Table 10 .
The necessity for noise-labeled samples in fine-tuning.