Continual Learning Strategy in One-Stage Object Detection Framework Based on Experience Replay for Autonomous Driving Vehicle

Object detection is an important aspect for autonomous driving vehicles (ADV), which may comprise of a machine learning model that detects a range of classes. As the deployment of ADV widens globally, the variety of objects to be detected may increase beyond the designated range of classes. Continual learning for object detection essentially ensure a robust adaptation of a model to detect additional classes on the fly. This study proposes a novel continual learning method for object detection that learns new object class(es) along with cumulative memory of classes from prior learning rounds to avoid any catastrophic forgetting. The results of PASCAL VOC 2007 have suggested that the proposed ER method obtains 4.3% of mAP drop compared against the all-classes learning, which is the lowest amongst other prior arts.


Introduction
A deep learning algorithm is expected to learn from data that are always changing. Consequently, a deep learning model is required to be trained repeatedly from the ground-up to stay relevant. In nature, humans are capable of learning new things continuously while retaining their previous knowledge. This ability to learn new things continuously is also known as continual or life-long learning. In contrast, for a machine, especially a deep learning model, learning new things naïvely means overwriting its previous knowledge. Instead of learning additional knowledge, the model will lose its ability to detect previous class(es). However, naïvely forcing a model that is already trained to learn additional classes without any strategy can lead to catastrophic forgetting. Catastrophic forgetting is a phenomenon in which a model forgets the previously learned knowledge. This forgetting of the previous knowledge is a major shortcoming of the convolutional neural networks (CNN). In order to preserve the performance on previous knowledge, a training strategy is needed to overcome the shortcoming of the naive training while learning additional classes [1]. Continual learning challenge has persisted for decades in the deep learning field. Moreover, as new data with more classes have become easier to obtain in recent years, continual learning has gained more attention from the research community [2,3]. Numerous researches have been proposed to deal with catastrophic forgetting. However, most of those solutions only focus on the classification problem while the problem of catastrophic forgetting on object detection is still largely untouched [11][12][13][14]. Even though one can attach an object proposal algorithm before the classification network to create an object detection framework, the resulting object detection framework would be considered as a two-stage detection framework. However, this framework is not suitable for real-time application because of its high inference time [1]. In gradient episodic memory (GEM) [15], previous data are stored in an episodic memory to avoid forgetting previous knowledge in the current continual training. They offer forward and backward transfer of knowledge. However, the backward transfer is essential for the previous task that increases the computation time. Moreover, they used a gradient constraint approach. This approach limits the gradient to prevent the weight from being drastically changed. While such a strategy is able to preserve the weights to some degree, it also limits the network's ability to learn new classes effectively. Consequently, this strategy often faces a dilemma where it should choose between preserving the previously learned classes' weights or radically changes its weights to learn new classes effectively. In recent years, a new strategy which employs knowledge distillation (KD) [16] that is originally utilized for model compression, has been proposed as an alternative [1,17]. While this strategy is generally more stable and yields better results than gradient constraint, it demands huge additional data that should be similar to previous classes. This additional data requirement highly affects the performance of the model. Thus, it defeats the original purpose of continual learning to add scarce and unique classes incrementally.
To overcome the shortcoming of prior works, a method that mimics the biological system's ability to replay past experiences is proposed. A similar strategy has been implemented successfully in image classification [15] and reinforcement learning [18]. Nevertheless, object detection is a completely different domain. Several modifications and adjustments have to be made for the proposed method to work properly, mainly on the issues concerning memory integration during l [1] training phase. Our research has demonstrated that replay can be used as an effective strategy in continual learning. In this paper, the proposed method is demonstrated in You-Only-Look-Once (YOLO) since it is one of the most popular one-stage detection framework [19]. However, the proposed method is flexible and can be implemented in other object detection frameworks without modifying the network architecture.

Related Work
Even though there has been plenty of work addressing continual learning, they typically only focused on the classification problem. This research focuses on the more practical setting in the context of object detection. There has been less research on continual learning in object detection scenarios. Amidst those few is the continual learning scheme applied to faster region convolution neural networks (Faster R-CNN) [1,20] object detection framework. They avoid catastrophic forgetting of previously learned knowledge by distilling knowledge from the previous model. The external distilled proposals of the prior network are saved and utilized as pseudo-data in continual learning. They freeze previous network layers while training the continual one. The whole network layers are utilized at the inference stage. However, Faster R-CNN is a two-stage object detector that consists of an external network for extracting proposals that result in higher computational complexity. Another recent approach to overcome the catastrophic forgetting is proposed in [21], that improved the learning procedure of EWC for object detection. To remember the previous knowledge, they proposed a pseudo-annotation of previously learned classes. A Laplace approximation [22] is proposed for the likelihood of each task to be diagonal.
Other research that focuses on continual learning on object detection is deep model consolidation (DMC) [17]. In the DMC, a double distillation loss has been proposed to combine two models that specialize in different classes into one compact model that can detect all the preceding models' classes. First, they trained two networks on different data. Then, they consolidated both models into one single model using double distillation loss through training on unlabeled auxiliary data. Although DMC is fast at the inference stage, the training time is extremely long compared to baseline object detectors and demands higher computational power since the consolidation phase requires three models to run simultaneously. Moreover, additional data needed for the consolidation phase are massive since they are not labeled. Furthermore, auxiliary data needed for the consolidation phase highly affect the performance of the consolidated model. Thus, results will vary according to the number of images and domain similarity between the original and auxiliary data. Therefore, in a case where many unique and rare classes are involved, DMC may not perform very well.
Other prior works regarding continual learning methods for object detection are inspired by KD [16], where the previous knowledge is saved in the frozen copy of the previous model. Moreover, the object detection frameworks considered for knowledge distillation are proposal generation-based methods and utilizing stored proposals from previous tasks [1]. Some models are based on auxiliary data and multi-models training as proposed in DMC [17] that consumes more computational power and resources. In contrast, the proposed approach is more general and can be used in any conventional object detection models. The proposed approach does not require any auxiliary data while learning the current data. A comprehensive comparison between the advantages and disadvantages of the proposed method and previous methods are described in Section 4.

Proposed Methodology
Suppose that a model l [0] is trained normally on n number of classes from the first task dataset that will be referred to as l [0] dataset, then the model l [0] is needed to detect additional t number of classes without using the whole data from the l [0] dataset. To avoid catastrophic forgetting where the model forgets features learned in classes belonging to l [0] dataset, a method which utilizes memory is proposed. The proposed method, which will be referred to as Experience Replay (ER), works by saving a portion of the l [0] dataset into the memory. Then, the images in the memory will be concatenated with the second task dataset, denoted as l [1] dataset in every iteration during l [1] training phase as shown in Figure 2. Furthermore, the dynamic omission is implemented in order to ensure that the memory and l [1] dataset can be adequately combined during l [1] training phase.

YOLO Architecture
The proposed ER method utilizing YOLO is suitable for real-time applications. Nevertheless, the proposed ER method is a flexible strategy and easy to implement in other frameworks without modifying the architecture. The architecture of YOLO that is utilized can be seen in Figure 3. To predict objects' location, YOLO utilizes grid cells with a size of N × N with three different scales located at layer 89, 101, and 113. Each grid cell has a fixed number of predictions depending on the number of anchor boxes specified. To update YOLO's weight, the loss between the target and model prediction needs to be calculated. First, in each grid cell, the bounding box regression loss (L bb ) is calculated using generalized intersection over union (GIOU) [23]. Suppose that a target bounding box, which is denoted as Tx, and a predicted bounding box denoted as Ty, both of which contain the coordinate and size of bounding boxes. Meanwhile, the algorithm can calculate the value of C by measuring the minimum rectangular area that surrounds both bounding boxes. Thus, the algorithm can get the GIOU by using Equation (1). Then, the resulting value is used as the input for Equations (3) and (5).
Aside from coordinates and sizes, each bounding box also contains an objectness score, which is denoted as Xo and Yo for predicted and target bounding boxes, respectively. The objectness score of a predicted bounding box can measure between 0 and 1, whereas for target bounding boxes, the score is only 0 or 1 depending on the target object's presence in the corresponding grid cell. Meanwhile, the predicted and target bounding boxes' classification scores are denoted as Xc and Yc, respectively. However, unlike the objectness score with a single value for each bounding box, the confidence score is represented as a one-hot vector with a length equal to the number of the model's classes. The confident score of each bounding box is the multiplication of objectness and classification scores, as shown in Equations (4) and (5), respectively. Both objectness and classification losses are calculated using binary cross-entropy (BCE) as written in Equation (2), whereas Cx and Cy represent the input and output of the function, respectively. It is important to note that the input of BCE should be normalized using the sigmoid function (σ) to avoid an exploding gradient. In Equation (2), w denotes the positive weight which should be used when there is a drastic imbalance between precision and recall. If there is no problem with the imbalance, the algorithm uses the default value of w, which is 1. Lastly, the total loss function, as shown in Equation (7) is calculated by adding α×L con f and β×L bb , with α and β denote the weight for each of their respective loss.

Task Distribution and Data Augmentation
The Pascal VOC 2007 [24] dataset is used for training the proposed ER method. For experiment purposes, the dataset is divided into the l [0] and l [1] tasks. The l [0] dataset is used to train YOLO normally, whereas the l [1] dataset is used for continual learning. The purpose of training on the l [1] dataset is to simulate a real-world condition where the model is needed to learn additional classes. The l [1] dataset has entirely different classes from the l [0] dataset. After the model l [0] is trained normally using the l [0] dataset, it will be used as the pre-trained weight for the continual learning on the l [1] . It is also important to increase the variance of training data. By utilizing data augmentation correctly, the accuracy of the model can be further improved as elaborated in Section 4.2.
The most recent YOLO utilizes Cut-Mix [25] for data augmentation. In ER implementation for the task l [1] , Cut-Mix should be applied after images are loaded and concatenated with the new dataset. Saving augmented images into memory instead of non-augmented ones can slightly decrease the accuracy as shown in our experiment in Section 4. Another important thing to note is the effect of the imbalance number of data between the l [0] and l [1] dataset, which is shown in Table 1. In the 19 + 1 scheme, the purpose of switching between TV and person classes as the l [1] task is to observe this particular issue. The imbalance data between l [0] and l [1] tasks may provide a better understanding of how to implement ER correctly.

Memory Replay on Continual Learning
Given a training with b-size batch, each iteration will contain c-number of training images from the l [1] dataset and (b − c) images from the l [0] dataset. Then, images from the l [0] dataset and l [1] dataset are concatenated. By mixing images from the l [0] dataset, the model would be able to avoid catastrophic forgetting. However, this strategy causes the model to train longer since each iteration only contains c-number of images from the l [1] task dataset. Therefore, one training epoch will have the same number of iterations as the number of images in the new dataset divided by c, regardless of the batch size. Details of the proposed continual learning algorithm is described in Algorithm 1, specifically in line [12][13][14][15][16][17][18][19][20][21][22][23]. When training the model l [1] in the continual learning scheme, the loss function is the same as the one used in normal training. The L con f is the multiplication result of L cls and L obj which are calculated using BCE in Equation (6), whereas for L bb the loss function used is the GIOU loss as shown in Equation (3). During the l [1] training phase, the L obj is affected by dynamic omission, which is described as in Algorithm 1 from line 21 to 27. Suppose a bounding box is predicted as a class that belongs to l [0] dataset, then the L obj of the corresponding bounding box is invalidated as it may happen because of an unlabeled target object. This rule is to prevent reckless training, which is illustrated in Figure 4. Reckless training occurs since the pre-trained model that is used in l [1] training phase is already capable of predicting objects which belong to l [0] dataset. Thus, when training l [1] model, it may detect an object which belongs to l [0] dataset in the l [1] . Conversely, the opposite occurrence, where the model detects objects that belong l [1] dataset in the memory, can also occur. However, since labels of l [0] objects are not present in l [1] dataset and contrariwise, calculating losses from these predictions is inadvisable. It is important to note that this strategy is also highly affected by the number of images from l [0] dataset denoted as m that is stored in the memory. It is not compulsory to store all images from l [0] dataset into the memory. However, having more images in the memory can lead to better accuracy for l [0] classes.

Results
In this section, the proposed ER method is extensively evaluated on Pascal VOC 2007 dataset. Results are provided for three different modifications to the proposed ER method. ER for 2500 memory size, ER for 2500 memory size with data augmentation, and ER for 1000 memory size with data augmentation. The proposed ER method is implemented on NVIDIA 1080 GPU. The proposed ER method is trained using stochastic gradient descent (SGD) [26] for 100 iterations. All experiments are written in Python using Pytorch [27] as the machine learning framework. Parameters for training the model is described in Table 2.
The distribution of the trainval and the test set for training and testing, respectively, is unchanged from the Pascal VOC 2007 official release. Following experiments presented in [1,17], the proposed continual learning method is evaluated on two tasks such that ten classes for the l [0] task and ten classes for l [1] or the continual task. The other experiment is performed on 19 classes for l [0] task and 1 class for l [1] task. The distribution of objects and images for 10 + 10 classes and 19 + 1 classes are presented in Table 1 in Section 3. A few images of the Pascal VOC 2007 are visualized in Figure 5. The graphs of loss, classification, recall, precision, and mean average precision (mAP) for both training such that the previous task and the current task are shown in Figure 6.

Performance Evaluation Parameters
The mAP is calculated independently for the l [0] and l [1] tasks. To calculate the mAP, the precision and recall score have to be obtained first from the discrepancy between ground-truth and prediction. Precision, also known as sensitivity, is the ratio of true positives to the sum of true positives and true negatives. Recall, also known as specificity, is the ratio of true positives to the total number of true positives and false negatives. Formulas for the precision and recall is given in Equations (8) and (9) respectively . The AP and mAP are calculated in Equations (10) and (11). A Jaccard index (J) is used to calculate the overlap area between predicted and ground-truth bounding box A and B, as shown in Equation (12).
where AP, TP, and FP denotes average precision, true positive, and false positive, respectively, whereas p(r) represents the probability of the event under the precision and recall curve. The mAP is the mean of AP for N number of classes. The proposed ER method is compared with prior state-of-the-art continual learning methods [1,15,17] for object detection. These methods have been implemented on different object detection framework by respective authors. However, for a fair comparison, these methods are adopted to the YOLO framework. Results of these adopted methods are then compared with the proposed ER method.

Addition of Classes Incrementally
In the first experiment, consider the first 19 classes that appear in alphabetical order in Pascal VOC 2007 dataset as the l [0] task, and the remaining one as the l [1] task. The model l [0] is trained normally on the first (1-19) classes on trainval subset, and the model l [1] is trained using continual strategies on the remaining one class, which is the TV class. A summary of the comparison between the proposed ER method and state-of-the-art methods are shown in Table 3. The baseline approach for continual learning is to store some of the images from l [0] task in memory and replay that memory repeatedly while training on l [1] task. By replaying those memories, ER is able to preserve features of classes in the l [0] dataset. Furthermore, it maintains the same accuracy on l [1] classes. In the l [0] training phase, 19 classes are trained first. Then during the l [1] training phase, only one class is incrementally trained. This scheme is performed to observe the effect of class distribution during continual learning. In comparison to previous methods [1,15,17], the proposed ER method has significantly higher mAP for all classes. The proposed ER method have increased the mAP to 8.9%, 8.8%, and 30.5% in comparison to GEM [15], DMC [17], and KD [1] respectively. Specifically, the proposed ER method with data augmentation has achieved 68.9% mAP for all 20 classes.
The second experiment is performed on the 10 + 10 class scheme. The first ten classes (1-10) are trained as the l [0] task, and the remaining ten (11)(12)(13)(14)(15)(16)(17)(18)(19)(20) are trained as l [1] task for continual learning. As presented in Table 4, GEM is the only method that achieves a higher result than the proposed ER method, yet only for the chair class. The reason behind this is the imbalanced data in the Pascal VOC 2007 dataset. The proposed ER method with 1000 memory size has slightly higher results than the 5000 memory size for l [0] classes (11)(12)(13)(14)(15)(16)(17)(18)(19)(20). However, the proposed ER method for 5000 memory size with augmentation has improved the mAP for all classes up to 65.5%. In the 10 + 10 class scenario the proposed ER method has increased the mAP up to 54.3%, 30%, and 14.6% as compared to GEM [15], KD [1], and DMC [17] respectively. To better investigate the proposed ER method's continual learning behaviour, an experiment is performed on the 19 + 1 scheme with the person class as the incremental class instead of TV. Table 5 presents experiments performed on normal training (1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18)(19), normal training (20), GEM, and the proposed ER method. As shown in Table 5, the mAP for the normal training is 68.8% while the proposed ER method achieved 67.1%. In comparison to GEM, the proposed ER method has 15.2% increased the mAP for all classes. Results presented in Table 5 indicates that the proposed ER method performed better in preserving features of l [0] task.

Visualization and Effect of Different Memory Size
In order to provide an insight of the proposed ER method's performance compared with prior methods, several prediction results are visualized from the Pascal VOC 2007 dataset. Those images are obtained from the test set that contain both the l [0] and l [1] classes. The performance of each method is visualized in Figure 7. Images from Figure 7a-d presents the object detection performance on KD [1], DMC [17], GEM [15], and the proposed ER method respectively. In Figure 7a KD [1] and Figure 7c GEM [15] are shown to have many false negatives. Generally, KD [1] has comparatively better performance on the l [1] task in comparison to l [0] task. Note that, the person class belongs to the l [1] task whereas the car class belong to the l [0] task. The same also happens with GEM [15], since it fails to recognise the bicycle class which belongs to l [0] task. Meanwhile, the performance of DMC [17] as shown in Figure 7b is considerably better than KD [1] and GEM [15]. However, DMC [17] has more false positive than other methods which are visualized as black bounding boxes such as train and dog in Figure 7a first row. This false positives occur because the mean of DMC's confidence score is higher than other methods. In comparison to all these prior methods, the proposed ER method which utilizes 2500 memory as shown in Figure 7d has correctly localized and classified all objects. However, as shown in Figure 7d first row, far objects are not detected by the proposed ER method. The false negative of far objects is the limitation of this work.
In Figure 8, four identical images as the one presented in Figure 7 have been considered for prediction using ER in various memory size to provide better comprehension of the effect of different memory size. In order to evaluate the effect of memory size on results, different experiments are performed on 500, 1000, 2500, and 5000 memory size in the 10 + 10 class scenario. As shown in Table 6, the difference between the mAP for the memory size 5000 and 2500 are 0.1%. However, there is 4.3% accuracy drop for the 1000 memory size. The proposed ER method has comparable results in all the memory size. As the memory size is increased, the capability of the model to predict many objects also increases. However, the results of memory size 2500 and 5000 are relatively similar with only 0.1% difference in the average mAP. The fewer memory size requires fewer training time and hardware's memory requirement. However, it will reduce the performance on l [0] classes. ER* method without augmentation before uploading to memory (2500), ER** method with augmentation after uploading to memory (2500), ER*** method with augmentation after uploading to memory (1000). ER* method without augmentation before uploading to memory (2500), ER** method with augmentation after uploading to memory (2500), ER*** method with augmentation after uploading to memory (1000). Class (20) -  Figure 7. Object detection performance of the proposed ER method with KD [1], DMC [17], GEM [15], and the proposed ER method on test dataset as shown above from (a-d) respectively.
(a) (b) (c) (d) Figure 8. Prediction of the proposed ER method with different memory size, (a) memory size of 500 frames, (b) memory size of 1000 frames, (c) memory size of 2500 frames, and (d) memory size of 5000 frames.

Performance Evaluation on ITRI-DrvieNet60 Dataset
The proposed ER model is trained on the ITRI-DriveNet-60 private dataset. ITRI-DriveNet-60 private dataset is taken on the highways of Taiwan. It has similar characteristics with famous autonomous driving object detection dataset such as KITTI [9] and Cityscape [10]. The number of images and objects for train and test set are shown in Table 7. Four classes (four-wheel vehicle, rider, two-wheel vehicle, and person) are introduced in this dataset. The task distribution strategy for the ITRI-DriveNet-60 dataset's continual learning is presented in Table 8. We trained the model on the 1 + 3 classes scheme. Therefore, the four-wheel vehicle class is considered in l [0] training, and the other classes are not considered in l [0] training those are represented by dashes in Table 8. and the other three classes are considered in l [1] training. While training the l [1] classes, the l [0] class labels are ignored and not utilized for training. The AP for the normal training, l [0] training, and l [1] training are illustrated in Table 8. Notably, the single class accuracy is higher because it is easier for the model to classify a single class. However, in the l [1] training, the four-wheel vehicle class accuracy is dropped to 4.5% still the four-wheel vehicle accuracy is higher among other classes. The total accuracy of four-wheel vehicle class in the l [1] training is 85.2%. The mAP of the proposed method is almost similar to the mAP of the normal training. Specifically, the proposed method obtains 77.1% mAP for all four classes. The mAP drops only 0.1% in comparison to normal training on the YOLOv3 object detection framework. This indicates the effectiveness of the proposed ER method for continual learning. Conclusively, the proposed ER method achieved comparably the same accuracy as the normal training.  The proposed ER method has excellent detection results when using the higher memory size (2500 and 5000 frames) as shown in Figure 9c,d respectively. On the other hand, the fewer memory size (500 and 1000 frames) has a tendency to detect one of the tasks better than the other. As illustrated in Figure 9a the model with memory size 500 frames can detect classes which belong to l [1] dataset better than classes which belong to l [0] dataset, whereas the opposite is true for the model with memory size 1000 frames. This condition occurs because employing a memory size of 1000 frames will preserve more features from the l [0] dataset compared with only utilizing a memory size of 500 frames. However, using higher memory will results in better performance from both l [0] and l [1] dataset since more images in memory can be shuffled, resulting in more variance while training. It is notable from Figure 9d that the memory size 5000 frames have better results among others. However, similar to prediction results in Figure 8, the detection result of the memory size 2500 and 5000 frames are comparable. This result indicates that even utilizing ER with the fewer memory size can achieve acceptable performance regardless of the dataset, as proven in Tables 6 and 8 which are performed in two different datasets.
Continual learning strategies tend to have higher training time compared with normal training. In normal training, after the model is run in inference for obtaining the prediction, backpropagation is performed to update the weights based on the loss between prediction and ground-truth. However, the prior works typically run the l [0] or previous model alongside the l [1] model and use the l [0] model prediction to learn previous features (KD [1] and DMC [17]) or compare the gradients of both models (GEM [15]). These approaches are rather cumbersome and take a longer time compared with normal training. Therefore, in the proposed ER method, a straight-forward approach has been considered, which is able to reduce the training time. In the proposed ER method, the noticeable addition in each iteration's training time is caused by the integration and augmentation of the memory in the l [1] training phase, which has an average of 200 ms for each iteration as described in Table 9. Meanwhile, the dynamic omission added to the algorithm is very simple. Thus, its time consumption is negligible. Table 9 also presents the training time of each process in the prior works. Based on these processes, the training time for a single iteration is calculated as the sum of all these processes. The average training time for the proposed ER methods is shown to be the lowest among other continual strategies. Table 10 provides the advantages and disadvantages of the proposed ER method in comparison to the previous methods. The proposed ER method achieves better detection results in comparison to other continual learning methods while having the lowest time complexity. Figure 9. The object detection result on ITRI-DriveNet-60 private dataset on various memory size, (a) memory size of 500 frames, (b) memory size of 1000 frames, (c) memory size of 2500 frames, and (d) memory size of 5000 frames. Table 9. Training time for single iteration of the proposed ER method and prior works.

Advantages Disadvantages
KD [1] No memory required and faster training time Auxiliary data required. DMC [17] No memory required Three models required to train and auxiliary data required GEM [15] No memory and no auxiliary data required Higher training time and lower performance ER No auxiliary data and faster training time Memory required

Conclusions
In this research, a novel method to address the problem of catastrophic forgetting is proposed. Using YOLO architecture as the benchmark framework, ER can preserve the features from l [0] task while training on the l [1] task. A guide to implementing the data augmentation technique is added to the proposed ER method to improve the learning process in the current task with varying memory sizes. Experimental results on Pascal VOC show that the proposed ER method provides acceptable results for continual learning. Specifically, the proposed ER method has achieved mAP of 65.5% and 68.9% in 10 + 10 and 19 + 1 classes scenario, respectively, higher than the state-of-the-art method. Nevertheless, continual learning for object detection still desires more improvement. A further evaluation that integrates with thoroughly extensive experiments on improving the continual learning process is expected to be performed for future study. Furthermore, the proposed ER method has alleviated the problem of catastrophic forgetting. However, enough improvements are needed to reduce the use of memory size to the minimum level while maintaining classes accuracy.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: