Transfer Incremental Learning using Data Augmentation

Deep learning-based methods have reached state of the art performances, relying on large quantity of available data and computational power. Such methods still remain highly inappropriate when facing a major open machine learning problem, which consists of learning incrementally new classes and examples over time. Combining the outstanding performances of Deep Neural Networks (DNNs) with the flexibility of incremental learning techniques is a promising venue of research. In this contribution, we introduce Transfer Incremental Learning using Data Augmentation (TILDA). TILDA is based on pre-trained DNNs as feature extractor, robust selection of feature vectors in subspaces using a nearest-class-mean based technique, majority votes and data augmentation at both the training and the prediction stages. Experiments on challenging vision datasets demonstrate the ability of the proposed method for low complexity incremental learning, while achieving significantly better accuracy than existing incremental counterparts.


Introduction
Humans have the ability to incrementally learn new pieces of information through time, building over previously acquired knowledge.This process is most of the time nondestructive, and results in what is often referred to as "curriculum learning" in the literature (Bengio et al. 2009).On the contrary, it has been known for decades that neural networks learning procedures, despite the fact they originally were proposed as a simplifying model for brain mechanisms, suffer from "catastrophic forgetting" (Kasabov 2013;French 1999), or the fact that previously learned knowledge is destroyed when learning new one.
During last years, deep learning has become the golden standard in many supervised learning challenges, especially in the field of computer vision (Iandola et al. 2016;Simonyan and Zisserman 2014;Szegedy et al. 2015).Deep Learning relies on the use of a large number of trainable parameters, that are carefully adjusted using stochastic gradient descent based algorithms.Learning novel data using the same set of parameters inevitably leads to the loss of the previously acquired knowledge.This is why many techniques have proposed to learn distinct deep learning systems Copyright c 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org).All rights reserved.
over the course of time, letting another algorithm decide which one to use at prediction stage (Girshick et al. 2014;Pan and Yang 2010).Such methods can quickly result in very complex systems, that are likely to fail in adversarial conditions.
Formally, an incremental learning approach would satisfy the following criteria (Rebuffi et al. 2017): 1.An ability to learn data using one (or a few) example(s) at a time, in any order, without requiring to reconsider or store previous ones.2.An ability to sustain a classification accuracy comparable to state-of-art methods while traversing successive incremental learning stages, thus avoiding catastrophic forgetting.3. Low computation and memory footprints, during training and classifying phases, that should remain sublinear in both of number of examples and their dimension.Satisfying these three criteria while keeping competitive accuracy of the proposed systems has remained a key open challenge.
A promising venue of research lies in "transfer learning" methods (Girshick et al. 2014), that make use of very efficient pre-trained deep neural networks previously obtained using huge datasets of signals related to the tasks at hand.As a result, very high quality feature vectors can be used to feed the incremental learning techniques, which can then achieve reasonable performances despite using simplistic mechanisms (Pan and Yang 2010).
In this paper, we introduce Transfer Increment Learning with Data Augmentation (TILDA), an incremental learning method that provides a) a robust selection of feature vectors in subspaces, and b) prediction procedures making use of data-augmentation.We stress the method using challenging vision datasets, namely CIFAR10, CIFAR100 and ImageNet LSVRC 2012.As a result the proposed method allows us to: • Perform incremental learning following the abovementioned definition, • Approach state-of-the-art performances on vision datasets, • Reduce memory usage and computation time by several order of magnitude compared to other incremental approaches.
Input Signal s Data Augmentation

Related Work
There has been interests in incremental learning for a long time (Schlimmer and Fisher 1986;Thrun 1996;Zhou and Chen 2002).For example, methods have been proposed (Syed et al. 1999;Poggio and Cauwenberghs 2001;Zheng et al. 2013) to address this problem with the aim at bounding memory footprint (c.f.criterion 3.).These approaches perform learning one subset at a time using Support Vector Machines (SVMs  Muhlbaier, Topalis, and Polikar 2009).This algorithm adds weak one-vs-all classifiers to accommodate new classes.Therefore, it may result in an excessive computational complexity and memory usage, disobeying criterion 3. It also needs training data for all classes to occur repeatedly, which contradicts criterion 1.
Research showed also the possibility for the sequential learning of data (Pentina, Sharmanska, and Lampert 2015).however, this requires to choose a correct ordering of the whole dataset, which does not fulfil criterion 1.In (Mensink et al. 2013), the authors proposed to use a pre-trained and unchanged DNN as feature extractor followed by the Nearest Class Mean classifier (NCM).NCM summarises each class using the average feature vector of all examples observed for the class so far.Classification processes by assigning the class of the most similar average vector using a metric that can be learned from data.Compared to other parametric classifiers (Mensink et al. 2012;Mensink et al. 2013;Ristin et al. 2014), NCM showed better performances in incremental learning scenarios.However, NCM gives a lower accuracy than state-of-art methods even when it uses all the dataset, hence does not fulfil criterion 2.
In (Hacene et al. 2017), a quite different incremental method called Budget Restricted Incremental Learning (BRIL) was proposed.BRIL combines "transfer learning" (Girshick et al. 2014;Pan and Yang 2010) with binary associative memories.A pre-trained DNN is used as feature extractor, as mentioned in (Mensink et al. 2013), while binary associative memories act as a classifier.A product random sampling is performed as an intermediate between the pre-trained DNN and the classifier.Despite being compliant with criteria 1 and 3, the accuracy remains significantly lower than existing counterparts, which violates criterion 2. Kuzborskij et al (Kuzborskij, Orabona, and Caputo 2013) showed that new classes can be added to a multi-class classifier with limited impact on accuracy when the classifiers can be retrained from at least a small amount of data be-longing to all classes.Using this, in (Rebuffi et al. 2017) the authors proposed an incremental learning method called "Incremental Classifier and Representation Learning" (iCaRL), based on a trainable DNN feature extractor, followed by a single classification layer.The classification process is inspired by NCM: it computes the mean of feature vectors for each class, and assign the label of the nearest prototype.However, memory usage can easily increase, especially when the dataset is made of high resolution images such as ImageNet, which may violate criterion 3.Moreover, the iCaRL method, when trained on data streams containing only few classes at a time, provides low accuracy as shown in (Rebuffi et al. 2017), hence iCaRL does not respect criterion 2. To reach good performances and a comparable accuracy to state-of-art methods, iCaRL thus needs to be trained over batches of data containing a large part of the dataset, which does not correspond to an incremental learning scenario and infringes 1.
In this paper, we introduce TILDA that builds upon previously proposed work, attempting to cover all 3 criteria for efficient incremental learning.As in iCaRL and BRIL, TILDA uses a pre-trained DNN as feature extractor.TILDA also uses an NCM-inspired classifier over the feature vectors obtained from by the pre-trained DNN.Data augmentation is performed on both training and classification datasets, aiming to improve accuracy.Consequently, there is no need to retrain the system with previous data, nor to perform computationally intensive processing when new data comes in.In addition, learning new data does not damage previously learned information.

Proposed Method
In this section, we describe the TILDA method.We start by giving a high level overview of the process, and then we explain the details.

Overview of the Proposed Method
TILDA is built upon four main steps: 1) a pre-trained DNN to perform feature extraction, 2) a technique to project features into low dimensional subspaces, 3) an assembly of NCM-inspired Classifiers (NCMC) applied independently in each subspace (c.f. Figure 1) and 4) a data augmentation inspired scheme to increase accuracy of the classifying process.We develop these steps in the following paragraphs.
The first step consists of using the internal layers of a pre-trained DNN (Krizhevsky, Sutskever, and Hinton 2012) as a generic feature extractor on which subsequent learning is performed.This process has become increasingly popular in the past few years and is often referred to as "Transfer Learning" (Oquab et al. 2014).The aim is to transfer acquired knowledge on a dataset to another related problem (Pan and Yang 2010).
In the following step we project feature vectors into multiple low dimensional subspaces.More precisely, we split feature vectors into P subvectors.For each class and each subspace, we produce k anchor vectors conveying robust statistical properties about corresponding feature subvectors.
Then, in each subspace anchor vectors are exploited to perform weak classification of the input data.We use here a NCM inspired method.A majority vote is then performed to obtain an aggregate decision.
Finally, we perform data augmentation on the input signals, be them training or testing inputs, thus obtaining multiple decisions for each input data as well as more robust classifiers in each subspace.A second majority vote is performed using these decisions to generate a global prediction.

Details of the Proposed Method
Pre-Trained Deep Neural Networks To obtain features from an input signal, TILDA relies on using DNNs that are pre-trained on a large number of examples.Consequently, using the pre-trained inner layers of the DNN acts as a generic feature extractor (Oquab et al. 2014;Hong et al. 2015;Pan and Yang 2010).As a matter of fact, inner layers of a deep DNN offer a good generic description of an input image, even when it does not belong to the learning domain (Oquab et al. 2014).
Using "Transfer Learning" ideas, we are not interested in this work in the network's architecture details, as we simply use the appropriate layers to extract features from a given input.
In the remainder of this paper, we denote by s m the mth input training signal and by x m its corresponding feature vector, where 1 ≤ m ≤ M and M is the total number of training signals.
Projection to Low Dimensional Subspaces Feature extraction allows us to consider the feature vector x m instead of the input signal s m .Formally, let us denote x m c the fact that feature vector x m belongs to class c.We split each x m c into P parts, denoted x m c,p 1≤p≤P .For each class and each subspace, we create k anchor vectors initialised with 0s, each of them associated with a counter, also initialised by 0. Considering the p-th subspace and the c-th class, we denote by Y c,p = [y c,p,1 , ..., y c,p,k ] the corresponding anchor vectors and N c,p = [n c,p,1 , . . ., n c,p,k ] their associated counters.
For each c and p, we aim at using the corresponding anchor vectors as centroids of a clustering of {x m c,p }.To this end, at each step of the training process, we ensure that each anchor vector is a centroid of a clustering of already processed input subvectors, and the associated counter accounts for the cardinality of the corresponding cluster.
Then, each time an input training vector is processed, we identify an anchor vector to be updated.The update simply consists of computing a new anchor vector obtained as a barycenter of the old one with weight given by its counter and the input subvector with weight 1, then incrementing the counter.This procedure is detailed in Algorithm 1. Namely, rather than simply associating the new subvector with the closest anchor vector, what would inevitably lead to unbalanced counters and thus poor performance in prediction, we prefer to take into account counters while performing this association.More precisely, we linearly penalize anchor vectors that are already made of the combination of many subvectors.Note that when two or more anchor vectors gives the same results (distances multiplied by counters), we choose uniformly at random one of these anchor vectors.
Algorithm 1 Incremental Learning of Anchor Subvectors Input: streaming feature vector x m c for p := 1 to P do for i := 1 to k do Note that the learning process is independent on the order of streaming data, and is performed one example at a time, thus enforcing criterion 1 of incremental learning methods described in the introduction.
Aggregation of subspaces weak classifiers At prediction stage, consider an input signal s and the associated feature vector x.We split x into the corresponding P parts and obtain (x p ) 1≤p≤P .We compute Euclidean distances between each x p and all anchor subvectors y c,p,i for which the counter is not 0. Note that there are at most kC such distances, where C is the number of classes seen so far.The class of the closest average anchor subvector is considered as the decision for the p-th subspace.Finally, we apply a majority vote over all subspaces to achieve an aggregate decision (c.f.Algorithm 2).Note that more elaborate strategies can result in higher accuracy but may require more computation during the learning phase as well as memorisation of previously seen examples.Data Augmentation during Training To improve the accuracy without increasing memory usage, data augmentation is applied to the training dataset.We gen-erate multiple version of each training input signal, and we consider the resulting dataset as an input to train the model.

Data Augmentation during Classification
In addition, we propose to obtain multiple predictions for each input signal s using data augmentation (Ciresan et al. ).The idea is to generate multiple versions of the input signal s that we denote (s r ) 1≤r≤R .We perform a prediction of the class associated with each s r independently, and then perform a majority vote to obtain the final prediction.From these facts we derive that TILDA is compliant with criteria 1 and 3 defined in the introduction.In the next section, we devise a set of experiments to evaluate the classification accuracy of the proposed method on challenging datasets (criterion 2).

Experiments
In this section we describe the protocol used to test the proposed method and compare its accuracy and memory usage with other incremental learning methods.

Benchmark Protocol
We propose an incremental learning scenario in which we have streaming data containing new classes/examples.We test and compare Budget Restricted Incremental Learning (BRIL), Nearest Neighbour search (NN), Nearest Class Mean classifier (NCM), Learn++, incremental Classifier and Representation Learning (iCaRL), and finally the proposed method (TILDA).Learn++ uses Classification And Regression Trees (CART) as weak classifiers.
We evaluate the different methods using CI-FAR10, CIFAR100 and ImageNet ILSVRC 2012 (Russakovsky et al. 2015).We also use 50 Ima-geNet classes which have not been used to train the CNN (denoted ImageNet50), and which contains 900/100 training/test images per class.All methods take the same feature vectors extracted from Inception V3 (Szegedy et al. 2015) as input.This requires to modify iCaRL method by replacing its CNN with a fully connected network.In the following, and for the iCaRL method, we use a MultiLayer Perceptron (MLP) with one hidden layer containing 1024 neurons, and output layer containing C neurons, where C is the number of classes.
The non-incremental learning methods (NI) used are denoted by TMLP and TSVM.TMLP uses transfer learning to compute feature extractors of input data through Inception V3, and then trains a MLP over feature vectors, using the hyperparameters previously described for iCaRL.TSVM method uses Inception V3 to get feature vectors as well, and uses them to train an SVM using Radial Basis Function kernel.
Data augmentation used in TILDA generates a horizontal flip of the original image, and shifts the pixels of the image by one pixel at a time (to the left, right, top, bottom, and on the four diagonals).Thus we generate R = 10 images (8 generated by shifting pixels on the image, one generated by horizontal flip and the original one).

Results
As a preliminary experiment, we aim to show that replacing the last layers of Inception V3 by the proposed method does not compromise the performances obtained on Imagenet ILSVRC 2012.The 5-top accuracy is 94.4% when we use TILDA with p = 16 and k = 30, and 96.5% when we use the last layers of Inception V3 to classify data.The accuracy obtained by TILDA approaches the one obtained by Inception V3, thus our method does not bring a considerable decrease in performances.
The second experiment is performed on CIFAR10/100, ImageNet50 and ImageNet ILSVRC 2012, in which we show the contribution of data augmentation, NCM-inspired classification, and subspace division on classification accuracy.Therefore, we define three methods: TILDA-DA does not use data augmentation and classifies only the original image, TILDA-NCM disregards NCM inspired classification and uses k feature vectors randomly chosen per class, and TILDA-P which is TILDA method with no splitting of vectors.Table 1 summarises the accuracy of TILDA, TILDA-DA, TILDA-NCM and TILDA-P, when performing one-shot learning (learn one example at a time).We notice that TILDA-DA, TILDA-NCM and TILDA-P reach lower accuracy than TILDA, which confirms that the combination of data augmentation with NCM-inspired classification and supspace division can achieve good performances.
In the third experiment, we study the effect of both quantization parameters P and k on the accuracy of TILDA (c.f. Figure 2).This experiment demonstrates that TILDA reaches best performances for P = 16.In the following, we perform experiments using TILDA with P = 16 and k = 30.
Note that in order to be fair in comparison with other techniques, we do not perform data-augmentation during training or prediction in TILDA in the upcoming experiments.
The fourth experiment is stressing the effect of classincremental learning.We adopt a class-incremental scenario (CI), in which methods are trained over streaming data providing all examples from one class simultaneously, one class at a time.We test and compare TILDA-DA, NCM, Learn++ and iCaRL on CIFAR10/100 and ImageNet50 (c.f. Figure 3).Learn++ adds one weak classifier each time a novel class is introduced, and iCaRL stores 30 feature vectors per class.We can see that TILDA-DA outperforms the other methods in this setting.
The next experiment illustrates the behaviour of the accuracy when trying to obtain incremental information from new examples of the same class.We adopt an exampleincremental scenario (EI), in which we train the method over streaming data providing new examples without introducing new classes.We test and compare TILDA-DA, NCM, NN, Learn++ and BRIL on CIFAR10/100 and ImageNet50.We divide these datasets into 10 equal size parts, each part containing 5000 examples (500/50 example per class) for CIFAR10/100 and 4500 examples (90 per class) for Ima-geNet50, and learn one part at a time.Learn++ adds one weak classifier each time a new part is learned.Figure 4 shows that all methods handle example-incremental learning and improve their accuracy each time they learn new information provided by new examples.TILDA-DA consistently obtains higher accuracy than Learn++, NCM, NN and BRIL regardless of the quantity of provided data.Note that Learn++ needs large number of examples to perform, and obtains a low accuracy when only few examples are provided.
Table 2 presents the different incremental learning methods, obtained accuracies and memory footprints.Learn++ uses either class-incremental scenario (CI) or exampleincremental scenario (EI).iCaRL performs learning process using CI.TILDA, NN, NCM, and BRIL use oneshot learning to process one example at a time providing a novel class or additional information, thus they handle both class-incremental and example-incremental at the same time.TILDA outperforms all other incremental learning methods on both accuracy and memory usage.
The last evaluation we perform aims to compare TILDA with a non incremental learning method such as TMLP and TSVM.To do this, we store and train these methods on the whole dataset.The parameters used for TILDA are P = 16 and k = 30 for CIFAR10/100 and ImageNet50.Table 3 shows that TILDA reaches an accuracy comparable to stateof-art methods.
As shown by the different evaluation, the TILDA method can at any instant classify data with a good accuracy (c.f. Figure 3 and Figure 4), outperforms other incremental learning methods (c.f.Table 2), and approaches state-of-art accuracy (c.f.Table 3).Consequently, TILDA fulfils criterion 2.

Conclusion
In this paper, we have introduced TILDA, a new incremental learning approach inspired by recently proposed methods.TILDA relies on a pre-trained DNN to process data, a projection technique that defines low-dimensional subspaces, NCM inspired classifiers, and data augmentation at both the training and prediction phases.This addresses previous concerns from previous methods of: a) iCaRL as it reaches a good accuracy when stream data contains one class at a time, b) BRIL as it provides a good accuracy comparable to state-of-art method, c) NCM as it uses k anchor vectors     instead of one and other methods to increase the accuracy, and d) Learn++ as it still performs well even if steam data does not contains examples of all classes each time.Experiments on challenging datasets show that: a) TILDA does not suffer from catastrophic forgetting, in such a way we get the same accuracy and model representation in both incremental learning and offline learning, b) TILDA approaches state-of-art accuracy, c) TILDA uses much less memory usage and gets the same accuracy as nearest neighbour search or even better, d) TILDA still gives a good accuracy even in the case where we have only one class each time, e) and finally, to our knowledge TILDA is the incremental method that reaches the best accuracy.This method is also promising for embedded devices, since it is not necessary to train a DNN or compute extensive operations for learning.
In future work, we plan to explore further the methods for splitting feature vectors, data augmentation strategies and a weighted majority vote to improve the accuracy.We also plan to propose a hardware architecture of TILDA for incremental learning on chip.

Figure 1 :
Figure1: Overview of the proposed method.Given an input signal s, we first use data augmentation to generate a multiple version of the input signal (s r ) 1≤r≤R .Then we use a pre-trained DNN for feature extraction and obtain the corresponding feature vectors (x r ) 1≤r≤R .Subsequently, we split each feature vector x r into P equal parts (x r,p ) 1≤p≤P , and classify each part x r,p using a NCM-inspired classifiers (NCMC) containing anchor vectors (Y c,p ) 1≤c≤C .We obtain a class for each part c r,p and do a majority vote to get the class of x r .Finally, a second majority vote is done thanks to the obtained classes (c r ) 1≤r≤R of all generated signals to get assigned class C to the original input signal s.

Algorithm 2
Predicting the Class of a Test Input Signal Input: input signal s Compute the feature vector x associated with s Initialize the vote vector v as the 0 vector with dimension C for p := 1 to P do v p = arg min c min i x p − y c,p,i 2 v vp = v vp + 1 end for C = arg max c (v c ) Output: class C attributed to s Data Augmentation We use two data augmentation methods to improve the accuracy and robustness during training and classification.
We point out multiple facts about the proposed method: a) The learning procedure performs learning one example at a time, b) The learning procedure is computationally light as it only requires performing of the order of d operations where d is the dimension of feature vectors, c) The learning procedure has a small memory footprint, as it only stores averages of feature vectors, d) The learning procedure is such that adding new examples can only increase robustness of the method, so that there is no catastrophic forgetting, e) During prediction stage, memory usage is of the order of kCd and thus is independent on the number of examples and grows linearly with the number of classes, f) During prediction stage, computations are of the order of kCdR elementary operations.

Figure 2 :
Figure 2: Evolution of the accuracy as a function of P and k for CIFAR10 (left), CIFAR100 (middle) and ImageNet50 (right).

Figure 3 :
Figure 3: Evolution of the accuracy as a function of number of classes for CIFAR10 (left), CIFAR100 (middle) and ImageNet50 (right).

Figure 4 :
Figure 4: Evolution of the accuracy as a function of number of learning examples for CIFAR10 (left), CIFAR100 (middle) and ImageNet50 (right).

Table 2 :
Comparison of accuracy (Acc) and memory usage (M) relative to full dataset (corresponding to 100%) for the different methods.Note that memory usage of Learn++ method represents the size of weak classifiers, and for iCaRL represents the stored feature vectors and the size of the trainable neural network.

Table 3 :
Comparison of TILDA with non-incremental learning methods in a non-incremental learning scenario.