Brick Assembly Networks: An Effective Network for Incremental Learning Problems

: Deep neural networks have achieved high performance in image classiﬁcation, image generation, voice recognition, natural language processing, etc.; however, they still have confronted several open challenges that need to be solved such as incremental learning problem, overﬁtting in neural networks, hyperparameter optimization, lack of ﬂexibility and multitasking, etc. In this paper, we focus on the incremental learning problem which is related with machine learning methodologies that continuously train an existing model with additional knowledge. To the best of our knowledge, a simple and direct solution to solve this challenge is to retrain the entire neural network after adding the new labels in the output layer. Besides that, transfer learning can be applied only if the domain of the new labels is related to the domain of the labels that have already been trained in the neural network. In this paper, we propose a novel network architecture, namely Brick Assembly Network (BAN), which allows a trained network to assemble (or dismantle) a new label to (or from) a trained neural network without retraining the entire network. In BAN, we train labels with a sub-network (i.e., a simple neural network) individually and then we assemble the converged sub-networks that have trained for a single label together to form a full neural network. For each label to be trained in a sub-network of BAN, we introduce a new loss function that minimizes the loss of the network with only one class data. Applying one loss function for each class label is unique and different from standard neural network architectures (e.g., AlexNet, ResNet, InceptionV3, etc.) which use the values of a loss function from multiple labels to minimize the error of the network. The difference of between the loss functions of previous approaches and the one we have introduced is that we compute a loss values from node values of penultimate layer (we named it as a characteristic layer) instead of the output layer where the computation of the loss values occurs between true labels and predicted labels. From the experiment results on several benchmark datasets, we evaluate that BAN shows a strong capability of adding (and removing) a new label to a trained network compared with a standard neural network and other previous work.


Introduction
Deep neural networks [1] have played an important role in many areas of artificial intelligence field, such as image classification and object detection [2][3][4][5], image generation [6][7][8][9], speech recognition [10][11][12], text generation [13,14], etc. Although deep neural networks produce remarkable results compared with other machine learning algorithms, there still remain some open challenges for researchers to further investigate. These challenges include incremental learning issue, overfitting, hyperparameter optimization, lack of flexibility and multitasking [15], etc. In this paper, we focus on the issue that a neural network is lacking of flexibility in terms of adding an extra label to its output layer after the neural network has been converged. This is basically one of the incremental learning problems that is related to machine learning methodologies that continuously train an existing model with additional knowledge. The incremental learning problem is worth exploration because most neural network systems have a poor capability in adding new labels to their output layer after the neural network systems have been converged. To the best of our knowledge, there are two common solutions to solve this issue which are 'retraining' and 'transfer learning' [16]. In the first solution, we add a new data into the training dataset, and we repeat the training procedure on a newly initialized neural network. However, this naïve solution has a drawback, which is that it is very time consuming because we have to retrain the entire neural network every time we add a new label into it. In order to apply a transfer learning on the image classification problem, we remain the convolutional layers of the neural network and retrain only the fully connected layers of the neural network. Although this solution is more effective than the first solution, it has a restriction that the new label must be from a similar domain of the other labels that have already been trained in the neural network.
To address these problems (i.e., time-consuming characteristics of the retraining method and domain restriction limitation of the transfer learning method), we propose a novel network architecture, namely brick assembly network (BAN). From a given dataset regardless its domain, we train each label with a neural network. From now on, we limit our discussion to a widely used convolution neural network (CNN) which consists of an input layer, a convolutional layer, and a fully connected layer to train a label for clear and concise description of our proposed method. We denote this neural network as sub-network for the remaining paragraphs in the paper. After the sub-networks are converged, they are merged together into one full neural network which is called BAN. In short, BAN provides capabilities for labels to be trained in its sub-networks, respectively, to assemble the converged sub-networks to the BAN, or dismantle the sub-networks from the BAN at any time. We explain the capability of BAN more detail in Section 3.2.
In this study, we summarize our research contributions as follows: • BAN is the first network architecture that provides a capability of a trained neural network to assemble (add) and dismantle (remove) labels without retraining the neural network.

•
We introduce a truth labels-less loss function to train a network.

•
We propose a way to train a network with only one label data. In other words, we can train BAN with a single label at a time. • BAN does not include old labels' datasets during the training phase when we add or remove a label from the network.
To promote reproducible research, we release the implementation of our network architecture (Our scripts are available at https://github.com/canboy123/ban).

Related Work
Roy et al. [17] have proposed a hierarchical deep convolutional neural network (TreeCNN) for solving the incremental learning problem by growing a trained network structure if new labels are added to the network. In their experiment results, TreeCNN has shown that it performs lesser training effort than the standard neural network, and it has maintained a competitive accuracy as the standard neural network. However, when new labels are added to a trained TreeCNN, it still requires old data to retrain the network. Besides that, TreeCNN consumes more time to train the network than our BAN as shown in Section 5.
Castro et al. [18] have proposed end-to-end incremental learning model which is composed of a feature extractor and a classification layer. From their experiment results, their model has shown the capability of performing incremental learning by increasing the number of the model's classification layer when a new label is added into the model. Unlike BAN, they include both new data and old data to train their model during the incremental learning while we just use only the new data to train the sub-network of BAN. Besides that, they do not include the capability of removing a trained label from their model; meanwhile, we show that our BAN can dismantle a label from a trained network.
Rosenblatt [19] is the founder of the perceptron algorithm which is used for supervised learning of binary classifiers. He has introduced an impressive back-propagation procedure for updating the weights of a neural network. Thoroughly, the weights are updated based on the gradient of the loss function in terms of weights. The loss function computes the difference between a true label and a predicted label from a trained network. In this paper, however, we propose a new loss function that updates the weights without using any label which is discussed in Equation (1) of Section 3.2 .
Oza and Patel [20] have proposed a novel one-class convolutional neural network. The network has a feature extractor and a classifier. The feature extractor integrates with pseudo-negative class data which are generated from a zero centered Gaussian distribution, and it is used to embed an input image into a feature space. The classifier, however, is used to produce a confidence score (i.e., 1 or 0) for a given input image. Although the authors have shown that their network outperforms other statistical and deep learning-based one class classification methods, it is limited to only one-class case that is either abnormal (1) or normal (0). In BAN, on the other hand, even though our training method of a sub-network is similar to the training method of their model (i.e., train a label in a network), our BAN produces multiple outputs (discussed in Equation (3) of Section 3.2) instead of binary output.
Generally, researchers have proposed a network architecture (e.g., AlexNet [21], GoogLeNet [22], VGGNet [23], etc.) that performs a training which accommodates all procedures from initial random weight assignment to full convergence. In consequence, it is difficult for the network to add a new label to or remove a trained label from the network. Unlike those network architectures, our BAN allows the network to assemble a new trained sub-network or dismantle a trained sub-network from the network without retraining the entire network.

Preliminaries
Let X be an input image for training neural networks. If the size of the image is w × h × c, then {x i |x i ∈ X}, where x i = x 1 , x 2 , . . . , x w×h×c is the pixel of the image. Note that w, h and c refers to width, height and channel, respectively. A classifier F(·) is a function to produce a predicted labelŶ for X. Let Y be the true label of image X. To optimize a neural network, we minimize a loss function L(·) that computes the difference between a true label Y and a predicted labelŶ for X. The derivative of the loss function is commonly used for updating the parameters of the neural network such as weights W, biases B, etc.

Brick Assembly Network
BAN is the novel network architecture that has innovative "retrain-less" features of assembling and dismantling trained sub-networks. In other words, we can add trained sub-networks to a BAN, and also remove trained sub-networks from a BAN without retraining the BAN. Note that the sub-network refers to a network that is trained by feeding only one label data as its training dataset. To compare the performance of BAN with other cutting-edge algorithms, we optimize them by computing the derivatives of their loss functions from several labeled datasets. The training using one label data in BAN is inspired from the observation that the classification step of each label has activated different neural nodes on a penultimate layer (i.e., a fully connected layer before an output layer) of a standard neural network through an activation function. In other words, our observation is that an image with a particular class label produces a unique pattern on the penultimate layer so that the neural network can generate a distinguishable output. To avoid a confusion of addressing the penultimate layer, we called it a characteristic layer that is composed of j nodes, C ∈ R j , where j > 1, in the remainder of this paper. Then, {c i |c i ∈ C}, where c i = c 1 , c 2 , . . . , c j refers to the node value of the characteristic layer. From the characteristic layer, we discover that each label can be trained separately by minimizing a loss function L l C (X l ) defined in Equation (1): where L C (·) is a loss function, C is a user-defined characteristic layer composed of j nodes,Ĉ is a predicted characteristic layer composed of j nodes, l is a label index, and X l is an image with a specific label. Note that C is a vector which consists of j values and is initialized with some random values which are − ≤ C j ≤ , where is a user-specified threshold. We set = 5 in our experiments. We define a simple predicted characteristic layer which is composed of j nodes as shown in Equation (2): where a(·) is an activation function, W is a weight matrix, X is images, and B is a bias vector. We illustrate the procedure of training and testing phases of BAN for MNIST (Modified National Institute of Standards and Technology) [24] examples in Figure 1. During the training phase, we train two sub-networks with labeled data (see the blue box in Figure 1), "0" images and "7" images, respectively, by computing the loss function defined in Equation (1). After the sub-networks have converged, we assemble them together to form a BAN as depicted with a red box in the testing phase. To test a new image, we calculate the distances between user-defined values of characteristic layers, C, and predicted values of characteristic layersĈ in BAN. Then, we classify the image into a specific label which corresponds to the lowest distance as defined in Equation (3): where D(·) is an Euclidean distance [25], C l is a user-defined characteristic layer values for a specific label l,Ĉ l is a predicted characteristic layer values for a specific label l, l is a label index, and X is an image. In summary, in order to train a network with only one label data, we optimize the weights of the network by minimizing a loss function from the difference between the values of user-defined characteristic layer C and the values of predicted characteristic layerĈ which has been shown in Equation (1). This is different from common loss functions which minimize the mean of squared error (MSE) between truth labels and predicted labels with multiple labels data. Hence, we emphasize that our BAN uses a label-less loss function to train a network. Although BAN needs more parameters (i.e., weights) than standard neural networks, it improves the neural network's capability of adding and classifying new class labels which are either from the same or different domains without retraining the entire network.

Pseudo-Code of the Brick Assembly Network
We provide Algorithm 1 to explain BAN in terms of pseudo-codes. First, we initialize all weights of convolutional layers and all nodes of characteristic layers to random numbers in the range [−0.5, 0.5] and [−5, 5], respectively. For each sub-network, we calculate the nodes value of predicted characteristic layer and its loss functions during the feed-forward procedure (in lines 6 to 9 of Algorithm 1). After that, we update the weights W (and biases B) by computing the gradient of the loss function in terms of the weights (in lines 11 to 14 of Algorithm 1). We repeat the feed-forward and back-propagation procedures until the sub-network has converged. Finally, we assemble them to form a BAN.

Algorithm 1
The pseudo-code of the Brick Assembly Network Input: Image dataset D, distributed into l sub-datasets, where each sub-dataset consists of only one label data, X l ⊂ D Output: A converged BAN. 1: Initialization: 2: Initialize the learning rate α. 3: Set initial weights w 1 , w 2 , . . . , w n ∈ W to random numbers in the range [−0.5, 0.5]. 4: Set initial nodes value of characteristic layer c 1 , c 2 , . . . , c j ∈ C to random numbers −5 ≤ c i ≤ 5. 5: Feed-forward Procedure: 6: for each neural network do 7: Compute predicted node values of the characteristic layer of a neural network: h(x) can be a nested function.

Parametric Characteristic Layer
In this paper, we also introduce the parametric characteristic layer. The parametric characteristic layer refers to the node values of a defined characteristic layer C which change dynamically. In other words, the final node values of a defined characteristic layer are different from their initialized values. The purpose of using parametric characteristic layer is to obtain a proper node value of the characteristic layer instead of a fixed node values by using gradient descent with a given parameter vector β. To produce a parametric characteristic layer, we multiply the characteristic layer C with a parameter vector β as defined in Equation (4): where β is a vector which consists of j parametric variables that control the latency of C. Note that we use the parametric characteristic layer C p instead of a fixed characteristic layer C in the experiments. Therefore, we modify Equation (1) into Equation (5) and Equation (3) into Equation (6): We also provide Algorithm 2 to explain the pseudo-code of updating the parametric vector β.

Algorithm 2
The pseudo-code of the parametric characteristic layer Input: Image dataset D, distributes into l sub-datasets, each sub-dataset consists of only one label data, X l ⊂ D Output: A converged parametric characteristic layer.

Dataset
For experiment analysis of our propose methods, we use three public benchmark datasets in this study. The datasets include MNIST [24], Fashion MNIST [26], and Kuzushiji-MNIST [27]. MNIST, Fashion MNIST, and Kuzushiji-MNIST have 60,000 training images and 10,000 test images associated with labels from ten classes. The size of each image is 28 × 28 grayscale. Note that, in these experiments, we represent the pixel value of the images in a normalized value (i.e., (0, 1)) instead of its original value (i.e., (0, 255)).
We use a basic convolutional neural network (CNN) [21] as the sub-network architecture for each label in three datasets. The sub-network's architecture is shown in Figure 1. We use one convolutional layer and one fully connected layer (i.e., characteristic layer). The convolutional layers are followed by a Leaky Rectifier Linear Unit (LeakyReLU) [28] activation function.

Experiment Results and Discussion
In this study, we perform several experiments to test our proposed network architecture, BAN.
In the experiments, we denote N mnist , N f mnist , and N kmnist as the number of label for MNIST, Fashion MNIST and Kuzushiji-MNIST datasets, respectively. Note that we perform only 50 epochs during the training phase to prevent the network from being over-fitted to the training dataset.

Single Dataset
The objective of this experiment is to demonstrate the classification performance of the classifiers (i.e., standard neural network, BAN, and TreeCNN [17]) by incrementally adding one new label to each classifier from a given dataset. Each classifier is trained with two labels at the beginning of the experiment. After that, a new label is incrementally added to each classifier until the classifier has trained with ten labels. We illustrate the experiment results in Figure 2. Figure 2a shows the accuracy results of the standard neural network, BAN, and TreeCNN which have trained with different numbers of labels at the 50th epoch. Meanwhile, Figure 2b displays the total time used to train classifiers by incrementally adding one new label to each classifier in 50 epochs. Although BAN produces lower performance than the standard neural network and TreeCNN in terms of accuracy, it has a strong capability of adding (or removing) new labels into a trained network from the observation that the total time used of BAN on training a new label is significantly lesser than the other two networks as shown in Figure 2b. BAN has used less than ten seconds to train each label in MNIST, Fashion MNIST, and Kuzushiji-MNIST, respectively. It is because, unlike the standard neural network and TreeCNN, BAN only has to train a sub-network with the new data without retraining the entire network. Therefore, BAN uses less time to achieve a converged stage.
In the standard image prediction, we apply a softmax function in the output layer of a neural network in order to produce the probability for each neural node. The highest probability of the neural node will be chosen as the final class. In BAN, instead, we compute the distances between user-defined values of parametric characteristic layers, βC and predicted values of characteristic layerŝ C as discussed in Equation (6). The lowest distance produced from the particular sub-network will be chosen as the final class. We perform this experiment to evaluate if the distance from Equation (6) can be used for the prediction result. We show the average distance for MNIST, Fashion MNIST, and Kuzushiji-MNIST datasets in Tables 1-3, respectively. In each table, the first column shows the true label for test images and the remaining columns are the average distance for predicted labels which are generated from sub-networks in BAN. The shortest average distance is in bold. The overall experiment results indicate that BAN has predicted most images correctly except one case in Table 3 with Kuzushiji-MNIST datasets . The exceptional case shows that BAN has predicted most images which are from label 7 to label 4. We believe that the issue can be solved easily by training the sub-network with different loss functions, adding regularizations in the loss function, or applying different activation functions in the layers. We will pursue the study in our future research direction.

Multiple Datasets
The aim of this experiment is to demonstrate the classification performance of three classifiers (i.e., standard neural network, BAN, and TreeCNN) by incrementally adding one new label each from MNIST, Fashion MNIST, and Kuzushiji-MNIST datasets. Initially, the classifier is trained with three labels where each label corresponds to each dataset mentioned above. Following that process, three new labels are incrementally added to the classifier until it becomes trained with total of 30 labels. In other words, we increment the label number of each dataset from one to ten where N mnist , N f mnist , N kmnist = {1, 2, 3, . . . , 10}. We depict the experiment results in Figure 3. Figure 3a presents the accuracy results of the standard neural network, BAN, and TreeCNN which have been trained with different number of labels at the 50th epoch, whereas Figure 3b shows the total time used to train classifiers by incrementally adding one new label each from the three datasets to each classifier in 50 epochs. Since BAN has trained ten labels from each dataset in Section 5.1.1, then we can use the trained sub-networks directly without any training procedure in this sub-section. Therefore, the total time used to train BAN is 0 for all cases in Figure 3b. Although the accuracy of BAN is lower than the standard neural network and TreeCNN in Figure 3a, we conjecture that it can be increased by using different settings on the sub-network such as the activation function, the number of neural nodes on the characteristic layers C, the user-specified threshold , etc. We will discuss the fine-tuning for the sub-network as our future research directions.
In summary, we evaluate that BAN costs less time to train a new label that is added to the trained network (Section 5.1.1). We also can reuse the trained sub-networks which are trained with different datasets to form one different network structure without any training procedure (Section 5.1.2).

The Capability of Changing Different Labels on a Network with a Mixture Dataset
The intention of this experiment is to examine the capability of the classifiers (i.e., standard neural network, BAN, and TreeCNN) from the changes of labels while maintaining a fixed number of the labels in the classifier. We perform the experiment with three cases by using two different datasets per case (i.e., MNIST & Fashion MNIST, MNIST & Kujushiji-MNIST, and Fashion MNIST & Kujushiji-MNIST). In each case, we make sure the total number of label chosen from both datasets are equaled to ten (e.g., N mnist + N f mnist = 10). For example, we select N mnist = {1, 2, . . . , 9} and N f mnist = {9, 8, . . . , 1}, respectively, in Table 4 (or the 1st row images of Figure 4). We show the experiment results of three cases in Tables 4-6, respectively. We also illustrate the results of three tables in Figure 4 for a better observation. Although the accuracy of BAN on a mixture of two datasets has lower performance than the standard neural network and TreeCNN, BAN has shown the retrain-less capability of sub-networks since they have been trained in Section 5.1.1. This also shows that the data with only one label can be trained with a unique pattern on the characteristic layer of a neural network. In summary, we evaluate that BAN has the best capability of assembling or dismantling any label to or from BAN while maintaining a fixed number of labels at any time without retraining the network.

Summary
We provide the qualitative comparison between a standard neural network, BAN, and TreeCNN in Table 7. The 'number of parameters' in Table 7 refers to the parameters used in the neural network such as weights, biases, and hyper-parameters of an activation function, etc. Although using BAN requires more memory for storing a high number of parameters, BAN has more advantages than using a standard neural network and TreeCNN. For instance, BAN can train a new label with a sub-network individually and then assemble to a trained BAN. Besides that, the training time of BAN is far lesser than the training time of the standard neural network and TreeCNN because BAN does not require retraining the entire network if new data are added into the dataset . Furthermore, BAN requires no label to train a dataset.