Ensemble Learning of Lightweight Deep Learning Models Using Knowledge Distillation for Image Classification

In recent years, deep learning models have been used successfully in almost every field including both industry and academia, especially for computer vision tasks. However, these models are huge in size, with millions (and billions) of parameters, and thus cannot be deployed on the systems and devices with limited resources (e.g., embedded systems and mobile phones). To tackle this, several techniques on model compression and acceleration have been proposed. As a representative type of them, knowledge distillation suggests a way to effectively learn a small student model from large teacher model(s). It has attracted increasing attention since it showed its promising performance. In the work, we propose an ensemble model that combines feature-based, response-based, and relation-based lightweight knowledge distillation models for simple image classification tasks. In our knowledge distillation framework, we use ResNet−20 as a student network and ResNet−110 as a teacher network. Experimental results demonstrate that our proposed ensemble model outperforms other knowledge distillation models as well as the large teacher model for image classification tasks, with less computational power than the teacher model.


Introduction
During the last few years, deep learning models have been used successfully in numerous industrial and academic fields, including computer vision [1,2], reinforcement learning [3], and natural language processing [4]. However, training data is not sufficient to effectively train deep learning models in many applications. Therefore, it is necessary to develop compact networks that generalize well without requiring large-scale datasets. Furthermore, most deep learning models are computationally expensive to run on resource-limited devices such as mobile phones and embedded devices.
To overcome these limitations, several model compression techniques (e.g., low-rank factorization [5][6][7], parameter sharing and pruning [8][9][10][11][12][13][14][15][16][17][18][19], and transferred/compact convolutional filters [20][21][22][23][24][25]) have been proposed for reducing the size of the models while still providing similar performance. One of the most effective techniques is knowledge distillation in a teacher-student setting, where a larger pretrained network (teacher network) produces output probabilities which are then used to train a smaller compact network (student network). Hence, it provides greater architectural flexibility since the structural differences between the teacher and student are allowed. Additionally, instead of training with one-hot-encoded labels where the classes are mutually exclusive, using the relative probabilities of secondary classes or relational information between data examples provides more information about the similarities of the samples, which is the core part of the knowledge distillation. Despite its simplicity, knowledge distillation shows promising results in various fields including image classification.
Generally, knowledge distillation methods can be divided into (1) response-based (2) feature-based, and (3) relation-based distillation methods depending on which forms of knowledge is transferred into the student network. In the response-based distillation method, the neural response of the last output layer of the teacher network is used. For instance, Ba and Caruana [26] used logits as knowledge. Hinton et al. [27] adopted the category distribution (soft target) as the distilled knowledge. In the feature-based distillation method, the output of the intermediate (i.e., feature maps) and last layers are all utilized as knowledge for supervising the training of the student model. For instance, FitNet [28] uses intermediate feature maps of the teacher network to improve performance. The work in [29] proposed Attention Transfer (AT) to transfer the attention maps which represent the relative importance of layer activations. In the relation-based distillation method, the relationships between data samples and different layers are further utilized in relation-based knowledge. For instance, Park et al. proposed Relational Knowledge Distillation (RKD) [30] to transfer relations of data examples based on distance-wise and angle-wise distillation losses which measure structural differences in relations. Peng et al. [31] proposed a method based on correlation congruence (CC), in which the distilled knowledge contains information of instances and the correlations between two instances.
In this work, we propose an ensemble model that combines three lightweight models learned by three different knowledge distillation strategies (feature-based, response-based, and relation-based) on CIFAR-10 and CIFAR-100 datasets which are widely used as the benchmarks for the image classification task. For knowledge distillation, we adopt ResNet-20 as a lightweight student network, containing only 0.27 million parameters, and ResNet-110 as a teacher network, containing 1.7 million parameters for knowledge distillation.
In our experiment, we provided an extensive evaluation of 20 different knowledge distillation and our proposed ensemble methods. For a fair comparison, all experiments are conducted under the same conditions for all knowledge distillation methods. Our experiment results demonstrate that our proposed ensemble model outperforms not only other knowledge distillation methods but also teacher networks with less computational power.
In summary, our contributions are listed as follows: • We designed and implemented an ensemble model that combines feature-based, response-based, and relation-based lightweight knowledge distillation models.

•
We conducted extensive experiments on various knowledge distillation models and our proposed ensemble models under the same conditions for a fair comparison.

•
We showed that our proposed ensemble model outperforms other state-of-the-art distillation models as well as large teacher networks on two different datasets (CIFAR-10 and CIFAR-100) with less computational power.
The rest of this work is organized as follows. In Section 2, we briefly present related works on model compression and knowledge distillation. Then, we describe our proposed ensemble method in Section 3. The experimental results are shown in Section 4. Finally, we summarize and conclude this work in Section 5.

Related Work
In this section, we discuss related literature in model compression and knowledge distillation.

Low-Rank Factorization
Low-rank factorization identifies informative parameters of deep neural networks by employing the matrix and tensor decomposition. In convolutional neural networks (CNNs), convolution kernels are viewed as a four-dimensional (4D) tensor. The main idea of tensor decomposition is that there are many redundant 4D tensors and these tensors could be removed for compressing the CNNs. In addition, for a fully connected layer, it can be viewed as a 2D matrix in which we can take advantage of the low-rankness for model compression. The overall framework of the low-rank factorization method is illustrated in Figure 1. Lebedev et al. [5] proposed Canonical Polyadic (CP) decomposition which is computed by nonlinear least squares for kernel tensors. For training CNNs that are low-rank constrained, Tai et al. [6] proposed a new method for computing the low-rank tensor decomposition. The activations of the intermediate hidden layers are transformed by Batch Normalization (BN) [7]. However, extensive model retraining is needed for low-rank factorization to accomplish similar performance of the original model. Another issue is that the implementation is extremely difficult as it involves computationally expensive decomposition operation.  [5][6][7], (2) parameter sharing and pruning [8][9][10][11][12][13][14][15][16][17][18][19], and (3) transferred/compact convolutional filters [20][21][22][23][24][25].

Low-Rank Factorization
Low-rank factorization identifies informative parameters of deep neural networks by employing the matrix and tensor decomposition. In convolutional neural networks (CNNs), convolution kernels are viewed as a four-dimensional (4D) tensor. The main idea of tensor decomposition is that there are many redundant 4D tensors and these tensors could be removed for compressing the CNNs. In addition, for a fully connected layer, it can be viewed as a 2D matrix in which we can take advantage of the low-rankness for model compression. The overall framework of the low-rank factorization method is illustrated in Figure 1. Lebedev et al. [5] proposed Canonical Polyadic (CP) decomposition which is computed by nonlinear least squares for kernel tensors. For training CNNs that are low-rank constrained, Tai et al. [6] proposed a new method for computing the low-rank tensor decomposition. The activations of the intermediate hidden layers are transformed by Batch Normalization (BN) [7]. However, extensive model retraining is needed for low-rank factorization to accomplish similar performance of the original model. Another issue is that the implementation is extremely difficult as it involves computationally expensive decomposition operation.

Parameter Sharing and Pruning
The parameter sharing and pruning-based methods focus on removing noncritical and redundant parameters from deep neural networks without any significant effect on the performance using (1) quantization/binarization, (2) parameter pruning/sharing, and (3) designing structural matrix.
The original network can be compressed by network quantization, in which the number of bits required to represent each weight is reduced. Gupta et al. [8] proposed 16-bit fixed-point representation for training CNNs with a stochastic rounding scheme. It was shown in [9] that linear eight-bit quantization of the parameters can significantly accelerate training with minimal loss in terms of accuracy. Han et al. [10] proposed the method which prunes the small-weight connections first and retrains the sparsely connected networks. After that, the weights of links are quantized by weight sharing. After that, Huffman coding is applied to both the quantized weights and the codebook for further reducing the rate. First, connectivity is learned through training a normal network. After that, pruning the connections which have a small weight is done as illustrated in Figure 2. Lastly, the remaining sparse connections are fine-tuned by retraining the network. There are also many works that train CNNs with binary weights directly (e.g., BinaryNet [11], Connect [12], and XNORNetworks [13]). However, when dealing with very large CNNs (e.g., GoogLeNet [32]), the accuracy of these binarized neural networks is significantly reduced.
On the other hand, several methods proposed network sharing and pruning to reduce network complexity. The work in [14] proposed a HashedNets which utilizes a lightweight hash function to randomly group weights into hash buckets for weight sharing. In [15], a structured sparsity

Parameter Sharing and Pruning
The parameter sharing and pruning-based methods focus on removing noncritical and redundant parameters from deep neural networks without any significant effect on the performance using (1) quantization/binarization, (2) parameter pruning/sharing, and (3) designing structural matrix.
The original network can be compressed by network quantization, in which the number of bits required to represent each weight is reduced. Gupta et al. [8] proposed 16-bit fixed-point representation for training CNNs with a stochastic rounding scheme. It was shown in [9] that linear eight-bit quantization of the parameters can significantly accelerate training with minimal loss in terms of accuracy. Han et al. [10] proposed the method which prunes the small-weight connections first and retrains the sparsely connected networks. After that, the weights of links are quantized by weight sharing. After that, Huffman coding is applied to both the quantized weights and the codebook for further reducing the rate. First, connectivity is learned through training a normal network. After that, pruning the connections which have a small weight is done as illustrated in Figure 2. Lastly, the remaining sparse connections are fine-tuned by retraining the network. There are also many works that train CNNs with binary weights directly (e.g., BinaryNet [11], Connect [12], and XNORNetworks [13]). However, when dealing with very large CNNs (e.g., GoogLeNet [32]), the accuracy of these binarized neural networks is significantly reduced.

manually.
The structural matrix can reduce the memory and also significantly accelerate the training and inference stage through gradient computations and fast matrix-vector multiplication. The work in [17] demonstrates the effectiveness of the new notion of parsimony according to the theory of structured matrices. Additionally, it can be further extended to other matrices such as block-Toeplitz matrices [18] which relate to multidimensional convolutions [19].

Figure 2.
A summary of the three stages in the compression pipeline proposed in [10]: pruning, quantization, and Huffman encoding. The input is the original network and the output is the compressed network.

Transferred/Compact Convolutional Filters
Transferred/compact convolutional filters focus on designing structural and special convolutional filters for reducing the size of the parameter and saving storage. According to the equivariant group theory introduced in [20], the whole network models can be compressed by applying a transform to filters or layers. Besides, the computational cost can be reduced using a compact convolutional filter. The main idea is that replacing the overparametric and loose filters with compact blocks can improve the speed, which accelerates the training of CNN models on several benchmarks. Iandola et al. [21] proposed a compact neural network called SqueezeNet which stacks a bunch of fire modules. This fire module contains squeeze convolutional layers that have only 1 × 1 convolution filters which then feed into expanding layers that have both 1 × 1 and 3 × 3 convolution filters as shown in Figure 3. Szegedy et al. [22] proposed decomposing 3 × 3 convolutions into two 1x1 convolutions, resulting in state-of-the-art acceleration performance on the object recognition task. Other techniques include efficient and lightweight network architectures such as ShuffleNet [23], CondenseNet [24], and MobileNet [25]. These methods work well for flat/wide architectures (e.g., VGGNet [33]) but not special/narrow ones (e.g., ResNet [34] and GoogLeNet [32]) since the transfer assumptions are too strong, which leads unstable results on some datasets. On the other hand, several methods proposed network sharing and pruning to reduce network complexity. The work in [14] proposed a HashedNets which utilizes a lightweight hash function to randomly group weights into hash buckets for weight sharing. In [15], a structured sparsity regularizer is added on each layer to reduce channels, trivial filters, and layers. The work in [16] proposed a Neuron Importance Score Propagation (NISP) to propagate an importance score to each node and optimize the reconstruction error of the last response layer. However, all pruning criteria could be cumbersome for some applications since it requires the setup of sensitivity for layers manually.
The structural matrix can reduce the memory and also significantly accelerate the training and inference stage through gradient computations and fast matrix-vector multiplication. The work in [17] demonstrates the effectiveness of the new notion of parsimony according to the theory of structured matrices. Additionally, it can be further extended to other matrices such as block-Toeplitz matrices [18] which relate to multidimensional convolutions [19].

Transferred/Compact Convolutional Filters
Transferred/compact convolutional filters focus on designing structural and special convolutional filters for reducing the size of the parameter and saving storage. According to the equivariant group theory introduced in [20], the whole network models can be compressed by applying a transform to filters or layers. Besides, the computational cost can be reduced using a compact convolutional filter. The main idea is that replacing the overparametric and loose filters with compact blocks can improve the speed, which accelerates the training of CNN models on several benchmarks. Iandola et al. [21] proposed a compact neural network called SqueezeNet which stacks a bunch of fire modules. This fire module contains squeeze convolutional layers that have only 1 × 1 convolution filters which then feed into expanding layers that have both 1 × 1 and 3 × 3 convolution filters as shown in Figure 3. Szegedy et al. [22] proposed decomposing 3 × 3 convolutions into two 1 × 1 convolutions, resulting in state-of-the-art acceleration performance on the object recognition task. Other techniques include efficient and lightweight network architectures such as ShuffleNet [23], CondenseNet [24], and MobileNet [25]. These methods work well for flat/wide architectures (e.g., VGGNet [33]) but not special/narrow ones (e.g., ResNet [34] and GoogLeNet [32]) since the transfer assumptions are too strong, which leads unstable results on some datasets.

Response-Based Knowledge
The neural response of the final output layer in the teacher network is used in the responsebased model. The basic concept is to mimic the final output of the teacher network directly. The distillation methods using response-based knowledge have been widely used in different applications and tasks due to its simplicity and effectiveness for model compression. The work in [26] shows that the student network is able to learn the complex functions of the teacher network by training the student network on the logits (the output of the last layer). Hinton et al. [27] propose to match the outputs of classifiers of student and teacher networks by minimizing the KL-divergence of the category distribution (soft target). Recently, Meng et al. [35] proposed conditional teacher-student learning, in which a student network selectively learns from either conditioned ground-truth labels or the soft target of a teacher network.

Feature-Based Knowledge
In feature-based knowledge distillation, the output of the intermediate (i.e., feature maps) and last layers are all utilized as knowledge for supervising the training of the student model. Romero et al. proposed FitNet [28] to mimic the intermediate feature maps of a teacher network to improve performance. However, FitNet may adversely affect the convergence and performance due to the capacity gap between student and teacher networks. The work in [29] proposed AT to transfer the attention maps which represent the relative importance of layer activations. Huang and Wang proposed Neural Selective Transfer (NST) [36] to imitate the distribution of the neuron activations from intermediate layers of the teacher network. Kim et al. proposed Factor Transfer (FT) [37] to compress features into "factors" as a more understandable form of intermediate representation using autoencoder in the teacher network, and then use a translator to extract "factors" in the student network. Heo et al. proposed Activation Boundaries (AB) [38] of the hidden neurons to force the student to learn the binarized values of the preactivation map in the teacher network. Instead of transferring knowledge by feedforward information, Czarnecki et al. [39] use gradient transfer with

Response-Based Knowledge
The neural response of the final output layer in the teacher network is used in the response-based model. The basic concept is to mimic the final output of the teacher network directly. The distillation methods using response-based knowledge have been widely used in different applications and tasks due to its simplicity and effectiveness for model compression. The work in [26] shows that the student network is able to learn the complex functions of the teacher network by training the student network on the logits (the output of the last layer). Hinton et al. [27] propose to match the outputs of classifiers of student and teacher networks by minimizing the KL-divergence of the category distribution (soft target). Recently, Meng et al. [35] proposed conditional teacher-student learning, in which a student network selectively learns from either conditioned ground-truth labels or the soft target of a teacher network.

Feature-Based Knowledge
In feature-based knowledge distillation, the output of the intermediate (i.e., feature maps) and last layers are all utilized as knowledge for supervising the training of the student model. Romero et al. proposed FitNet [28] to mimic the intermediate feature maps of a teacher network to improve performance. However, FitNet may adversely affect the convergence and performance due to the capacity gap between student and teacher networks. The work in [29] proposed AT to transfer the attention maps which represent the relative importance of layer activations. Huang and Wang proposed Neural Selective Transfer (NST) [36] to imitate the distribution of the neuron activations from intermediate layers of the teacher network. Kim et al. proposed Factor Transfer (FT) [37] to compress features into "factors" as a more understandable form of intermediate representation using autoencoder in the teacher network, and then use a translator to extract "factors" in the student network. Heo et al. proposed Activation Boundaries (AB) [38] of the hidden neurons to force the student to learn the binarized values of the preactivation map in the teacher network. Instead of transferring knowledge by feedforward information, Czarnecki et al. [39] use gradient transfer with Sobolev training. Heo et al. proposed Boundary Supporting Sample (BSS) [40] to match the decision boundary more explicitly using an adversarial attack for discovering samples that support a decision boundary. Additionally, they use an additional boundary supporting loss which encourages the student network to match the output of the teacher network on samples close to the decision boundary. Ahn et al. proposed Variational Information Distillation (VID) [41] that maximizes a lower boundary for the mutual information between the student network and the teacher network. Heo et al. proposed Overhaul of Feature Distillation (OFD) [42] to transfer the magnitude of feature response which contains both the activation status of each neuron and feature information. Wang et al. proposed Attentive Feature Distillation (AFD) [43] which dynamically learns not only the features to transfer, but also the unimportant neurons to skip. Tian et al. proposed Contrastive Representation Distillation (CRD) [44] that maximizes a tighter lower boundary for the mutual information via a contrastive loss between the teacher network and the student network. The work in [45] proposed Deep Mutual Learning (DML) which learns in a collaborative way and teaches each other between the teacher and student networks to boost performance for both open-set and close-set problems. Any network can be either the student model or the teacher model during the training process.

Relation-Based Knowledge
While the output of specific layers in the teacher model is used in both response-based and feature-based knowledge, the relationships between data samples and different layers are further utilized in relation-based knowledge. To explore the relationships between different feature maps, Yim et al. proposed the Flow of Solution Procedure (FSP) [46] which is defined by the Gramian matrix (second-order statistics) across layers for transfer learning and fast optimization. They demonstrate that the FSP matrix can reflect the data flow of how to solve a problem. Passalis and Tefas proposed Probabilistic Knowledge Transfer [47] to transfer knowledge by matching the probability distribution in feature space. Park et al. proposed RKD [30] to transfer relations of data examples based on angle-wise and distance-wise distillation losses which measure structural differences in relations. The work in [48] proposed a Similarity Preserving (SP) distillation method to transfer pairwise activation similarities of input samples. Peng et al. proposed a method based on CC [31], in which the distilled knowledge contains information of instances and the correlations between two instances. Liu et al. proposed Instance Relationship Graph (IRG) [49] which contains instance features and relationships, and the feature space transformation cross layers as the knowledge to transfer.
In this paper, we take advantage of three different types of knowledge distillation to build our ensemble model as shown in Figure 4. Our proposed ensemble model consists of AT [29] as a feature-based knowledge ( Figure 4A) where attention maps of intermediate layers are transferred into the student network, Logit [26] as a response-based knowledge ( Figure 4B) where logits (output of the last layer) is transferred into a student network and RKD [30] as a relation-based knowledge ( Figure 4C), in which relational information between the outputs of the intermediate layers is transferred into a student network.

Proposed Methods
In this section, the architecture of our proposed ensemble method is first presented. After that, we present the details of three key components in the following subsections.
The overall architecture of our proposed framework is illustrated in Figure 5. First, input images are augmented by the image augmentation module (Section 3.1). Second, augmented images are used as the input of three knowledge distillation models (Section 3.2). Third, the outputs of each distillation model are combined by our ensemble module to predict the category of input images (Section 3.3).

Proposed Methods
In this section, the architecture of our proposed ensemble method is first presented. After that, we present the details of three key components in the following subsections.
The overall architecture of our proposed framework is illustrated in Figure 5. First, input images are augmented by the image augmentation module (Section 3.1). Second, augmented images are used as the input of three knowledge distillation models (Section 3.2). Third, the outputs of each distillation model are combined by our ensemble module to predict the category of input images (Section 3.3).

Image Augmentation
Image augmentation as a preprocessing step of input data can improve performance and prevent overfitting. In the image augmentation step, input images are mapped into an extended space, in which all their variances are covered. Recent work has shown the effectiveness of the image augmentation method in increasing the amount of training data by augmenting our original training dataset. Additionally, the overfitting problem on training models can be reduced by augmenting training data. Our image augmentation module consists of (1) normalizing each image by mean and standard deviation, (2) randomly cropping the image to 32 × 32 pixels size with a padding of 4, and (3) applying a random horizontal flip.

Knowledge Distillation
In our ensemble model, we use three types of individual distillation models: (1) the responsebased distillation model, (2) feature-based distillation model, and (3) relation-based distillation model. Each individual distillation model is trained by minimizing its loss function. The loss function consists of the student loss between the output of the student model and ground-truth label and the distilled loss between student and teacher models for each distillation method.

Student Loss
First, the student loss can be calculated as the cross-entropy between the soft target of the student model estimated by softmax function and the ground-truth label as follows: where is a ground-truth one-hot vector which represents the ground-truth label of the training dataset as 1 and all other elements are 0, and is the logit (the output of the last layer) for the i-th class of the student model.

Distilled Loss of the Response-Based Model
The distilled loss of response-based model using logits [26] can be calculated using the mean square error between the logits of the student model and the logits of the teacher model as follows: logits ( , ) = ( − ) . (2)

Image Augmentation
Image augmentation as a preprocessing step of input data can improve performance and prevent overfitting. In the image augmentation step, input images are mapped into an extended space, in which all their variances are covered. Recent work has shown the effectiveness of the image augmentation method in increasing the amount of training data by augmenting our original training dataset. Additionally, the overfitting problem on training models can be reduced by augmenting training data. Our image augmentation module consists of (1) normalizing each image by mean and standard deviation, (2) randomly cropping the image to 32 × 32 pixels size with a padding of 4, and (3) applying a random horizontal flip.

Knowledge Distillation
In our ensemble model, we use three types of individual distillation models: (1) the response-based distillation model, (2) feature-based distillation model, and (3) relation-based distillation model. Each individual distillation model is trained by minimizing its loss function. The loss function consists of the student loss between the output of the student model and ground-truth label and the distilled loss between student and teacher models for each distillation method.

Student Loss
First, the student loss can be calculated as the cross-entropy between the soft target of the student model z s estimated by softmax function and the ground-truth label as follows: where y is a ground-truth one-hot vector which represents the ground-truth label of the training dataset as 1 and all other elements are 0, and z si is the logit (the output of the last layer) for the i-th class of the student model.

Distilled Loss of the Response-Based Model
The distilled loss of response-based model using logits [26] can be calculated using the mean square error between the logits of the student model z s and the logits of the teacher model z t as follows: (2) The total loss of response-based model using logits is then calculated as the joint of the distilled and student losses as follows: where x is a training input, W are the parameters of the student model, and α and β are the weights of the student loss and the distilled loss, respectively.

Distilled Loss of the Feature-Based Model
The distilled loss of feature-based model using AT [29] can be calculated as follows: where Q j T and Q j S are the j-th pair of teacher and student attention maps in vectorized form respectively, I is the indices of all student-teacher activation layer pairs for which we want to transfer attention maps, and · 2 is the l2 norm of a vector which is the square root of the sum of the absolute values squared. The total loss of the feature-based model using AT is then calculated as the joint of the distilled and student losses as follows:

Distilled Loss of the Relation-Based Model
The distilled loss of relation-based model using RKD [30] can be calculated as follows: where χ N is a set of N-tuples of distinct data examples (e.g., χ 2 = x i , x j i j and χ 3 = . , x n ) is an n-tuple extracted from χ N , f T , and f S are the outputs of any layer of the teacher and student network examples, respectively, ϕ is a relational potential function that measures relational properties (the angle and Euclidean distance) of the given n-tuple, and l δ is the Huber loss that penalizes the difference between the student model and teacher model, which can be defined as follows: In RKD, the student model is trained to mimic the relational structure of the teacher model using the relational potential function. The total loss of relation-based model using RKD is then calculated as the joint of the distilled and student losses as follows:

Ensemble of KD Models
The aim of ensemble models is to combine the prediction of several individual models in order to improve robustness and generalizability over a single individual model, thus increasing the accuracy of the model. When we try to predict the target variable using any algorithm, the main causes of the difference in predicted and actual values are variance, noise, and bias. The ensemble model helps to reduce these factors. The popular methods for ensemble learning are voting, bagging, and boosting. We adopt voting as our ensemble learning techniques since it is simple yet effective and usually recommended to complement the weakness of each individual model that generally performs well. On the other hand, other ensemble learning techniques such as bagging and boosting create very complex classifiers. Additionally, time and computation can be a bit expensive, hence unsuitable for our knowledge distillation framework in which the goal is to build a lightweight model with high performance.
The voting methods can be classified into: (1) soft voting and (2) hard voting. Soft voting is often used when an individual model can measure probabilities for the outcomes. In soft voting, the best result is obtained by averaging out the probabilities measured by individual models. On the other hand, in hard voting, the final prediction is measured by a simple majority vote. We use soft voting as our voting scheme since the output of our individual model is the probability of class membership. The final prediction of our model is then calculated using soft voting as follows: whereŷ is the predicted category, m is the number of individual distillation models, i is the index value of the category list, and z ij is the output of the last layer for the i-th class of the j-th distillation model.

Experiments and Results
To verify the suitability of our proposed method, we evaluate our method on image classification datasets. In our experiments, we adopt a ResNet-20 and ResNet-110 as student model and teacher model, respectively.

Dataset
We perform a set of experiments on two datasets, CIFAR-10 and CIFAR-100 which are widely used as the benchmarks for image classification tasks and are often used to evaluate model architectures and novel methods in the field of deep learning. CIFAR-10 consists of both 50,000 training images and 10,000 test images, from 10 classes. Each image is of size 32 × 32 pixels. On the other hand, CIFAR-100 is a more complex dataset since it has 100 classes with 500 training images and 100 test images per class. The task for all of them is to classify images into image categories.

Experimental Settings
For a fair comparison and controlling other factors, the algorithms of other distillation methods are reproduced based on their papers and codes in Github [50] under the same conditions. In our experiment, we use Stochastic Gradient Descent (SGD) as our optimizer for all methods with Nesterov acceleration, a momentum of 0.9. The learning rate starts from 0.1 and is divided by 10 at 50% and 75% of the total epochs. Weight decay is set to 0.0005. In the training step, we augment data using our image augmentation module described in Section 3.1. The validation set is only normalized. We use the efficient ResNet-20 and ResNet-110 which has been fine-tuned to achieve high accuracy on CIFAR-10 and CIFAR-100 for all of our knowledge distillation experiments. We run each of the distillation methods for 200 epochs. We collect the highest top-k accuracy (the accuracy of the true class being equal to any of the k predicted classes) for our validation dataset for each run.

Results
The empirical results were obtained for two different datasets (CIFAR-10 and CIFAR-100) for baseline (teacher and student networks), other knowledge distillation methods discussed earlier in Section 2.2, and our proposed ensemble models with different combinations of two knowledge distillation models and three knowledge distillation models. Top-k accuracy (%) of various knowledge distillation models on CIFAR-10 and CIFAR-100 datasets is shown in Table 1 (k = 1,2 on CIFAR-10 and k = 1,5 on CIFAR-100) and Table A1 (k = 1~5 on both CIFAR-10 and CIFAR-100). In Table 1, Top-1 and Top-2 were selected on the CIFAR-10 dataset, and Top-1 and Top-5 were selected on the CIFAR-100 dataset, since Top-1 and Top-2 are commonly used metrics for the dataset with few classes (e.g., 10 classes on CIFAR-10) while Top-1 and Top-5 are commonly used metrics for the dataset with many classes (e.g., 100 classes on CIFAR-100). Additionally, the computational time for inference of the test set and the size of parameters are shown in Table 2. From Tables 1 and 2, we found the following observations. Table 1. Top-k accuracy (%) of knowledge distillation models and our proposed models on the CIFAR-10 and CIFAR-100 datasets. (k = 1,2 on CIFAR-10 and k = 1,5 on CIFAR-100 where k is the number of top elements to look at for computing accuracy). Observation 1: Some knowledge distillation models have lower accuracy than baseline. Analysis: From Table 1, it can be seen that some knowledge distillation methods (e.g., FSP, VID) perform worse than baseline (student network) on the CIFAR-10 dataset. These results indicate that some distillation methods based on feature-based knowledge may not generalize well and achieve unsatisfactory results in a slightly different context. This is because each distillation method is only optimized for a particular training scheme and model architecture.

Method
Observation 2: Some relation-based knowledge distillation models have less effect on the CIFAR-100 dataset.
Analysis: From Table 1, it can be seen that some relation-based knowledge distillation methods (e.g., RKD, CC, and PKT) have less effect on the CIFAR-100 dataset. This is because there are more interclasses but fewer intra-classes in one batch. It could be alleviated by increasing the batch size or designing advanced batch sampling methods.
Observation 3: The ensemble of three models (AT, logits, and RKD) has the highest top-k accuracy among other methods in both the CIFAR-10 and CIFAR-100 datasets.
Analysis: From Table 1, it can be seen that our proposed ensemble models have the highest top-k accuracy among other models including a large teacher network on both datasets. Besides, the ensemble of only two models outperforms the teacher network in most cases. This is because our proposed ensemble methods take advantage of three types of knowledge distillation schemes which are response-based (logits), feature-based (AT), and relation-based methods (RKD).
Observation 4: Our ensemble models have less computational time for inference and less parameter size than the teacher network.
Analysis: From Table 2, it can be seen that our proposed ensemble models have less computational power than the teacher network in terms of the size of parameters and inference time. The results from Tables 1 and 2 suggest that our ensemble model is suitable for the lightweight model with high performance for classification tasks.

Discussion
In this section, we discuss the comparison between our experimental results and others, the computational advantage and disadvantage of our method, and the size of training data.

Comparison between our Experimental Results and Others
Our experiment shows that several knowledge distillation models underperform one of the simple distillation methods (e.g., logits). In addition, some knowledge distillation models fail to outperform the baseline (student). The authors in [44], which implemented 10 different knowledge distillation methods, also found that all of them fail to outperform the simple knowledge distillation method (e.g., soft target). This is because most of the knowledge distillation methods are sensitive to the changes of the teacher-student pairs, and optimized for a particular training scheme, model architecture, and size. These results indicate that many knowledge distillation methods proposed by others may not generalize well and achieve poor results under different student-teacher architectures. Another potential reason is that each knowledge distillation model implemented by ours differs slightly in the design and selection of hyper-parameters from the models in the original papers. In our ensemble model, we used AT for feature-based distillation model, logits for response-based distillation model, and RKD for relation-based distillation model among other distillation models since these distillation models perform well under our student-teacher architectures as shown in Table 1, and complement each other with different types of knowledge.

Computational Advantage/Disadvantage of our Method
The computational advantage of our method is that the computational time of our proposed method is lower than the large teacher model (e.g., ResNet-110). For instance, as shown in Table 2, it took about 3.09 s for inference of the test set (10,000 images) using a large teacher model while it took about 2.53 s (ensemble of three) or 1.78 s (ensemble of two) for inference of the test set using our proposed methods. Hence, our proposed method can be used for real-time applications since it takes approximately 0.00025 s to classify each image. Besides, our proposed method still outperforms the large teacher model in terms of accuracy as shown in Table 1. In contrast, other knowledge distillation methods proposed by other authors perform worse than the large teacher model in terms of accuracy. However, our proposed method takes more computational time than other knowledge distillation methods since our method is the ensemble method which combines three lightweight models trained by three distillation methods. For training (200 epochs), it took 4600 s on CIFAR-10 and 4800 s on CIFAR-100 for each distillation method of our ensemble method with an NVIDIA GeForce GTX 1070 Ti GPU.

The Size of Training Data
The size of the training data is one of the important factors which affects the learning process. When the size of the training data is too small, it is hard to achieve good generalization performance. As a result, the accuracy of the deep learning model (e.g., ResNet-20 or 110) will be poor. In case the size of training data is not enough, the pretrained model which was previously trained on a large dataset (e.g., ImageNet [51]), can be used as a transfer learning process. In our experiment, we do not use the pretrained model since the size of the training data (CIFAR-10 and CIFAR-100) is enough.

Conclusions
In summary, we have presented an ensemble that combines three lightweight models learned by three different knowledge distillation strategies (feature-based, response-based, and relation-based). We evaluate our ensemble model on CIFAR-10 and CIFAR-100 which are widely used as the benchmarks for the image classification task. In our experiment, we provided an extensive evaluation of 20 different knowledge distillation and our proposed ensemble methods. Our experiment results demonstrate that our proposed ensemble model outperforms the other knowledge distillation models as well as large teacher networks. In addition, our ensemble model has less parameter size and less computational time for inference than the teacher network. These results indicate that the proposed ensemble model is suitable for the lightweight model with high performance for classification tasks. Additionally, our proposed method which is based on knowledge distillation can be applied to various tasks such as medical image segmentation [52], character animation [53][54][55], computational design [56], and object recognition [1], since knowledge distillation is not task-specific, rather a very general approach and any deep neural network architectures can be used to define both teacher (usually large network) and student (small network). For instance, CNNs [57][58][59], Long Short-Term Memory (LSTM) [60], Hidden Markov Models (HMM) [61], Dilated CNNs [62], GAN [63], and graphical model [64] can be also used to design the student and teacher networks for a specific task. Although our main focus is a model compression method using knowledge distillation which seeks to decrease the size of a particular network to run on resource-limited devices without a significant drop in accuracy, dimensionality reduction techniques [65][66][67][68] which focus on feature-level compression of input data can be also applied as the preprocessing step when handling the input data with high dimensional space or other deep learning architectures. Additionally, Restricted Boltzmann Machines (RBMs) [69][70][71] can be used to preprocess the input data in other deep learning architectures (e.g., deep autoencoder) to help the learning process become more efficient. Therefore, our proposed ensemble method can be very useful since it can be applied to other tasks and other deep neural network architectures easily, and our ensemble method can reduce the model size and at the same time enhance the performance, unlike other knowledge distillation models.
Most knowledge distillation methods focus on new distillation loss functions or new knowledge. However, the design of the student-teacher architectures is poorly investigated. In fact, apart from the distillation loss functions and knowledge, the network architectures of the student and the teacher and the relationship between them significantly influences the performance of knowledge distillation.
Our experiment shows that several knowledge distillation models underperform a very simple distillation method (e.g., logits). In addition, some knowledge distillation models fail to outperform the baseline (student). These results indicate that many knowledge distillation methods may not generalize well and achieve poor results under different student-teacher architectures. As a result, the design of an effective teacher and student network architecture is still a challenging problem in knowledge distillation.
In the future, we plan to explore our proposed ensemble model with other teacher-student network architectures (e.g., AlexNet, VGG-16, ResNet-32, ResNet-50, and ResNet-110), and combine both knowledge distillation and other model compression techniques to learn effective and efficient lightweight deep learning models. Additionally, we plan to apply our ensemble model to other domains such as medical image classification and segmentation to gain more insights into the benefits of the ensemble model.