Is One Teacher Model Enough to Transfer Knowledge to a Student Model?

: Nowadays, the transfer learning technique can be successfully applied in the deep learning ﬁeld through techniques that ﬁne-tune the CNN’s starting point so it may learn over a huge dataset such as ImageNet and continue to learn on a ﬁxed dataset to achieve better performance. In this paper, we designed a transfer learning methodology that combines the learned features of different teachers to a student network in an end-to-end model, improving the performance of the student network in classiﬁcation tasks over different datasets. In addition to this, we tried to answer the following questions which are in any case directly related to the transfer learning problem addressed here. Is it possible to improve the performance of a small neural network by using the knowledge gained from a more powerful neural network? Can a deep neural network outperform the teacher using transfer learning? Experimental results suggest that neural networks can transfer their learning to student networks using our proposed architecture, designed to bring to light a new interesting approach for transfer learning techniques. Finally, we provide details of the code and the experimental settings.


Introduction
Nowadays, a large part of the research on transfer learning is applied to neural models [1][2][3][4][5][6], and most of this research tries to solve the problem of the transfer of learning from one already trained neural network to another. It is rare to find research works in the literature that try to transfer knowledge from more than one trained model to a new model that needs to be trained.
The best transfer learning results are obtained when the learning is transferred from the same network architecture trained on different datasets [7]: it is common to start the training of classification models from a pre-trained model on ImageNet [8]. Only a small part of the research tries to obtain better results with a less powerful network-and many of these only use specific network structures [2] that one must find by experimentation.
In particular, Romero [2] proposed to iterate over the epochs two times: the first time to transfer the learning using the distillation loss to obtain a pre-trained student; and a second time to train the student with classical cross-entropy. Our proposal instead also improves the accuracy of weaker neural networks.
The transfer learning is today intended as one to one: one teacher and one student. Our proposal instead shows a method to transfer the learning from multiple teachers to one student.
Thus, the primary contributions of this paper are: • We propose an approach to simultaneously transfer learning from multiple teachers to another neural network: more than one neural network can simultaneously transfer its learning to another neural network; • Improve Romero's idea [2] by reducing to only one cycle of training instead of two and by effectively transferring the learning between any kind of network.

Related Work
There are many types of transfer learning [9] in the literature that use different approaches to solve this problem. To better understand how to classify our proposal, we can say that it can be included in inductive transfer learning [9] because we use teachers trained on the same domain or on a different one but we only use supervised data. We are also in the sub-category of self-taught learning [9] because we only train the student and not other models.
In many existing papers on transfer learning [1][2][3][4][5][6], the process of knowledge transfer always takes place from one model to another, however, our proposal extends this concept and introduces a transfer of knowledge from many to one as we can see in Figure 1: more than one teacher can teach a student at the same time. represented as a small robot, can directly learn from the original data represented on the blackboard and also from the knowledge acquired by more than one teacher. The light on the teachers' head, unlike the student's, is not on because their learning is blocked.
A transfer learning method similar to the one proposed here is that of knowledge distillation (KD) [1]. It tries to transfer the learning between two models using as a target the response of the pre-trained model and using as loss function 1/T(z i − v i ) where T is a hyper-parameter named temperature, z i is the new model prediction using softmax and v i is the prediction of the pre-trained model which is obtained through the use of softmax. This method works well if we transfer learning between models of the same type, but it does not work well between two different models. Furthermore, this solution uses the classification level of the neural model to transfer knowledge from the teacher to the student, whereas our solution is instead based on the level just before the classification level, which is more general.
The similarity-preserving knowledge distillation [4] is a variation of knowledge distillation that applies an L2 normalization to classical knowledge distillation.
Another paper close to what we have proposed is the transfer of knowledge with Jacobian matching [5] which proposes the use of two separate loss functions: one to fit the label and another to fit the output activation of the teacher. This applies without the sum of the two loss function, but it does apply inside the layer via the use of Jacobian matching.
Finally, Lee et al. [6] proposed to add a distillation module on the teacher and on the student to extract some multi-dimensional features from the two models that are minimized with an L2 loss such as the similarity-preserving knowledge distillation [4].
Another similar method is the variational information distillation [3]. It minimizes the cross-entropy such as in our proposal while retaining high mutual information with the teacher network. The mutual information is computed on more than one layer. However, its major limitation is that the shape of all of the layers in which the mutual information is computed must be the same. Generally, in this method, the teacher network is the same size as the student network, because it aims to transfer learning across the domain.
In our proposal, we tried to preserve information with the translator and other loss with the same aim that has mutual information in Ahn et al. [3]. However, we want to transfer learning from one neural network to another without any limit, unlike what happens with the layer dimension in Ahn et al. [3].
Romero et al. [2], in their paper, performed something similar to the KD [1] method discussed above. In particular, this paper used the Maxout network [10] as a teacher and a smaller version of the same model as a student. In analogy with KD [1], the model proposed by Romero uses the distillation loss function. However, it is performed on the pre-trained weight after using an MSE loss function between the output of an intermediate layer (hint layer) of the teacher and an intermediate layer of the student.
Our first related paper is that by Romero et al. [2] because our proposal is to improve and generalize their method. At first, we aim to achieve this by reducing the learn transfer to one cycle of learning and at last by generalizing to multiple teachers. For this reason, another important paper for comparison is that on classical distillation, because if the effectiveness of this method is also not visible, then the improvement will have similar results. The only situation in which more than one teacher can be considered is that in Hinton et al. [1] which uses an ensemble, but it uses the same technique: trying to reduce the KD loss from the student output and merge the single output of the ensemble, but it is not different from a transfer learning between a more complex teacher and a simpler student.

Proposed Methods
Our proposal aimed to create a completely general transfer learning approach that could include any network architecture without limitations and the student can learn from multiple networks at the same time. The source code is available here [11].
Our first goal was to explore the problem present in the model proposed by Romero et al. [2] and determine a solution. In particular, the student network proposed by Romero et al. must have a particular topology to be found with a trial and error approach, otherwise the student's performance does not improve compared to the same trained student without making use of transfer learning and therefore of the teacher. The first novelty that we therefore introduce is the possibility of using any type of neural network as a teacher, without the use of a trial and error approach-while as a student, we have no limitations.

Analyzing Romero's Approach
The base idea of Romero et al. [2] was to use a simple linear layer to convert the teacher's features into student's features. There are two potential problems with this approach. The first problem is that they use twice the epochs compared to a normal training (2 · N epochs): the first N epochs only using the MSE loss function, which serves to transfer the knowledge from the teacher to the student, while with the other N epochs, in which the teacher and also the translator are removed, only the cross-entropy loss function is used to fine-tune the student. This approach can decrease the accuracy test of the student in case a student topology suitable for the problem cannot be found, compared to the student alone without any teacher. Our hypothesis is that the problem with this approach lies in the first training phase in which the knowledge is transferred and where the classification objective is also missing, which could lead to further worsening of the translated features compared to the original ones. In particular, this last problem should be caused by the translator. The second potential problem is that the solution adopted by Romero et al. does not work with all types of networks used as students, but requires that one finds the suitable model for each teacher and also for each domain in which this approach is used.
To understand which of the problems listed above are real, we propose the following two potential solutions. Our first idea is to do only one training cycle (N epochs), in which both tasks carried out by the Romero's model (transfer learning and fine-tuning) are performed, this time simultaneously and not in succession. Since we suspect that another problem is in the translator, our second experiment is to eliminate the translator and make sure that the student has a features layer with the same shape as the features layer of the teacher. This way, we overcome the problem of information loss in the transfer learning phase adopted by Romero et al.
The hypotheses formulated above were experimentally validated, and in Table 1, the summary results are shown, where it can be seen that the solution described above leads to an increase in accuracy of all the different neural models used and for each dataset used. Table 1. Comparison of the baseline test accuracies on the Cifar10 and Cifar100 datasets with the knowledge distillation (KD) [1], Romero approach [2], with one teacher (T1S) and one teacher with augmentation (T1S + Aug.), after 200 epochs, using two Resnet50 as teachers (one for Cifar10 and one for Cifar100). The student is of the same size as the feature layer adapted to that of the teacher.

Our Proposal
Since we experimentally discovered that, by eliminating the translator and inserting the whole transfer learning process within a single cycle, we were able to eliminate the problem highlighted in the Romero model, we can thus try to put the translator back, but this time, by adding a new loss function which in Figure 2 is called Xent (2). We do this to ensure that the translated features are, this time, as a function of the classification problem and not independently of this. Formally, the new cross-entropy loss introduced is defined in Equation (3) where we can see that the input values cl (fs) directly link the student's features to the classification layer.
Compared to the solution analyzed in the previous section, we want to remove the limitation that the student and the teacher must have a layer with the same dimensions, but this time adding a new loss function. Thanks to the elimination of this constraint, in our proposed model, all types of networks, i.e., teachers, can teach any other type of network, i.e., students. In particular, we propose a translator layer plus a cross-entropy loss function to overcome this limitation. The translator can be represented by a simple layer or even by another network that transforms the dimensions of the teacher's features layer into the dimensions of the student's features layer, as we can see in Figure 2.
More formally, the cross-entropy loss indicates how far the model predictions are from the supervised labels (y i ). The cross-entropy loss is defined as which computes the loss between predicted classes' probabilities x (in Figure 2, x is the output of the layer called Classify) and target y (in Figure 2, y is the target). In the training process, we update the model weights at each batch of size m. In Equation (1), n is the number of classes, so x i,j is the probability of the j-th class of the i-th prediction and x i,y i is the probability corresponding to the y i -th class of the i-th prediction.
The MSE loss is used to minimize the distance between the teacher's features and the student's features. The general formula of this last loss function is defined as follows: where y = f l 1 is the student's features (l 1 is the name of the features layer) while x = f t is the teacher's features.
Combining together the two losses defined in Equations (1) and (2), we obtain the loss proposed for our model. For the case where only one teacher is used, the loss function is the following: where f l i is the student's features for layer l i , f t 1 is the teacher's features, and l 0 (x) is the classification layer associated with the last layer of our model. Proposed approach with one teacher. We can distinguish three different networks: the teacher, the translator, and the student. The loss function is composed by the mean squared error (MSE) between the student and translator plus the first cross-entropy (XENT (1)) of the student's feature and classify layer flow, plus the second cross-entropy (XENT (2)) of the translator's feature and classify layer flow.

More than One Teacher
From what we saw above with the introduction of our solution, we realized that it is always possible to take advantage of a single teacher. Now, we ask ourselves whether it is possible to generalize this concept to more teachers and understand under what conditions the introduction of other teachers in the transfer learning process continues to bring advantages. In the multi-teacher scenario, the final loss function uses the student's last layers l i that are closest to the output layer l 0 or the classification layer. In particular, the teacher t 1 will use the layer l 1 , the teacher t 2 will use the layer l 2 , etc. As we did in the case of a single teacher t 1 in which we take their features layer f t 1 and transfer it to the student through the use of the MSE loss with the student's layer l 1 , generalizing this process for each teacher t i , we transfer their knowledge to the student via a new MSE loss between the features' layer f t i and the student layer l i (see Equation (5) for details). In addition to minimizing the distance between the features' layers through the use of an MSE loss, as we did in our proposal in the case of a single teacher, we need a new additional L xent (y, x) (defined in Equation (1)) for each new teacher introduced. As shown in Figure 3, the features layer of teacher 2, called f t 2 , is the same size as the student's layer l 2 but not necessarily the same size as the student's layer l 1 . This observation forces us to transform the size of f t 2 into the size of l 1 before we can calculate the XENT3 loss shown in Figure 3. To perform this last step, the existing layer l 1 is used so the L xent3 (y, x) input values will be x = l 0 (l 1 ( f t 2 )). This is best described in Equation (4). All this reasoning can be generalized to the case of a student with T teachers, as better described in Equation (5).   Figure 2, however, here we also add MSE(2) between the features of Translator2 and the student's layer Features2 and the third cross-entropy XENT(3) using the features of Translator2 and the student's classification layer.
Below, we formalize the loss function in the case in which we only have two teachers: where f t 1 and f t 2 are the translated features of the two teachers, f l 1 , f l 2 are the student features into which we want to transfer the learning of the teachers, l 0 it the last layer function or classification layer, and y is the labels. We can generalize the formula in Equation (4) for T-teachers using the function composition • as g • f (x) = g( f (x)). Consequently, the loss function of our model in the case where exactly T teachers and one student are used is the following: Algorithm 1 describes the proposed approach that uses more than one teacher in a more formal way.  compute back propagation wit loss L and update weights of m s , l i , t i 12: end function 13: for x, y ∈ batches do 14: train_step(x, y) 15: end for

Implementation Details
To implement our solution in an efficient way, we used back-propagation with the sum of all losses, but a different optimizer for each component: an optimizer for the translators and one for the student. In our situation, we used the same parameters so it is not fundamental, but it opens up the possibility of changing parameter for each optimizer.
To reduce memory usage, we can store the teacher's features on file and customize the data loader to obtain the features for each teacher associated with the input and output data. This way, the teacher is not included in the learning process and therefore reduces the complexity of the model. To evaluate and compare the results, we used the following accuracy measure: where correct is the number of samples correctly predicted by the model on whole test set and incorrect is the number of mistakes made by the model, so correct + incorrect represents the total number of samples available in the test set.

Data Augmentation
To further improve the student's generalization ability, we introduced a data augmentation strategy in the student network training process. In practice, we use a generic random generator able to represent a sequence of real numbers to build a two-dimensional array or image with random pixel values in the range of [0, 1]. We used these generated random images for each batch size as |batch size| = |batch size| * N where N is a chosen integer that represents the number of random images to be created. In this way, for each training batch, a fixed number of randomly generatedx i images were used by the teacher network to obtain the predictedŷ i label from it and were thus used to improve the student's generalization ability.
This solution represents a simple technique for augmenting data in the case of problems with a low number of classes. It is not recommended that this technique is used for problems with many classes.

Datasets
In this work, we used five image datasets to conduct the various experiments. We moved from less complex datasets such as Cifar10 to very complex datasets such as ImageNet. Below, we briefly describe the datasets used.
Cifar10 [12] dataset consists of 60,000 images divided into 10 classes (6000 per class) with a training size and test size of 50,000 and 10,000, respectively. Each input sample is a 32 × 32 color image with low resolution. The 10 classes are namely airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks.
Cifar100 [12] dataset consists of 60,000 images divided in 100 classes (600 per class) with a training size and test size of 50,000 and 10,000, respectively. Each input sample is a 32 × 32 color images with low resolution.
In Figure 4, some representative examples of the two datasets are shown, extracted from a subset of classes.
This dataset is a good dataset because they are very useful in many papers. However, we also wanted to test it with a more real and difficult dataset. Thus, we merged the Stanford Dogs [13,14] dataset with the dogs species inside Oxford Cats and Dogs [15]. This merged dogs dataset (SOD) contained 131 classes and was used with a fixed split to obtain 20,492 training images and 5066 test images. Cub200 2011 [16] is a dataset that we use to face up the fine-grained classification problem ( Figure 5). It contains 11, 788 images of 200 species of birds split into 5994 and 5794 images for training and testing, respectively. Each image contains 15 part locations, 312 binary attributes, and 1 bounding boxes containing the bird within the image; however, we only used input images and their labels. ImageNet is the most widely used dataset for image classification. This is mainly due to the high number of classes and the inhomogeneous collection images. We conducted an experiment using the ILSVRC2012 [17] version of this dataset, containing 1000 classes and 1.2 million images (see Figure 4). We used this dataset to obtain pre-trained models.

Experiments
In this section, we describe the set of conducted experiments. In particular, we conducted the following groups of experiments: • In a first group of experiments, we want to understand whether the hypotheses of the problems found in the Romero model can be overcome; • The second group of experiments serves to evaluate our proposed solution in which we add the translator linked to the new loss function; • A last group of experiments aims to evaluate the introduction of more than one teacher for a single student.
In general, in all the experiments, we used Adam Optimizer and the Cosine Annealing learning rate function and performed a total of 200 training epochs.

Hypothesis Validation
In this first group of experiments summarized in Tables 1 and 2, as described in Section 3.1, we eliminated the translator and changed the shape of the penultimate layer of the student to adapt it to the size of the features' layer of the teacher. To try to understand how this solution works, we used different neural network models on the Cifa10, Cifar100, and Cub200 datasets. Furthermore, we compared the results with the original approaches proposed in Romero et al. [2] and KD [1].
For the experiment on Cifar10 and Cifar100, the size of the input images was set to 32 × 32 and we also set a learning rate to 0.0001, the batch size was set at 128, and we performed a total of 200 training epochs. We can see in Table 1 the comparison with the KD and Romero et al. approaches. We can see how the KD approach (which does not use the translator for features but uses the original values) works well compared to the Romero et al. approach which uses a translator. Our solution, on the other hand, without the translator, is able to outperform any other proposed method. The effectiveness of the data augmentation solution (see Section 3.5) on Cifar10, which has a limited number of classes, is particularly noteworthy. Instead, this method is less effective on Cifar100, which has a much higher number of classes. Table 1 validates our hypothesis at the translator level: the model proposed by Romero et al. without the translator level generally works better.
In this last experiment of this first group, we used the same dimensions of the features as for the previous experiments and the aim was to validate the goodness of our model. Table 2 reports all numerical results. We trained a MobileNetV2 [18] on the CUB200 dataset using a pre-trained model NTS-Net [8,19] as a teacher. The MobileNetV2 student feature layer f l 1 was adapted to the size of the teacher that is 10240. We used a batch size equal to 10. In these experiments, we used 448 × 448 size input images obtained by scaling the original image to 600 × 600 and extracting a centered crop of size 448 × 448. In terms of the best results, the teacher-less experiments had a learning rate of 0.0001 which did not change over the epochs. Instead, for the experiment with the teacher, we have a learning rate of 0.001 and after 100 epochs, it is multiplied by 0.1. In Table 2, we can see the accuracy of the student when trained without a teacher compared to the version trained using a teacher. We highlight the capability to increase the learning process using the same baseline but without the teacher model. Table 2. In this table, we report the test accuracy obtained by a MobileNetV2 used as a student with an NTS-Net as a teacher over CUB200 dataset.

Name Without Teacher With Teacher
MobileNetV2 42.5

49.2
The NTS-Net [8] is a CNN successfully applied to solve fine-grained classification problems. This concatenates the features extracted from the whole image and four crops using attention mechanisms. We experimentally proved the effectiveness of our proposal over the CUB200 dataset and report the results in Table 2, where the student network can improve their accuracy thanks to teacher-student learning strategy.

Analysis of the Proposed Method with a Single Teacher and Student
As we saw in the previous group of experts, the problem is not the translator itself, but that the feature translation phase does not know that the features need to be translated for a classification problem that the student has to solve. From this observation, we introduced our proposal: the student topology is not modified and we introduced the translator again but this time, adding another cross-entropy loss function integrated with the transfer learning process. In this way, the student can learn from the teacher's features while also learning to classify.
In this group of experiments, we used the Lenet5 and Mobilenetv3 models as students and Resnet50 and InceptionV3 as teachers on Cifar10 and Cifar100. The solution with only one teacher and one student proposed by us is compared with the accuracy reported by the model proposed by Romero et al. and by the KD model. In this group of experiments, the student model has the feature layer not adapted to the teacher size, because we have the translator. As in the other experiments on Cifar10 and Cifar100, the size of the input images was set to 32 × 32 and we also set a learning rate to 0.0001, the batch size was set at 128, and we performed a total of 200 training epochs. The results are reported in Table 3. From the comparative results, we can see that the KD model, as we explained in Section 2 and as the authors demonstrated in their article [1], does not always show improvements in accuracy compared to the same model without transfer learning (in fact, their best result is obtained when they do the transfer learning between an ensemble and a single network, unlike what we do).
The approach proposed by Romero et al. gives good results in transferring learning between a complex network and a simplified version of the same network, but in a more generic case such as those shown in Table 3, we can see that its results are not exciting.
In conclusion, this group of experiments with Cifar10 and Cifar100, comparing our results with those of the other models, we can see that our proposal always improves accuracy using a single teacher and is therefore more reliable than the other solutions used.

Analysis of the Proposed Solution Using More Than One Teacher
Sometimes, the use of a second teacher can improve accuracy because the two teachers are trained on the same dataset but are two different types of networks. All the experts of this last group are reported in two tables. In Table 4, we analyze the behavior of our proposed model that uses more teachers, while in Table 5, we again analyze the model that uses more teachers but this time we want to understand whether the student is able to overcome one of the teachers thanks to transfer learning.
The experiments were conducted on two different datasets, and in particular, we chose to work on the SOD dataset (the results of which were reported in Table 4) and on the Cifar10 dataset (see results in Table 5) using a student with multiple parameters. In all of these experiments, however, we evaluate the version of our proposed model that uses two teachers.
Analyzing the data reported in the last line of Table 5, obtained on Cifar10, we can say that the student can exceed the performance of the teacher and of itself. In the experiment with two teachers, we use an Upsample layer instead of a linear layer related to the transfer between NTS-Net and Inceptionv3 as memory optimization.
However, to better understand the behavior of the proposed model that uses two teachers for the same student, in addition to analyzing the final accuracies reported in Table 4, we can see the behavior of the model during the entire training phase as shown in Figure 6. In this case, the student who uses two different teachers obtains the best results. We highlight in Figure 6 that the MobileNetv3 with two teacher approaches overcomes the others (orange and green lines). In this experiment, the student improves thanks to the contribution of two different teachers pretrained over the same dataset. Into the last row of Table 4, we used the same models demonstrating that two teachers pretrained over different datasets are more effective for the learning student process reporting an improvement in terms of accuracy. Table 4. We report the results after 200 epochs of a MobileNetV3 with a classical teacher and with a fine-grained teacher such as NTS-NET (demo available in [20]) on SOD datasets.

Student
Teacher/s Pretrained Accuracy  Figure 6. Test accuracy over the SOD dataset. The student network is a MobileNetv3. The teacher one is an NTS-NET. Teacher two is the ResNet50.
In this last case, we have two different teachers, and in particular, we have an NTS-NET [8] which is a model for fine-grained classification and a ResNet50 [21,22] which is a well-known classification model. Therefore, in this situation where the two teachers implement two different approaches to the classification problem, we obtain an improvement in the student's accuracy. In particular, in the last row of Table 4, we can see that if teachers are also trained on different datasets, then better results are obtained.
In the experiments of Table 4, we use a student who is smaller than the teacher in terms of the number of parameters, and therefore the results obtained by the student after transfer learning are never superior to those of the teacher. Thus, let us try on Cifar10 to replicate the experiment, this time using Inceptionv3 as a student and Resnet50 as a teacher. Resnet50 was also trained on ImageNet instead of using Cifar10 and NTS-Net was trained on SOD. As we can see in Table 5, both versions of ResNet50 used as teacher allow the student Inceptionv3 to overcome the results obtained by the teacher alone. We can conclude this last group of experiments by saying that the best results with the single teacher were obtained when the teacher was trained on the same dataset. With two teachers on different datasets, students are also able to outperform the single teacher, the only student, and the single teacher on ImageNet.

Conclusions
Generally, the transfer learning technique is used as a fine-tuning technique to continue the training phase by using the weights learned on a huge dataset in order to achieve better and faster convergence performance. However, a transfer learning technique requires having the same model. In this work, we designed an efficient transfer learning methodology to train a student model from one or more than one pretrained different models without the need to have the same architectures. We confirm that a weak network (student) can learn more using a teacher. If we have two teachers with different skills (in terms of models and datasets), then the student is able to better improve its performance by using two teachers rather than just one. According to "The Lottery Ticket Hypothesis" [23], a smaller network can be equal to a larger one by pruning. Thus, using this hypothesis, we can apply our proposal in the future to find the best pruned network. Our proposal can transfer the learning from any networks so it is possible to test the transfer between a teacher trained on text classification to a student on image classification. In conclusion, we propose a new methodology to transfer learning from different models in a different way from the classical transfer learning techniques, demonstrating the effectiveness of our proposed approach on many different combinations of models and datasets.