Multistructure-Based Collaborative Online Distillation

Recently, deep learning has achieved state-of-the-art performance in more aspects than traditional shallow architecture-based machine-learning methods. However, in order to achieve higher accuracy, it is usually necessary to extend the network depth or ensemble the results of different neural networks. Increasing network depth or ensembling different networks increases the demand for memory resources and computing resources. This leads to difficulties in deploying depth-learning models in resource-constrained scenarios such as drones, mobile phones, and autonomous driving. Improving network performance without expanding the network scale has become a hot topic for research. In this paper, we propose a cross-architecture online-distillation approach to solve this problem by transmitting supplementary information on different networks. We use the ensemble method to aggregate networks of different structures, thus forming better teachers than traditional distillation methods. In addition, discontinuous distillation with progressively enhanced constraints is used to replace fixed distillation in order to reduce loss of information diversity in the distillation process. Our training method improves the distillation effect and achieves strong network-performance improvement. We used some popular models to validate the results. On the CIFAR100 dataset, AlexNet’s accuracy was improved by 5.94%, VGG by 2.88%, ResNet by 5.07%, and DenseNet by 1.28%. Extensive experiments were conducted to demonstrate the effectiveness of the proposed method. On the CIFAR10, CIFAR100, and ImageNet datasets, we observed significant improvements over traditional knowledge distillation.


Introduction
The development of deep learning [1,2] has led to a leap in the fields of computer vision [3][4][5][6] and natural language processing [7][8][9][10][11]. In image recognition in particular [4,12,13], recognition accuracy has reached a high level by using deep-learning methods. However, high-quality models are often accompanied by huge parameter quantities, huge computing resources, and huge storage requirements [3,14,15]. The huge demand for resources is an important obstacle to the promotion and use of deep-learning models in the industry. Especially in resource-preferred scenarios such as FPGA, mobile devices, and microcomputers, the contradiction between performance improvement and resource occupancy is more intense. Traditionally, training a deeper network or merging multiple models [16,17] may achieve better performance, but this cannot avoid the growth of resource consumption. The problem of how to improve performance without increasing network size has received extensive attention. Some training methods, such as model compression [18,19] and model pruning [20], have been proposed to solve this problem. multistructure online distillation enhances information diversity in the distillation process and improves the accuracy of the model through multinetwork cooperation. Extensive experimentation was carried out on image-recognition issues using popular network architectures and achieved the highest performance improvement. Our approach has the following advantages: 1. Effectively utilizes diversity between different structural models. 2. Demand for network resources is low and applicability is stronger. 3. Network performance achieved the highest improvement while not increasing resource occupation.
The organization of this paper is as follows. Section 2 briefly reviews the related work, and Section 3 describes our approach. The experiment is shown in Section 4, and we summarize this work in Section 5.

Related Work
There is a lot of work to improve the performance of deep-learning models through knowledge transfer. This section describes some of the network-interaction methods that are relevant to our work.
In the study of network interaction, some distributed methods [26][27][28][29][30] are used to accelerate the training of the network. For example, the parameter average MA [28][29][30] method and the distributed stochastic gradient descent algorithm [31] are widely used to accelerate the training process through multinode information exchange. In addition, multitasking learning [32][33][34][35] mainly plays a role in feature selection to lift accuracy. In multitask learning, there are multiple training objectives at the same time. The network has different characteristics for feature selection based on different targets, forcing the network to learn complementary features to improve network performance. The ensemble method integrates the output of multiple networks to generate more flat predictions than a single network, improving the generalization of the model. There are many studies integrating multiple models, such as separated score integration (SSI) [36], Bayesian model averaging [37], and score fusion based on alpha integration [38]. These methods inspired us to design our own training framework.
Our approach is based on knowledge distillation [21,22,25,39,40]. Hinton, Geoffrey proposed knowledge distillation [22] that allows small models to learn the knowledge in a pretrained large model. The main motivation for distillation is that the teacher model's soft label provides knowledge of interclass similarity that cannot be provided by the ground-truth goal. The traditional method of knowledge distillation is limited by the direction of information flow, so it cannot improve performance on large networks. Lan, Zhu, and Gong improved knowledge distillation and proposed distillation method ONE [41], which uses a set of multibranch student models to learn from each other. Mutual imitation to enhance the learning of the target network, but ONE shares the low-level features lacking diversified feature learning. Codistilling [39] online distillation through large-scale distributed training can accommodate more nodes for training and accelerates the training process.
Existing distillation methods lack the inclusiveness of the network structure, which limits the complementary information that the structure can provide. In addition, we found that the distillation process forced multiple models to fully adapt to uniform soft-label output, which actually weakened the diversity of student development, which was detrimental to the end result.
In this paper, we created a distributed cross-structure online-distillation method and loosened distillation constraints to enhance network diversity. We used soft labels that are independent of the network structure as the information-exchange medium, abandoning the parameters and gradients commonly used for traditional distributed-information exchange. Soft labels meet the needs of information exchange during the distillation process and are structurally independent, allowing our students to combine multiple structural models. The ensemble method is helpful to form teachers who are better than previous knowledge-distillation methods. Considering that continuity distillation leads to the simplification of model information, we used interval distillation to enhance model diversity, and gradually enhanced distillation to capture small information differences.

Methodology
This section introduces our methods. First, in the overview, we introduce our methods and principles. Then, details of the implementation of the method are introduced in the following section.

Overview
As Figure 1 shows, we used multiple networks for distributed joint training; each network is a student. The teacher network is the knowledge ( S e ) aggregated by all student networks. Training is divided into two stages, independent training and distillation. In the independent-training phase, the model learns the relationship between training data and the ground-truth label by minimizing cross-entropy loss. After a certain independent-training period, students' soft labels are uploaded to the server. In the server, we aggregated student information by calculating the weighted average values of their soft labels ( S 1 , S 2 , ..., S N ). Then, the aggregated soft labels ( S e ) were sent to each student to guide students' distillation training. Then, in the distillation process students are optimized by minimizing Kullback-Leibler divergence between fixed aggregated soft labels S e and students' soft labels S i . After distillation is finished, we return to the independent-training stage. Overview of our approach. Multiple students work together, each student is a separate model, and the teacher was the aggregated information ( S e ) from multiple student networks. First, students train their models by minimizing cross-entropy loss L C to learn from ground-truth label Y. Then, the server aggregates student information ( S 1 , S 2 , ..., S N ) to generate teacher information ( S e ). Finally, the teacher feeds back the student by minimizing Kullback-Leibler divergence L KL between fixed aggregated soft labels S e and student soft labels S i during the distillation phase. Distributed Training. We distributed training a group of students and, by regularly aggregating information in the server, we used the aggregated information as the teacher in distillation. Soft labels were used for information transmission instead of parameters or gradients. Soft labels have network-independent characteristics, so we could mix networks of different structures. Soft labels do not need to be updated at every step, longer transmission interval is allowed. Our distributed training has three advantages: reducing network overhead, compatibility with different network structures, and expansion-capacity enhancement.
Ensemble gets better teachers. The distillation method improves network performance because of supplementary information. Different network structures and different initializations can generate different knowledge. The model expresses the relationship between data and labels by probability output. Soft labels (as shown in Figure 2) are the softening probabilities that not only concern sample classification, but also sample similarity. The similarity relationships provide additional information during distillation. We collected soft labels from all students and ensembled them by calculating the weighted average values. The aggregated soft labels had better generalization performance, abstracted as the teacher in distillation.   Better distillation. Additional knowledge is important for distillation. Compared with the previous method, we used interval distillation to expand information diversity. In our approach, the teacher is generated through the aggregation of students. The teacher has more supplementary information, while the students are more diverse. The paradox is that all students learn from the same teacher, which would make students' knowledge consistent and lack complementarity. We joined an independent-training phase without distillation constraints, so that students could learn independently and have diversified development. Switching the loss function between independent training and distillation causes the gradient value to change dramatically, which may lead to loss of model information. We gradually increased the limitation of teacher guidance (as Equation (10) shows) in the distillation process to retain more details.
The training process is shown in Figure 3. Our approach is a multicycle process. Each cycle can be roughly divided into five steps. These are independent training, distillation training, and information, transmission, reception, aggregation. In this paper, we propose a cross-architecture online-distillation approach that improves classification accuracy without changing the network structure. We combined distillation and independent training in a cycle. The weighted average ensemble method was used to synthesize teacher knowledge; information on multistudents was extracted and utilized. The teacher guides students training through knowledge distillation. Student performance improved through three stages: learning from local data distribution, generating teacher information by aggregating students' information, and learning from the teacher by distillation.

Independent Training
For a C class-classification task, assume there are N samples of input X = {x i |i ∈ (1, 2, ..., N)}, and the corresponding ground-truth labels Y = {y i |i ∈ (1, 2, ..., N)}. In general, ground-truth label y i is a one-hot vector with a dimension of 1 × C, y i is equal to 1 only in the position of the category to which sample x i belongs, and equal to 0 in other positions. We take the i-th student as an example to introduce the training steps.
The student networks optimize the model by minimizing cross-entropy loss between predicted value p i and ground-truth label y i during the independent-training process. Students do not have information interaction, which is conducive to increasing the diversity of model information. Students' loss function could be freely chosen according to their respective situations. Here, we take the cross-entropy-loss function as an example: where p k i represents the normalized probability that a student model F s classifies input x i as the k-th class.
Logits is the output of the penultimate layer of the convolutional neural networks (CNN) used to classify the problem. In the CNN, input x i is subjected to feature processing to obtain a feature vector Then, we used feature-classification layers (linear layer) to map the feature space to the sample mark space by linear transformation, and the output is logits. The linear relationship between logits g i and feature f (x i ) can be abstracted as C is the number of classes), and logits g i is a vector of dimension 1 × C. In general, logits g i is normalized by the softmax layer to obtain classification-probability output p i . Assume that logits were obtained by inputting x i of the student network F s is [g 1 i , g 2 i , ..., g C i ]. Use the following formula to calculate normalized classification probability p k i .
Normalized probability p i characterizes the model's confidence in the classification decision of x i . The closer the p k i value is to 1, the higher the confidence of x i belonging to the k-th class. To optimize the model parameters from θ t to θ t+1 at the t-th iteration, we minimized loss with a back-propagating algorithm. The model was optimized as follows: ∂θ t is the partial derivative gradient of cross-entropy loss L C (y t |F s (θ t , x t )) to θ t .
Independent training in each cycle lasts for a T in epoch. As the loss of L C decreases, the student model fits the data distribution more accurately, and the more likely the sample is to be mapped to the correct classification. In the independent-training phase, student models do not communicate with each other, and they are more likely to fall into different local optimums. Student models have inconsistent information that can produce more complementary information.

Information Aggregation
We used soft labels as the medium of information. Soft labels are a softer normalized probability value that is more gradual and emphasizes the relationship between samples and classes. The soft-label calculation method is as follows: Parameter t is the temperature parameter that is used to increase the degree of relaxation of the soft label and emphasis on the secondary category. However, a too-large t causes confusion in the category, and we chose t = 3 in our experiment.The soft-label aggregation corresponding to all inputs X is recorded as S, S = { p 0 , p 1 , ..., p N−1 }.
At the end of the independent-training period, student model F si calculates its own soft labels S i generated for input X, obtains a new data relationship (X, S i ), and then uploads S i to the server.
Using a soft label to calculate cross-entropy loss: Loss function L C (Y| S) is convexly optimized for soft-label value S. For the nature of convex optimization, the weighted combination of any two students' cross-entropy loss has the following characteristics: Simple promotion to any number of student models: We used the weighted average approach to aggregate student information. The weighted average method is equally simple compared to the average method, but the weights can be set based on model performance to reduce the impact of interference information. Using the weighted average method can ensure that the overall cross-entropy loss of student models reducing (according to Equation (7) shows), the information of the student is effectively integrated to reduce overfitting. The students' score weighted average process is as follows: In the formula, S i are the soft labels of i-th students, M is the student number, a i is the weight, S e are the aggregated soft labels. Weighted value a i of the student's soft labels is determined by the accuracy: Where M is the number of student models and v i is the model accuracy of student model F si .

Distillation Fusion
In the distillation-fusion phase, aggregated soft labels (as teacher knowledge) S e are used to guide student training. Minimizing the Kullback-Leibler divergence of S e and widetildeS i to urge students to learn from the teacher. Kullback-Leibler divergence describes the discrepancy between students' soft-label distribution widetildeS i and the teacher's distribution S e . The Kullback-Leibler divergence-loss calculation formula of the i-th student is as follows: p k i is the soft-label probability of S e for input sample x i on the k-th class, and p k i is the soft-label probability for the x i in the k-th class on soft-label S i of the i-th student. S e is the teacher model information that is sent to the students by the server in the previous step. S e is fixed in the distillation-fusion stage, and S i is the soft-label output of the student model, updated with the update of the model parameters.
Kullback-Leibler divergence combines with standard cross-entropy loss L C to maintain the target of the ground-truth label value. We used a weighted approach to balance the proportion of Kullback-Leibler divergence loss and cross-entropy loss. The loss function of the i-th student in the distillation-fusion phase is as follows: The weights of cross-entropy loss and Kullback-Leibler divergence loss in the distillation-fusion process are α and β. We gradually changed them other than using constant values, which we call "gradual knowledge transfer". Specifically, gradually reducing the weight of cross entropy while gradually increasing Kullback-Leibler divergence to achieve smooth loss transition in distillation.
Here, r is the epoch number of the distillation, and r is set to 0 at the start of each training cycle. The work of distillation in generations [42] shows that gradual knowledge transfer is more effective. We note that teachers should guide students step by step. Our approach is to gradually enhance the guidance role of the teacher in each distillation cycle, which has been experimentally proven to be simple and effective. Gradually reducing the proportion of standard cross entropy during the distillation process is conducive to the smooth transition of the loss function, preventing large changes in the gradient value from causing the network to collapse. At the same time, gradually increasing Kullback-Leibler divergence loss and strengthening the guidance role of the teacher are conducive to students' better learning.
To optimize model form θ t to θ t+1 at the t-th iteration in distillation process, operations are a similar form of independent training.
where η is the learning rate, (x t , y t ) is the input data of the i-th distillation iteration, and ∂L i ∂θ t is the partial derivative gradient of distillation loss L i to θ t .
The student model performs T d epoch distillation training in a training cycle. After the student model has trained the model through T in rounds of independent training and T d rounds of distillation training in a cycle, the model continues as the starting point for next cycle.
The deployment and training processes of the model are shown in Algorithm 1. Unlike general distillation methods, independent training and distillation training are included in one training cycle, while the aggregated information representing the teacher is updated in each new cycle. Student models participating in the training can have different network structures, but all students follow the same training process. End training until all student models are in a state of convergence. It is not allowed to exit early, so as to avoid a reduction in the overall amount of information in the model. After training is completed, select the best model among students for deployment. A collection of multiple student networks may also be chosen under conditions of sufficient memory resources and computational resources.  (2)); Calculate cross-entropy loss (Equation (1)); SGD backpropagation optimization model Θ i (Equation (3)); Testing; Sending soft labels S i to server; Receiving aggregated soft labels S e from server; while r<T d do Calculating standard class probability p k i and soft probability p k i (Equations (2, 4)); Calculate cross entropy loss and Kullback-Leibler divergence (Equations (1,9)); Get ultimate loss L i (Equations (10, 11)); SGD backpropagation optimization model Θ i (Equation (12)); Testing; Model deployment: Θ i ;
Neural Networks: We used seven networks in our experiments, AlexNet [44], VGG19 [45], ResNet18, ResNet50, ResNet110 [3], SqueezeNet1-1 [18], and DenseNet100 [46]. Implementation Details: We used the PyTorch framework and python gRPC (communication uses) to conduct all the following experiments. The student in the experiment was configured with an Nvidia 1080Ti graphics card, and the server could use the host of the CPU processor. We followed the training strategy and first initialized learning rate 0.1, then dropped the rate from 0.1 to 0.01 halfway (50%) through training, and to 0.001 at 75%. For the hyperparameters involved in the experiment, we used T in = 10, T d = 30 in the CIFAR experiment, while T in = 5, T d = 10 in ImageNet. Table 1 compares the top-1 accuracy performance of varying-capacity state-of-the-art network models trained by independent conventional training and our collaborative online-distillation learning algorithms on CIFAR10/CIFAR100. From Table 1, we can make the following observations: (1) All the different networks benefit from our collaborative online-distillation learning algorithm, particularly when small models collaboratively learn with large-capacity models. Top-1 accuracy improved by 5.94% (49.79-43.85) for AlexNet when training together with VGG. This suggests a generic superiority of our online knowledge distillation across different architectures. (2) Performance gains on classification tasks with more classes (CIFAR100 VS CIFAR10) were higher for all networks. This is reasonable because richer interclass knowledge is transferred across individual architectures in an online manner to facilitate model optimization, indicating the favorable scalability of our method in solving large classification problems. (3) All of the large-capacity models benefit from joint training with small networks. Especially when ResNet collaboratively trained with VGG, top-1 accuracy improved by 3.73% (77.75-74.02). Table 1. Top-1 accuracy (%) on the CIFAR10 and CIFAR100 datasets. I represents independent training and MD represents our method. m1 and m2 are the abbreviations of Model1 and Model2.

Model1
Model2 I (m1) I (m2) MD (m1) MD (m2) Experiment Results on ImageNet We tested the large-scale ImageNet with three different networks: ResNet-18, ResNer-50, and Squeeze1-1; results are shown in Table 2. Overall, we observed a similar performance comparison with these networks as on CIFAR10/CIFAR100. This indicates the superiority of our method when testing on large-scale image-classification settings. ImageNet is large, containing more than 1000 types of image datasets. On ImageNet, we verified that this method is effective for complex tasks with very large datasets.

Comparison with Conventional Distillation Methods
DML and Ensemble-Compression (EC-DNN) are the two most advanced distillation methods. In DML, we use two models for mutual distilling, while the EC-DNN method uses ensemble compression to improve model performance. This shows that our operation is effective: independent training was added to the process of model distillation, which increased the model difference with gradually incremental distillation weights to fuse polymerization information in distillation. Our approach still has tremendous advantages compared with state-of-the-art distillation methods, DML [25] and EC-DNN [47]. Table 3 shows that our method, MD, is almost always higher than DML and EC-DNN on CIFAR10/CIFAR100 on three typical networks: AlexNet, VGG, and ResNet. In Figure 4, wee see that MD continuously achieved higher Top-1 accuracy.

Ablation Study
Effect on Student Number. Cluster learning enhances the effects of model aggregation. Intuitively, different students have individual prediction distributions on the same samples. In the process of aggregating the distribution of students, the more students involved, the smaller the common error. The experiment results of AlexNet and VGG19 on CIFAR10 in Figure 5 show that the more nodes that participate in training, the better the training effect is. However, we also noticed that the bigger the network number is, the slower the growth of model accuracy. Analysis of Experimental Parameters. We explored the effects of important parameters in our method, including temperature parameter t, the number of independent training epochs T in and distillation epochs T d . On the CIFAR10 data set, we use two nodes for cooperatively distillation and setting T in = 10, T d = 30 to explore the influence of temperature parameters t. The results are shown in Table 4. We have the following conclusions: 1. The temperature parameters t have less impact for the same structures (AlexNet + AlexNet) compared to different structures (AlexNet + VGG19). We compared the logits statistics of AlexNet and VGG19 after 50 epochs training on CIFAR10. For AlexNet, the maximum, minimum and variance of logits are (24.41, −24.2, 8.15) while (20.5, −11.3, 11.42) for VGG19. It is more important to use an appropriate temperature parameter to unify the classification score scale for networks with different structures.
2. In the experiments of (AlexNet + VGG19), the best results (AlexNet: 80.23, VGG: 93.81) are obtained with t = 3. The classification accuracy decreases when t > 5, which show that a bigger t obscures the major class probability. The gradient explosion occurs when t = 0.5 or 1. A smaller t leading to the loss of secondary class information, and the values of soft labels are close to discrete 0 and 1 which causes the difficulty in model optimization with Kullback-Leibler divergence. Table 4. Setting the number of independent training epochs T in = 10 and the number of distillation epochs T d = 30, the influence of temperature parameters on accuracy (%) is studied. The same structure (AlexNet + AlexNet) and different structure (AlexNet + VGG19) are compared in the experiment. Setting temperature parameter t = 3, the relationship between the number of independent training epochs T in and distillation training epochs T d is shown in Table 5: Table 5. Setting temperature parameter t = 3, the number of independent training epochs T in = 10, change the number of distillation epochs T d , the accuracy (%) of cooperative training of two nodes (AlexNet + AlexNet) using our method. We observed that the best results were obtained when T in = 10, Td = 30. Fixed T in = 10, with the increase of the number of distillation epochs T d , the accuracy increases gradually and stop increasing after T d = 30. Such changes show that a short distillation process cannot adequately transfer teacher's knowledge to the student networks, resulting in the loss of information.
Gradual Knowledge Transfer. An article on knowledge distillation in generations [42] suggests that teachers who are tolerant usually educate better teachers and produce better results by gradually increasing the constraints of the teacher. We used a multicycle training process in which gradual strict KL constraints are used in each cycle of the distillation phase. Validated on CIFAR10 and CIFAR100, our method compares the use of generally incremental distillation weights ((gMD), T in = 10, T d = 30, α = 1 − r/T d , β = 1 + r/T d ) with fixed distillation weights (fMD), T in = 10, T d = 30, α = 1, β = 2). Experiment results are shown in Table 6. On both the CIFAR10 and CIFAR100 datasets, AlexNet and VGG19 were more accurate at using gMD than using fMD. Increasing teacher constraints enhances the distillation effect.

Conclusions and Future Work
In this paper, we proposed a collaborative framework using ensemble and distillation mechanisms, which helps participating nodes learn to integrate the network information of other nodes during training to achieve better performance. Our approach uses soft labels to pass model information and make it compatible with different networks and training methods. The use of progressively increasing distillation and adding independent-training phase constraints enhances the effectiveness of the method. The experiment verified that our method surpassed the best distillation method of the time. Our method has wide applicability in classification tasks. In the hopes of helping other researchers, competition members, and industries to train better networks, our future work will explore the method of intermediate distillation and the effectiveness of distillation in other scenarios.