Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Multiple-Stage Knowledge Distillation

Appl. Sci. 2022, 12(19), 9453; https://doi.org/10.3390/app12199453

by Chuanyun Xu^1,2

, Nanlan Bai^1,*, Wenjian Gao¹, Tian Li³, Mengwei Li¹, Gang Li^1,* and Yang Zhang²

Reviewer 1: Anonymous

Reviewer 2:

Mohammad Alsajri

Reviewer 3:

Trung Nguyen

Reviewer 4: Anonymous

Appl. Sci. 2022, 12(19), 9453; https://doi.org/10.3390/app12199453

Submission received: 31 July 2022 / Revised: 16 September 2022 / Accepted: 19 September 2022 / Published: 21 September 2022

(This article belongs to the Special Issue Recent Advances in Deep Learning for Image Analysis)

Round 1

Reviewer 1 Report

The article present an approach in training the trainee-network utilizing distilled knowledge of the teacher-network at different levels of NN processing chain. Although the distillation of knowledge at various levels of multistage network processing is not new nor seldomly used but here it is used in another, novel, way where distiled knowledge of the trainer and trainee network is used for trainee adjustment at the same and equals level of processing in a bit more tightly coupled structural layout of both NN topologies. It practically and quantitativelly demonstrates success of learning adjustment at earliest stages of processing, showing improvement either in precision as in performance of trainee network more readily and faster than traditional approaches do.

Furthermore, from the article content and style perspectives of presentation assesment I can confirm that article is styled and written in a decent form with enough explanatory details for experts in the field of the article topic.

Author Response

Response to Reviewer 1 Comments

Point 1: The article present an approach in training the trainee-network utilizing distilled knowledge of the teacher-network at different levels of NN processing chain. Although the distillation of knowledge at various levels of multistage network processing is not new nor seldomly used but here it is used in another, novel, way where distiled knowledge of the trainer and trainee network is used for trainee adjustment at the same and equals level of processing in a bit more tightly coupled structural layout of both NN topologies. It practically and quantitativelly demonstrates success of learning adjustment at earliest stages of processing, showing improvement either in precision as in performance of trainee network more readily and faster than traditional approaches do.

Response 1: Thank you very much for your positive comments and appreciation of the manuscript.

Author Response File: Author Response.pdf

Reviewer 2 Report

1- i dont see the motivation of the study, state clearly the motivation of the study in abstract

2-we see the main objectives in introduction, It should be the main contributions there as well

3- all figures, equations, and tables should mentioned in the text before using it

4-give the details of the data set in form of table to make it easy for readers to understand it n

5-add numerical results to conclusion

6- improve the quality of the paper by citing recent published work in the filed

https://journal.esj.edu.iq/index.php/IJCM/article/view/100

Author Response

Response to Reviewer 2 Comments

Point 1: i dont see the motivation of the study, state clearly the motivation of the study in abstract

Response 1: We deeply appreciate your suggestion. According to your suggestion, we have

described the motivation for the study more clearly in the abstract (Line 6, page 1). Thanks again for your valuable comment.

Point 2: we see the main objectives in introduction, It should be the main contributions there as well

Response 2: We deeply appreciate your suggestion. According to your suggestion, we have seriously considered adding main contributions (Line 75, page 3). Thanks again for your valuable comment.

Point 3: all figures, equations, and tables should mentioned in the text before using it

Response 3: Thanks for your suggestion. According to your suggestion, we have made the corresponding changes in the revised manuscript. Thanks again for your valuable comment.

Point4: give the details of the data set in form of table to make it easy for readers to understand it n

Response 4: Thanks for your suggestion. According to your suggestion, we have given the data set in a table display as shown in Table 1. Thanks again for your valuable comment.

Point 5: add numerical results to conclusion

Response 5: We are grateful for the suggestion. According to the your comment, we have added relevant numerical results in the conclusion section (Line 340, page 15). Thanks again for your valuable comment.

Point 6: improve the quality of the paper by citing recent published work in the filed

https://journal.esj.edu.iq/index.php/IJCM/article/view/100

Response 6: Thanks for your suggestion. According to the your comment, we mentioned those articles in the manuscript as reference [4]. Thanks again for your valuable comment.

Author Response File: Author Response.pdf

Reviewer 3 Report

The paper presents variants of multi-stage knowledge distillation (KD) methods for transfer learning from the teacher network model to the student network model. The one-to-one method employs multi-exit architecture in both the teacher and student models. The student model then can mimic the logits of the teacher model at each exit stage. Although this paper does not propose novel technical improvements from the current literature on this domain, the author adopted new improvements to the existing KD methods. There are some minor problems in the paper that need to be fixed:

A. One of my main concerns about this paper is the effectiveness of the proposed one-to-one multi-stage KD method. The framework and experiment results have been discussed in good detail in sections 3-6. However, some problems need to be addressed as follows:

- How to split teacher/student network models (e.g., ResNet152) into multiple parts (multi exit)?

- With the description of the OtO model in section 3.3 and Figure 3, my impression is the student model is very similar to the teacher model. Does it make sense as we expect KD to be used for neural network compression, and the student model should be much simpler than the teacher model?

- The experimental results were collected mainly on the image dataset. What is the effectiveness of the proposed method on other kinds of data such as text (for NLP), video, etc.?

B. The literature review covered details of KD and multi-exit architectures. However, there is multiple other related KD research. Please look at the following related works and others:

- Gou, J., Yu, B., Maybank, S. J., & Tao, D. (2021). Knowledge distillation: A survey. International Journal of Computer Vision, 129(6), 1789-1819.

- Mirzadeh, S. I., Farajtabar, M., Li, A., Levine, N., Matsukawa, A., & Ghasemzadeh, H. (2020, April). Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, No. 04, pp. 5191-5198).

- Wang, L., & Yoon, K. J. (2021). Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks. IEEE Transactions on Pattern Analysis and Machine Intelligence.

- Aguilar, G., Ling, Y., Zhang, Y., Yao, B., Fan, X., & Guo, C. (2020, April). Knowledge distillation from internal representations. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 05, pp. 7350-7357).

- Xu, G., Liu, Z., Li, X., & Loy, C. C. (2020, August). Knowledge distillation meets self-supervision. In European Conference on Computer Vision (pp. 588-604). Springer, Cham.

- Allen-Zhu, Z., & Li, Y. (2020). Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv preprint arXiv:2012.09816.

- Fukuda, T., Suzuki, M., Kurata, G., Thomas, S., Cui, J., & Ramabhadran, B. (2017, August). Efficient Knowledge Distillation from an Ensemble of Teachers. In Interspeech (pp. 3697-3701).

- Beyer, L., Zhai, X., Royer, A., Markeeva, L., Anil, R., & Kolesnikov, A. (2022). Knowledge distillation: A good teacher is patient and consistent. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10925-10934).

- Liu, Y., Zhang, W., & Wang, J. (2020). Adaptive multi-teacher multi-level knowledge distillation. Neurocomputing, 415, 106-113.

C. The references for the following terms/phrases need to be cited: CIFAR100, Tiny Image Net (line 70), softmax function (in section 3.1), Kullback-Leibler divergence (in section 3.1), VGG8, VGG13, ResNet18, ResNet34 (section 4.2), SOTA HSAKD (line 220), and Grad-CAM (line 276).

D. In Figures 2-3, multiple components such as FC and Attention modules need to be explained in more detail in the paragraph that mentioned these figures.

E. Figures 4, 5, 6, and 7 can be enlarged as some texts inside those figures are quite small.

F. Some places in this paper need to be improved in written English or formats. Please check the comments in the attached PDF.

Author Response

Response to Reviewer 3 Comments

Point 1: One of my main concerns about this paper is the effectiveness of the proposed one-to-one multi-stage KD method. The framework and experiment results have been discussed in good detail in sections 3-6. However, some problems need to be addressed as follows:

- How to split teacher/student network models (e.g., ResNet152) into multiple parts (multi exit)?

Response 1: We deeply appreciate your suggestion. According to your comment, we have added a more detailed description of the teacher/student network division in the Model Architecture section of Section 4.1 (Line 191, page 9). The division of the teacher/student network into multiple parts is mainly based on the constituent blocks of the network itself, for example, the ResNet152 network is divided according to its constituent blocks ResBlock. An early exit is added after a constituent block. Thanks again for your valuable comment.

Response 1: Thanks for your suggestion. We need to explain this, first, the early exit structure added to the teacher-student network is similar, but the original teacher-student itself has a size difference, and the gap between teachers and students remains the same after adding a similar structure. The student network with the early exit added remains much smaller than the teacher network with the early exit added, and still achieves the model compression effect.

Second, the student network was trained to become a multi-exit student network, and the early branching network effects of the student network were then able to outperform those of the baseline method, as shown in Tables 2 and 3. The student branching networks are smaller in size and have higher accuracy than the benchmark method, achieving model compression.

Finally, knowledge distillation can be used for either model compression or model enhancement, with the ultimate goal of enabling improved student network performance. It can be used as a model enhancement method without considering model compression. For example, in the first column of Table 3, the accuracy of the model is improved with the same teacher networks and student networks, and the effect of model enhancement is achieved. Thanks again for your valuable comment.

- The experimental results were collected mainly on the image dataset. What is the effectiveness of the proposed method on other kinds of data such as text (for NLP), video, etc.?

Response 1: We are extremely grateful to you for pointing out this problem. Since the proposed method is only explored in the related references on image classification, we have not considered applying the proposed method to other data types for the time being. This was an oversight on our part. Thank you for your valuable suggestion, and we will continue to study the effect of the proposed method on other data types in our future work. Thanks again for your valuable comment.

Point 2: The literature review covered details of KD and multi-exit architectures. However, there is multiple other related KD research. Please look at the following related works and others:

- Gou, J., Yu, B., Maybank, S. J., & Tao, D. (2021). Knowledge distillation: A survey. International Journal of Computer Vision, 129(6), 1789-1819.

- Mirzadeh, S. I., Farajtabar, M., Li, A., Levine, N., Matsukawa, A., & Ghasemzadeh, H. (2020, April). Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, No. 04, pp. 5191-5198).

- Wang, L., & Yoon, K. J. (2021). Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks. IEEE Transactions on Pattern Analysis and Machine Intelligence.

- Aguilar, G., Ling, Y., Zhang, Y., Yao, B., Fan, X., & Guo, C. (2020, April). Knowledge distillation from internal representations. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 05, pp. 7350-7357).

- Xu, G., Liu, Z., Li, X., & Loy, C. C. (2020, August). Knowledge distillation meets self-supervision. In European Conference on Computer Vision (pp. 588-604). Springer, Cham.

- Allen-Zhu, Z., & Li, Y. (2020). Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv preprint arXiv:2012.09816.

- Fukuda, T., Suzuki, M., Kurata, G., Thomas, S., Cui, J., & Ramabhadran, B. (2017, August). Efficient Knowledge Distillation from an Ensemble of Teachers. In Interspeech (pp. 3697-3701).

- Beyer, L., Zhai, X., Royer, A., Markeeva, L., Anil, R., & Kolesnikov, A. (2022). Knowledge distillation: A good teacher is patient and consistent. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10925-10934).

- Liu, Y., Zhang, W., & Wang, J. (2020). Adaptive multi-teacher multi-level knowledge distillation. Neurocomputing, 415, 106-113.

Response 2: Thanks for your good comments. These papers you provide have a great effect on a comprehensive understanding of knowledge distillation, so we mentioned those articles in the manuscript as reference [8], reference [9], reference [18], etc. Thanks again for your valuable comment.

Point 3: The references for the following terms/phrases need to be cited: CIFAR100, Tiny Image Net (line 70), softmax function (in section 3.1), Kullback-Leibler divergence (in section 3.1), VGG8, VGG13, ResNet18, ResNet34 (section 4.2), SOTA HSAKD (line 220), and Grad-CAM (line 276).

Response 3: Thank you for pointing out this problem in the manuscript. According to your suggestion, we have added the relevant references. Thanks again for your valuable comment.

Point4: In Figures 2-3, multiple components such as FC and Attention modules need to be explained in more detail in the paragraph that mentioned these figures.

Response 4: We deeply appreciate your suggestion. According to your comment, we have added a more detailed interpretation of FC and Attention modules in the Model Architecture section of Section 4.1 (Line 193, page 9). Thanks again for your valuable comment.

Point 5: Figures 4, 5, 6, and 7 can be enlarged as some texts inside those figures are quite small.

Response 5: Thank you for pointing out this problem in the manuscript. According to your comment, we have enlarged the relevant figures to ensure that the text is clear. Thanks again for your valuable comment.

Point 6: Some places in this paper need to be improved in written English or formats. Please check the comments in the attached PDF.

Response 6: We deeply appreciate your suggestion. We are very sorry that the attached PDF was not found. And the manuscript has been revised by a native English speaker. Thanks again for your valuable comment.

Author Response File: Author Response.pdf

Reviewer 4 Report

Dear Authors;

I have some comments on your manuscript, and I wish you have to read it carefully and make the required changes and modifications, which are:

1- For Equation 4, are taw and squared taw the same parameter? and you have to explain how taw squared works to keep the gradient contributions constant.
2- For Figure 2, what are those the green boxes at the teacher section?
3- 7th row of the section 3.2 MSKD, could you write few words to define the difference between true labels and the output of the teacher network to be more clear what is the meaning of classification loss and distillation loss. because you have defined the classification loss and distillation loss which are the loss obtained by comparing with ..., comparing between what?
4- For equation 5, you have to mention the main difference between P_s(x_i,1) and P_s(x_i,taw), and what is the range of taw values?
5- I could not found any difference between Eq.5 and Eq.8, could you give more details?
6- I suggest you to mention some numerical results and Conclusion section.

Author Response

Response to Reviewer 4 Comments

Point 1: For Equation 4, are taw and squared taw the same parameter? and you have to explain how taw squared works to keep the gradient contributions constant.

Response 1: We deeply appreciate your suggestion. For Equation 4, the and are the same parameter. According to your comment, we have added a more detailed interpretation regarding (Line 148, page 6). The main reason is that there is a ratio between the gradient of classification loss and distillation loss, so the distillation loss is multiplied by to keep the gradient contribution consistent. The specific formula is derived as follows:

According to Equation 1 in Section 3.1, the probability of the network output for each category j ∈ {1 . . . C} is expressed as:

By substituting equation 1 into the distillation loss, we can get:

where is the output of the teacher network before the softmax function is applied， is the output of the student network before the softmax function is applied. Where is a constant independent of , then

By substituting equation 1 into the classification loss, we can get:

Then,

In addition, . It turns out that is about times the magnitude of . Therefore, when the distillation loss and classification loss are used together, it is necessary to multiply before the distillation loss, so that the gradient of the contribution of distillation loss and classification loss is consistent. Thanks again for your valuable comment.

Point 2: For Figure 2, what are those the green boxes at the teacher section?

Response 2: Thank you for pointing out the omission of annotation in Figure 2. We are sorry for our carelessness. The green boxes of the teacher section are the constituent blocks of the teacher network. For example, when the teacher network is a ResNet series network, the green boxes represent Resblocks. We have added the relevant annotations in Figure 2. Thanks again for your valuable comment.

Point 3: 7th row of the section 3.2 MSKD, could you write few words to define the difference between true labels and the output of the teacher network to be more clear what is the meaning of classification loss and distillation loss. because you have defined the classification loss and distillation loss which are the loss obtained by comparing with ..., comparing between what?

Response 3: Thanks for your suggestion. We are very sorry for not explaining clearly the meaning of classification loss and distillation loss. According to your comment, we have added a more detailed interpretation of the two types of losses in line 7 of section 3.2. The true labels are the actual category of the samples. The output of teacher network refers to the probability distribution of the teacher network's judgment on the sample category, and the output of teacher network is softened by temperature to obtain a smoother probability distribution. The classification loss is the distance between the student network output and the true label, that is, the loss obtained by comparing the student network output with the true label. And the distillation loss is the distance between the output of the student network and the output of the teacher network, that is, the loss obtained by comparing the output of the student network with the output of the teacher network. Thanks again for your valuable comment.

Point4: For equation 5, you have to mention the main difference between Ps(xi,1) and Ps(xi,taw), and what is the range of taw values?

Response 4: Thanks for your suggestion. According to your comment, we have added a more detailed interpretation of equation 5 below. According to equation 1, and are the output of the student network at different temperatures. indicates that the temperature is 1 at this time, and the output of the student network is not softened by temperature, that is, is the original softmax function. indicates that at this time, the temperature is set for the experiment, and the output of the student network is softened by temperature. The value of ranges from greater than or equal to 1. The larger the value of is, the more average the probability distribution is, and when the temperature is infinite, the probability distribution is average. The value of negative label correlation increases relatively at higher temperatures, when the student network will focus relatively more on negative labels. The negative label contains certain information, but since the training process of the teacher network determines that the negative label part is noisier, the temperature is chosen to be more empirical, and this paper complies with the relevant research [26] to set the temperature to 3. Thanks again for your valuable comment.

Point 5: I could not found any difference between Eq.5 and Eq.8, could you give more details?

Response 5: Thanks for your suggestion. According to your comment, we have added a more detailed interpretation of the difference between Eq.5 and Eq.8 below Eq.8. Eq.5 and Eq.8 represent the loss representation of the k-th exit of the student network for MSKD and OtO methods, respectively. The main difference between the two equations lies in the distillation loss part, namely, the in Eq. 5 and the in Eq. 8. The distillation loss in Eq. 5 is the distance between the output of the student branch network and the final output of the teacher network, and the distillation loss in Eq. 8 is the distance between the output of the student branch network and the output of the corresponding teacher branch network. As can be seen from the above description, the main difference between the two is the source of guidance of the teacher network, one is the final output of the teacher network with the specific formula ) and the other corresponds to the output of the teacher branch network with the specific formula . Thanks again for your valuable comment.

Point 6: I suggest you to mention some numerical results and Conclusion section.

Response 6: We are grateful for the suggestion. According to your comment, we have added relevant numerical results in the conclusion section (Line340, page16). Thanks again for your valuable comment.

Author Response File: Author Response.pdf

Round 2

Reviewer 4 Report

Dear Authors;

I'd like to thank you for following all my comments and made the required changes and modifications.

Best Regards

Article Menu

Multiple-Stage Knowledge Distillation

Further Information

Guidelines

MDPI Initiatives

Follow MDPI