MicroBERT: Distilling MoE-Based Knowledge from BERT into a Lighter Model
Abstract
1. Introduction
- In order to promote the proper transfer of the linguistic information stored in the teacher’s BERT to MicroBERT, we suggest a Transformer distillation approach called MicroBERT, with the routing algorithms for MoE models with reasonable serving costs. We investigate alternative routing methods that are taught to exploit global task-level knowledge to route all tokens predicted to a given task collectively to the same collection of experts.
- We propose a new feature-refinement method called Feature Alignment Loss (FAL), which enables the model to perform feature learning with a higher perspective, effectively reducing the computational burden of the model and shortening the inference time while ensuring accuracy.
- We design a new feature-mapping method to ensure that the information from the teacher model can be passed to the student model as much as possible, which improves the efficiency of information utilization. At the same time, the soft target predicted for the teacher model can be migrated to any task.
- We introduced the idea of GAN to train discriminators using the outputs of the teacher model and the student model, so that the output of the student model can maximize the approximation of the output of the teacher model. This research aims to make the student model more lightweight by compressing the model without reducing the accuracy.
2. Related Work
3. Methods
3.1. Distillation Process
3.2. Feature Alignment Loss
3.3. Soft Loss and Hard Loss
3.4. Efficient Inference on MoE
- Fusion: The origin graph and the matching distributed method for the ultra-large-scale distributed training model are combined to eliminate parameter redundancy.
- Distillation and Compression: Less experts are present in the student network as a result of the instructor network’s many experts being concentrated and condensed.
- Optimization: Relevant IR Pass optimizations, including kernel fusion, are applied to the distributed sub-graphs in order to further increase the inference time.
3.5. Discriminator Loss
4. Experiments and Results
4.1. Model Accuracy Comparison Experiment
4.2. Distillation Effect Comparison Experiment
4.3. Loss Algorithm Comparison Experiment
4.4. Algorithmic Ablation Experiments
5. Discussion
6. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Gong, Y.; Liu, L.; Yang, M.; Bourdev, L. Compressing deep convolutional networks using vector quantization. arXiv 2014, arXiv:1412.6115. [Google Scholar]
- Han, S.; Pool, J.; Tran, J.; Dally, W. Learning both weights and connections for efficient neural network. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
- Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. Fitnets: Hints for thin deep nets. arXiv 2014, arXiv:1412.6550. [Google Scholar]
- Aguilar, G.; Ling, Y.; Zhang, Y.; Yao, B.; Fan, X.; Guo, C. Knowledge distillation from internal representations. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 7350–7357. [Google Scholar]
- Ryu, M.; Lee, G.; Lee, K. Knowledge distillation for bert unsupervised domain adaptation. Knowl. Inf. Syst. 2022, 64, 3113–3128. [Google Scholar] [CrossRef]
- Feng, L.; Qiu, M.; Li, Y.; Zheng, H.T.; Shen, Y. Learning to augment for data-scarce domain bert knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 7422–7430. [Google Scholar]
- Wu, Y.; Rezagholizadeh, M.; Ghaddar, A.; Haidar, M.A.; Ghodsi, A. Universal-KD: Attention-based output-grounded intermediate layer knowledge distillation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 7649–7661. [Google Scholar]
- Kim, J.; Park, J.H.; Lee, M.; Mok, W.L.; Choi, J.Y.; Lee, S. Tutoring Helps Students Learn Better: Improving Knowledge Distillation for BERT with Tutor Network. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 7371–7382. [Google Scholar]
- Li, M.; Zhao, H.; Gu, T.; Ying, D.; Liao, B. Class imbalance mitigation: A select-then-extract learning framework for emotion-cause pair extraction. Expert Syst. Appl. 2024, 236, 121386. [Google Scholar] [CrossRef]
- Huo, Y.; Wong, D.F.; Ni, L.M.; Chao, L.S.; Zhang, J. Knowledge modeling via contextualized representations for LSTM-based personalized exercise recommendation. Inf. Sci. 2020, 523, 266–278. [Google Scholar] [CrossRef]
- Otmakhova, J.; Verspoor, K.; Lau, J.H. Cross-linguistic comparison of linguistic feature encoding in BERT models for typologically different languages. In Proceedings of the 4th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, Seattle, WA, USA, 14 July 2022; pp. 27–35. [Google Scholar]
- Chen, Q.; Du, J.; Allot, A.; Lu, Z. LitMC-BERT: Transformer-based multi-label classification of biomedical literature with an application on COVID-19 literature curation. IEEE/ACM Trans. Comput. Biol. Bioinform. 2022, 19, 2584–2595. [Google Scholar] [CrossRef]
- Shen, L.; Wu, Z.; Gong, W.; Hao, H.; Bai, Y.; Wu, H.; Wu, X.; Bian, J.; Xiong, H.; Yu, D.; et al. SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System. arXiv 2023, arXiv:2205.10034. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
- Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S.R. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv 2018, arXiv:1804.07461. [Google Scholar]
- Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving language understanding by generative pre-training. arXiv 2018, arXiv:1909.11942. [Google Scholar]
- Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. Albert: A lite bert for self-supervised learning of language representations. arXiv 2019, arXiv:1909.11942. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Sun, S.; Cheng, Y.; Gan, Z.; Liu, J. Patient knowledge distillation for bert model compression. arXiv 2019, arXiv:1908.09355. [Google Scholar]
- Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
- Adhikari, A.; Ram, A.; Tang, R.; Lin, J. Docbert: Bert for document classification. arXiv 2019, arXiv:1904.08398. [Google Scholar]
- Sun, Z.; Yu, H.; Song, X.; Liu, R.; Yang, Y.; Zhou, D. Mobilebert: A compact task-agnostic bert for resource-limited devices. arXiv 2020, arXiv:2004.02984. [Google Scholar]
- Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. Tinybert: Distilling bert for natural language understanding. arXiv 2019, arXiv:1909.10351. [Google Scholar]
- Khattab, O.; Zaharia, M. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 25–30 July 2020; pp. 39–48. [Google Scholar]
- Wang, W.; Wei, F.; Dong, L.; Bao, H.; Yang, N.; Zhou, M. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; Volume 33, pp. 5776–5788. [Google Scholar]
- Liu, W.; Zhou, P.; Zhao, Z.; Wang, Z.; Deng, H.; Ju, Q. Fastbert: A self-distilling bert with adaptive inference time. arXiv 2020, arXiv:2004.02178. [Google Scholar]
- Zuo, S.; Zhang, Q.; Liang, C.; He, P.; Zhao, T.; Chen, W. Moebert: From bert to mixture-of-experts via importance-guided adaptation. arXiv 2022, arXiv:2204.07675. [Google Scholar]
- Kudugunta, S.; Huang, Y.; Bapna, A.; Krikun, M.; Lepikhin, D.; Luong, M.T.; Firat, O. Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference. arXiv 2021, arXiv:2110.03742. [Google Scholar]
- Lepikhin, D.; Lee, H.; Xu, Y.; Chen, D.; Firat, O.; Huang, Y.; Krikun, M.; Shazeer, N.; Chen, Z. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. arXiv 2020, arXiv:2006.16668. [Google Scholar]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
System | Params | FLOPs | Speedup | SST-2 | MNLI-m | MRPC | QQP | RTE | Avg |
---|---|---|---|---|---|---|---|---|---|
BERT(Google) | 109 M | 22.5 B | 1.0× | 93.5 | 84.6 | 88.9 | 71.2 | 66.4 | 80.92 |
BERT(Teacher) | 109 M | 22.5 B | 1.0× | 91.1 | 84.4 | 86.0 | 88.8 | 64.2 | 82.9 |
BERT-PKD4 | 52.2 M | 7.6 B | 3.0× | 89.4 | 79.9 | 82.6 | 70.2 | 62.3 | 76.88 |
DistillBERT4 | 52.2 M | 7.6 B | 3.0× | 91.4 | 78.9 | 82.4 | 68.5 | 54.1 | 75.06 |
MobileBERT4 | 15.1 M | 3.1 B | - | 91.2 | 81.5 | 87.9 | 68.9 | 65.1 | 78.92 |
TinyBERT4 | 14.5 M | 1.2 B | 9.4× | 87.6 | 80.5 | 82.3 | 87.7 | 61.7 | 79.96 |
MinilmV24 | 5.4 M | 4.0 B | 5.3× | 88.4 | 74.3 | 81.8 | 82.0 | 60.6 | 78.0 |
MicroBERT | 14.5 M | 1.2 B | 9.3× | 89.6 | 80.3 | 88.7 | 86.6 | 62.8 | 81.6 |
Avg | 89.8 | 79.9 | 84.5 | 78.9 | 61.9 | 79.2 |
System | SST-2 | MNLI-m | MRPC | QQP | RTE | Avg |
---|---|---|---|---|---|---|
91.1 | 84.4 | 88.9 | 88.8 | 54.8 | - | |
89.6 | 80.3 | 88.7 | 86.6 | 62.8 | - | |
Reappear persent | 98.3 | 95.1 | 99.7 | 97.5 | 114.5 | 101.0 |
93.4 | 83.9 | 87.5 | 71.1 | 67.1 | - | |
92.6 | 82.5 | 86.4 | 71.3 | 66.6 | - | |
Reappear percent | 99.1 | 98.3 | 98.7 | 100.2 | 99.4 | 99.0 |
Difference (Tiny-Micro) | −0.79 | −3.2 | 1.03 | −2.76 | 15.1 | −2.0 |
System | SST-2 | QQP | MRPC |
---|---|---|---|
our_model | 89.6 | 86.6 | 88.7 |
w/o Dis | 88.9 | 86.3 | 88.0 |
w/o Hidn | 88.3 | 85.9 | 87.7 |
w/o Pred | 87.4 | 85.1 | 87.1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zheng, D.; Li, J.; Yang, Y.; Wang, Y.; Pang, P.C.-I. MicroBERT: Distilling MoE-Based Knowledge from BERT into a Lighter Model. Appl. Sci. 2024, 14, 6171. https://doi.org/10.3390/app14146171
Zheng D, Li J, Yang Y, Wang Y, Pang PC-I. MicroBERT: Distilling MoE-Based Knowledge from BERT into a Lighter Model. Applied Sciences. 2024; 14(14):6171. https://doi.org/10.3390/app14146171
Chicago/Turabian StyleZheng, Dashun, Jiaxuan Li, Yunchu Yang, Yapeng Wang, and Patrick Cheong-Iao Pang. 2024. "MicroBERT: Distilling MoE-Based Knowledge from BERT into a Lighter Model" Applied Sciences 14, no. 14: 6171. https://doi.org/10.3390/app14146171
APA StyleZheng, D., Li, J., Yang, Y., Wang, Y., & Pang, P. C.-I. (2024). MicroBERT: Distilling MoE-Based Knowledge from BERT into a Lighter Model. Applied Sciences, 14(14), 6171. https://doi.org/10.3390/app14146171