# MicroBERT: Distilling MoE-Based Knowledge from BERT into a Lighter Model

^{*}

## Abstract

**:**

## 1. Introduction

- In order to promote the proper transfer of the linguistic information stored in the teacher’s BERT to MicroBERT, we suggest a Transformer distillation approach called MicroBERT, with the routing algorithms for MoE models with reasonable serving costs. We investigate alternative routing methods that are taught to exploit global task-level knowledge to route all tokens predicted to a given task collectively to the same collection of experts.
- We propose a new feature-refinement method called Feature Alignment Loss (FAL), which enables the model to perform feature learning with a higher perspective, effectively reducing the computational burden of the model and shortening the inference time while ensuring accuracy.
- We design a new feature-mapping method to ensure that the information from the teacher model can be passed to the student model as much as possible, which improves the efficiency of information utilization. At the same time, the soft target predicted for the teacher model can be migrated to any task.
- We introduced the idea of GAN to train discriminators using the outputs of the teacher model and the student model, so that the output of the student model can maximize the approximation of the output of the teacher model. This research aims to make the student model more lightweight by compressing the model without reducing the accuracy.

## 2. Related Work

## 3. Methods

#### 3.1. Distillation Process

**Hidden Layer Distillation**. Our proposed distillation technique for transformer layers builds upon hidden state distillation. Recent research has shown that the [CLS] token, max pooling and mean pooling, which BERT learns, possess robust sentence vector representations. Consequently, distilling such information becomes a crucial factor for sentence vector-based distillation. This linguistic knowledge encompasses syntactic information and other essential aspects for natural language understanding. We anticipate that the student model (MicroBERT) will learn more semantic knowledge from the teacher model (BERT). Therefore, we introduce a hidden layer-based language-refinement method called FAL. We extract three aspects from FAL: the [CLS] token, mean pooling and max pooling. These aspects are then trained by the students, enabling the student model to focus not only on word vector features but also on sentence vector features. The following objectives have been defined:

**Embedding Layer Distillation**. We also performed distillation on the embedding layer, aiming to achieve similar goals as the hidden-state-based distillation:

**MoE-guided distillation**. We use the MoE proposed by LLM [30], where the MoE layers for the Transformers consist of E Prediction Layer, such that ($P{L}_{1}$…$P{L}_{E}$).

**Prediction Layer Distillation**. For the prediction layer component of our model, we employ the knowledge-distillation method proposed by Hinton et al. [31]. This method involves calculating the soft cross-entropy loss by taking the logarithm of the outputs from both the student model and the teacher model. Additionally, the soft labels generated by the trained teacher model can be transferred to any task.

#### 3.2. Feature Alignment Loss

_{base}model, the BERT model takes into account that many downstream tasks rely on analyzing the relationship between two sentences for modeling, such as question answering. To enable the model to have this capability, for such tasks, BERT does so by splicing a [CLS] token at the beginning of a sentence. [CLS] token is encoded by BERT and the resulting vector representation is usually used as the representation of the current sentence. In addition, the BERT model takes word sequences as input and passes them up through multiple layers of encoders. Each layer passes through Self-Attention and Feedforward Neural Network (FFNN). The encoding of all tokens by BERT outputs a vector of size hidden size (768 in BERT

_{base}) at each position.

_{base}is 768, while the implied dimension of our MicroBERT student model is 312. Due to the mismatch of these two dimensions, the distillation operation cannot be performed. To solve the above problem, we map the two features. Then the [CLS] token in the features is taken out, the mean pooling and the max pooling are taken out and the loss between them is calculated, which is the FAL.

_{base}corresponds to the number of layers of MicroBERT to find the FAL.

#### 3.3. Soft Loss and Hard Loss

_{base}) teacher model or a student model and the step is to train the discriminator first and then to train the GAN.

#### 3.4. Efficient Inference on MoE

- Fusion: The origin graph and the matching distributed method for the ultra-large-scale distributed training model are combined to eliminate parameter redundancy.
- Distillation and Compression: Less experts are present in the student network as a result of the instructor network’s many experts being concentrated and condensed.
- Optimization: Relevant IR Pass optimizations, including kernel fusion, are applied to the distributed sub-graphs in order to further increase the inference time.

#### 3.5. Discriminator Loss

## 4. Experiments and Results

#### 4.1. Model Accuracy Comparison Experiment

_{4}with 14.5 M parameters in total. It contains 12 attention heads (h = 12), 4 encoder layers (M = 4), a feed-forward/filter size of 1200 (${d}_{i}$ = 1200) and hidden dimensions of 312 (d = 312). The teacher model, which has 110 M parameters, is based on the original BERT

_{base}(d = 768, ${d}_{i}$ = 3072, N = 12 and h = 12). We extracted the average taken every third layer as the layer-mapping function in the teacher model. so MicroBERT4 learns by taking the average from every 3 layers of BERT

_{base}.

_{base}, and lightweight models for several different compression methods, and also for a fair comparison, we trained 4-layer BERT-PKD

_{4}, 4-layer DistillBERT

_{4}and 4-layer MinilmV2

_{4}using the published code and fine-tuning these 4-layer baselines using the proposed hyper-parameters. The accuracy results for their dataset tasks are shown in the following Table 1.

_{4}and 84.4% for BERT (Teacher) and the accuracy of our ultra-lightweight model in this task is about 0.4% higher than the average accuracy of the four lightweight models, which is within our acceptance range. In MRPC for the semantic judgment task, the best models are 88.7% for our ultra-lightweight model, reaching an extremely high level. The 86.6% in QQP is about 6.5% more than the average accuracy of the other lightweight models. Finally, in the RTE task, our model is about 3.6% less accurate than the best BERT (Google), while it is higher than the average accuracy.

#### 4.2. Distillation Effect Comparison Experiment

#### 4.3. Loss Algorithm Comparison Experiment

#### 4.4. Algorithmic Ablation Experiments

## 5. Discussion

## 6. Conclusions and Future Work

_{base}NLP models on hardware. However, it is important to note that our study has certain limitations. Specifically, we have not explored the effectiveness of our method on models larger than BERT

_{base}, such as BERT-large, and we have not conducted extensive experiments on languages other than English. Future research could delve into the methods of effectively transferring knowledge from a comprehensive and sophisticated teacher model, such as BERT-large, to a compact student model like MicroBERT, addressing these limitations. Additionally, an interesting avenue to further condense pre-trained language models is to combine distillation with quantization or pruning techniques. In conclusion, we highlight the importance of focusing future research on enhancing inference efficiency for large models by developing methods that are more inference-friendly while maintaining the quality improvements of MoE models. We believe that further studies on hierarchical variations or more detailed routing hybrids will yield additional benefits and deepen our understanding of large-scale, heavily multi-tasking and sparsely gated networks.

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv
**2018**, arXiv:1810.04805. [Google Scholar] - Gong, Y.; Liu, L.; Yang, M.; Bourdev, L. Compressing deep convolutional networks using vector quantization. arXiv
**2014**, arXiv:1412.6115. [Google Scholar] - Han, S.; Pool, J.; Tran, J.; Dally, W. Learning both weights and connections for efficient neural network. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
- Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. Fitnets: Hints for thin deep nets. arXiv
**2014**, arXiv:1412.6550. [Google Scholar] - Aguilar, G.; Ling, Y.; Zhang, Y.; Yao, B.; Fan, X.; Guo, C. Knowledge distillation from internal representations. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 7350–7357. [Google Scholar]
- Ryu, M.; Lee, G.; Lee, K. Knowledge distillation for bert unsupervised domain adaptation. Knowl. Inf. Syst.
**2022**, 64, 3113–3128. [Google Scholar] [CrossRef] - Feng, L.; Qiu, M.; Li, Y.; Zheng, H.T.; Shen, Y. Learning to augment for data-scarce domain bert knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 7422–7430. [Google Scholar]
- Wu, Y.; Rezagholizadeh, M.; Ghaddar, A.; Haidar, M.A.; Ghodsi, A. Universal-KD: Attention-based output-grounded intermediate layer knowledge distillation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 7649–7661. [Google Scholar]
- Kim, J.; Park, J.H.; Lee, M.; Mok, W.L.; Choi, J.Y.; Lee, S. Tutoring Helps Students Learn Better: Improving Knowledge Distillation for BERT with Tutor Network. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 7371–7382. [Google Scholar]
- Li, M.; Zhao, H.; Gu, T.; Ying, D.; Liao, B. Class imbalance mitigation: A select-then-extract learning framework for emotion-cause pair extraction. Expert Syst. Appl.
**2024**, 236, 121386. [Google Scholar] [CrossRef] - Huo, Y.; Wong, D.F.; Ni, L.M.; Chao, L.S.; Zhang, J. Knowledge modeling via contextualized representations for LSTM-based personalized exercise recommendation. Inf. Sci.
**2020**, 523, 266–278. [Google Scholar] [CrossRef] - Otmakhova, J.; Verspoor, K.; Lau, J.H. Cross-linguistic comparison of linguistic feature encoding in BERT models for typologically different languages. In Proceedings of the 4th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, Seattle, WA, USA, 14 July 2022; pp. 27–35. [Google Scholar]
- Chen, Q.; Du, J.; Allot, A.; Lu, Z. LitMC-BERT: Transformer-based multi-label classification of biomedical literature with an application on COVID-19 literature curation. IEEE/ACM Trans. Comput. Biol. Bioinform.
**2022**, 19, 2584–2595. [Google Scholar] [CrossRef] - Shen, L.; Wu, Z.; Gong, W.; Hao, H.; Bai, Y.; Wu, H.; Wu, X.; Bian, J.; Xiong, H.; Yu, D.; et al. SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System. arXiv
**2023**, arXiv:2205.10034. [Google Scholar] - Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM
**2020**, 63, 139–144. [Google Scholar] [CrossRef] - Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S.R. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv
**2018**, arXiv:1804.07461. [Google Scholar] - Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving language understanding by generative pre-training. arXiv
**2018**, arXiv:1909.11942. [Google Scholar] - Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. Albert: A lite bert for self-supervised learning of language representations. arXiv
**2019**, arXiv:1909.11942. [Google Scholar] - Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv
**2019**, arXiv:1907.11692. [Google Scholar] - Sun, S.; Cheng, Y.; Gan, Z.; Liu, J. Patient knowledge distillation for bert model compression. arXiv
**2019**, arXiv:1908.09355. [Google Scholar] - Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv
**2019**, arXiv:1910.01108. [Google Scholar] - Adhikari, A.; Ram, A.; Tang, R.; Lin, J. Docbert: Bert for document classification. arXiv
**2019**, arXiv:1904.08398. [Google Scholar] - Sun, Z.; Yu, H.; Song, X.; Liu, R.; Yang, Y.; Zhou, D. Mobilebert: A compact task-agnostic bert for resource-limited devices. arXiv
**2020**, arXiv:2004.02984. [Google Scholar] - Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. Tinybert: Distilling bert for natural language understanding. arXiv
**2019**, arXiv:1909.10351. [Google Scholar] - Khattab, O.; Zaharia, M. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 25–30 July 2020; pp. 39–48. [Google Scholar]
- Wang, W.; Wei, F.; Dong, L.; Bao, H.; Yang, N.; Zhou, M. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; Volume 33, pp. 5776–5788. [Google Scholar]
- Liu, W.; Zhou, P.; Zhao, Z.; Wang, Z.; Deng, H.; Ju, Q. Fastbert: A self-distilling bert with adaptive inference time. arXiv
**2020**, arXiv:2004.02178. [Google Scholar] - Zuo, S.; Zhang, Q.; Liang, C.; He, P.; Zhao, T.; Chen, W. Moebert: From bert to mixture-of-experts via importance-guided adaptation. arXiv
**2022**, arXiv:2204.07675. [Google Scholar] - Kudugunta, S.; Huang, Y.; Bapna, A.; Krikun, M.; Lepikhin, D.; Luong, M.T.; Firat, O. Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference. arXiv
**2021**, arXiv:2110.03742. [Google Scholar] - Lepikhin, D.; Lee, H.; Xu, Y.; Chen, D.; Firat, O.; Huang, Y.; Krikun, M.; Shazeer, N.; Chen, Z. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. arXiv
**2020**, arXiv:2006.16668. [Google Scholar] - Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv
**2015**, arXiv:1503.02531. [Google Scholar]

**Figure 1.**Overall Architecture Diagram: This diagram outlines the sophisticated design of the MicroBERT framework. ① The student model aligns its intermediate features with those of the teacher model using FAL. The predict layer is optimized through the application of soft loss derived from the teacher’s outputs and hard loss based on the ground truth labels. ② Both models incorporate Mixture of Experts (MoE) layers, guided by routers, to enhance processing capabilities. ③ Furthermore, a discriminator network introduces discriminator loss to ensure the student’s outputs closely resemble those of the teacher.

**Figure 2.**FAL: This diagram illustrates the feature alignment process between a pretrained Teacher BERT and a Student BERT. Both models process input tokens through their respective transformer layers. The outputs are then aggregated using mean pooling, max pooling and the [CLS] token. These features are aligned by minimizing the difference between corresponding outputs from the teacher and student models.

**Figure 3.**Efficient Inference on MoE: This diagram illustrates the process within the Mixture of Experts (MoE) layer. It begins with the fusion of multiple expert networks, followed by distillation and compression to consolidate their outputs. The final step involves optimization, where the combined network is fine-tuned for improved performance.

System | Params | FLOPs | Speedup | SST-2 | MNLI-m | MRPC | QQP | RTE | Avg |
---|---|---|---|---|---|---|---|---|---|

BERT(Google) | 109 M | 22.5 B | 1.0× | 93.5 | 84.6 | 88.9 | 71.2 | 66.4 | 80.92 |

BERT(Teacher) | 109 M | 22.5 B | 1.0× | 91.1 | 84.4 | 86.0 | 88.8 | 64.2 | 82.9 |

BERT-PKD_{4} | 52.2 M | 7.6 B | 3.0× | 89.4 | 79.9 | 82.6 | 70.2 | 62.3 | 76.88 |

DistillBERT_{4} | 52.2 M | 7.6 B | 3.0× | 91.4 | 78.9 | 82.4 | 68.5 | 54.1 | 75.06 |

MobileBERT_{4} | 15.1 M | 3.1 B | - | 91.2 | 81.5 | 87.9 | 68.9 | 65.1 | 78.92 |

TinyBERT_{4} | 14.5 M | 1.2 B | 9.4× | 87.6 | 80.5 | 82.3 | 87.7 | 61.7 | 79.96 |

MinilmV2_{4} | 5.4 M | 4.0 B | 5.3× | 88.4 | 74.3 | 81.8 | 82.0 | 60.6 | 78.0 |

MicroBERT | 14.5 M | 1.2 B | 9.3× | 89.6 | 80.3 | 88.7 | 86.6 | 62.8 | 81.6 |

Avg | 89.8 | 79.9 | 84.5 | 78.9 | 61.9 | 79.2 |

System | SST-2 | MNLI-m | MRPC | QQP | RTE | Avg |
---|---|---|---|---|---|---|

${\mathrm{MicroBERT}}_{tea}$ | 91.1 | 84.4 | 88.9 | 88.8 | 54.8 | - |

${\mathrm{MicroBERT}}_{stu}$ | 89.6 | 80.3 | 88.7 | 86.6 | 62.8 | - |

Reappear persent | 98.3 | 95.1 | 99.7 | 97.5 | 114.5 | 101.0 |

${\mathrm{TinyBERT}}_{tea}$ | 93.4 | 83.9 | 87.5 | 71.1 | 67.1 | - |

${\mathrm{TinyBERT}}_{stu}$ | 92.6 | 82.5 | 86.4 | 71.3 | 66.6 | - |

Reappear percent | 99.1 | 98.3 | 98.7 | 100.2 | 99.4 | 99.0 |

Difference (Tiny-Micro) | −0.79 | −3.2 | 1.03 | −2.76 | 15.1 | −2.0 |

System | SST-2 | QQP | MRPC |
---|---|---|---|

our_model | 89.6 | 86.6 | 88.7 |

w/o Dis | 88.9 | 86.3 | 88.0 |

w/o Hidn | 88.3 | 85.9 | 87.7 |

w/o Pred | 87.4 | 85.1 | 87.1 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Zheng, D.; Li, J.; Yang, Y.; Wang, Y.; Pang, P.C.-I.
MicroBERT: Distilling MoE-Based Knowledge from BERT into a Lighter Model. *Appl. Sci.* **2024**, *14*, 6171.
https://doi.org/10.3390/app14146171

**AMA Style**

Zheng D, Li J, Yang Y, Wang Y, Pang PC-I.
MicroBERT: Distilling MoE-Based Knowledge from BERT into a Lighter Model. *Applied Sciences*. 2024; 14(14):6171.
https://doi.org/10.3390/app14146171

**Chicago/Turabian Style**

Zheng, Dashun, Jiaxuan Li, Yunchu Yang, Yapeng Wang, and Patrick Cheong-Iao Pang.
2024. "MicroBERT: Distilling MoE-Based Knowledge from BERT into a Lighter Model" *Applied Sciences* 14, no. 14: 6171.
https://doi.org/10.3390/app14146171