Symmetry and Asymmetry in Pre-Trained Transformer Models: A Comparative Study of TinyBERT, BERT, and RoBERTa for Chinese Educational Text Classification
Abstract
1. Introduction
2. Related Work
2.1. Traditional Machine Learning Approaches
2.2. Deep Learning for Text Classification
2.3. Pre-Trained Language Models for NLP
2.4. Chinese PLMs and Domain Adaptation
2.5. Efficiency and Lightweight Transformer Models
2.6. Educational Text Characteristics and Challenges
2.7. Comparative Studies of BERT Variants
2.8. Summary and Research Gap
3. Materials and Methods
3.1. Dataset Description and Exploratory Analysis
3.1.1. Corpus Source and Label Taxonomy
3.1.2. Splitting Strategy and Reproducibility
3.1.3. Basic Statistics and Length Analysis
3.1.4. Lexical Overlap and Class Similarity
3.1.5. Preprocessing Strategy and Case Analysis
3.2. Model and Training Configuration
3.2.1. Model Architecture
3.2.2. Optimization and Training Configuration
3.2.3. BERT-Base-Chinese
3.2.4. RoBERTa-wwm-ext Chinese
3.2.5. Summary of Model and Training Configuration
3.3. Evaluation Protocol
3.3.1. Evaluation Metrics
3.3.2. Experimental Setup
4. Results and Analysis
4.1. Overall Performance Comparison
4.2. Confusion Matrix Analysis
4.3. Efficiency—Effectiveness Trade-Off
4.4. Error Analysis
- Domain-specific pre-training (e.g., finance-adapted or education-adapted PLMs) to enhance fine-grained discrimination in specialized contexts.
- Hierarchical classification strategies, such as first distinguishing between economic and non-economic domains before refining Finance vs. Stock, to reduce semantic interference and strengthen decision boundaries.
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Zhu, S.; Supryadi; Xu, S.; Sun, H.; Pan, L.; Cui, M.; Du, J.; Jin, R.; Branco, A.; Xiong, D. Multilingual large language models: A systematic survey. arXiv 2024, arXiv:2411.11072. [Google Scholar] [CrossRef]
- Zhu, S.; Cui, M.; Xiong, D. Towards robust in-context learning for machine translation with large language models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Turin, Italy, 20–25 May 2024; ELRA: Paris, France; ICCL: London, UK, 2024; pp. 16619–16629. [Google Scholar]
- Zhu, S.; Pan, L.; Xiong, D. FEDS-ICL: Enhancing translation ability and efficiency of large language model by optimizing demonstration selection. Inf. Process. Manag. 2024, 61, 103825. [Google Scholar] [CrossRef]
- Zhu, S.; Pan, L.; Li, B.; Xiong, D. LANDeRMT: Detecting and routing language-aware neurons for selectively finetuning LLMs to machine translation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; Volume 1, pp. 12135–12148. [Google Scholar] [CrossRef]
- Dong, T.; Li, B.; Liu, J.; Zhu, S.; Xiong, D. MLAS-LoRA: Language-Aware parameters detection and LoRA-based knowledge transfer for multilingual machine translation. In Proceedings of the 2025 Annual Meeting of the Association for Computational Linguistics (Long Papers), Vienna, Austria, 27 July–1 August 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Cui, Y.; Che, W.; Liu, T.; Qin, B.; Yang, Z. Revisiting Pre-Trained Models for Chinese NLP (MacBERT). In Findings of EMNLP; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 657–668. [Google Scholar]
- Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. TinyBERT: Distilling BERT for Natural Language Understanding. In Findings of EMNLP; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 4163–4174. [Google Scholar]
- Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling laws for neural language models. arXiv 2020, arXiv:2001.08361. [Google Scholar] [CrossRef]
- Clark, K.; Khandelwal, U.; Levy, O.; Manning, C.D. What does BERT look at? An analysis of attention. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, 28 July–2 August 2019; pp. 276–286. [Google Scholar]
- Tay, Y.; Dehghani, M.; Bahri, D.; Metzler, D. Efficient transformers: A survey. ACM Comput. Surv. 2022, 55, 1–28. [Google Scholar] [CrossRef]
- Sebastiani, F. Machine learning in automated text categorization. ACM Comput. Surv. 2002, 34, 1–47. [Google Scholar] [CrossRef]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar] [CrossRef]
- Kim, Y. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1746–1751. [Google Scholar]
- Zhang, X.; Zhao, J.; LeCun, Y. Character-level convolutional networks for text classification. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Liu, P.; Qiu, X.; Huang, X. Recurrent neural network for text classification with multi-task learning. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), New York, NY, USA, 9–15 July 2016; pp. 2873–2879. [Google Scholar]
- Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E. Hierarchical attention networks for document classification. In Proceedings of the North American Chapter of the Association for Computational Linguistics-Human Language Technologies (NAACL-HLT), San Diego, CA, USA, 12–17 June 2016; pp. 1480–1489. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Association for Computational Linguistics-Human Language Technologies (NAACL-HLT), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A robustly optimized BERT pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized autoregressive pretraining for language understanding. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
- Clark, K.; Luong, M.T.; Le, Q.V.; Manning, C.D. ELECTRA: Pre-training text encoders as discriminators rather than generators. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26 April–1 May 2020. [Google Scholar]
- Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A lite BERT for self-supervised learning of language representations. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26 April–1 May 2020. [Google Scholar]
- He, P.; Liu, X.; Gao, J.; Chen, W. DeBERTa: Decoding-enhanced BERT with disentangled attention. In Proceedings of the International Conference on Learning Representations (ICLR), Virtually, 3–7 May 2021. [Google Scholar]
- Cui, Y.; Che, W.; Liu, T.; Qin, B.; Wang, S.; Hu, G. Pre-Training with Whole Word Masking for Chinese BERT. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3504–3514. [Google Scholar] [CrossRef]
- Gururangan, S.; Marasović, A.; Swayamdipta, S.; Lo, K.; Beltagy, I.; Downey, D.; Smith, N.A. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Seattle, WA, USA, 9–11 October 2020; pp. 8342–8360. [Google Scholar]
- Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef] [PubMed]
- Xu, H.; Hu, S.; Zhang, H.; Li, J.; Cao, R.; Xu, Y.; Sun, X. CLUE: A Chinese Language Understanding Evaluation Benchmark. In Proceedings of the International Conference on Computational Linguistics (COLING), Barcelona, Spain, 8–13 December 2020; pp. 4762–4772. [Google Scholar]
- Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT: A distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
- Sun, Z.; Yu, H.; Song, X.; Liu, R.; Yang, Y.; Zhou, D.; Zhou, J. MobileBERT: A Compact Task-Agnostic BERT for Resource-Limited Devices. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Seattle, WA, USA, 9–11 October 2020; pp. 2158–2170. [Google Scholar]
- Wang, W.; Wei, F.; Dong, L.; Bao, H.; Yang, N.; Zhou, M. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 6–12 December 2020. [Google Scholar]
- Michel, P.; Levy, O.; Neubig, G. Are sixteen heads really better than one? In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
- Schwartz, R.; Dodge, J.; Smith, N.A.; Etzioni, O. Green AI. Commun. ACM 2020, 63, 54–63. [Google Scholar] [CrossRef]
- Sun, C.; Qiu, X.; Xu, Y.; Huang, X. How to fine-tune BERT for text classification? In Proceedings of the China National Conference on Computational Linguistics (CCL), Kunming, China, 18–20 October 2019; pp. 194–206. [Google Scholar]
- Strubell, E.; Ganesh, A.; McCallum, A. Energy and policy considerations for deep learning in NLP. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Austin, TX, USA, 28 July–2 August 2019; pp. 3645–3650. [Google Scholar]
- Patterson, D.; Gonzalez, J.; Le, Q.; Liang, C.; Munguia, L.M.; Rothchild, D.; Dean, J. Carbon emissions and large neural network training. arXiv 2021, arXiv:2104.10350. [Google Scholar] [CrossRef]
- Wu, Z.; Liu, H.; Zhou, T.; Zhang, W. Toward sustainable AI: Energy-aware deployment and efficient model design. IEEE Trans. Artif. Intell. 2022, 3, 544–557. [Google Scholar]
- Huang, J.; Cheng, J.; He, J. Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings of the Annual IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–21 June 2019; pp. 7310–7319. [Google Scholar]
- Tang, R.; Lu, Y.; Liu, L.; Mou, L.; Vechtomova, O.; Lin, J. Distilling tasks from task-specific BERT models. arXiv 2019, arXiv:1908.08142. [Google Scholar]
- Minaee, S.; Kalchbrenner, N.; Cambria, E.; Nikzad, N.; Chenaghlu, M.; Gao, J. Deep learning–based text classification: A comprehensive review. ACM Comput. Surv. 2021, 54, 1–40. [Google Scholar] [CrossRef]
- Zhang, D.; Wang, D.; Zhang, J. A survey of deep learning applications in natural language processing. AI Open 2021, 2, 604–624. [Google Scholar]






















| Class | ID | Semantics |
|---|---|---|
| Education | 0 | K–12 and higher education policy, pedagogy, exams |
| Technology | 1 | Educational technology and related innovations |
| Finance | 2 | Macroeconomy, financial reports, market news |
| Stock | 3 | Equity market trends, indices, company disclosures |
| Split | Documents (N) | Avg Length (Chars) | Min–Max | Class Balance (E/T/F/S) | Truncated (>128) |
|---|---|---|---|---|---|
| Train | 22,400 | 935.4 | 7–24,109 | 0.07/0.07/0.07/0.07 | 0.94 |
| Validation | 2800 | 952.8 | 23–43,154 | 0.07/0.07/0.07/0.07 | 0.93 |
| Test | 2800 | 954.8 | 12–22,470 | 0.07/0.07/0.07/0.07 | 0.94 |
| Class | Education | Technology | Finance | Stock |
|---|---|---|---|---|
| Education | 1.000 | 0.685 | 0.640 | 0.593 |
| Technology | 0.685 | 1.000 | 0.665 | 0.633 |
| Finance | 0.640 | 0.665 | 1.000 | 0.728 |
| Stock | 0.593 | 0.633 | 0.728 | 1.000 |
| Label | Original Length (Chars) | Processed Length (Chars) | Truncated @128 | Attention Length |
|---|---|---|---|---|
| Finance | 3153 | 2954 | 0 | 128 |
| Component | Value |
|---|---|
| Backbone model | huawei-noah/TinyBERT_General_4L_312D |
| Optimizer | AdamW |
| Initial learning rate | 5 × 10−5 |
| Warm-up ratio | 0.1 |
| Weight decay | 0.01 |
| Dropout | 0.1 |
| Batch size (train/eval) | 64/128 |
| Max sequence length | 128 |
| Epochs | 10 |
| Gradient clipping | 1.0 |
| Mixed precision | Enabled |
| Random seed | 42 |
| Model | L | H | Heads | Params | Pre-Training | Epochs | Batch (Tr/Val) | LR | Warmup | WD | FP16 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| TinyBERT-4L | 4 | 312 | 12 | ~14 M | Distilled from BERT-base | 10 | 64/128 | 5 × 10−5 | 0.1 | 0.01 | Yes |
| BERT-base-Chinese | 12 | 768 | 12 | ~110 M | MLM | 10 | 64/128 | 5 × 10−5 | 0.1 | 0.01 | Yes |
| RoBERTa-wwm-ext | 12 | 768 | 12 | ~110 M | WWM + ext corpora | 10 | 64/128 | 5 × 10−5 | 0.1 | 0.01 | Yes |
| Component | Specification | Notes |
|---|---|---|
| GPU | NVIDIA Tesla T4 (16 GB) | Mixed precision enabled (fp16) |
| CPU & Memory | 2 vCPUs, 13 GB RAM | Kaggle default configuration |
| Software | Python 3.10, PyTorch 2.1.0, Transformers 4.37.0, scikit-learn 1.4.0 | Reproducible across runs |
| Epochs | 10 | Best checkpoint selected by validation F1 |
| Batch size | 64 (train)/128 (val & test) | Unified across all models |
| Max sequence | 128 tokens | Truncation applied beyond this length |
| Optimizer | AdamW | Weight decay = 0.01 |
| Learning rate | 5 × 10−5 + linear scheduler | 10% warm-up steps |
| Loss function | Cross-entropy with class weights | Handles mild class imbalance |
| Seed | 42 | Fixed for reproducibility |
| Model | Accuracy | Precision | Recall | F1-Score | Inference Latency |
|---|---|---|---|---|---|
| TinyBERT-4L | 0.8892 | 0.8876 | 0.8892 | 0.8879 | 2.8 |
| BERT-base-Chinese | 0.9174 | 0.9161 | 0.9174 | 0.9166 | 6.7 |
| RoBERTa-wwm-ext | 0.9312 | 0.9305 | 0.9312 | 0.9308 | 12.4 |
| Model | Training Time/Epoch (s) | Inference Latency (ms/Sample) | Throughput (Samples/s) |
|---|---|---|---|
| TinyBERT-4L | 42.5 | 2.8 | 357.1 |
| BERT-base-Chinese | 131.2 | 6.7 | 149.3 |
| RoBERTa-wwm-ext | 248.6 | 12.4 | 80.6 |
| True Label | Predicted Label | Example (Excerpt) | Possible Cause |
|---|---|---|---|
| Finance | Stock | Fund performance drove the index upward | Shared tokens commonly used in Stock news |
| Stock | Finance | Stock market volatility intensified financial risks | Semantic framing overlaps with macroeconomic discourse |
| Technology | Education | EdTech platforms adopted new algorithms to improve learning experience | EdTech terminology overlaps between Tech and Education |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Muhetaer, M.; Meng, X.; Zhu, J.; Aikebaier, A.; Zu, L.; Bai, Y. Symmetry and Asymmetry in Pre-Trained Transformer Models: A Comparative Study of TinyBERT, BERT, and RoBERTa for Chinese Educational Text Classification. Symmetry 2025, 17, 1812. https://doi.org/10.3390/sym17111812
Muhetaer M, Meng X, Zhu J, Aikebaier A, Zu L, Bai Y. Symmetry and Asymmetry in Pre-Trained Transformer Models: A Comparative Study of TinyBERT, BERT, and RoBERTa for Chinese Educational Text Classification. Symmetry. 2025; 17(11):1812. https://doi.org/10.3390/sym17111812
Chicago/Turabian StyleMuhetaer, Munire, Xiaoyan Meng, Jing Zhu, Aixiding Aikebaier, Liyaer Zu, and Yawen Bai. 2025. "Symmetry and Asymmetry in Pre-Trained Transformer Models: A Comparative Study of TinyBERT, BERT, and RoBERTa for Chinese Educational Text Classification" Symmetry 17, no. 11: 1812. https://doi.org/10.3390/sym17111812
APA StyleMuhetaer, M., Meng, X., Zhu, J., Aikebaier, A., Zu, L., & Bai, Y. (2025). Symmetry and Asymmetry in Pre-Trained Transformer Models: A Comparative Study of TinyBERT, BERT, and RoBERTa for Chinese Educational Text Classification. Symmetry, 17(11), 1812. https://doi.org/10.3390/sym17111812
