A Comparative Study of BERT-Based Models for Teacher Classification in Physical Education
Abstract
1. Introduction
- (1)
- How accurately can Transformer models classify teacher behaviors from real classroom transcripts into predefined motivational categories?
- (2)
- To what extent can data augmentation techniques based on generative language models improve classification performance in the presence of class imbalance?
- (3)
- How does the length of teacher utterances affect model performance, and what preprocessing strategies might be required to optimize outcomes?
- (4)
- Which Transformer variants offer the best trade-off between accuracy and computational efficiency for this task?
2. Previous Works
BERT and Transformer Models in Education
3. Materials and Methods
3.1. Dataset
- Autonomy Support: Messages that promote choice, student initiative, or personal expression.
- Structure: Phrases that provide guidance, clear explanations, expectations, or structured feedback.
- Control: Authoritative interventions involving imposition, threats, or constraints on student autonomy.
- Chaos: Disorganized, ambiguous, or pedagogically irrelevant utterances.
- Unidentified Style: Utterances that do not clearly fit any defined motivational category.
3.2. Data Augmentation
3.3. Dataset Refinement and Statistical Analysis
3.4. BERT Tokenization
3.5. BERT Architecture
3.6. BERT Variants
3.6.1. BETO
3.6.2. Distill-BERT
3.6.3. RoBERTa
3.6.4. ALBERT
3.6.5. XLNet
3.6.6. mBERT
3.6.7. ELECTRA
4. Results
4.1. Evaluation Metrics
4.2. Hyperparametrization Results
- Learning rate (1 × 10−6 to 1 × 10−4): Controls the step size in each optimization iteration.
- Weight decay (0.0 to 0.3): Adds regularization to reduce overfitting.
- Batch size (4 to 64): Affects gradient stability and computational efficiency.
- Number of epochs (2 to 50): Determines training duration; higher values can improve learning but also increase risk of overfitting.
- Maximum sequence length (16 to 512): Defines input size; given the complex structure of classroom utterances, this parameter was found to play a critical role in model accuracy.
4.3. BERT Model Performance Comparison
4.4. Statistical Result Analysis
4.5. Generalization to Unseen Classroom Data
5. Discussion and Future Directions
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
AI | Artificial Intelligence |
ASR | Automatic Speech Recognition |
BERT | Bidirectional Encoder Representations from Transformers |
BETO | BERT for Spanish (monolingual adaptation) |
NLP | Natural Language Processing |
MLM | Masked Language Modeling |
NSP | Next Sentence Prediction |
SOP | Sentence Order Prediction |
RTD | Replaced Token Detection |
mBERT | Multilingual BERT |
SDT | Self-Determination Theory |
LLM | Large Language Model |
TPE | Tree-structured Parzen Estimator (used in Optuna for optimization) |
GPU | Graphics Processing Unit |
CLS | Classification token (used in BERT input format) |
SEP | Separator token (used in BERT input format) |
PAD | Padding token (used in BERT input format) |
References
- Falcon, S.; Admiraal, W.; Leon, J. Teachers’ engaging messages and the relationship with students’ performance and teachers’ enthusiasm. Learn. Instr. 2023, 86, 101750. [Google Scholar] [CrossRef]
- Vermote, B.; Aelterman, N.; Beyers, W.; Aper, L.; Buysschaert, F.; Vansteenkiste, M. The role of teachers’ motivation and mindsets in predicting a (de)motivating teaching style in higher education: A circumplex approach. Motiv. Emot. 2020, 44, 270–294. [Google Scholar] [CrossRef]
- Franco, E.; Coterón, J.; Spray, C.M. Antecedents of Teachers’ Motivational Behaviours in Physical Education: A Scoping Review Utilising Achievement Goal and Self-Determination Theory Perspectives. Int. Rev. Sport Exerc. Psychol. 2024, 1–40. [Google Scholar] [CrossRef]
- González-Peño, A.; Franco, E.; Coterón, J. Do Observed Teaching Behaviors Relate to Students’ Engagement in Physical Education. Int. J. Environ. Res. Public Health 2021, 18, 2234. [Google Scholar] [CrossRef]
- Coterón, J.; González-Peño, A.; Martín-Hoz, L.; Franco, E. Predicting students’ engagement through (de)motivating teaching styles: A multi-perspective pilot approach. J. Educ. Res. 2025, 118, 243–256. [Google Scholar] [CrossRef]
- Billings, K.; Chang, H.-Y.; Lim-Breitbart, J.M.; Linn, M.C. Using Artificial Intelligence to Support Peer-to-Peer Discussions in Science Classrooms. Educ. Sci. 2024, 14, 1411. [Google Scholar] [CrossRef]
- Wang, S.; Wang, F.; Zhu, Z.; Wang, J.; Tran, T.; Du, Z. Artificial intelligence in education: A systematic literature review. Expert Syst. Appl. 2024, 252, 124167. [Google Scholar] [CrossRef]
- Adoma, A.F.; Henry, N.-M.; Chen, W. Comparative Analyses of BERT, RoBERTa, DistilBERT, and XLNet for Text-Based Emotion Recognition. Proc. Int. Comput. Conf. Wavelet Act. Media Technol. Inf. Process. 2020, 17, 117–121. [Google Scholar] [CrossRef]
- Alic, S.; Demszky, D.; Mancenido, Z.; Liu, J.; Hill, H.; Jurafsky, D. Computationally Identifying Funneling and Focusing Questions in Classroom Discourse. arXiv 2022, arXiv:2208.04715. [Google Scholar] [CrossRef]
- Jensen, E.; Pugh, S.L.; D’Mello, S.K. A Deep Transfer Learning Approach to Modeling Teacher Discourse in the Classroom. Proc. Int. Learn. Analytics Knowl. Conf. (LAK) 2021, 11, 302–312. [Google Scholar] [CrossRef]
- Kasneci, E.; Sessler, K.; Küchemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; Günnemann, S.; Hüllermeier, E.; et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learn. Individ. Differ. 2023, 103, 102274. [Google Scholar] [CrossRef]
- Lan, Y.; Li, X.; Du, H.; Lu, X.; Gao, M.; Qian, W.; Zhou, A. Survey of Natural Language Processing for Education: Taxonomy, Systematic Review, and Future Trends. arXiv 2024, arXiv:2401.07518. [Google Scholar] [CrossRef]
- Sajja, R.; Sermet, Y.; Cikmaz, M.; Cwiertny, D.; Demir, I. Artificial Intelligence-Enabled Intelligent Assistant for Personalized and Adaptive Learning in Higher Education. Information 2024, 15, 596. [Google Scholar] [CrossRef]
- Zheng, X.; Zhang, J. The usage of a transformer based and artificial intelligence driven multidimensional feedback system in English writing instruction. Sci. Rep. 2025, 15, 19268. [Google Scholar] [CrossRef]
- Kökver, Y.; Pektaş, H.M.; Çelik, H. Artificial intelligence applications in education: Natural language processing in detecting misconceptions. Educ. Inf. Technol. 2025, 30, 3035–3066. [Google Scholar] [CrossRef]
- Tran, N.; Pierce, B.; Litman, D.; Correnti, R.; Matsumura, L.C. Utilizing Natural Language Processing for Automated Assessment of Classroom Discussion. In Artificial Intelligence in Education. Communications in Computer and Information Science; Wang, N., Rebolledo-Mendez, G., Dimitrova, V., Matsuda, N., Santos, O.C., Eds.; Springer: Cham, Switzerland, 2023. [Google Scholar] [CrossRef]
- Ilagan, M.; Beigman Klebanov, B.; Mikeska, J. Automated Evaluation of Teacher Encouragement of Student-to-Student Interactions in a Simulated Classroom Discussion. In Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024); Association for Computational Linguistics: Mexico City, Mexico, 2024; pp. 182–198. [Google Scholar]
- Liu, C.; Yang, F.; Ge, C.; Shao, Z. A Dynamic Precision Evaluation System for Physical Education Classroom Teaching Behaviors Based on the CogVLM2-Video Model. Appl. Sci. 2025, 15, 7712. [Google Scholar] [CrossRef]
- Ramírez Cañas, A. Procesamiento del lenguaje natural y aprendizaje de lenguas extranjeras: Abordaje metodológico desde la realización de una tarea lingüística. Cuad. Activa 2023, 14. [Google Scholar] [CrossRef]
- Wulff, P.; Mientus, L.; Nowak, A.; Borowski, A. Utilizing a Pretrained Language Model (BERT) to Classify Preservice Physics Teachers’ Written Reflections. Int. J. Artif. Intell. Educ. 2023, 33, 439–466. [Google Scholar] [CrossRef]
- Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2020, arXiv:1910.01108. [Google Scholar] [CrossRef]
- Nazaretsky, T.; Mikeska, J.N.; Beigman Klebanov, B. Empowering Teacher Learning with AI: Automated Evaluation of Teacher Attention to Student Ideas during Argumentation-focused Discussion. Proc. Int. Learn. Analytics Knowl. Conf. (LAK) 2023, 13, 122–132. [Google Scholar] [CrossRef]
- Cañete, J.; Chaperon, G.; Fuentes, R.; Ho, J.-H.; Kang, H.; Pérez, J. Spanish Pre-Trained BERT Model and Evaluation Data. arXiv 2023, arXiv:2308.02976. [Google Scholar] [CrossRef]
- Lin, F. Sentiment analysis in online education: An analytical approach and application. Appl. Comput. Eng. 2024, 33, 9–17. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. arXiv 2022, arXiv:2212.04356. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
- Kim, T.; Lee, H.-J. Analyzing the Scaling Characteristics of Transformer Feed-Forward Networks for the Trillion-Parameter Era and Beyond. In Proceedings of the International Conference on Electronics, Information, and Communication (ICEIC) 2024, Taipei, Taiwan, 28–31 January 2024; pp. 1–2. [Google Scholar] [CrossRef]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar] [CrossRef]
- Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv 2019, arXiv:1909.11942. [Google Scholar] [CrossRef]
- Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; Volume 33, pp. 5753–5763. [Google Scholar]
- Clark, K.; Luong, M.-T.; Le, Q.V.; Manning, C.D. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 6 April–1 May 2020. [Google Scholar]
- Wolpert, D.H.; Macready, W.G. No Free Lunch Theorems for Optimization. IEEE Trans. Evol. Comput. 1997, 1, 67–82. [Google Scholar] [CrossRef]
Author/s (Year) | Aim | Methodology | Findings |
---|---|---|---|
Lin (2024) [24] | To streamline the process of automatically evaluating sentiments and extracting opinions from the vast sea of content produced by learners during their online interactions | Data from online learning texts are preprocessed (cleaning, tokenization, lemmatization, vectorization), analyzed with ML models (Naive Bayes, SVM, RNNs) for sentiment, and enriched through topic modeling (LDA) to extract learner concerns. | The developed model could predict the sentiment of learners’ texts with considerable reliability |
Nazaretsky et al. (2023) [22] | To explore whether automated analysis of transcripts of the teacher’s interaction with the simulated students using Natural Language Processing techniques could yield an accurate evaluation of the teacher’s performance | Transcripts of 157 simulated classroom discussions were annotated with rubric scores and rater justifications; datasets were built for utterance- and transcript-level tasks, and models (DistilBERT, regression) were trained and evaluated with MSE, F1, Pearson’s r, and the Pyramid method. | Automating the holistic scoring of teacher-led argumentation practice is feasible. System’s ability to identify evidence for the score could serve as a basis for formative feedback to the teacher |
Tran et al. (2023) [16] | To experiment with various modern Natural Language Processing techniques to automatically generate rubric scores for individual dimensions of classroom text discussion quality | Instructional Quality Assessment (IQA) scores were predicted using BERT-based models: a baseline end-to-end neural model, hierarchical Analytic Teaching Moves (ATM) classifiers (2-step BERT), and BERT-BiLSTM-CRF for sequence labeling; additional techniques included downsampling for imbalance, merging adjacent ATM codes, and regression from ATM counts to IQA scores. | IQA models using either Hierarchical Classification or Sequence Labeling to first predict ATM codes outperform baseline end-to-end IQA models, while each of the ATM-based IQA models performs better than the other in certain IQA rubrics |
Wulff et al. (2023) [20] | To explore to what extent deep learning approaches can improve classification performance for segments of written reflections | BERT was utilized to classify segments of preservice physics teachers’ written reflections according to elements in a reflection-supporting model | BERT outperformed the other deep learning architectures and previously reported performances with shallow learning algorithms for classification of segments of reflective writing. BERT starts to outperform the other models when trained on about 20 to 30% of the training data |
Utterance | Label |
---|---|
“We’ll finish by playing a match and applying what we’ve learned today.” | 2 (Structure) |
“We’ll be a bit tight in space, but don’t worry, we all fit.” | 1 (Autonomy Support) |
“Well, you can wait here for the whole hour.” | 4 (Chaos) |
“No. Now you must help me pick up the balls.” | 3 (Control) |
“I’m just observing.” | 0 (Unidentified style) |
Category | % of Utterances | Avg. Length (Chars) | Avg. Length (Words) | Example Utterance |
---|---|---|---|---|
0—Unidentified | 10.84% | 88.6 | 17.1 | “Well, what’s this for, teacher?”/“Alright. Let’s go.” |
1—Autonomy Support | 27.71% | 64.0 | 11.4 | “Of the two we know, whichever you want: the backward one or the lateral one.” |
2—Structure | 16.94% | 89.8 | 16.4 | “Rubén, take the group and lead your warm-up.” |
3—Control | 15.58% | 80.2 | 14.7 | “Take Mateo out, he doesn’t know what he’s doing.”/“Now I’m on guard duty.” |
4—Chaos | 28.93% | 50.5 | 9.5 | “Don’t ask me, I’m fed up.”/“I don’t understand anything.” |
Model | Language Focus | Pretraining Objective | Architecture | Advantages |
---|---|---|---|---|
BETO | Spanish (monolingual) | MLM + NSP | 12 layers, 12 heads, 110M params | Optimized for Spanish; high accuracy in Spanish NLP tasks |
DistilBERT | Spanish (monolingual, distilled) | MLM (no NSP) | 6 layers, 6 heads, ~66M params | Faster and lighter; competitive performance |
RoBERTa | Spanish (from English RoBERTa variant) | MLM (dynamic masking) | 12 layers, 12 heads, 125M params | Improved training procedure; better generalization |
ALBERT | Multilingual/English | MLM + Sentence Order Prediction | Parameter sharing across layers (~12M params) | Parameter-efficient; reduced size with comparable performance |
XLNet | English (tested in Spanish via transfer) | Permutation Language Modeling | Transformer-XL base (~110M params) | Captures bidirectional context without masking |
mBERT | Multilingual (104 languages) | MLM + NSP | Same as BERT-base (110M params) | Cross-lingual capabilities; robust multilingual support |
ELECTRA | English (evaluated in Spanish) | Replaced Token Detection | Same as BERT-base (110M params) | More efficient pretraining; better use of all tokens |
ELECTRA-Small | English (evaluated in Spanish) | Replaced Token Detection | Reduced version of ELECTRA (~14M params) | Fast, lightweight; strong performance on small datasets |
Model | Learning Rate (×10−5) | Weight Decay | Batch Size | Epochs | Max Length |
---|---|---|---|---|---|
BETO | 1.3 | 0.18 | 8 | 50 | 436 |
DistilBERT | 5.82 | 0.36 | 16 | 4 | 62 |
RoBERTa | 2.95 | 0.08 | 4 | 14 | 208 |
ALBERT | 1.0 | 0.01 | 16 | 6 | 70 |
XLNet | 0.55 | 0.15 | 32 | 23 | 497 |
mBERT | 1.2 | 0.2 | 16 | 35 | 456 |
ELECTRA | 5.91 | 0.13 | 16 | 24 | 91 |
ELECTRA-Small | 1.98 | 0.14 | 4 | 22 | 277 |
Model | Accuracy | Precision | Recall | F1-Score | Best Classified | Worst Classified | Training Time (s) |
---|---|---|---|---|---|---|---|
BETO | 0.79 | 0.75 | 0.74 | 0.74 | Chaos | Unclassified | 1029.65 |
DistilBERT | 0.73 | 0.68 | 0.68 | 0.68 | Chaos | Unclassified | 72.11 |
RoBERTa | 0.56 | 0.50 | 0.50 | 0.50 | Chaos | Control | 467.55 |
ALBERT | 0.62 | 0.57 | 0.53 | 0.52 | Autonomy Support | Control | 116.84 |
XLNet | 0.64 | 0.59 | 0.58 | 0.58 | Chaos | Unclassified | 8980.88 |
mBERT | 0.67 | 0.62 | 0.60 | 0.61 | Chaos | Control | 768.58 |
ELECTRA | 0.58 | 0.55 | 0.53 | 0.53 | Autonomy Support | Unclassified | 311.18 |
ELECTRA-Small | 0.60 | 0.55 | 0.55 | 0.54 | Chaos | Unclassified | 164.39 |
True Label → Predicted Label | Example Utterance |
---|---|
Control → Autonomy support | Who is winning? |
One? | |
Bad, terrible. | |
Everyone must be willing to collaborate from each other. | |
Unclassified → Control | Alba, we hand out the bibs in order. |
I’ll stay right here. | |
Okay. | |
I don’t have a helmet. |
Model | Accuracy (95% CI) | Macro-F1 (95% CI) | Δ Accuracy | Δ Macro-F1 | ART p-Value |
---|---|---|---|---|---|
BETO + augmentation | 0.78 (0.74–0.83) | 0.73 (0.68–0.79) | +0.20 | +0.19 | <0.001 |
BETO without augmentation | 0.58 (0.54–0.62) | 0.54 (0.51–0.58) | – | – | – |
Class | n01 (A Correct, B Incorrect) | n10 (A Incorrect, B Correct) | χ2 Statistic | p-Value |
---|---|---|---|---|
Unidentified (0) | 26 | 20 | 0.78 | 0.38 |
Autonomy Support (1) | 52 | 19 | 15.34 | <0001 |
Structured (2) | 29 | 25 | 0.3 | 0.59 |
Control (3) | 22 | 19 | 0.22 | 0.64 |
Chaos (class 4) | 77 | 9 | 53.77 | <0.001 |
Model | Accuracy (95% CI) | Macro-F1 (95% CI) | Δ Accuracy vs. BETO | Δ Macro-F1 vs. BETO | ART p-Value (10,000 perms) |
---|---|---|---|---|---|
BETO | 0.78 (0.74–0.83) | 0.73 (0.68–0.79) | – | – | – |
DistilBERT | 0.74 (0.69–0.79) | 0.69 (0.63–0.74) | −0.05 | −0.05 | 0.05 |
mBERT | 0.66 (0.61–0.71) | 0.60 (0.54–0.65) | −0.13 | −0.14 | <0.001 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Martín-Hoz, L.; Yanes-Luis, S.; Huerta Cejudo, J.; Gutiérrez-Reina, D.; Franco Álvarez, E. A Comparative Study of BERT-Based Models for Teacher Classification in Physical Education. Electronics 2025, 14, 3849. https://doi.org/10.3390/electronics14193849
Martín-Hoz L, Yanes-Luis S, Huerta Cejudo J, Gutiérrez-Reina D, Franco Álvarez E. A Comparative Study of BERT-Based Models for Teacher Classification in Physical Education. Electronics. 2025; 14(19):3849. https://doi.org/10.3390/electronics14193849
Chicago/Turabian StyleMartín-Hoz, Laura, Samuel Yanes-Luis, Jerónimo Huerta Cejudo, Daniel Gutiérrez-Reina, and Evelia Franco Álvarez. 2025. "A Comparative Study of BERT-Based Models for Teacher Classification in Physical Education" Electronics 14, no. 19: 3849. https://doi.org/10.3390/electronics14193849
APA StyleMartín-Hoz, L., Yanes-Luis, S., Huerta Cejudo, J., Gutiérrez-Reina, D., & Franco Álvarez, E. (2025). A Comparative Study of BERT-Based Models for Teacher Classification in Physical Education. Electronics, 14(19), 3849. https://doi.org/10.3390/electronics14193849