Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

A Comparative Study of BERT-Based Models for Teacher Classification in Physical Education

Electronics 2025, 14(19), 3849; https://doi.org/10.3390/electronics14193849

by Laura Martín-Hoz¹

, Samuel Yanes-Luis²

, Jerónimo Huerta Cejudo³, Daniel Gutiérrez-Reina^2,*

and Evelia Franco Álvarez³

Reviewer 1:

Mohammad Shahid Husain

Reviewer 2: Anonymous

Reviewer 3:

Hakan Gunduz

Reviewer 4:

Nikolay Teslya

Electronics 2025, 14(19), 3849; https://doi.org/10.3390/electronics14193849

Submission received: 11 August 2025 / Revised: 17 September 2025 / Accepted: 23 September 2025 / Published: 28 September 2025

(This article belongs to the Section Artificial Intelligence)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Your manuscript presents a valuable contribution to the field of Teacher Behavior Classification by using BERT-based models. The proposed approach potentially enhances classification accuracy, making the study a relevant addition to the literature. However, a few areas require further clarification and refinement.

Specific Comments and Suggestions for Improvement:

Abstract is very generic. Analytical results should be mentioned in the abstract.
Better to summarize the analytical summary of previous works in tabular form and mention clearly the research gap.
The system was trained and tested on self-curated dataset why not on benchmark dataset.
It’s concluded that BETO outperforms all other methods, it will be helpful to the readers if the authors can explain the reasons of better performance in detail.
Results need to be compared with some benchmark.
Future Research Directions: Including a short discussion on future extensions, such as real-time adaptability would be valuable.

Author Response

See the attached pdf file.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The study attempts to explore the most suitable model in BERT to make the classification of teacher's performance. In my opinion, the study needs to enhance the quality of its statement style.

The study needs to construct the research hypothesis in its methodology part and summarize the test results of the hypothesis in the "Discussion" part. I have read the authors construct the research questions. However, the reviewers would like to know whether these research questions have been solved.
Table 5 shows the parameters of the whole research and Table 6 shows the results. The authors need to tell readers the meaning of results and the standard of the choice of models, because there are still some problems to be solved, like "Control" and "Unclassified" remain the most challenging,
726 underscoring the complexity of modeling motivational discourse categories when
727 theoretical overlap exists"

Author Response

See the attached pdf file.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The paper presents a comparative evaluation of BERT-based Transformer architectures (BETO, DistilBERT, RoBERTa, ALBERT, XLNet, mBERT, ELECTRA, ELECTRA-Small) for the task of classifying teaching behaviors in Physical Education (PE) classes. The authors constructed a dataset of ~1,393 annotated utterances derived from 34 lessons, categorized according to Self-Determination Theory’s circumplex model (Autonomy Support, Structure, Control, Chaos, Unidentified). To address data imbalance, ChatGPT-4o was used for generative augmentation. Hyperparameter optimization was carried out using Optuna, and models were evaluated using multiple metrics. BETO achieved the best performance (79% accuracy, macro-F1 = 0.74), demonstrating the benefits of monolingual Spanish pretraining for this task.

Detailed suggestions for improvement are listed below:

-The dataset, while carefully annotated, remains relatively small and heavily supplemented with synthetic data. Expanding it to other subjects and institutions would improve generalizability, and an ablation study comparing models with and without augmentation would help quantify the true impact of synthetic samples.

-The evaluation would also benefit from stronger statistical validation. Reporting confidence intervals or applying significance tests (e.g., McNemar’s) would confirm whether observed improvements, such as BETO’s superiority, are meaningful rather than the result of sampling variation.

-More detailed error analysis is needed to highlight model limitations. Confusion matrices and qualitative examples of misclassifications—particularly between overlapping categories like “Structure” and “Control”—would clarify where models struggle. Simple interpretability methods such as attention visualizations could further explain model decisions.

-The paper should discuss deployment and generalization more explicitly. Lightweight models (e.g., DistilBERT, ELECTRA-Small) may be more practical in real-time classroom contexts, and the potential to extend this work across languages, subjects, and multimodal data should be emphasized. Ethical and privacy considerations in teacher monitoring also deserve acknowledgment.

Author Response

See the attached PDF file.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

The article examines the application of Transformer-based language models for the automatic classification of pedagogical behavior in physical education lessons. Based on a corpus of more than 1,300 utterances annotated according to the circumplex approach, eight pretrained models were evaluated, with the BETO model demonstrating the best performance (accuracy of 0.78). The article is characterized by a clear structure of presentation; the research methodology is described in sufficient detail and contains no semantic inconsistencies. The reference list corresponds to the content of the article and is both comprehensive and up to date.

The following shortcomings should be noted:

In the description of the dataset (lines 310–311), the authors state that segmentation was carried out according to predefined linguistic and contextual criteria agreed upon within the team. However, no details of these criteria are provided. A reference should be included if they were borrowed from another source, or a few statements should be added to clarify and justify them.
It is also unclear in the same section how many segments were obtained in total and subsequently annotated by experts.
On page 8, lines 335–336, it is mentioned that in cases where consensus could not be reached, a third expert was involved to make the final decision. However, earlier (page 7, lines 312–314), it is indicated that there were seven experts supervised by two PhDs in Physical Activity and Sport Sciences. If all seven experts annotated the same item, it remains unclear how they could fail to reach a decision given that there were only five labels. The only conceivable cases are those in which the annotation results in distributions such as 2+2+1+1+1, 2+2+2+1, or 3+3+1, and both supervisors cannot reach agreement on the final label to be assigned. In all such situations, the issue could be resolved within one or two additional rounds of annotation with a reduced set of labels. Therefore, the involvement of a third expert making a unilateral decision would not be necessary. The description of the annotation process requires careful revision.
The class imbalance mentioned in connection with Figure 2 is difficult to assess without close inspection of the figure. Although the number of elements per class is shown, the imbalance is not easily discernible. Since Figures 2 and 3 primarily aim to illustrate the distribution of segment lengths, it would be advisable to add a histogram displaying the distribution of the number of segments per class, or alternatively, revise Figures 2 and 3.
The authors note that some examples were categorized as insufficient (lines 360–362). This requires clarification: examples should be provided to illustrate what constituted “insufficient” texts, and the number of sufficient texts in each class should be reported.
After augmentation, the number of examples in each class was balanced to 600, except for class 2 (700). The reference to Figure 3 in this section (4.2) is inaccurate, as this figure, according to subsequent text, reflects the final distribution after further processing and filtering of the dataset (Section 4.3, lines 398–411). The description of dataset preparation should be carefully reviewed.
In Figure 6, either the labels of the bars or the bars themselves are mismatched. Based on the number of elements, class 0 (Unidentified) with 151 elements (121+30) is presented first, followed by classes 1 through 4.
In the section describing the comparison of model testing results, a visualization of the confusion matrix is lacking. It should be provided at least for the best-performing model. The confusion matrix would offer a clearer understanding of the most frequent classification errors made by the model.

Author Response

See the attached PDF file.

Author Response File: Author Response.pdf

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

The authors have revised the manuscript in response to my prior feedback as follows:

Clarifying the dataset augmentation process, with explicit description of expert validation of synthetic samples.
Adding confusion matrices (Figures 6–7) and an error analysis section, including misclassified examples (Table 7).
Expanding the discussion on deployment, generalization, and ethical considerations.
Acknowledging limitations and adding a “Future Directions” section.

These revisions strengthen the manuscript by improving transparency of methods, interpretability of results, and practical framing.

The authors clarified that synthetic data were validated by domain experts, which improves confidence in their reliability. They also acknowledged the limitation of dataset size and committed to expansion in future work. However, the requested ablation study comparing models with and without augmentation was not conducted. This remains a significant gap, as it is critical to quantify the actual contribution of synthetic samples.

The authors added confusion matrices but declined to provide statistical tests, arguing that tests such as McNemar’s are binary-only and that confidence intervals are unnecessary due to deterministic predictions. While their reasoning is noted, statistical validation is still essential in multi-class classification to ensure that observed differences (e.g., BETO’s superiority) are meaningful and not due to sampling variation.

To make the paper scientifically rigorous, I recommend the following additions:

Compact Ablation Study: Train BETO and one lightweight baseline (e.g., DistilBERT) under four conditions: (i) no augmentation, (ii) undersampling only, (iii) synthetic only, (iv) undersampling + synthetic. Report Accuracy, Macro-F1, per-class F1, and confusion matrices. This will quantify the true benefit of augmentation.
Statistical Testing: Apply paired statistical methods suitable for multi-class classification, such as:
- Stratified bootstrap confidence intervals (Accuracy, Macro-F1, per-class F1).
- Approximate randomization test on Macro-F1 and Accuracy.
- Bowker’s or Stuart–Maxwell test on paired predictions to detect significant disagreement patterns.
- If preferred, run one-vs-rest McNemar tests.
Explicit reporting of effect sizes: Alongside p-values, report absolute ΔMacro-F1 and Accuracy between models.

Author Response

See the attached document

Author Response File: Author Response.pdf

Round 3

Reviewer 3 Report

Comments and Suggestions for Authors

Thank you for the opportunity to review the second revised version of the manuscript. I have carefully evaluated the updated submission and would like to confirm that all of my previous comments and suggestions have been thoroughly addressed by the authors. The manuscript has significantly improved in clarity, structure, and scientific rigor as a result.

Article Menu

A Comparative Study of BERT-Based Models for Teacher Classification in Physical Education

Further Information

Guidelines

MDPI Initiatives

Follow MDPI