Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Critical Thinking Writing Assessment in Middle School Language: Logic Chain Extraction and Expert Score Correlation Test Using BERT-CNN Hybrid Model

Appl. Sci. 2025, 15(19), 10504; https://doi.org/10.3390/app151910504

by Yao Wu and Qin-Hua Zheng^*

Reviewer 1: Anonymous

Reviewer 2:

Adrian Stancu

Reviewer 3: Anonymous

Appl. Sci. 2025, 15(19), 10504; https://doi.org/10.3390/app151910504

Submission received: 23 August 2025 / Revised: 17 September 2025 / Accepted: 23 September 2025 / Published: 28 September 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper proposes an automated method for assessing middle school students’ critical thinking in Chinese argumentative writing using a BERT-CNN hybrid model. The authors construct a dataset of 4,827 essays, annotated according to the Paul-Elder framework, and design three logic chain extraction algorithms. The model achieves strong correlation with expert scores (Pearson r = 0.872) and outperforms traditional baselines. The study highlights contributions of semantic, syntactic, and logical features, and discusses implications for intelligent educational assessment.

Major Concerns:

The paper positions itself as innovative, but much of the methodology (BERT-based embeddings, CNN local feature extraction, Paul-Elder framework) has already been explored in related work. The contribution compared to prior hybrid or multi-task approaches is not clearly articulated. The authors must clarify:

What is genuinely novel in their architecture beyond combining known components?
How does their dataset or annotation scheme significantly advance prior resources like CEDCC or NLPCC benchmarks?

The description of the logic chain extraction algorithms (argument-evidence mapping, causal reasoning, rebuttal-support) is too brief and mathematically dense without sufficient intuitive explanation or examples. More illustrative cases and error analyses should be added.
The annotation process needs clearer detail: how were disagreements between experts resolved? How were the Paul-Elder standards operationalized for middle school essays?
Some claims (e.g., “significantly outperforming existing models”) are overstated since no recent strong baselines (e.g., Transformer-only architectures with attention pooling, graph-based argument mining) are included in the comparison.
The interpretation of ablation results could be more carefully reasoned. For example, while semantic features dominate, the contribution of CNN is relatively small, does this justify the added complexity?
The manuscript is overly long and technical in some sections, while practical educational implications are condensed. The balance should be improved. Several grammatical and stylistic errors remain.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

There are some minor recommendations for improvement.

I) General recommendations:

- Please ensure that “F1 score” is written the same throughout the whole paper. In Lines 19, 134, 158, etc., it is “F1 score” whereas in Lines 462 (Table 7) it is “F1-Score” and in Lines 458, 667, etc., it is “F1”.

- Please cite in the text the equations. For instance, “Equation (1) presents ….”.

II) Particular recommendations:

1) Abstract

- In Line 11, mention “Bidirectional Encoder Representations from Transformers - Convolutional Neural Network (BERT-CNN)”

2) 1. Introduction

- In Lines 59-66, the research innovations stated should be written with a bullet list to emphasize them more.

- In Line 47, replace “LSTM” with “Long Short-Term Memory (LSTM)”. This is the first time you mention BERT in text, and this abbreviation must be explained.

- In Line 49, replace “BERT” with “Bidirectional Encoder Representations from Transformers (BERT)”.

- The same for CNN in Line 56

3) 2.1. Critical Thinking Assessment Framework

- In Line 106, explain that M stands for mean and SD stands for standard deviation.

4) 2.2. Development of Automated Writing Assessment Technology

- Explain the meaning of the following abbreviations: QWK in Line 117, ASAP in Line 128, XGBoost in Line 131, C-LSTM in Line 133, CEDCC in Line 139, NLPCC in Line 141.

5) 2.3. Argumentation Mining and Logic Chain Extraction

- In Lines 151, 156, 164, etc., it is not necessary to mention the conference name. It is enough to state the name of the author. If you need to highlight the year, you can write “In 2024, …..”.

- Explain the meaning of the following abbreviations: AAEC, AbstRCT, and CDCP in Line 158, IRT, DNN-AES in Line 173, SHAP in Line 176.

6) 3.1.1. BERT Encoder Design

- The Equation (3) does not include “Hi”, the parameter that you explained in Line 204.

- Explain the meaning of “bg” from Equation (4).

- In Line 213, mention “rectified linear unit (ReLU)”.

7) 3.3.3. Annotation Consistency Verification

- Mention in Lines 299-304 that ICC stands for Intraclass Correlation Coefficients and CI for confidence interval.

8) 4.2.1. Comparison with Baseline Models

- Add a note under Table 5 to explain the meaning of the up and down arrows used for MSE, MAE, Person, and Spearman parameters.

9) 4.2.2. Performance of Different Architecture Variants

- Please increase the size of Figure 5.

- Add value labels for each histogram in Figure 5b.

10) 4.3.1. Recognition Accuracy of Three Types of Logic Chains

- In Table 7, add the unit of measurement (%) after the Error rate, such as “Error Rate (%)” and remove it after each value.

11) 4.3.2. Typical Case Analysis

- In Figure 7, make sure that the “Extraction Performance” text box is brought to the front so it can be visible.

12) 4.3.3. Error Pattern Analysis

- In Table 8, replace “Percentage” with “Weight (%)” and remove it after each value.

- Use only one digit (for instance, 28.4 instead of 28.40) for the values in this column.

13) 4.4.1. Overall Correlation Analysis

- Please make sure that the values on the right in Figure 8b are legible.

- Add value labels for each histogram in Figure 8b.

14) 4.4.3. Cross-Grade Consistency Analysis

- In Table 10, you used “Average” and “Std Dev”, and in Lines 106 and 107, you used “M” for mean and “SD” for standard deviation. Please ensure the same abbreviation is used throughout the entire paper.

15) 4.5.2. Feature Importance Assessment

- In Line 575, use only the SHAP. You have to explain the meaning in Line 176 when you used it for the first time.

- In Table 12, use “Contribution (%)” and remove it after each value.

- Use only one digit (for instance, 34.6 instead of 34.60) for the values in this column.

16) 6. Conclusions

- Additional limitations of the study should be mentioned, such as the fact that the results depend on the dataset, specificities of the language, students' cognitive development, etc.

17) Add the Abbreviation list according to the MDPI template https://www.mdpi.com/files/word-templates/applsci-template.dot

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This paper proposes an automated assessment method for middle school students’ critical thinking in Chinese language writing. The proposed method is based on a hybrid BERT-CNN model created to combine deep semantic encoding with local feature extraction. The study uses a dataset of 4827 argumentative essays annotated by experts according to the Paul–Elder framework across nine dimensions. Three types of logic chain extraction algorithms are implemented (argument–evidence mapping, causal reasoning, rebuttal–support structures). Experimental results show a high correlation between the model’s outputs and expert scores and competitive performance in logic chain recognition.

Weaknesses:

1) The dataset is restricted to Chinese middle school essays. It is unclear how well the proposed approach transfers to other languages, cultures, or genres of writing.

2) The concept of critical thinking should be explained in more details in Introduction.

3) The formulas in 3.1.1 and 3.2.1 are badly formatted.

4) Some references (e.g., recent ACL/EACL papers) are cited, but it would be useful to relate this work more explicitly to the growing body of research on argument mining and automated writing evaluation in multilingual contexts.

5) Table 3. Please discuss why the inter-annotator agreement are lower for some dimension (for example, depth, breadth, fairness).

Article Menu

Critical Thinking Writing Assessment in Middle School Language: Logic Chain Extraction and Expert Score Correlation Test Using BERT-CNN Hybrid Model

Further Information

Guidelines

MDPI Initiatives

Follow MDPI