Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

ChatGPT as a Stable and Fair Tool for Automated Essay Scoring

Educ. Sci. 2025, 15(8), 946; https://doi.org/10.3390/educsci15080946

by Francisco García-Varela^1,*

, Miguel Nussbaum¹, Marcelo Mendoza¹

, Carolina Martínez-Troncoso²

and Zvi Bekerman³

Reviewer 1: Anonymous

Reviewer 2:

Sinan Bataklar

Reviewer 3:

Rina Zviel- Girshin

Educ. Sci. 2025, 15(8), 946; https://doi.org/10.3390/educsci15080946

Submission received: 8 June 2025 / Revised: 9 July 2025 / Accepted: 18 July 2025 / Published: 23 July 2025

(This article belongs to the Section Technology Enhanced Education)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Dear authors. Please find the review for your article bellow:

Article title: ChatGPT as a Stable and Fair Tool for Automated 2 Essay Scoring

Summary:
This study investigates ChatGPT's reliability and fairness as an automated essay scoring tool through systematic experimentation with 40 marketing students. The authors demonstrate that while ChatGPT without proper guidance produces inconsistent results, a carefully engineered approach involving explicit rubrics, decision tables, standardized output formats, and specific hyperparameters can achieve stable grading. The work culminates in developing an algorithm that automatically generates normalized rubrics for new questions, offering a scalable alternative to traditional fine-tuning approaches.

General concept comments:
The research addresses a timely and important question in educational technology. The systematic approach from unguided ChatGPT to fully normalized rubrics provides valuable insights into the engineering requirements for reliable AI-based assessment. The concept of using decision tables and normalization processes as an alternative to fine-tuning is innovative and practically relevant for educational institutions with limited technical resources.

Weaknesses of the article:

The study has several methodological and scope limitations that affect its generalizability. The sample size (N=40) is quite small for drawing broad conclusions about fairness across diverse student populations. The focus on a single domain (marketing) and question type limits external validity. The protocol's sensitivity to model updates (demonstrated when ChatGPT-4o changes affected results) raises concerns about long-term stability in production environments. Additionally, the "perfect stability" achieved still required extensive example curation, which the authors acknowledge is equivalent in effort to traditional fine-tuning approaches.

Specific comments:

Introduction and literature review:

Introduction effectively introduces the problem and provides enough context on automated essay scoring challenges, but literature review could benefit from more critical analysis of existing AES systems and clearer positioning of how this work differs from established approaches like e-rater or IntelliMetric. The connection to prompt engineering literature is well-established, but it lacks discussion of recent advances in constitutional AI and alignment techniques that could be relevant.

Methodology:

Methodology is systematic and well-structured, following a logical progression from basic reliability testing to algorithm development. The experimental design with independent chat sessions is sound for measuring consistency. However, the reduction from 10 to 5 chat sessions due to "practical constraints" and rate limits raises questions about the robustness of the findings. The decision to use different models (ChatGPT-4o vs o1) across phases makes direct comparisons challenging. The color-coding system for consistency analysis is helpful but somewhat subjective in defining "moderate" vs "inconsistent" performance.

Results:

Results are clearly presented with extensive appendices supporting the claims. The progression from volatile scoring (Section 3.1.1) to perfect consistency (Section 3.1.3) demonstrates the value of systematic prompt engineering, but impact of model updates on results reliability is concerning and inadequately addressed.

Discussion:

Discussion thoughtfully analyzes the trade-offs between stability and preparation effort. The comparison to traditional fine-tuning approaches is valuable, though the claim that this method requires "minimal human intervention" is questionable given the extensive rubric development required. The sensitivity analysis regarding examples and decision tables provides useful insights for practitioners. However, the discussion could better address the scalability challenges and long-term maintenance requirements of the proposed approach.

Conclusion and limitations:

Conclusions are appropriately cautious and acknowledge key limitations. The authors honestly discuss the data-intensive nature of achieving perfect consistency and the ongoing sensitivity to model updates.

Overall recommendation:

Accept with major revisions. The work makes valuable contributions to understanding prompt engineering for educational assessment, but requires addressing methodological concerns, expanding the evaluation scope, and providing more realistic guidance for practitioners.

Regards, reviewer

Author Response

Dear Reviewer,

We sincerely thank you for your detailed and constructive feedback, which has greatly improved the quality and clarity of our manuscript. We have carefully addressed each of your comments, including methodological clarifications, expansion of the literature review with critiques of legacy AES systems and transformer-based approaches, explicit statements on external validity limitations, and a deeper analysis of model-update volatility and operational maintenance requirements.

For your convenience, we have attached a document detailing all changes made in response to your comments, alongside your original feedback.

Thank you again for your valuable insights and for highlighting areas requiring further elaboration.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Dear Authors,

Great work. Apparently, you have a very comprehensive, detailed and thorough research. Please also see the attachment.

Thank you.

Comments for author File: Comments.pdf

Author Response

Dear Reviewer,

Thank you very much for your positive and encouraging comments. We appreciate your recognition of the study’s relevance and originality, as well as your thoughtful suggestions for potential future work, such as extending the algorithm to speaking assessments.

Please note that while your review did not require any changes, we have made minor modifications to the manuscript in response to comments from other reviewers to improve clarity and address additional methodological aspects. We are grateful for your supportive evaluation and valuable endorsement of our work.

Reviewer 3 Report

Comments and Suggestions for Authors

The paper is very interesting and is worthy of publication.

Authors' ideas of a multi-threaded and multi-level approach where using AI in scoring are very important.

The authors clearly have invested a great effort and supply the reader with wealth of very interesting information.

I would suggest some minor improvements that do not diminish the value of the paper in its present form. They are described hereunder.

Most of the lengthy list of references of state-of-art in introduction should be placed in a section titled "state-of-art" or similar title.

It would benefit the reader to get in introduction, and conclusion parts a paragraph that summarizes the authors' approach and its proof.

It would be nice to hear authors' more general suggestions for other educators.

Author Response

Dear Reviewer,

Thank you for your kind and supportive comments, as well as your constructive suggestions. We have implemented the improvements you recommended by restructuring the Introduction into “Definition of the problem” and “State of the art” subsections, adding a summarizing paragraph at the end of the Introduction, and including practical recommendations for educators in the Conclusions section.

For your convenience, we have attached a document detailing these changes alongside your original comments.

We deeply appreciate your insights and positive evaluation of our study.

Author Response File: Author Response.pdf

Article Menu

ChatGPT as a Stable and Fair Tool for Automated Essay Scoring

Further Information

Guidelines

MDPI Initiatives

Follow MDPI