ChatGPT as a Stable and Fair Tool for Automated Essay Scoring

García-Varela, Francisco; Nussbaum, Miguel; Mendoza, Marcelo; Martínez-Troncoso, Carolina; Bekerman, Zvi

doi:10.3390/educsci15080946

Open AccessArticle

ChatGPT as a Stable and Fair Tool for Automated Essay Scoring

by

Francisco García-Varela

^1,*

,

Miguel Nussbaum

¹,

Marcelo Mendoza

¹

,

Carolina Martínez-Troncoso

²

and

Zvi Bekerman

³

¹

School of Engineering, Computer Science Department, Pontificia Universidad Católica de Chile, Santiago 8320165, Chile

²

School of Engineering, Department of Industrial and Systems Engineering, Pontificia Universidad Católica de Chile, Santiago 8320165, Chile

³

The Seymour Fox School of Education, The Hebrew University of Jerusalem, Jerusalem 9190500, Israel

^*

Author to whom correspondence should be addressed.

Educ. Sci. 2025, 15(8), 946; https://doi.org/10.3390/educsci15080946

Submission received: 8 June 2025 / Revised: 9 July 2025 / Accepted: 18 July 2025 / Published: 23 July 2025

(This article belongs to the Section Technology Enhanced Education)

Download Review Reports Versions Notes

Abstract

The evaluation of open-ended questions is typically performed by human instructors using predefined criteria to uphold academic standards. However, manual grading presents challenges, including high costs, rater fatigue, and potential bias, prompting interest in automated essay scoring systems. While automated essay scoring tools can assess content, coherence, and grammar, discrepancies between human and automated scoring have raised concerns about their reliability as standalone evaluators. Large language models like ChatGPT offer new possibilities, but their consistency and fairness in feedback remain underexplored. This study investigates whether ChatGPT can provide stable and fair essay scoring—specifically, whether identical student responses receive consistent evaluations across multiple AI interactions using the same criteria. The study was conducted in two marketing courses at an engineering school in Chile, involving 40 students. Results showed that ChatGPT, when unprompted or using minimal guidance, produced volatile grades and shifting criteria. Incorporating the instructor’s rubric reduced this variability but did not eliminate it. Only after providing an example-rich rubric, a standardized output format, low temperature settings, and a normalization process based on decision tables did ChatGPT-4o demonstrate consistent and fair grading. Based on these findings, we developed a scalable algorithm that automatically generates effective grading rubrics and decision tables with minimal human input. The added value of this work lies in the development of a scalable algorithm capable of automatically generating normalized rubrics and decision tables for new questions, thereby extending the accessibility and reliability of automated assessment.

Keywords:

automated essay scoring; educational assessment; ChatGPT; reliable assessment; prompt engineering

1. Introduction

1.1. Definition of the Problem

Educational assessment plays a crucial role in student learning, serving both as a tool for measuring knowledge acquisition and fostering critical thinking and comprehension (Brown, 2022). Typically, the evaluation of open-ended questions or essays is carried out by human instructors who use well-defined criteria to assess student responses, ensuring the maintenance of academic standards and the achievement of educational objectives. However, challenges such as the high costs associated with manual grading, along with issues like rater fatigue and potential bias, highlight the need to explore automated essay scoring systems.

Advancements in artificial intelligence (AI) and large language models have significantly expanded the scope of possibilities in educational assessment (Lo, 2023). While traditional uses of AI in automatic assessment typically adhere to stringent, teacher-defined criteria, the potential applications of these technologies in education are far more expansive. Beyond merely enforcing correction frameworks, AI has the capability to transform assessment into a more interactive and engaging learning process. Working in conjunction with educators, AI can enhance the educational experience by providing detailed, personalized feedback that is not limited to right or wrong answers. For instance, AI can analyze the nuances of student responses, offering insights that help students understand not just what they got incorrect, but why, and how they can improve in future attempts (Ouyang et al., 2023). This type of feedback is crucial for learning as it encourages students to engage deeply with the material, fosters critical thinking, and promotes a more reflective approach to learning.

In this line, automated essay scoring systems have emerged as significant tools for evaluating written assignments, utilizing natural language processing (NLP) techniques to assess content, coherence, and grammar. These systems offer quick feedback, which not only alleviates the workload for instructors but also allows students to iteratively enhance their work (S. Li & Ng, 2024a). Automated essay scoring also has the potential to reduce human raters’ workload while providing feedback.

Given that ChatGPT is widely used as an automated essay scoring tool (Bui & Barrot, 2024), we are prompted to ask our research question: is ChatGPT a reliable (stable) and fair tool for essay scoring? Our goal is to determine whether ChatGPT ensures that identical student responses receive consistent evaluations across different AI interactions using the same evaluation criteria. It is crucial to examine ChatGPT’s consistency, particularly considering the significant impact these scores can have on students’ academic trajectories and educators’ instructional methods (S. Li & Ng, 2024b). Building on the insights gained from addressing this research question, we aim to develop an algorithm capable of generating effective grading rubrics with minimal human intervention. This would enhance the efficiency and scalability of AI-assisted educational assessments.

In this way, this study examines whether large language models such as ChatGPT can deliver stable and fair scoring of open-ended student responses, specifically, whether identical answers receive consistent evaluations across multiple AI interactions when the same criteria are applied. Conducted in two marketing courses at an engineering school in Chile with 40 students, the study found that ChatGPT, when used without structured guidance, produced inconsistent scores and shifting interpretations of criteria. Incorporating the instructor’s rubric reduced this variability but did not eliminate it. Consistent and fair grading was achieved only after introducing an example-rich rubric, a standardized output format, low temperature settings, and a normalization process using decision tables. Based on these findings, we developed a scalable algorithm that automatically generates effective grading rubrics and decision tables with minimal human input.

1.2. State of the Art

Legacy, feature-engineered engines such as e-rater and IntelliMetric have long reported high correlations with human raters, yet their heavy reliance on handcrafted linguistic features constrains adaptability across prompts and disciplines. Recent transformer-based approaches seek to overcome this limitation through direct fine-tuning on large, labelled essay corpora (Miao & Xu, 2025; Tang et al., 2024).

Using the widely cited ASAP dataset (≈12 k K-12 essays across eight prompts), Miao and Xu (2025) introduce KWM-B, a BERT-based scorer that applies keyword-to-multi-scale weighting, that is, it gives extra weight to the most informative words, sentences, and paragraphs. On this benchmark, the method raises quadratic-weighted kappa (agreement with human raters) beyond earlier convolutional neural network (CNN) baselines that rely on fixed sliding-window filters. Complementing this, Tang et al. (2024) enrich BERT embeddings with a graph convolutional network (GCN) to capture links between sentences and train the model in a multitask set-up that predicts both holistic and analytic scores; the accuracy gains come at the cost of tens of thousands of hand-scored essays and periodic retraining, resources that many institutions cannot spare.

Documented discrepancies between automated essay scoring and human scoring have led to the recommendation that automated essay scoring should not operate as an independent assessment tool. Instead, it should function as a supplemental aid to instructor evaluations, which include direct scoring and qualitative analysis (Almusharraf & Alotaibi, 2023). Studies indicate that systems like ChatGPT often show less agreement with human ratings on certain texts, which raises concerns about the reliability of automated essay scoring tools (Yancey et al., 2023). Further research by Bui and Barrot (2024) comparing ChatGPT with the evaluations of an experienced human rater found that ChatGPT has limited capabilities in automated essay scoring, underscoring inconsistencies in how this AI tool aligns with human evaluations.

The advent of generative LLMs introduces a data-light alternative. Tate et al. (2024) report substantial agreement between ChatGPT and secondary-level teachers on holistic essay scores, highlighting its potential for low-stakes formative assessment. In contrast, Uyar and Büyükahıska (2025) document statistically significant divergences between ChatGPT and expert raters on English-as-a-foreign-language essays, underscoring persisting reliability concerns within even a single genre.

Given the profound impact that automated essay scoring tool scores have on decisions made by educational stakeholders, in-depth research is essential to assess their reliability and validity. Despite existing studies on the correlation between human and machine scores (Yun, 2023; Y. Wang et al., 2024), a thorough analysis of automated essay scoring model outputs remains lacking, often leaving it unclear how these systems behave (S. Li & Ng, 2024a).

Beyond mean-level agreement, robustness remains a critical challenge: adversarial experiments on the essay-BR corpus show that subtly perturbed texts can systematically elicit inflated scores from state-of-the-art models (Anchiêta et al., 2024). Such findings highlight the need for explicit control mechanisms, such as the decision tables and rubric normalization proposed here, to limit hidden heuristics and prevent score inflation. Taken together, existing evidence suggests a trade-off: fine-tuned transformers yield strong accuracy but demand large labelled datasets, while LLM-based scoring is data-light yet sensitive to prompt wording and model updates (Tate et al., 2024; Uyar & Büyükahıska, 2025). The rubric-driven prompt-engineering protocol presented in this study positions itself as a middle path, achieving stable, reproducible grades without model retraining and human-scored examples.

The performance of automated essay scoring tools based on language models heavily depends on these models’ ability to imitate human reasoning, aligning with the instructions provided in the prompt. One of the most effective strategies is known as the chain of thought (Wei et al., 2022). A chain of thought divides complex tasks into a sequence of steps, emulating sequential thinking. This approach allows chain-of-thought-based large language models to achieve more reliable outcomes. Running a chain of thought a prompt multiple times enables the analysis of consistency in the model’s reasoning paths, enhancing the reasoning capabilities of these models (X. Wang et al., 2023). Through self-consistency, the stochasticity in response generation can be reduced by retaining responses that consistently lead to the same results. A reasoning path may branch into a step due to the consideration of external information via Retrieval Augmented Generation or the model’s implicit knowledge (Lou et al., 2024). These branches result in divergent reasoning paths, leading to different outcomes, a concept referred to in the field as a tree of thoughts (Long, 2023). Analyzing consistency within these trees allows the identification of reliable results, even in scenarios involving divergent thinking. By introducing feedback based on reasoning, the sequential thought structure can form loops, a phenomenon described as a graph of thoughts (Besta et al., 2024). Networks of thoughts offer advantages over sequential strategies, such as chain of thought or tree of thoughts, in various tasks by incorporating mechanisms similar to human reasoning, such as recurrent reasoning. Recurrent strategies improve prompts using methods like Progressive Rectification Prompting (Z. Wu et al., 2024), discarding prompting variants that lead to unreliable outcomes. Contrasting different reasoning paths, a strategy known in the field as self-contrast (W. Zhang et al., 2024), helps the model reflect on inconsistent solving perspectives. These strategies have proven effective in tasks like Multi-correct Multiple-Choice Questions and multi-hop question answering (R. Li et al., 2025), demonstrating performance levels comparable to those achieved by humans (Yeadon & Hardy, 2024).

Including humans in imitative reasoning strategies is an area of interest within the field. It has been shown that in argumentative reasoning scenarios, human inclusion in specific rounds of deliberation supports reaching consensus (Triem & Ding, 2024). Along these lines, strategies such as ReConcile (L. Chen et al., 2023) demonstrate that round table conferences in debates facilitate consensus in argumentative reasoning, leveraging diverse large language model systems. Similarly, human-in-the-loop strategies involving rounds of interaction between large language models and humans have proven effective in automated essay scoring tasks (Xiao et al., 2025). Comparing sequential reasoning paths that diverge has also shown utility in automated essay scoring tasks through a prompting strategy known as cross-prompt automated essay scoring (C. Zhang et al., 2025). In this approach, the prompt compares two reasoning paths leading to different outcomes, emulating a contrastive learning process (Y. Chen & Li, 2023).

Although large language models like ChatGPT can generate coherent and contextually relevant text, their effectiveness in providing accurate feedback is still uncertain (Stahl et al., 2024). This highlights the necessity for continued investigation into automated essay scoring behaviors and their practical implementation in educational settings.

2. Methodology

2.1. Research Context and Sample

This study was conducted in a marketing course at an engineering school. Out of 110 enrolled students, 20 provided informed consent to anonymously analyze their responses to a brief open-ended quiz designed to be completed in under 20 min. In a later semester, in the same course, another group of 20 students signed an informed consent to replicate the experiment with a different quiz.

Participants were in their fourth or fifth year of study and close to graduation. The course is a mandatory part of the engineering curriculum, and most students have a major or minor in industrial engineering.

2.2. Research Model and Procedure

To evaluate ChatGPT’s stability and fairness in grading qualitative open-ended questions, we relied on two GPT models with complementary roles: the freely accessible ChatGPT-4o carried out the grading tasks, while ChatGPT-o1, optimized for chain of thought reasoning, powered the algorithmic stage introduced later. No direct statistical comparison between ChatGPT-4o and ChatGPT-o1 was conducted; all quantitative analyses were performed within each model separately, and any cross-model results are reported only descriptively. We followed the steps below:

2.2.1. Assessing ChatGPT Reliability

Consistency in AI-Based Grading Without a Prescribed Rubric

Objective:

To understand if ChatGPT is a reliable (stable) tool for essay scoring, it was necessary to evaluate ChatGPT’s ability to assess a student’s response to a given question (Section S1) with any given (external) rubric.

Description:

ChatGPT was tested across 10 independent chat sessions, each representing a single interaction with the AI. A new chat session was initiated for each evaluation by opening a new chat window or thread to ensure independent assessments.
In each chat session, ChatGPT-4o was asked to grade the student’s answer with any given (external) rubric. First, ChatGPT-4o was requested to assess the student’s response and then ask for the evaluation criteria used to grade that student (Section S3).
A table was created (Section S4) summarizing the results. In the horizontal axis, we have the ten chat sessions, and in the vertical one, the different evaluation criteria indicated by ChatGPT-4o. For each chat session, the criteria used in that iteration is indicated in yellow. In the last column, the evaluation criteria used are indicated. In the last two rows, the evaluation criteria used for each iteration and the assigned grade are indicated.
Next, evaluation criteria are grouped together (Section S5.1) based on the definitions provided by ChatGPT-4o (Section S5.2). This step aimed to simplify the visualization from Section S4 and explicitly define the scope of the evaluation criteria used by the AI. By consolidating similar terms under shared definitions, it was possible to reduce redundancy and better understand the core dimensions consistently applied during grading.

Assessing Reliability in AI-Based Grading Using a Predefined Rubric

Objective:

To understand if ChatGPT is a reliable (stable) tool for essay scoring using the same rubric given by the teacher (Section S2) to a given question (Section S1).

Description:

ChatGPT-4o is provided with the instructor’s rubric (Section S2), which specifies the evaluation criteria for the same question of Section 2.2.1. “Consistency in AI-Based Grading Without a Prescribed Rubric”.
As in Section 2.2.1. “Consistency in AI-Based Grading Without a Prescribed Rubric”, ChatGPT was tested across 10 independent chat sessions, each representing a single interaction with the AI. A new chat session was initiated for each evaluation by opening a new chat window or thread to ensure independent assessments and using Section S3 prompts.
In each chat session, ChatGPT-4o was asked to grade the student’s answer with the teacher’s rubric (Section S2).
A table was created (Section S6.1) summarizing the results. In the horizontal axis, we have the ten iterations, and in the vertical one, the different criteria indicated by ChatGPT-4o. A yellow highlight indicates each evaluation criterion when it appears in a given chat session, and purple is used to highlight any evaluation criteria not defined by the instructor. In the last column, the evaluation criteria used are indicated. In the last two rows, the evaluation criteria used and the assigned grade for each iteration are indicated.
The definitions of the evaluation criteria that ChatGPT-4o implicitly uses (those highlighted in purple in Section S6.1) are documented, resulting in Section S6.2.

Assessing ChatGPT Reliability with a Predefined Explicit Rubric and Examples

Objective:

To evaluate whether ChatGPT-4o can consistently assign the same grade and apply the same evaluation criteria across 10 independent chat sessions when assessing the same student response, using a predefined explicit rubrica (Section S7.1) and illustrative examples (Section S7.2) to a given question (Section S1).

Description:

The instructor’s rubric is modified to make the evaluation criteria explicit (example for evaluation criterion Functional Benefits in Section S7.1). For each evaluation criterion defined by the teacher, the following is indicated:
(a)
Evaluation Criterion Name.
(b)
Formal Definition: A clear explanation of what is being evaluated.
(c)
Evaluated Concepts: A list of elements considered in the response, along with their respective scope and definition. These elements determine the assigned score.
(d)
Possible Scores: The permissible score range for the defined evaluation criterion.
Examples are created for each possible score in each evaluation criterion (example for evaluation criterion Functional Benefits in Section S7.2) to guide the AI on how to apply the rubric’s evaluation criteria. The examples consist of the following:
(a)
Concrete responses that illustrate how a student might address the evaluation criterion.
(b)
Explanations that justify the score given to each example, highlighting whether and how the response meets the expectations of the criterion.
Development of a document with a Standardized Output Format for AI Grading Responses: To further enhance stability, a uniform response format was developed for ChatGPT-4o, ensuring that all grading outputs followed a standardized structure. At least two uniform reference examples were used (Section S7.3) and consist of the following elements:
(a)
Evaluated Criterion: The name of the evaluation criterion being assessed.
(b)
Score: The assigned score for the evaluation criterion.
(c)
Observation: A brief description of what the student mentioned in their response.
(d)
Justification: An explanation—based on the decision table—of why the assigned score was given.
(e)
Final Score: The sum of all partial scores obtained for each evaluated criterion.
(f)
Final Comments: A conclusive feedback section provided to the student after assigning the final score. These comments summarize the main strengths of the response, highlight areas that require further development or justification based on the rubric, and offer concrete recommendations for improvement in future responses. The feedback explicitly avoids vague or unrelated suggestions that do not align with the evaluation rubric.
A prompt was created to explicitly instruct ChatGPT-4o to isolate the evaluation criteria from the instructor’s rubric, prohibiting the use of any additional criteria. The prompt also specifies the output format, clarifies the focal points for grading, and includes basic consistency hyperparameters, as consolidated in Section S7.4.
After updating both the instructor’s rubric and the AI prompt, ChatGPT-4o was asked to grade the student’s response across 10 different chat sessions. For each session, the input provided includes the prompt (Section S7.4), the question (Section S2), the revised rubric in (Section S7.1 and examples from Section S7.2 for each evaluation criterion), and the output format example in (Section S7.3). The results are recorded in a table as indicated in Section S7.5.

2.2.2. Assessing ChatGPT Fairness

Refining the Rubric for Consistent Evaluation Across Multiple Students

Objective:

Generalize the methodology achieved in the previous step (stability for a single student) to a larger group of students.

Description:

The process is divided into two phases, as is explained in Section 3.1.3. The first phase involves the following:

Randomly selecting the responses of 10 students from the sample who took Quiz 1 (Section S1).
Normalizing the instructor’s rubric by keeping the previously applied modifications and adding new elements beneath the “Possible Scores” section from Section 2.2.1. “Assessing ChatGPT Reliability with a Predefined Explicit Rubric and Examples” 1.d in the instructor’s modified rubric for each evaluation criterion. A complete example of a normalized rubric for one evaluation criterion is shown in Section S8.1 without the examples from Section S7.2. This last decision was made to isolate the effect of normalization as later analyzed in the results (see Section 4.3). The new elements of the normalization process are as follows:
(a)
Problem Concepts: Definition of key concepts that indicate which reasons or specific ideas are considered valid justifications for the fourth evaluation question and subsequent decision tables (Section 2.2.2. “Refining the Rubric for Consistent Evaluation Across Multiple Students” c.). These are added because the first three questions focus on general aspects (presence, focus, coherence), while the fourth and subsequent questions require more specific justifications (see Section S11, step 4).
(b)
Notes: Additional clarifications regarding the evaluation criterion, the evaluated concepts, or the possible scores. These help disambiguate scoring logic and avoid contradictory interpretations, especially in borderline cases (see findings in the Section 3.2 “Results for Step Refining the Rubric for Consistent Evaluation Across Multiple Students”).
(c)
Decision Table: A four-column table that utilizes binary yes/no questions to determine compliance with evaluated concepts and objectively assign scores. Decision tables were introduced to make the scoring logic fully explicit and reproducible across chat sessions (see the Section 3.2 “Results for Step Refining the Rubric for Consistent Evaluation Across Multiple Students”). The structure is as follows:
Column 1: Step—A sequential identifier for each stage within the decision table.
Column 2: Evaluation Question—Binary yes/no questions based on the evaluated concepts, progressing from general to specific criteria to ensure an objective scoring process.
Column 3: Action—Specifies what action should be taken depending on a “Yes” or “No” response. If “No,” a brief explanation is provided.
Column 4: Result—Records the assigned score for the evaluation criterion or the conclusion derived from applying the corresponding Action.
The phrase “…based on the decision tables for score assignment” was added to the prompt, as shown in Section S8.2. This ensures that ChatGPT does not reinterpret the rubric and follows the decision logic strictly, avoiding changes to the evaluation criteria.
For each student, five chat sessions instead of ten (explanation in Section 3.2 “Results for Step Refining the Rubric for Consistent Evaluation Across Multiple Students”) were conducted with ChatGPT-4o, providing as input the modified prompt (Section S8.2), the question (Section S2), the normalized rubric (Section S8.1 for each evaluation criterion), and the output format example in (Section S7.3).
The scores for each student across all sessions were recorded and tabulated as shown in Section S9.1. In cases of discrepancies, the causes are analyzed and the examples for each evaluation criterion (Section S7.2) are revised until the desired level of consistency is achieved, as noted in Section S9.2 and subsequently Section S9.3. The following color-coding is applied to these tables:
- Yellow highlighting indicates full consistency across all chat sessions.
- Blue highlighting signifies the presence of two different assigned grades.
- Red highlighting indicates three or more different assigned grades, representing a clear inconsistency.

II.: The second phase involves the following:

Adding the responses of the remaining 10 students from the sample who took Quiz 1, thus obtaining a total of 20 students.
This second phase is further divided into three segments to examine the role of examples in the normalization process and to evaluate the use of decision tables. The aim is to determine whether examples alone, or in conjunction with other components, are necessary to ensure consistent grading.
(a)
Grading using the same methodology from the first phase, generating a new table with 5 chat sessions per student as shown in Section S10.1. The results in Section S9.3 were compared with those in Section S10.1. The difference between them stems from an update applied to the ChatGPT-4o model between 12 and 13 February 2025, which affected the outcomes and led to the development of the following two parts (b and c).
(b)
Removing all examples from Section S7.2 except for three arbitrarily selected examples for each possible score that are deemed representative, as shown in Section S10.4 for the Functional Benefits evaluation criterion. A new table was created as shown in Section S10.2.
(c)
Removing all remaining examples from Section S7.2. Another new table was created, as shown in Section S10.3.

2.2.3. Automating the Protocols: Designing an Algorithmic Approach

Proposed Algorithm

Objective:

Develop an algorithm that autonomously identifies, for different open-ended questions, like the one in Section S1, the evaluation criteria and reproduces for those evaluation criteria the Decision Tables and normalization process: reproduce the Formal Definition, Evaluated Concepts, Possible Scores, Problem Concepts, and Notes. Additionally, it must create at least two structural examples using the format from Section S7.3 and a prompt. Two examples were included arbitrarily to establish a consistent output structure; more could be added, but two are sufficient to demonstrate the expected format and ensure alignment with the decision logic.

Description:

Four sequential phases were defined to create the algorithm, each building upon the previous. They are outlined in Section S11 and are composed of:
(a)
Phase 1, Definition of Evaluation Criteria:
The open-ended question was decomposed into concrete sub-questions to identify the evaluation criteria related to value, benefits, and costs for the user.
(b)
Phase 2, Creation of Decision Tables for the Evaluation:
Rule-based evaluation structures were created for each criterion (normalization), including binary (Yes/No) decision steps, scoring rules, and detailed justifications, as the one of Section S13.
(c)
Phase 3, Definition of the Output Format:
A unified response format was defined, including Score, Observation, and Justification per criterion, followed by Final Score and Final Comments for consistent evaluation, as in Section S7.3.
(d)
Phase 4, Creation of the Evaluation Prompt:
A precise and constrained prompt was designed for the AI, specifying evaluation logic, excluded variables, response format, and fixed hyperparameters to ensure uniform assessment, as in Section S8.2.
To test the algorithm, a new chat session is opened in ChatGPT-o1 (see Section 4.5.4 for the explanation of the new model), and the following parameters are provided:
(a)
A prompt containing the expected outputs and instructions, as shown in Section S12. More details in Section 3.3 “Results for Step Proposed Algorithm”.
(b)
The normalized rubric for Section S1 and all the decision tables are loaded as a file as an example for the IA (Section S13).
(c)
Section S7.3 examples are loaded as a file.
(d)
The algorithm for creating automatic correction, Section S11, is loaded as a file.
After sending the prompt and the documents, the algorithm requests the user to upload the question and complete the four phases with the respective steps of Section S11 in sequence. The user decides when to proceed to the next phase.
Two actions were taken to validate the algorithm:
(a)
Verify whether, by loading the same question used in the study (Section S1), it can replicate the decision tables. This process is documented at the (https://github.com/anonymous-researcher-0/ChatGPT-as-a-stable-and-fair-tool-for-Automated-Essay-Scoring/blob/main/Section%20S16%20Process.pdf, accessed on 8 June 2025), and the result is in Section S16.
(b)
Use another open-ended question such as the one in Section S14, along with its original rubric (Section S15), to generate the four phases of the algorithm. This process is documented at the (https://github.com/anonymous-researcher-0/ChatGPT-as-a-stable-and-fair-tool-for-Automated-Essay-Scoring/blob/main/Section%20S18%20Process.pdf, accessed on 8 June 2025), and the result for the new questions is in Section S18.
The information obtained through the new prompt, output format, and the normalization process is then applied to a new set of 20 students from the marketing course who answered the new question (Section S14).
(a)
A table is created, indicating the student number in the first column and showing 5 chat sessions per student, with the assigned grade recorded and color-coded by category. This can be seen in Section S19, described in Section 3.3 “Results for Step Proposed Algorithm” and analyzed in Section 4.4.

3. Results

3.1. Assessing ChatGPT Reliability

3.1.1. Results for Step “Consistency in AI-Based Grading Without a Prescribed Rubric” (Section 2.2.1)

After running ten (10) independent chat sessions where ChatGPT-4o graded the same student response without any predefined rubric (ChatGPT-4o was used in this step instead of ChatGPT-o1 due to its free accessibility, ensuring the study reflects a tool available to all users without requiring a paid subscription):

Variability in Evaluation Criteria:
(a)
The AI produced multiple distinct evaluation criteria across different chat sessions (Section S4).
(b)
Each chat session varied between five and six different evaluation criteria (Section S4), although some consolidated sessions featured five or fewer, as is shown in Section S5.1. Section S5.2 explains these criteria used by ChatGPT-4o.
(c)
Section S5.1 criteria showed overlap in definitions such as “Structure”, “Clarity”, “Analysis”, and “Understanding of the Question”, but labeled or grouped them differently across sessions, leading to inconsistencies, making it necessary to define the criteria (Section S5.2).
Variability in Grades:
(a)
As Section S5.2 shows, the final assigned grades differed across the 10 chat sessions, suggesting that ChatGPT-4o’s implicit evaluation framework was not stable (different criteria and grades between Chat sessions) when prompted repeatedly with no rubric.
(b)
In some sessions, for example chat session 1 and 8 in Section S5.1, ChatGPT-4o assigned multiple criteria but still arrived at a similar grade (8 points with “Solid Conclusion” and “Value Proposition” as differences); in others, the grade varied, even when seemingly similar criteria were referenced, such as chat sessions 5 and 8 in Section S5.1 (8.5 points vs. 8 points using the same evaluation criteria). Section S5.1 was used as a reference instead of Section S4 due to the high variability of criteria in Section S4, which makes it difficult to identify consistent patterns or meaningful similarities.
Emergence of Non-Teacher-Defined Criteria:
(a)
As Section S5.1 shows, ChatGPT-4o often introduced additional elements such as “Originality” or “Value Proposition” (five to six times, respectively).
(b)
While some of these elements (“Originality”, “Value Proposition”, etc.) could be conceptually valid from a pedagogical standpoint, their inconsistent appearance across chat sessions, despite the same student response, made it difficult to align them with the original grading intentions (Section S2). This lack of stability undermines the reliability of the criteria, especially when they are not part of the instructor-defined rubric and appear to depend arbitrarily depending on the chat session.

3.1.2. Results for Step “Assessing Reliability in AI-Based Grading Using a Predefined Rubric” (Section 2.2.1)

Building on the initial findings, this step introduced the instructor’s rubric (Section S2) to guide ChatGPT-4o:

Improved Alignment with the Instructor’s Criteria:
(a)
When the explicit rubric was provided (Section S2), ChatGPT-4o consistently recognized and applied the criteria outlined in the rubric (e.g., perceived value, functional benefits, psychological benefits, monetary costs, non-monetary costs).
(b)
As Section S6.1 shows, all sessions yielded references to the expected instructor-defined evaluation criteria (Section S2).
Inconsistencies in Final Scores:
(a)
Analysis of the documented chat sessions (Section S6.1) showed that while each session consistently included the five rubric criteria, at least two of the ten sessions introduced non-instructor-defined criteria. It only happened in chat sessions 4 and 8, where there were more evaluation criteria than in the other chat sessions as “Depth and Analysis” and “Clarity”. These additional criteria were not part of the instructor’s rubric but were introduced by ChatGPT-4o during the evaluation process.
(b)
The AI introduced an extra criterion in all chat sessions (“Justification” in Section S6.1) and factored them into the final grade. While this may appear similar to “Depth and Analysis” from Section S5.2, the two differ in focus: “Justification” (Section S6.2) assesses whether the student’s ideas are supported with clear reasoning or evidence, whereas “Depth and Analysis” evaluates the complexity, insight, and critical thinking applied in exploring those ideas.

3.1.3. Results for Step “Assessing ChatGPT Reliability with a Predefined Explicit Rubric and Examples” (Section 2.2.1)

This step focused on revising and standardizing the instructor’s rubric, with the goal of making ChatGPT-4o’s grading completely consistent for a single student’s response over 10 chat sessions:

Rubric Restructuring:
(a)
The teacher’s rubric was reorganized to include the following: a Formal Definition, Evaluated Concepts, Possible Scores, and clear evaluation criteria (Section S7.1)
(b)
Examples illustrating why a score was assigned were added to guide ChatGPT-4o precisely (Section S7.2). The number of examples per evaluation criterion was not fixed; instead, a trial-and-error approach was used, incorporating both real student responses and simulated ones (examples designed to resemble plausible student answers).
Standardized Output Format:
(a)
A new standardized output structure (Section S7.3) was introduced, detailing how ChatGPT-4o should present the evaluation for each criterion, listing the criterion name, an observation, a justification, a score, a final score, and final comments. This structure allowed the AI to follow a fixed format and avoid including additional elements that could lead to the incorporation of unintended evaluation criteria, as previously observed in Section 3.1.2.
Refined Prompt:
(a)
A new prompt (Section S7.4) was introduced to reinforce which evaluation criteria should be applied and which should be ignored, improving consistency across sessions. Prompt 2 from Section S3 was still included to identify the evaluation criteria used in each chat session.
(b)
The hyperparameters used in Section S7.4—{“temperature”: 0.1, “frequency_penalty”: 0.0, “presence_penalty”: −1.0}—helped shape the behavior of the model:
Temperature (0.1): Controls randomness; a low value makes the responses more focused and deterministic.
Frequency penalty (0.0): Prevents the model from penalizing repeated words or phrases.
Presence penalty (−1.0): Encourages the model to reuse certain concepts or terms, rather than avoiding repetition.
Perfect Consistency Achieved:
(a)
After implementing the newly structured rubric, the same student response was graded across 10 separate chat sessions (Section S7.5). All sessions produced the same final score and applied identical evaluation criteria, demonstrating the rubric’s effectiveness in achieving consistent grading under controlled conditions. Notably, the justification criterion, Section S6.1, emerged as an inherent element, appearing consistently across all sessions.

3.2. Assessing ChatGPT Fairness

Results for Step “Refining the Rubric for Consistent Evaluation Across Multiple Students” (Section 2.2.2)

Expanding from the single-student success in Section 2.2.1. Step ”Assessing ChatGPT Reliability with a Predefined Explicit Rubric and Examples”, this step aimed to generalize the method for a larger group:

Initial Testing With 10 Students
(a)
Ten students’ responses were selected and graded five times each under the new format. Color-coded tables (Section S9.1) displayed the consistency of results (yellow for perfect alignment: the same grade in all five sessions; blue as moderate: exactly two different grades across the five sessions; red as inconsistent: three or more different grades). Five chat sessions per student were necessary to assess stability, based on findings in Section S6.1, where inconsistencies, when present, tended to appear at least once within the first five sessions. The number of sessions was reduced from ten to five due to practical constraints: running multiple iterations with ChatGPT-4o proved time- and resource-intensive, as OpenAI imposes temporary rate limits on file uploads and prompt executions (which vary depending on global demand). These restrictions frequently caused delays, interrupted uploads, or incomplete prompts, making larger-scale iteration impractical within a limited timeframe. Reducing from 10 to 5 sessions weakens the resolution, stability, and statistical confidence of the evaluation process. It increases the likelihood of false positives for consistency, under-detection of model brittleness, and missed opportunities for rubric improvement. As this reduction was necessary, we used a validation subset, still using 10 sessions to benchmark accuracy loss. Importantly, the analysis showed that student grades remained stable despite using only five sessions. For example, although adjustments were made to improve the evaluation protocol between Sections S6.1 and S9.1, Student 1 ultimately maintained the same final grade (5) across both datasets. This suggests that the protocol adjustments made at this stage, such as clearer evaluation instructions and improved grading structures, helped maintain consistency even with fewer repetitions. Thus, while using five sessions may reduce resolution, the main findings and student grading outcomes remained robust.
(b)
Early rounds revealed grading inconsistencies in Section S9.1. While following the same grading protocols that led to success in Section S7.5, some inconsistencies (blue cases, meaning the presence of two different assigned grades in different chat sessions for the same student) still appeared in certain sessions, such as with Student 2 and Student 5. Others, like Student 8 and Student 16, showed even greater inconsistency (red cases, meaning three or more different assigned grades in different chat sessions for the same student). Notably, Student 1, whose response was the same one used in Section S7.5, maintained full stability in scores, as shown in Section S9.1.
(c)
Final Decision Tables were developed (Section S13), incorporating additional elements such as Problem Concepts and Notes (normalization process). Examples and Clarifications were also refined iteratively to avoid internal contradictions within or across evaluation criteria. This process led to progressive improvements in grading consistency, as documented in Section S9.2 and ultimately reflected in the outcomes presented in Section S9.3, where all 10 students received consistent grading across five chat sessions each.
Scaling to 20 Students
(a)
After achieving perfect consistency across five chat sessions for an initial group of 10 students, Section S9.3, 10 additional responses were introduced. These new samples had never been iterated before, allowing for an unbiased analysis of the grading system’s fairness.
(b)
It is important to note a discrepancy between the results in Section S9.3 and those of the same 10 students in Section S10.1. As indicated in the methodology (Section 2.2.2 “Refining the Rubric for Consistent Evaluation Across Multiple Students”, second phase), an update applied to the ChatGPT-4o model between 12 and 13 February 2025, impacted how evaluation criteria were interpreted and applied. Specifically, although stability remained consistent for most students, the actual grades assigned changed in several cases. Furthermore, the last two students (14 and 16 in Section S9.3), who previously showed full consistency, exhibited a change in grade patterns (14 and 16 in Section S10.1), indicating a shift in the model’s stability following the update. This suggests that model updates can modify how evaluation criteria are interpreted and applied, producing observable shifts in grading outcomes despite using identical prompts and rubrics. This aspect is analyzed in greater depth in Section 4.4.2
(c)
Given the sensitivity of AI model adjustments and their impact on the study’s primary goal (achieving stable grading), it became necessary to evaluate the role of Examples and Clarifications in maintaining stability. Section S10.1 shows 12 out of 20 cases with perfect consistency, 5 with moderate consistency, and 3 with inconsistencies. However, as examples were gradually removed until reaching the case with no examples at all (Section S10.3), consistency declined to 7 perfect cases, 11 moderate, and 2 inconsistent.
Key Factors Influencing Fairness
(a)
Thorough normalization and decision tables with yes/no questions based on the evaluation criteria.
(b)
Carefully drafted examples illustrate correct and incorrect ways of meeting those criteria.
(c)
Prompt instructions explicitly disallowing any supplementary or AI-generated criteria.

3.3. Automating the Protocols: Designing an Algorithmic Approach

Results for Step “Proposed Algorithm” (Section 2.2.3)

Lastly, an algorithm was developed to automate the creation and application of these grading protocols across different open-ended questions:

Algorithm Development and Components:
(a)
The algorithm was structured into four sequential phases (Definition of Evaluation Criteria, Construction of Decision Tables, Standardization of Output Format, Creation of the Evaluation Prompt).
(b)
The prompt in Section S12 ensures that the AI generates four deliverables (a list of evaluation criteria, all decision tables for each criterion, an evaluation structure, and a ready-to-use prompt), allowing the user to review and confirm the criteria and elements created. The process does not advance to the next phase until the user confirms that no further changes are needed in the current one. It also allows the user to add or specify details, as was the case with incorporating Section S18 to replicate the instructor’s intended structure.
Verification With Original Question:
(a)
To confirm the algorithm’s correctness, ChatGPT-o1 was first supplied instead of ChatGPT-4o with the relevant documents (Sections S7.3, S11 and S13) and the same marketing question from Section S1 was used.
(b)
The algorithm successfully replicated (Section S16) the Formal Definitions, Evaluated Concepts, Possible Scores, Problem Concepts, Notes, and Decision Tables in a comparable manner, maintaining the original structure and evaluation logic, and using equivalent but not identical wording, relative to the reference materials in Section S13. Section S17 was used as a supporting guideline during the process, but in the algorithm’s first phase, it correctly detected the same evaluation criteria, as shown in the process (https://github.com/anonymous-researcher-0/ChatGPT-as-a-stable-and-fair-tool-for-Automated-Essay-Scoring/blob/main/Section%20S16%20Process.pdf, accessed on 8 June 2025).
Testing With a New Question:
(a)
Next, a different open-ended question (Section S14) and its original rubric (Section S15) were uploaded to test whether the algorithm could generalize its approach by following the same procedure and using the relevant documents. It is important to note that the rubric in Section S15 had a less structured and less detailed format—it did not separate components such as Formal Definition, Evaluated Concepts, Possible Scores, or Evaluation Rules, which are required by the algorithm to build a systematic and replicable evaluation process. In contrast, Section S17 served as a model of the expected structure, providing a clear and complete example of how these elements should be organized. This contrast allowed us to observe whether the algorithm could still extract and reorganize the relevant evaluation logic from a less structured rubric, which, as the results show, it successfully did.
(b)
The algorithm generated a new set of evaluation criteria and decision tables (Section S18) that aligned with the distinct requirements of the question in Section S15.
Application to a New Group of 20 Students:
(a)
With the revised rubric and decision tables created automatically by the algorithm, the question from Section S14 was administered to a new group of 20 students in the marketing course.
(b)
Each student’s response was assessed via five chat sessions (mirroring previous methodology). A preliminary data table is presented in Section S19, showing a high degree of fairness, defined as consistency in assigned grades and evaluation criteria across sessions, with 11 perfect cases, 8 moderate, and 1 inconsistent.

4. Discussion

4.1. Discussion for Step 3.1.1: Consistency in AI-Based Grading Without a Prescribed Rubric

This first step of the study explored the behavior of ChatGPT-4o when assessing a student’s response without a predefined rubric. The findings indicate significant inconsistencies in both the evaluation criteria applied and the final grades assigned across ten independent chat sessions. These outcomes reveal crucial implications for the use of large language models in educational assessment:

4.1.1. Emergence of Implicit and Unregulated Evaluation Criteria:

Without explicit guidance (S. Li & Ng, 2024b; Yancey et al., 2023), ChatGPT-4o autonomously introduced a variety of evaluation dimensions, such as Structure, Clarity and Coherence, Originality and others of Section S5.1. While some of these may be contextually appropriate in academic settings, their spontaneous inclusion suggests that the model relies on internal heuristics that may not align with instructor expectations or learning objectives. This behavior undermines standardization, as it allows each session to apply distinct and unverified evaluative lenses.

4.1.2. Variability in Scoring Outcomes:

The grades assigned to the exact same student response varied notably across sessions (Yun, 2023; Bui & Barrot, 2024). This variance was not always attributable to major differences in criteria application, which suggests that ChatGPT-4o’s internal evaluation logic fluctuates between iterations. In contexts where reliability and transparency are essential, such as formal assessment, this degree of stochastic behavior is problematic. In some cases, similar sets of criteria yielded different scores; in others, different criteria resulted in similar grades, further highlighting the model’s unpredictability.

4.1.3. Misalignment with Educational Standards:

The lack of a shared reference framework allowed the AI to emphasize dimensions unrelated to the intended learning outcomes. This includes overemphasizing presentation features or value-laden constructs, which are not part of the course’s objectives. The implications are clear: grading without an explicit and carefully constructed teacher’s rubric can lead to misaligned feedback, confusing students, and potentially distorting their perception of performance expectations (Brown, 2022; Ouyang et al., 2023).

4.1.4. Lessons Learned:

The findings of Step 3.1.1 confirm that the absence of a prescribed rubric severely compromises grading stability (S. Li & Ng, 2024b). Without a controlled structure, ChatGPT-4o’s evaluations lack repeatability, transparency, and alignment with curricular goals. This reinforces the necessity of implementing well-defined rubrics and clear prompts when using AI for assessment tasks. Even when AI-generated feedback appears reasonable, its arbitrary foundations make it unreliable in high-stakes academic contexts.

4.2. Discussion for Step 3.1.2: Stability in AI-Based Grading Using a Predefined Rubric

Section 3.1.2 examined whether introducing a predefined rubric (Section S2) could improve the consistency of ChatGPT-4o’s grading. Compared to the results from Section 3.1.1, the inclusion of instructor-defined criteria significantly reduced the spontaneous introduction of ad hoc dimensions and improved alignment with pedagogical expectations. However, some variability persisted, revealing critical insights into the strengths and limitations of rubric-based AI assessment.

4.2.1. Improved Alignment with Instructor Expectations:

Across the ten chat sessions, ChatGPT-4o consistently referenced the rubric-defined criteria—perceived value, functional benefits, psychological benefits, monetary costs, and non-monetary costs. This marked a substantial improvement over Step 3.1.1, where criteria appeared without justification (Bui & Barrot, 2024). With the rubric in place, the AI’s evaluations better reflected the course’s learning goals and showed a reduced tendency to rely on internal heuristics.

4.2.2. Persistence of Non-Rubric Criteria:

Despite overall improvements, certain deviations still appeared in specific sessions, particularly in chat sessions 4 and 8 (Section S6.1). In these instances, ChatGPT-4o introduced evaluation criteria not presented in the original rubric (Y. Chen & Li, 2023), such as Depth and Analysis or Justification. Although these additions may be pedagogically valid, their unanticipated inclusion influenced the final scores and affected grading consistency. Interestingly, whenever a non-rubric criterion was introduced (with the exception of Justification, which appeared consistently), it resulted in a different final grade (Section S6.1). This pattern indicates that the specific evaluation criteria applied by the AI have a direct impact on the outcomes. As such, clearly defining and constraining the set of evaluation criteria, whether through prompt design or rubric refinement, proves essential for maintaining grading stability.

4.2.3. Lessons Learned:

The introduction of a predefined rubric markedly enhances grading stability, but it does not eliminate all variability. Even with an explicit instructor rubric, ChatGPT-4o occasionally introduced extra, unintended criteria, leading to variations in grading (X. Wu et al., 2024). This highlights the importance of not only having a well-defined rubric but also providing precise prompt instructions and maintaining continuous oversight of the model’s behavior. As such, rubric represents a critical, but not self-sufficient, component in achieving stable AI-based assessment, and must be complemented by prompt refinement and, when necessary, rubric adjustments to ensure full alignment with the intended evaluation framework.

4.3. Discussion for Step 3.1.3: Assessing ChatGPT Reliability with a Predefined Explicit Rubric and Examples

Section 3.1.3 investigated whether refining the evaluation structure, through a detailed rubric, illustrative examples, standardized output format, and strict prompt controls, could ensure grading consistency for a single student response across multiple ChatGPT-4o interactions. The results showed complete stability in the grading outputs across 10 chat sessions (Section S7.5).

4.3.1. Structuring the Rubric to Constrain Interpretation

The complete grading stability achieved in this step demonstrates that ambiguity in AI assessment can be minimized (S. Li & Ng, 2024b) when the rubric is designed with precise definitions, clearly delimited concepts, and explicit scoring boundaries (Section S7.1). Unlike previous stages, this structure left little room for the model to improvise, resulting in strict adherence to the instructor’s evaluation logic.

4.3.2. Examples as Calibration Anchors

Examples played a pivotal role (Yancey et al., 2023) in guiding the model’s decision-making (Section S7.2), helping it distinguish between acceptable and unacceptable justifications for each score level. However, integrating examples also introduced complexity; selecting or designing appropriate examples required fine-tuning the rubric, avoiding contradictions across scoring levels or between different evaluation criteria. The optimal number and type of examples remained uncertain, raising questions about how much instructional scaffolding is necessary for the model to achieve perfect consistency.

4.3.3. Standardization of Output and Prompt Design

Stability was further reinforced by a standardized output format (Section S7.3) and a tightly constrained prompt (Section S7.4) that explicitly banned non-rubric criteria and limited model stochasticity via hyperparameter tuning. This combination ensured that ChatGPT-4o followed a deterministic, rule-based evaluation logic and avoided unintended improvisations. Additionally, the use of hyperparameters in the final prompt, specifically a low temperature (0.1), a frequency penalty of 0.0, and a negative presence penalty (−1.0), contributed meaningfully to the consistency of ChatGPT-4o’s behavior (Ilemobayo et al., 2024). The low temperature limited randomness, encouraging more stable and repeatable outputs. The neutral frequency penalty avoided penalizing repeated terms that are essential in rubric-based grading, such as specific evaluative labels. Meanwhile, the negative presence penalty promoted the reuse of key rubric concepts across responses, reinforcing alignment with the expected criteria. Together, these parameters helped reduce improvisation and focus the model on a fixed evaluation logic. If adjusted, e.g., using a higher temperature or positive presence penalty, the model could introduce more variation or evaluation criteria, which may be desirable for creative tasks but detrimental in contexts requiring strict consistency, such as rubric-based assessment.

Importantly, the standardized output format used in this step was arbitrary and can be freely defined by the instructor. By specifying the desired structure, length, and evaluative focus within the prompt—and supporting it with illustrative examples—the AI can be guided to replicate that output reliably. In contrast, if no format is explicitly established, ChatGPT-4o may generate responses with inconsistent structure and depth, often reintroducing non-rubric criteria such as Depth of Analysis (as seen in Section S5.1), thereby undermining consistency.

4.3.4. Trade-Off Between Stability and Preparation Effort

While the approach successfully eliminated grading variability in this controlled scenario, it came at the cost of increased preparation. Constructing and refining rubrics, examples, and prompts demanded significant time and expertise, particularly in ensuring internal consistency and avoiding contradictory cues (Kaldaras et al., 2022). Moreover, rubrics may require updates over time as instructional goals or content evolve, meaning the system must remain adaptable without compromising stability (Ling, 2024). Although developing example-rich rubrics and decision tables does require an up-front design effort, the algorithm subsequently automates rubric creation for new questions and limits the instructor’s role to brief spot-checks. Classroom deployments of LLM-based graders report average time savings of 40–80%, with 65% savings reported in a recent multi-institution study in science, technology, engineering, and mathematics (STEM) study, after the initial calibration is complete (Yang et al., 2025). Likewise, Xie et al. (2024) demonstrate that while designing effective prompts and frameworks requires upfront instructor effort, these can be reused and adapted across courses and semesters, enabling scalable deployment of LLM-based educational tools. Hence, our use of the term “minimal human intervention” refers to the ongoing workload once the system is deployed, not to the one-time rubric-design phase.

4.3.5. Lessons Learned

This step confirms that highly stable AI grading is achievable through comprehensive rubric design, rigorous prompt control, and well-chosen examples. However, it also reveals the operational cost of such precision. Achieving consistency is not merely a technical task, it is an iterative design challenge that blends pedagogical clarity with prompt engineering (Lo, 2023). When these elements are harmonized, ChatGPT-4o can serve as a reliable grading assistant, provided that human oversight remains central to the design and refinement of its evaluative framework.

4.4. Discussion for Step “Results for Step “Refining the Rubric for Consistent Evaluation Across Multiple Students” (Section 3.2)

While the first three experimental stages focused primarily on grading stability for a single student response, the step “Results for Step Refining the Rubric for Consistent Evaluation Across Multiple Students” extended the analysis to a broader set of students to evaluate fairness, defined here as the consistent application of evaluation standards across varied content and students. This scaling process revealed new dimensions of variability and emphasized the ongoing challenges of maintaining uniformity in real-world classroom conditions.

4.4.1. Variability in Scaling Across Diverse Responses:

As the refined rubric was applied to 10 and later 20 distinct student answers, new inconsistencies emerged in the final scores (Section S9.1–S9.3, S10.1–10.3). These discrepancies were not always due to flaws in the rubric itself, but rather arose from the diversity of student expression, which exposed edge cases or ambiguities in interpretation. Even with identical rubrics and prompts, some responses triggered unexpected variation in scoring, highlighting the complexity of ensuring fairness at scale (Yigiter & Boduroğlu, 2025).

4.4.2. Sensitivity to Model Updates and Example Reduction:

Two experimental changes revealed the fragility of fairness in AI-based grading:

(a): Model Updates: A system update to ChatGPT-4o between iterations introduced subtle changes in how evaluation criteria were applied, even when instructions and inputs remained constant. Although the version of ChatGPT used during this phase of the study remained the same, an update to the model’s operational parameters, rather than its pre-trained weights, may account for the differences observed in the results following this update. The changes associated with GPT-4o stem from the fact that this model replaced GPT-4 as the base model for ChatGPT, and several of its features were in a testing phase during February 2025. These features were later consolidated in the latest release of GPT-4o, which introduced modifications to the default values of internal operational parameters. These adjustments enhanced the model’s ability to follow instructions and improved formatting precision, alongside changes aimed at better interpreting the implied intent behind user prompts—referred to by OpenAI as fuzzy improvements (OpenAI, n.d.). This demonstrates that versioning must be tracked and tested whenever AI tools are used in longitudinal educational assessments. Longitudinal evaluations have shown that GPT-4 performance can drift across releases; for example, its accuracy in identifying prime numbers dropped from 84% (March 2023) to 51% (June 2023), with chain of thought compliance decreasing accordingly (L. Chen et al., 2023). Such shifts threaten the reproducibility of automated grading pipelines. To contain this risk, it could be useful to (i) log the exact model identifier and timestamp for each evaluation instance, (ii) lock inference hyper-parameters (e.g., temperature = 0.1, presence-penalty = −1.0) to minimize stochasticity, and (iii) rerun a fixed calibration set after any detected version change to quantify drift.
(b): Example Removal: The gradual removal of illustrative examples (Section S7.2) reduced the effort involved in rubric deployment but negatively impacted the rate of fully consistent evaluations. As shown in Sections S10.1–S10.3, the absence of examples correlated with a drop in perfect consistency. However, this effect depends on how well the examples are constructed; Section S10.2 shows that including only three selected examples was actually less effective than providing none at all, as in Section S10.3 (Hmoud et al., 2024).

4.4.3. Normalization and Decision Tables as a Scalable Alternative to Fine-Tuning

The process of normalization and the introduction of detailed decision tables (Section S8.1) was helped by enforcing a binary, rule-based evaluation logic, where specific Yes/No questions led to clearly defined scoring actions. These tools reduced interpretive flexibility and enhanced transparency in how grades were assigned.

One of the most relevant outcomes of this approach is that it avoids the need for fine-tuning language models, which is currently a central requirement in most automated essay scoring (AES) systems. Traditional AES methods rely on training models with thousands of human-scored examples to approximate consistency. In contrast, the method presented here reaches comparable levels of reliability through the use of structured logic and targeted prompt design, without the need for large datasets or retraining. This distinction is critical: it significantly lowers the technical barrier to implementation and allows for the rapid creation of grading protocols for new open-ended questions, such as the one in Section S1 (Mayfield & Black, 2020).

Although the system does not always produce perfect stability (i.e., all sessions in yellow), the high proportion of near-consistent scores (blue cases in Sections S9.1, S10.1–S10.3) reflects an effective balance between scalability and reliability. These small variations typically occur in borderline responses, where interpretation is more subjective in cases where even human graders might disagree. For this reason, blue cases are not necessarily evidence of system failure but rather a fair reflection of the complexity of student responses. In fact, many of these answers include contradictions, vague phrasing, or partial reasoning that could reasonably be scored in more than one way depending on the rater’s strictness.

The approach also enables the design of hybrid systems where human judgment complements automated grading. Since the AI produces structured, explainable outputs, instructors can quickly review and, if necessary, adjust individual scores. This workflow substantially reduces grading time while maintaining pedagogical control.

Finally, the implementation of a cross-session consensus strategy, where the AI compares multiple outputs to detect stable patterns and resolve inconsistencies, could further enhance performance in ambiguous cases. Taken together, these results suggest that normalization and decision tables offer a robust, transparent, and scalable alternative to data-intensive fine-tuning approaches. In particular, while fine-tuning has proven valuable in enhancing model accuracy for complex domains (Yamtinah et al., 2025), the integration of normalization processes and decision tables offers an alternative scalable, low-effort pathway, enabling consistent evaluation logic without requiring large datasets, positioning them as a viable path forward in the evolution of AI-assisted assessment (Atkinson & Palma, 2025).

4.4.4. Operational Maintenance and Model Drift

Once the seed rubrics and decision tables are in place, the same pipeline grades any new question with no additional rubric edits. Day-to-day upkeep is limited to a quick spot-check on a small calibration set whenever the LLM version changes. A similar “rubric + spot-check” workflow has already scaled to thousands of submissions in three large MOOCs, with instructors intervening only when model drift is detected (Golchin et al., 2024). These early results suggest that the main bottleneck for expansion is computational rather than human, although multi-semester studies are still needed to fully characterize long-term costs.

4.4.5. Lessons Learned

Scaling AI-based grading to diverse student responses introduces new fairness challenges, even when using refined rubrics. Results show that variability often stems from content diversity, model updates, or poorly constructed examples. However, leveraging normalization techniques and decision tables provides a scalable, low-effort substitute for fine-tuning, supporting consistent evaluation logic without the need for large datasets (Atkinson & Palma, 2025). While not flawless, this approach supports human–AI hybrid workflows and provides a transparent, adaptable path toward fair assessment across varied educational contexts.

4.5. Discussion for Step “Results for Step Proposed Algorithm” (Section 3.3)

The results from the step “Results for Step Proposed Algorithm” confirm that the proposed algorithm successfully replicates the normalization process and decision tables for both the original question (Section S16) and a novel question with a distinct structure (Section S18). This validates the algorithm as a self-sufficient system capable of generating complete evaluation rubrics, defining formal criteria, evaluating concepts, scoring ranges, problem concepts, clarifying notes, and decision logic. Although autonomous in its operations, the algorithm was designed as an iterative tool that enables instructor intervention at each phase, allowing for the adjustment, refinement, or redefinition of evaluation elements. This flexibility ensures alignment with pedagogical intentions while maintaining process consistency.

4.5.1. Automating Rubric Construction for Consistency

The algorithm successfully encoded a step-by-step pipeline for defining evaluation criteria, constructing decision tables, producing uniform output formats, and generating correction prompts (Sections S12 and S13). By systematizing this process, the model reduces the reliance on manual rubric design for each new question. This promotes stability through procedural standardization and fairness by applying a consistent framework to new educational contexts (Mangili et al., 2022).

4.5.2. Equivalent Outcomes Using Structurally Aligned Versus Original Rubrics

When the rubrics generated by the algorithm were applied to a new group of 20 students (Section S19), the results showed that using a reformulated rubric—one explicitly structured to match the algorithm’s decision tables and format (Section S17)—produced scoring outcomes similar to those obtained with the original rubric designed by the instructor (Section S15). In other words, both rubrics led to functionally equivalent grades. However, the version aligned with the algorithm’s structure offered important advantages: it used clearer language, reduced ambiguity in interpretation, and made the evaluation process easier to understand and replicate (Kaldaras et al., 2022). These benefits suggest that structurally aligning rubrics with the algorithm’s logic does not affect grading outcomes, but it does enhance transparency and consistency in how the evaluation is carried out.

4.5.3. The Algorithm Accurately Reproduces Evaluative Logic

The algorithm’s output, executed according to the designed protocol, successfully reconstructed the instructor’s evaluation logic, even for a new assessment context. Section S18 illustrates how the system correctly reproduced the formal rubric structure, decision tables, and scoring logic for a previously unseen question. The evaluation of 20 new student responses showed high consistency, with only one inconsistent case observed across all sessions (Section S19). This performance level demonstrates that the algorithm can deliver reliable results in any open-ended question (Bernard et al., 2024).

4.5.4. Enhanced Reasoning Capabilities

The inclusion of enhanced reasoning capabilities has direct implications for the outcomes produced by our algorithm. A comparison between GPT-4o and o1 (see Section S20) revealed that GPT-4o faced greater challenges when following more extensive protocols, primarily due to its limited long-term memory management. These shortcomings were addressed by GPT-o1, which demonstrated improved consistency throughout the evaluation process when applying the same protocol. The observed improvements in following complex instructions and structured procedures—such as those defined in this study—are highlighted by OpenAI as key features of o1. This model was specifically trained to enhance its performance in chain of thought (COT) reasoning. The “COT step-by-step problem solving” feature of GPT-o1 enables it to tackle complex tasks requiring multiple interaction steps, thereby providing it with greater analytical capacity, as documented in OpenAI’s model system card for GPT-o1 (OpenAI, 2024).

4.5.5. Lessons Learned

This phase highlights that consistent and fair grading is attainable without fine-tuning language models. Through a normalization process, decision tables prompt engineering, and structured output formats, the algorithm achieves high stability with minimal resource demands. Importantly, the tool’s iterative nature allows instructors to intervene at any step to refine criteria, clarify logic, or adapt to specific pedagogical goals. This makes the system both scalable and adaptable across contexts, while preserving transparency and interpretability, essential features for integration into real-world educational workflows (Yeung et al., 2025).

5. Conclusions

The aim of this work was to find out if ChatGPT is a reliable (stable) and fair tool for essay scoring. We saw that when the model was prompted with no rubric, it produced volatile criteria and grades; adding the instructor’s rubric reduced, yet did not eliminate, this variability, and only after introducing an explicit, example-rich rubric, a standardized output format with low-temperature hyper-parameters, and a normalization process based on decision tables did ChatGPT-4o reach perfect stability for a single response and a high degree of fairness across twenty students (11 perfect, 8 moderate, 1 inconsistent). Drawing upon the insights obtained through the investigation of the previous research question, we developed an algorithm capable of generating effective grading rubrics with minimal human intervention. This algorithm is characterized by the following:

(a): Automatic detection of evaluation criteria from any open-ended question.
(b): Creation of fully explicit normalization artefacts—formal definitions, evaluated concepts, possible scores, problem concepts, notes, and binary decision tables.
(c): Construction of a standardized output format and a tightly constrained prompt with fixed hyper-parameters.
(d): An instructor-in-the-loop workflow that delivers high stability and fairness without finetuning, keeping the need for knowledge and extra resources low.

The rapid advancement of generative AI highlights the importance of monitoring how model updates affect the consistency of automated essay scoring tools. This study demonstrated that improvements in reasoning capabilities enabled GPT-o1 to outperform GPT-4o when evaluated under the protocol described in this article. Recent trends indicate a growing interest in enhancing the reasoning abilities of large language models (Xia et al., 2025). A common feature of these approaches involves aligning the pretrained model using multi-hop datasets—that is, datasets composed of examples that require solving complex problems through a sequence of reasoning steps. Although such improvements suggest that adherence to complex protocols should improve, it remains essential to assess the actual capabilities of these models. Future evaluations should consider whether enhanced reasoning capacities translate into gains in stability and fairness, as these qualities are not necessarily correlated.

The added value of this wok is as follows:

(a): Offers a prompt-engineering, model-agnostic pathway that achieves stable and fair automated essay scoring, requiring only limited technical or computational resources.
(b): Provides a scalable algorithm that automatically builds normalized rubrics and decision tables for new questions, widening the reach of reliable assessment.
(c): Produces transparent, explainable, standardized feedback that supports hybrid human-AI workflows and lowers the technical barrier to adoption in educational settings.

The study also quantifies the trade-off between precision and stability: by comparing no-rubric, rubric-only, and fully normalized conditions (with examples), it identifies how much instructional effort is needed for each incremental improvement in grading consistency, offering instructors an evidence-based framework to guide effort allocation. It further validates the use of normalization and decision tables as a low-data alternative to traditional model fine-tuning, demonstrating that near-perfect grading consistency can be achieved and maintained across entirely new open-ended questions without requiring large datasets. The algorithm operates as an instructor-in-the-loop pipeline, with each of its four phases including a pause for expert review, allowing teachers to adjust criteria and clarify rules before continuing. The paper also outlines concrete operational safeguards for real-world deployment, including calibrated hyperparameter settings to mitigate model drift. Additionally, it presents a controlled comparison between GPT4o and GPTo1 in chained grading tasks, showing that GPTo1’s extended context and chain of thought training yield more consistent outcomes, informing future model selection for rubric generation. Finally, the protocol transforms grading into a pedagogically valuable activity: by requiring criterion-level justifications, students receive actionable feedback while instructors can audit scores swiftly, multiplying pedagogical value and reducing grading time.

Although the algorithm was built using a bottom-up data-driven approach, deriving criteria from authentic prompts and refining decision tables, it proved effective beyond its original dataset. By successfully applying the same evaluative logic to a new question (Section S15) and achieving consistent grading for 20 new students (Section S19), the framework demonstrated its comprehensiveness and rigor. In contrast, a top-down approach based on classical assessment theory would require extensive upfront effort, including domain analysis, large expert-scored datasets, and regular updates with each LLM revision, undermining the lightweight, reusable nature that makes the proposed bottom-up method practical for real-world use.

This work offers several practical recommendations for educators seeking to integrate AI tools into assessment processes:

(a): Avoid using AI tools for grading without a clear structure. AI should not be treated as a standalone evaluator; it must operate within a carefully designed and pedagogically sound framework.
(b): Translate rubrics into explicit decision rules. Converting evaluation criteria into a structured yes/no logic enhances transparency and improves consistency across assessments.
(c): Standardize AI configurations. To ensure reliable results, prompts, output formats, and temperature settings should be standardized and consistently applied.
(d): Use AI-generated feedback as a learning tool. Share the underlying evaluation logic with students to foster metacognitive awareness and support self-assessment.

The limitations of this work are fourfold. First, perfect stability still demands a form of data-intensive fine-tuning by example: the stage that yielded flawless agreement (Section S9.3) required a large, carefully curated set of illustrative answers for every evaluation criterion. Crafting, curating, and periodically refreshing that example bank is equivalent, in effort if not in model-weight updates, to the data demands of classic AES fine-tuning. We showed that a Pareto-optimal baseline that keeps only the normalization process, well-constructed prompt, and decision tables (Section S8) attains “high-but-not-perfect” consistency; however, reaching the last few percentage points still costs disproportionately more instructional data, so the workflow should include cross-prompt grading or a light human audit to reconcile residual divergences.

Second, external validity is constrained by the small sample (N = 40). Expanding the study to hundreds of students would strengthen the fairness claims, but doing so manually is prohibitive. While an API-driven pipeline could automate data collection and grading, such infrastructure entails additional monetary costs (API calls, server time, dataset storage, and compliance). In the same line, the focus on a single domain (marketing) and question type also limits external validity. Future research should test this protocol across different disciplines and assessment formats to assess its generalizability.

Third, the protocol remains sensitive to model updates. Even minor parameter changes in GPT-4o altered rubric interpretation; therefore, any production deployment must track versioning, re-benchmark after each release, and roll back or re-calibrate if stability drifts. This sensitivity was reduced in the final implementation, as the use of illustrative examples was removed in favor of a purely structured format (Section S8).

Fourth, long-horizon tasks such as running the rubric-construction algorithm require a model with stronger session memory (GPT-o1) than the freely available GPT-4o. Relying on GPT-o1 (or an equivalent long-context model) raises accessibility and cost issues for institutions/users that cannot afford premium tiers.

6. Declaration of Generative AI and AI-Assisted Technologies in the Writing Process

During the preparation of this work, the author(s) used ChatGPT 4 to expand ideas and explore different versions of the text, particularly in refining the language. As non-native English speakers, the author(s) found ChatGPT to be an effective and affordable tool for revising their language and considering multiple options for phrasing and expression. This tool was much more cost-effective compared to the human editors previously used. All work carried out by ChatGPT was based entirely on the inputs provided by the author(s), who reviewed and edited the content as needed. The author(s) take full responsibility for the final content of the publication.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/educsci15080946/s1.

Author Contributions

Conceptualization, F.G.-V.; Methodology, M.N.; Validation, M.M. and Z.B.; Formal analysis, C.M.-T.; Investigation, F.G.-V.; Writing—original draft, F.G.-V.; Writing—review & editing, F.G.-V., M.M., C.M.-T. and Z.B.; Supervision, M.N.; Project administration, M.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Agencia Nacional de Investigación y Desarrollo (ANID), FONDECYT Grant No. 1241462. Additional support was provided by the Millennium Institute for Foundational Research on Data (IMFD) – ICN17_002, and the National Center for Artificial Intelligence (CENIA)–FB210017.

Institutional Review Board Statement

This study was conducted in accordance with the Declaration of Helsinki and approved by the Ethics Committee for Social Sciences, Arts and Humanities at the Pontificia Universidad Católica de Chile (approval code: 241120006; approval date: 8 January 2025).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study. A blank version of the consent form was submitted to the editorial office and is not publicly available due to confidentiality considerations.

Data Availability Statement

The data supporting the findings of this study are not publicly available due to ethical and privacy restrictions. However, selected materials and results are provided as Supplementary Material and are available upon request from the corresponding author.

Acknowledgments

All individuals to be acknowledged have provided their consent.

Conflicts of Interest

The authors declare no conflict of interest.

References

Almusharraf, N., & Alotaibi, H. (2023). An error-analysis study from an EFL writing context: Human and automated essay scoring approaches. Technology, Knowledge and Learning, 28(3), 1015–1031. [Google Scholar] [CrossRef]
Anchiêta, R. T., de Sousa, R. F., & Moura, R. S. (2024, October 17). A robustness analysis of automated essay scoring methods. Anais do Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana (STIL), Belém, Brazil. [Google Scholar] [CrossRef]
Atkinson, J., & Palma, D. (2025). An LLM-based hybrid approach for enhanced automated essay scoring. Scientific Reports, 15(1), 5623. [Google Scholar] [CrossRef] [PubMed]
Bernard, R., Raza, S., Das, S., & Murugan, R. (2024). EQUATOR: A deterministic framework for evaluating LLM reasoning with open-ended questions. # v1.0.0-beta. arXiv. [Google Scholar] [CrossRef]
Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Podstawski, M., Gianinazzi, L., Gajda, J., Lehmann, T., Niewiadomski, H., Nyczyk, P., & Hoefler, T. (2024, February 20–27). Graph of thoughts: Solving elaborate problems with large language models. Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada. [Google Scholar] [CrossRef]
Brown, G. T. (2022). The past, present and future of educational assessment: A transdisciplinary perspective. Frontiers in Education, 7, 1060633. [Google Scholar] [CrossRef]
Bui, N. M., & Barrot, J. S. (2024). ChatGPT as an automated essay-scoring tool in the writing classroom: How it compares with human scoring. Education and Information Technologies, 30, 2041–2058. [Google Scholar] [CrossRef]
Chen, L., Zaharia, M., & Zou, J. (2023). How is ChatGPT’s behavior changing over time? arXiv. [Google Scholar] [CrossRef]
Chen, Y., & Li, X. (2023). PMAES: Prompt-mapping contrastive learning for cross-prompt automated essay scoring. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Proceedings of the 61st annual meeting of the association for computational linguistics (Vol. 1, pp. 1489–1503). Association for Computational Linguistics. [Google Scholar]
Golchin, S., Garuda, N., Impey, C., & Wenger, M. (2024). Grading massive open online courses using large language models. arXiv. [Google Scholar] [CrossRef]
Hmoud, M., Swaity, H., Anjass, E., & Aguaded-Ramírez, E. M. (2024). Rubric development and validation for assessing tasks’ solving via AI chatbots. Electronic Journal of e-Learning, 22(6), 1–17. Available online: https://files.eric.ed.gov/fulltext/EJ1434299.pdf (accessed on 30 January 2025). [CrossRef]
Ilemobayo, J. A., Durodola, O., Alade, O., Awotunde, O. J., Olanrewaju, A. T., Falana, O., Ogungbire, A., Osinuga, A., Ogunbiyi, D., Ifeanyi, A., Odezuligbo, I. E., & Edu, O. E. (2024). Hyperparameter tuning in machine learning: A comprehensive review. Journal of Engineering Research And Reports, 26(6), 388–395. [Google Scholar] [CrossRef]
Kaldaras, L., Yoshida, N. R., & Haudek, K. C. (2022). Rubric development for AI-enabled scoring of three-dimensional constructed-response assessment aligned to NGSS learning progression. Frontiers in Education, 7, 983055. [Google Scholar] [CrossRef]
Li, R., Wang, Y., Wen, Z., Cui, M., & Miao, Q. (2025). Different paths to the same destination: Diversifying LLM generation for multi-hop open-domain question answering. Knowledge-Based Systems, 309, 112789. [Google Scholar] [CrossRef]
Li, S., & Ng, V. (2024a, November 12–16). Automated essay scoring: A reflection on the state of the art. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (pp. 17876–17888), Miami, FL, USA. [Google Scholar]
Li, S., & Ng, V. (2024b, August 3–9). Automated essay scoring: Recent successes and future directions. Proceedings of the 33rd International Joint Conference on Artificial Intelligence, Jeju, Republic of Korea. [Google Scholar]
Ling, J. H. (2024). A review of rubrics in education: Potential and challenges. Pedagogy: Indonesian Journal of Teaching and Learning Research, 2(1), 1–14. Available online: https://ejournal.aecindonesia.org/index.php/pedagogy/article/view/199 (accessed on 30 January 2025). [CrossRef]
Lo, C. K. (2023). What is the impact of ChatGPT on education? A rapid review of the literature. Education Sciences, 13(4), 410. [Google Scholar] [CrossRef]
Long, J. (2023). Large language model guided tree-of-thought. arXiv. [Google Scholar] [CrossRef]
Lou, R., Zhang, K., & Yin, W. (2024). Large language model instruction following: A survey of progresses and challenges. Computational Linguistics, 50(3), 1053–1095. [Google Scholar] [CrossRef]
Mangili, F., Adorni, G., Piatti, A., Bonesana, C., & Antonucci, A. (2022). Modelling assessment rubrics through bayesian networks: A pragmatic approach. arXiv. [Google Scholar] [CrossRef]
Mayfield, E., & Black, A. W. (2020, July 10). Should you fine-tune BERT for automated essay scoring? The 15th Workshop on Innovative Use of NLP for Building Educational Applications (pp. 151–160), Seattle, WA, USA. Available online: https://aclanthology.org/2020.bea-1.16 (accessed on 30 January 2025).
Miao, T., & Xu, D. (2025). KWM-B: Key-information weighting methods at multiple scale for automated essay scoring with BERT. Electronics, 14(1), 155. [Google Scholar] [CrossRef]
OpenAI. n.d. ChatGPT release notes. OpenAI Help Center. Available online: https://help.openai.com/en/articles/6825453-chatgpt-release-notes (accessed on 25 February 2025).
OpenAI. (2024, December 5). GPT-o1 system card [Technical report]. Available online: https://arxiv.org/pdf/2412.16720 (accessed on 30 January 2025).
Ouyang, F., Dinh, T. A., & Xu, W. (2023). A systematic review of AI-driven educational assessment in STEM education. Journal for STEM Education Research, 6(3), 408–426. [Google Scholar] [CrossRef]
Stahl, M., Biermann, L., Nehring, A., & Wachsmuth, H. (2024). Exploring LLM prompting strategies for joint essay scoring and feedback generation. arXiv. [Google Scholar] [CrossRef]
Tang, X., Lin, D., & Li, K. (2024). Enhancing automated essay scoring with GCNs and multi-level features for robust multidimensional assessments. Linguistics Vanguard, 10(1). [Google Scholar] [CrossRef]
Tate, T. P., Steiss, J., Bailey, D., Graham, S., Moon, Y., Ritchie, D., Tseng, W., & Warschauer, M. (2024). Can AI provide useful holistic essay scoring? Computers & Education: Artificial Intelligence, 5, 100255. [Google Scholar] [CrossRef]
Triem, H., & Ding, Y. (2024). Tipping the balance: Human intervention in large language model multi-agent debate. Proceedings of the Association for Information Science and Technology, 61(1), 1034. [Google Scholar] [CrossRef]
Uyar, A. C., & Büyükahıska, D. (2025). Artificial intelligence as an automated essay scoring tool: A focus on ChatGPT. International Journal of Assessment Tools in Education, 12(1), 20–32. [Google Scholar] [CrossRef]
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2023, May 1–5). Self-consistency improves chain of thought reasoning in language models. The International Conference on Learning Representations, Kigali, Rwanda. [Google Scholar]
Wang, Y., Hu, R., & Zhao, Z. (2024). Beyond agreement: Diagnosing the rationale alignment of automated essay-scoring methods based on linguistically informed counterfactuals. arXiv. [Google Scholar] [CrossRef]
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E. H., Le, Q. V., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837. [Google Scholar]
Wu, X., Saraf, P. P., Lee, G., Latif, E., Liu, N., & Zhai, X. (2024). Unveiling scoring processes: Dissecting the differences between LLMs and human graders in automatic scoring. arXiv. [Google Scholar] [CrossRef]
Wu, Z., Jiang, M., & Shen, C. (2024, February 20–27). Get an A in math: Progressive rectification prompting. Thirty-Eighth AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada. [Google Scholar] [CrossRef]
Xia, Y., Wang, R., Liu, X., Li, M., Yu, T., Chen, X., McAuley, J., & Li, S. (2025, January 19–24). Beyond chain-of-thought: A survey of chain-of-X paradigms for LLMs. 31st International Conference on Computational Linguistics (pp. 10795–10809), Abu Dhabi, United Arab Emirates. [Google Scholar]
Xiao, C., Ma, W., Song, Q., Xu, S. X., Zhang, K., Wang, Y., & Fu, Q. (2025, March 3–7). Human-AI collaborative essay scoring: A dual-process framework with LLMs. 15th International Learning Analytics and Knowledge Conference (LAK 2025), Dublin, Ireland. [Google Scholar]
Xie, W., Niu, J., Xue, C. J., & Guan, N. (2024). Grade like a human: Rethinking automated assessment with large language models. arXiv. [Google Scholar] [CrossRef]
Yamtinah, S., Wiyarsi, A., Widarti, H. R., Shidiq, A. S., & Ramadhani, D. G. (2025). Fine-tuning AI Models for enhanced consistency and precision in chemistry educational assessments. Computers And Education Artificial Intelligence, 8, 100399. [Google Scholar] [CrossRef]
Yancey, K. P., Laflair, G., Verardi, A., & Burstein, J. (2023, July 13). Rating short L2 essays on the CEFR scale with GPT-4. 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) (pp. 576–584), Toronto, ON, Canada. [Google Scholar]
Yang, Y., Kim, M., Rondinelli, M., & Shao, K. (2025). Pensieve grader: An AI-powered, ready-to-use platform for effortless handwritten STEM grading. arXiv. [Google Scholar] [CrossRef]
Yeadon, W., & Hardy, T. (2024). The impact of AI in physics education: A comprehensive review from GCSE to university levels. Physics Education, 59(2), 025010. [Google Scholar] [CrossRef]
Yeung, C., Yu, J., Cheung, K. C., Wong, T. W., Chan, C. M., Wong, K. C., & Fujii, K. (2025). A zero-shot LLM framework for automatic assignment grading in higher education. arXiv. [Google Scholar] [CrossRef]
Yigiter, M. S., & Boduroğlu, E. (2025). Examining the performance of artificial intelligence in scoring students’ handwritten responses to open-ended items. TED EĞİTİM VE BİLİM, 50, 1–18. [Google Scholar] [CrossRef]
Yun, J. (2023). Meta-analysis of inter-rater agreement and discrepancy between human and automated English essay scoring. English Teaching, 78(3), 105–124. [Google Scholar] [CrossRef]
Zhang, C., Deng, J., Dong, X., Zhao, H., Liu, K., & Cui, C. (2025). Pairwise dual-level alignment for cross-prompt automated essay scoring. Expert Systems with Applications, 125, 125924. [Google Scholar] [CrossRef]
Zhang, W., Shen, Y., Wu, L., Peng, Q., Wang, J., Zhuang, Y., & Lu, W. (2024). Self-contrast: Better reflection through inconsistent solving perspectives. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd annual meeting of the association for computational linguistics (Vol. 1, pp. 3602–3622). Association for Computational Linguistics. [Google Scholar] [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

García-Varela, F.; Nussbaum, M.; Mendoza, M.; Martínez-Troncoso, C.; Bekerman, Z. ChatGPT as a Stable and Fair Tool for Automated Essay Scoring. Educ. Sci. 2025, 15, 946. https://doi.org/10.3390/educsci15080946

AMA Style

García-Varela F, Nussbaum M, Mendoza M, Martínez-Troncoso C, Bekerman Z. ChatGPT as a Stable and Fair Tool for Automated Essay Scoring. Education Sciences. 2025; 15(8):946. https://doi.org/10.3390/educsci15080946

Chicago/Turabian Style

García-Varela, Francisco, Miguel Nussbaum, Marcelo Mendoza, Carolina Martínez-Troncoso, and Zvi Bekerman. 2025. "ChatGPT as a Stable and Fair Tool for Automated Essay Scoring" Education Sciences 15, no. 8: 946. https://doi.org/10.3390/educsci15080946

APA Style

García-Varela, F., Nussbaum, M., Mendoza, M., Martínez-Troncoso, C., & Bekerman, Z. (2025). ChatGPT as a Stable and Fair Tool for Automated Essay Scoring. Education Sciences, 15(8), 946. https://doi.org/10.3390/educsci15080946

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ChatGPT as a Stable and Fair Tool for Automated Essay Scoring

Abstract

1. Introduction

1.1. Definition of the Problem

1.2. State of the Art

2. Methodology

2.1. Research Context and Sample

2.2. Research Model and Procedure

2.2.1. Assessing ChatGPT Reliability

Consistency in AI-Based Grading Without a Prescribed Rubric

Assessing Reliability in AI-Based Grading Using a Predefined Rubric

Assessing ChatGPT Reliability with a Predefined Explicit Rubric and Examples

2.2.2. Assessing ChatGPT Fairness

Refining the Rubric for Consistent Evaluation Across Multiple Students

2.2.3. Automating the Protocols: Designing an Algorithmic Approach

Proposed Algorithm

3. Results

3.1. Assessing ChatGPT Reliability

3.1.1. Results for Step “Consistency in AI-Based Grading Without a Prescribed Rubric” (Section 2.2.1)

3.1.2. Results for Step “Assessing Reliability in AI-Based Grading Using a Predefined Rubric” (Section 2.2.1)

3.1.3. Results for Step “Assessing ChatGPT Reliability with a Predefined Explicit Rubric and Examples” (Section 2.2.1)

3.2. Assessing ChatGPT Fairness

Results for Step “Refining the Rubric for Consistent Evaluation Across Multiple Students” (Section 2.2.2)

3.3. Automating the Protocols: Designing an Algorithmic Approach

Results for Step “Proposed Algorithm” (Section 2.2.3)

4. Discussion

4.1. Discussion for Step 3.1.1: Consistency in AI-Based Grading Without a Prescribed Rubric

4.1.1. Emergence of Implicit and Unregulated Evaluation Criteria:

4.1.2. Variability in Scoring Outcomes:

4.1.3. Misalignment with Educational Standards:

4.1.4. Lessons Learned:

4.2. Discussion for Step 3.1.2: Stability in AI-Based Grading Using a Predefined Rubric

4.2.1. Improved Alignment with Instructor Expectations:

4.2.2. Persistence of Non-Rubric Criteria:

4.2.3. Lessons Learned:

4.3. Discussion for Step 3.1.3: Assessing ChatGPT Reliability with a Predefined Explicit Rubric and Examples

4.3.1. Structuring the Rubric to Constrain Interpretation

4.3.2. Examples as Calibration Anchors

4.3.3. Standardization of Output and Prompt Design

4.3.4. Trade-Off Between Stability and Preparation Effort

4.3.5. Lessons Learned

4.4. Discussion for Step “Results for Step “Refining the Rubric for Consistent Evaluation Across Multiple Students” (Section 3.2)

4.4.1. Variability in Scaling Across Diverse Responses:

4.4.2. Sensitivity to Model Updates and Example Reduction:

4.4.3. Normalization and Decision Tables as a Scalable Alternative to Fine-Tuning

4.4.4. Operational Maintenance and Model Drift

4.4.5. Lessons Learned

4.5. Discussion for Step “Results for Step Proposed Algorithm” (Section 3.3)

4.5.1. Automating Rubric Construction for Consistency

4.5.2. Equivalent Outcomes Using Structurally Aligned Versus Original Rubrics

4.5.3. The Algorithm Accurately Reproduces Evaluative Logic

4.5.4. Enhanced Reasoning Capabilities

4.5.5. Lessons Learned

5. Conclusions

6. Declaration of Generative AI and AI-Assisted Technologies in the Writing Process

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI