ChatGPT as a Stable and Fair Tool for Automated Essay Scoring
Abstract
1. Introduction
1.1. Definition of the Problem
1.2. State of the Art
2. Methodology
2.1. Research Context and Sample
2.2. Research Model and Procedure
2.2.1. Assessing ChatGPT Reliability
Consistency in AI-Based Grading Without a Prescribed Rubric
- ChatGPT was tested across 10 independent chat sessions, each representing a single interaction with the AI. A new chat session was initiated for each evaluation by opening a new chat window or thread to ensure independent assessments.
- In each chat session, ChatGPT-4o was asked to grade the student’s answer with any given (external) rubric. First, ChatGPT-4o was requested to assess the student’s response and then ask for the evaluation criteria used to grade that student (Section S3).
- A table was created (Section S4) summarizing the results. In the horizontal axis, we have the ten chat sessions, and in the vertical one, the different evaluation criteria indicated by ChatGPT-4o. For each chat session, the criteria used in that iteration is indicated in yellow. In the last column, the evaluation criteria used are indicated. In the last two rows, the evaluation criteria used for each iteration and the assigned grade are indicated.
- Next, evaluation criteria are grouped together (Section S5.1) based on the definitions provided by ChatGPT-4o (Section S5.2). This step aimed to simplify the visualization from Section S4 and explicitly define the scope of the evaluation criteria used by the AI. By consolidating similar terms under shared definitions, it was possible to reduce redundancy and better understand the core dimensions consistently applied during grading.
Assessing Reliability in AI-Based Grading Using a Predefined Rubric
- ChatGPT-4o is provided with the instructor’s rubric (Section S2), which specifies the evaluation criteria for the same question of Section 2.2.1. “Consistency in AI-Based Grading Without a Prescribed Rubric”.
- As in Section 2.2.1. “Consistency in AI-Based Grading Without a Prescribed Rubric”, ChatGPT was tested across 10 independent chat sessions, each representing a single interaction with the AI. A new chat session was initiated for each evaluation by opening a new chat window or thread to ensure independent assessments and using Section S3 prompts.
- In each chat session, ChatGPT-4o was asked to grade the student’s answer with the teacher’s rubric (Section S2).
- A table was created (Section S6.1) summarizing the results. In the horizontal axis, we have the ten iterations, and in the vertical one, the different criteria indicated by ChatGPT-4o. A yellow highlight indicates each evaluation criterion when it appears in a given chat session, and purple is used to highlight any evaluation criteria not defined by the instructor. In the last column, the evaluation criteria used are indicated. In the last two rows, the evaluation criteria used and the assigned grade for each iteration are indicated.
- The definitions of the evaluation criteria that ChatGPT-4o implicitly uses (those highlighted in purple in Section S6.1) are documented, resulting in Section S6.2.
Assessing ChatGPT Reliability with a Predefined Explicit Rubric and Examples
- The instructor’s rubric is modified to make the evaluation criteria explicit (example for evaluation criterion Functional Benefits in Section S7.1). For each evaluation criterion defined by the teacher, the following is indicated:
- (a)
- Evaluation Criterion Name.
- (b)
- Formal Definition: A clear explanation of what is being evaluated.
- (c)
- Evaluated Concepts: A list of elements considered in the response, along with their respective scope and definition. These elements determine the assigned score.
- (d)
- Possible Scores: The permissible score range for the defined evaluation criterion.
- Examples are created for each possible score in each evaluation criterion (example for evaluation criterion Functional Benefits in Section S7.2) to guide the AI on how to apply the rubric’s evaluation criteria. The examples consist of the following:
- (a)
- Concrete responses that illustrate how a student might address the evaluation criterion.
- (b)
- Explanations that justify the score given to each example, highlighting whether and how the response meets the expectations of the criterion.
- Development of a document with a Standardized Output Format for AI Grading Responses: To further enhance stability, a uniform response format was developed for ChatGPT-4o, ensuring that all grading outputs followed a standardized structure. At least two uniform reference examples were used (Section S7.3) and consist of the following elements:
- (a)
- Evaluated Criterion: The name of the evaluation criterion being assessed.
- (b)
- Score: The assigned score for the evaluation criterion.
- (c)
- Observation: A brief description of what the student mentioned in their response.
- (d)
- Justification: An explanation—based on the decision table—of why the assigned score was given.
- (e)
- Final Score: The sum of all partial scores obtained for each evaluated criterion.
- (f)
- Final Comments: A conclusive feedback section provided to the student after assigning the final score. These comments summarize the main strengths of the response, highlight areas that require further development or justification based on the rubric, and offer concrete recommendations for improvement in future responses. The feedback explicitly avoids vague or unrelated suggestions that do not align with the evaluation rubric.
- A prompt was created to explicitly instruct ChatGPT-4o to isolate the evaluation criteria from the instructor’s rubric, prohibiting the use of any additional criteria. The prompt also specifies the output format, clarifies the focal points for grading, and includes basic consistency hyperparameters, as consolidated in Section S7.4.
- After updating both the instructor’s rubric and the AI prompt, ChatGPT-4o was asked to grade the student’s response across 10 different chat sessions. For each session, the input provided includes the prompt (Section S7.4), the question (Section S2), the revised rubric in (Section S7.1 and examples from Section S7.2 for each evaluation criterion), and the output format example in (Section S7.3). The results are recorded in a table as indicated in Section S7.5.
2.2.2. Assessing ChatGPT Fairness
Refining the Rubric for Consistent Evaluation Across Multiple Students
- The process is divided into two phases, as is explained in Section 3.1.3. The first phase involves the following:
- Randomly selecting the responses of 10 students from the sample who took Quiz 1 (Section S1).
- Normalizing the instructor’s rubric by keeping the previously applied modifications and adding new elements beneath the “Possible Scores” section from Section 2.2.1. “Assessing ChatGPT Reliability with a Predefined Explicit Rubric and Examples” 1.d in the instructor’s modified rubric for each evaluation criterion. A complete example of a normalized rubric for one evaluation criterion is shown in Section S8.1 without the examples from Section S7.2. This last decision was made to isolate the effect of normalization as later analyzed in the results (see Section 4.3). The new elements of the normalization process are as follows:
- (a)
- Problem Concepts: Definition of key concepts that indicate which reasons or specific ideas are considered valid justifications for the fourth evaluation question and subsequent decision tables (Section 2.2.2. “Refining the Rubric for Consistent Evaluation Across Multiple Students” c.). These are added because the first three questions focus on general aspects (presence, focus, coherence), while the fourth and subsequent questions require more specific justifications (see Section S11, step 4).
- (b)
- Notes: Additional clarifications regarding the evaluation criterion, the evaluated concepts, or the possible scores. These help disambiguate scoring logic and avoid contradictory interpretations, especially in borderline cases (see findings in the Section 3.2 “Results for Step Refining the Rubric for Consistent Evaluation Across Multiple Students”).
- (c)
- Decision Table: A four-column table that utilizes binary yes/no questions to determine compliance with evaluated concepts and objectively assign scores. Decision tables were introduced to make the scoring logic fully explicit and reproducible across chat sessions (see the Section 3.2 “Results for Step Refining the Rubric for Consistent Evaluation Across Multiple Students”). The structure is as follows:
- Column 1: Step—A sequential identifier for each stage within the decision table.
- Column 2: Evaluation Question—Binary yes/no questions based on the evaluated concepts, progressing from general to specific criteria to ensure an objective scoring process.
- Column 3: Action—Specifies what action should be taken depending on a “Yes” or “No” response. If “No,” a brief explanation is provided.
- Column 4: Result—Records the assigned score for the evaluation criterion or the conclusion derived from applying the corresponding Action.
- The phrase “…based on the decision tables for score assignment” was added to the prompt, as shown in Section S8.2. This ensures that ChatGPT does not reinterpret the rubric and follows the decision logic strictly, avoiding changes to the evaluation criteria.
- For each student, five chat sessions instead of ten (explanation in Section 3.2 “Results for Step Refining the Rubric for Consistent Evaluation Across Multiple Students”) were conducted with ChatGPT-4o, providing as input the modified prompt (Section S8.2), the question (Section S2), the normalized rubric (Section S8.1 for each evaluation criterion), and the output format example in (Section S7.3).
- The scores for each student across all sessions were recorded and tabulated as shown in Section S9.1. In cases of discrepancies, the causes are analyzed and the examples for each evaluation criterion (Section S7.2) are revised until the desired level of consistency is achieved, as noted in Section S9.2 and subsequently Section S9.3. The following color-coding is applied to these tables:
- Yellow highlighting indicates full consistency across all chat sessions.
- Blue highlighting signifies the presence of two different assigned grades.
- Red highlighting indicates three or more different assigned grades, representing a clear inconsistency.
- II.
- The second phase involves the following:
- Adding the responses of the remaining 10 students from the sample who took Quiz 1, thus obtaining a total of 20 students.
- This second phase is further divided into three segments to examine the role of examples in the normalization process and to evaluate the use of decision tables. The aim is to determine whether examples alone, or in conjunction with other components, are necessary to ensure consistent grading.
- (a)
- Grading using the same methodology from the first phase, generating a new table with 5 chat sessions per student as shown in Section S10.1. The results in Section S9.3 were compared with those in Section S10.1. The difference between them stems from an update applied to the ChatGPT-4o model between 12 and 13 February 2025, which affected the outcomes and led to the development of the following two parts (b and c).
- (b)
- Removing all examples from Section S7.2 except for three arbitrarily selected examples for each possible score that are deemed representative, as shown in Section S10.4 for the Functional Benefits evaluation criterion. A new table was created as shown in Section S10.2.
- (c)
- Removing all remaining examples from Section S7.2. Another new table was created, as shown in Section S10.3.
2.2.3. Automating the Protocols: Designing an Algorithmic Approach
Proposed Algorithm
- Four sequential phases were defined to create the algorithm, each building upon the previous. They are outlined in Section S11 and are composed of:
- (a)
- Phase 1, Definition of Evaluation Criteria:The open-ended question was decomposed into concrete sub-questions to identify the evaluation criteria related to value, benefits, and costs for the user.
- (b)
- Phase 2, Creation of Decision Tables for the Evaluation:Rule-based evaluation structures were created for each criterion (normalization), including binary (Yes/No) decision steps, scoring rules, and detailed justifications, as the one of Section S13.
- (c)
- Phase 3, Definition of the Output Format:A unified response format was defined, including Score, Observation, and Justification per criterion, followed by Final Score and Final Comments for consistent evaluation, as in Section S7.3.
- (d)
- Phase 4, Creation of the Evaluation Prompt:A precise and constrained prompt was designed for the AI, specifying evaluation logic, excluded variables, response format, and fixed hyperparameters to ensure uniform assessment, as in Section S8.2.
- To test the algorithm, a new chat session is opened in ChatGPT-o1 (see Section 4.5.4 for the explanation of the new model), and the following parameters are provided:
- (a)
- A prompt containing the expected outputs and instructions, as shown in Section S12. More details in Section 3.3 “Results for Step Proposed Algorithm”.
- (b)
- The normalized rubric for Section S1 and all the decision tables are loaded as a file as an example for the IA (Section S13).
- (c)
- Section S7.3 examples are loaded as a file.
- (d)
- The algorithm for creating automatic correction, Section S11, is loaded as a file.
- After sending the prompt and the documents, the algorithm requests the user to upload the question and complete the four phases with the respective steps of Section S11 in sequence. The user decides when to proceed to the next phase.
- Two actions were taken to validate the algorithm:
- (a)
- Verify whether, by loading the same question used in the study (Section S1), it can replicate the decision tables. This process is documented at the (https://github.com/anonymous-researcher-0/ChatGPT-as-a-stable-and-fair-tool-for-Automated-Essay-Scoring/blob/main/Section%20S16%20Process.pdf, accessed on 8 June 2025), and the result is in Section S16.
- (b)
- Use another open-ended question such as the one in Section S14, along with its original rubric (Section S15), to generate the four phases of the algorithm. This process is documented at the (https://github.com/anonymous-researcher-0/ChatGPT-as-a-stable-and-fair-tool-for-Automated-Essay-Scoring/blob/main/Section%20S18%20Process.pdf, accessed on 8 June 2025), and the result for the new questions is in Section S18.
- The information obtained through the new prompt, output format, and the normalization process is then applied to a new set of 20 students from the marketing course who answered the new question (Section S14).
- (a)
- A table is created, indicating the student number in the first column and showing 5 chat sessions per student, with the assigned grade recorded and color-coded by category. This can be seen in Section S19, described in Section 3.3 “Results for Step Proposed Algorithm” and analyzed in Section 4.4.
3. Results
3.1. Assessing ChatGPT Reliability
3.1.1. Results for Step “Consistency in AI-Based Grading Without a Prescribed Rubric” (Section 2.2.1)
- Variability in Evaluation Criteria:
- (a)
- The AI produced multiple distinct evaluation criteria across different chat sessions (Section S4).
- (b)
- Each chat session varied between five and six different evaluation criteria (Section S4), although some consolidated sessions featured five or fewer, as is shown in Section S5.1. Section S5.2 explains these criteria used by ChatGPT-4o.
- (c)
- Section S5.1 criteria showed overlap in definitions such as “Structure”, “Clarity”, “Analysis”, and “Understanding of the Question”, but labeled or grouped them differently across sessions, leading to inconsistencies, making it necessary to define the criteria (Section S5.2).
- Variability in Grades:
- (a)
- As Section S5.2 shows, the final assigned grades differed across the 10 chat sessions, suggesting that ChatGPT-4o’s implicit evaluation framework was not stable (different criteria and grades between Chat sessions) when prompted repeatedly with no rubric.
- (b)
- In some sessions, for example chat session 1 and 8 in Section S5.1, ChatGPT-4o assigned multiple criteria but still arrived at a similar grade (8 points with “Solid Conclusion” and “Value Proposition” as differences); in others, the grade varied, even when seemingly similar criteria were referenced, such as chat sessions 5 and 8 in Section S5.1 (8.5 points vs. 8 points using the same evaluation criteria). Section S5.1 was used as a reference instead of Section S4 due to the high variability of criteria in Section S4, which makes it difficult to identify consistent patterns or meaningful similarities.
- Emergence of Non-Teacher-Defined Criteria:
- (a)
- As Section S5.1 shows, ChatGPT-4o often introduced additional elements such as “Originality” or “Value Proposition” (five to six times, respectively).
- (b)
- While some of these elements (“Originality”, “Value Proposition”, etc.) could be conceptually valid from a pedagogical standpoint, their inconsistent appearance across chat sessions, despite the same student response, made it difficult to align them with the original grading intentions (Section S2). This lack of stability undermines the reliability of the criteria, especially when they are not part of the instructor-defined rubric and appear to depend arbitrarily depending on the chat session.
3.1.2. Results for Step “Assessing Reliability in AI-Based Grading Using a Predefined Rubric” (Section 2.2.1)
- Improved Alignment with the Instructor’s Criteria:
- (a)
- When the explicit rubric was provided (Section S2), ChatGPT-4o consistently recognized and applied the criteria outlined in the rubric (e.g., perceived value, functional benefits, psychological benefits, monetary costs, non-monetary costs).
- (b)
- As Section S6.1 shows, all sessions yielded references to the expected instructor-defined evaluation criteria (Section S2).
- Inconsistencies in Final Scores:
- (a)
- Analysis of the documented chat sessions (Section S6.1) showed that while each session consistently included the five rubric criteria, at least two of the ten sessions introduced non-instructor-defined criteria. It only happened in chat sessions 4 and 8, where there were more evaluation criteria than in the other chat sessions as “Depth and Analysis” and “Clarity”. These additional criteria were not part of the instructor’s rubric but were introduced by ChatGPT-4o during the evaluation process.
- (b)
- The AI introduced an extra criterion in all chat sessions (“Justification” in Section S6.1) and factored them into the final grade. While this may appear similar to “Depth and Analysis” from Section S5.2, the two differ in focus: “Justification” (Section S6.2) assesses whether the student’s ideas are supported with clear reasoning or evidence, whereas “Depth and Analysis” evaluates the complexity, insight, and critical thinking applied in exploring those ideas.
3.1.3. Results for Step “Assessing ChatGPT Reliability with a Predefined Explicit Rubric and Examples” (Section 2.2.1)
- Rubric Restructuring:
- (a)
- The teacher’s rubric was reorganized to include the following: a Formal Definition, Evaluated Concepts, Possible Scores, and clear evaluation criteria (Section S7.1)
- (b)
- Examples illustrating why a score was assigned were added to guide ChatGPT-4o precisely (Section S7.2). The number of examples per evaluation criterion was not fixed; instead, a trial-and-error approach was used, incorporating both real student responses and simulated ones (examples designed to resemble plausible student answers).
- Standardized Output Format:
- (a)
- A new standardized output structure (Section S7.3) was introduced, detailing how ChatGPT-4o should present the evaluation for each criterion, listing the criterion name, an observation, a justification, a score, a final score, and final comments. This structure allowed the AI to follow a fixed format and avoid including additional elements that could lead to the incorporation of unintended evaluation criteria, as previously observed in Section 3.1.2.
- Refined Prompt:
- (a)
- A new prompt (Section S7.4) was introduced to reinforce which evaluation criteria should be applied and which should be ignored, improving consistency across sessions. Prompt 2 from Section S3 was still included to identify the evaluation criteria used in each chat session.
- (b)
- The hyperparameters used in Section S7.4—{“temperature”: 0.1, “frequency_penalty”: 0.0, “presence_penalty”: −1.0}—helped shape the behavior of the model:
- Temperature (0.1): Controls randomness; a low value makes the responses more focused and deterministic.
- Frequency penalty (0.0): Prevents the model from penalizing repeated words or phrases.
- Presence penalty (−1.0): Encourages the model to reuse certain concepts or terms, rather than avoiding repetition.
- Perfect Consistency Achieved:
- (a)
- After implementing the newly structured rubric, the same student response was graded across 10 separate chat sessions (Section S7.5). All sessions produced the same final score and applied identical evaluation criteria, demonstrating the rubric’s effectiveness in achieving consistent grading under controlled conditions. Notably, the justification criterion, Section S6.1, emerged as an inherent element, appearing consistently across all sessions.
3.2. Assessing ChatGPT Fairness
Results for Step “Refining the Rubric for Consistent Evaluation Across Multiple Students” (Section 2.2.2)
- Initial Testing With 10 Students
- (a)
- Ten students’ responses were selected and graded five times each under the new format. Color-coded tables (Section S9.1) displayed the consistency of results (yellow for perfect alignment: the same grade in all five sessions; blue as moderate: exactly two different grades across the five sessions; red as inconsistent: three or more different grades). Five chat sessions per student were necessary to assess stability, based on findings in Section S6.1, where inconsistencies, when present, tended to appear at least once within the first five sessions. The number of sessions was reduced from ten to five due to practical constraints: running multiple iterations with ChatGPT-4o proved time- and resource-intensive, as OpenAI imposes temporary rate limits on file uploads and prompt executions (which vary depending on global demand). These restrictions frequently caused delays, interrupted uploads, or incomplete prompts, making larger-scale iteration impractical within a limited timeframe. Reducing from 10 to 5 sessions weakens the resolution, stability, and statistical confidence of the evaluation process. It increases the likelihood of false positives for consistency, under-detection of model brittleness, and missed opportunities for rubric improvement. As this reduction was necessary, we used a validation subset, still using 10 sessions to benchmark accuracy loss. Importantly, the analysis showed that student grades remained stable despite using only five sessions. For example, although adjustments were made to improve the evaluation protocol between Sections S6.1 and S9.1, Student 1 ultimately maintained the same final grade (5) across both datasets. This suggests that the protocol adjustments made at this stage, such as clearer evaluation instructions and improved grading structures, helped maintain consistency even with fewer repetitions. Thus, while using five sessions may reduce resolution, the main findings and student grading outcomes remained robust.
- (b)
- Early rounds revealed grading inconsistencies in Section S9.1. While following the same grading protocols that led to success in Section S7.5, some inconsistencies (blue cases, meaning the presence of two different assigned grades in different chat sessions for the same student) still appeared in certain sessions, such as with Student 2 and Student 5. Others, like Student 8 and Student 16, showed even greater inconsistency (red cases, meaning three or more different assigned grades in different chat sessions for the same student). Notably, Student 1, whose response was the same one used in Section S7.5, maintained full stability in scores, as shown in Section S9.1.
- (c)
- Final Decision Tables were developed (Section S13), incorporating additional elements such as Problem Concepts and Notes (normalization process). Examples and Clarifications were also refined iteratively to avoid internal contradictions within or across evaluation criteria. This process led to progressive improvements in grading consistency, as documented in Section S9.2 and ultimately reflected in the outcomes presented in Section S9.3, where all 10 students received consistent grading across five chat sessions each.
- Scaling to 20 Students
- (a)
- After achieving perfect consistency across five chat sessions for an initial group of 10 students, Section S9.3, 10 additional responses were introduced. These new samples had never been iterated before, allowing for an unbiased analysis of the grading system’s fairness.
- (b)
- It is important to note a discrepancy between the results in Section S9.3 and those of the same 10 students in Section S10.1. As indicated in the methodology (Section 2.2.2 “Refining the Rubric for Consistent Evaluation Across Multiple Students”, second phase), an update applied to the ChatGPT-4o model between 12 and 13 February 2025, impacted how evaluation criteria were interpreted and applied. Specifically, although stability remained consistent for most students, the actual grades assigned changed in several cases. Furthermore, the last two students (14 and 16 in Section S9.3), who previously showed full consistency, exhibited a change in grade patterns (14 and 16 in Section S10.1), indicating a shift in the model’s stability following the update. This suggests that model updates can modify how evaluation criteria are interpreted and applied, producing observable shifts in grading outcomes despite using identical prompts and rubrics. This aspect is analyzed in greater depth in Section 4.4.2
- (c)
- Given the sensitivity of AI model adjustments and their impact on the study’s primary goal (achieving stable grading), it became necessary to evaluate the role of Examples and Clarifications in maintaining stability. Section S10.1 shows 12 out of 20 cases with perfect consistency, 5 with moderate consistency, and 3 with inconsistencies. However, as examples were gradually removed until reaching the case with no examples at all (Section S10.3), consistency declined to 7 perfect cases, 11 moderate, and 2 inconsistent.
- Key Factors Influencing Fairness
- (a)
- Thorough normalization and decision tables with yes/no questions based on the evaluation criteria.
- (b)
- Carefully drafted examples illustrate correct and incorrect ways of meeting those criteria.
- (c)
- Prompt instructions explicitly disallowing any supplementary or AI-generated criteria.
3.3. Automating the Protocols: Designing an Algorithmic Approach
Results for Step “Proposed Algorithm” (Section 2.2.3)
- Algorithm Development and Components:
- (a)
- The algorithm was structured into four sequential phases (Definition of Evaluation Criteria, Construction of Decision Tables, Standardization of Output Format, Creation of the Evaluation Prompt).
- (b)
- The prompt in Section S12 ensures that the AI generates four deliverables (a list of evaluation criteria, all decision tables for each criterion, an evaluation structure, and a ready-to-use prompt), allowing the user to review and confirm the criteria and elements created. The process does not advance to the next phase until the user confirms that no further changes are needed in the current one. It also allows the user to add or specify details, as was the case with incorporating Section S18 to replicate the instructor’s intended structure.
- Verification With Original Question:
- (a)
- To confirm the algorithm’s correctness, ChatGPT-o1 was first supplied instead of ChatGPT-4o with the relevant documents (Sections S7.3, S11 and S13) and the same marketing question from Section S1 was used.
- (b)
- The algorithm successfully replicated (Section S16) the Formal Definitions, Evaluated Concepts, Possible Scores, Problem Concepts, Notes, and Decision Tables in a comparable manner, maintaining the original structure and evaluation logic, and using equivalent but not identical wording, relative to the reference materials in Section S13. Section S17 was used as a supporting guideline during the process, but in the algorithm’s first phase, it correctly detected the same evaluation criteria, as shown in the process (https://github.com/anonymous-researcher-0/ChatGPT-as-a-stable-and-fair-tool-for-Automated-Essay-Scoring/blob/main/Section%20S16%20Process.pdf, accessed on 8 June 2025).
- Testing With a New Question:
- (a)
- Next, a different open-ended question (Section S14) and its original rubric (Section S15) were uploaded to test whether the algorithm could generalize its approach by following the same procedure and using the relevant documents. It is important to note that the rubric in Section S15 had a less structured and less detailed format—it did not separate components such as Formal Definition, Evaluated Concepts, Possible Scores, or Evaluation Rules, which are required by the algorithm to build a systematic and replicable evaluation process. In contrast, Section S17 served as a model of the expected structure, providing a clear and complete example of how these elements should be organized. This contrast allowed us to observe whether the algorithm could still extract and reorganize the relevant evaluation logic from a less structured rubric, which, as the results show, it successfully did.
- (b)
- The algorithm generated a new set of evaluation criteria and decision tables (Section S18) that aligned with the distinct requirements of the question in Section S15.
- Application to a New Group of 20 Students:
- (a)
- With the revised rubric and decision tables created automatically by the algorithm, the question from Section S14 was administered to a new group of 20 students in the marketing course.
- (b)
- Each student’s response was assessed via five chat sessions (mirroring previous methodology). A preliminary data table is presented in Section S19, showing a high degree of fairness, defined as consistency in assigned grades and evaluation criteria across sessions, with 11 perfect cases, 8 moderate, and 1 inconsistent.
4. Discussion
4.1. Discussion for Step 3.1.1: Consistency in AI-Based Grading Without a Prescribed Rubric
4.1.1. Emergence of Implicit and Unregulated Evaluation Criteria:
4.1.2. Variability in Scoring Outcomes:
4.1.3. Misalignment with Educational Standards:
4.1.4. Lessons Learned:
4.2. Discussion for Step 3.1.2: Stability in AI-Based Grading Using a Predefined Rubric
4.2.1. Improved Alignment with Instructor Expectations:
4.2.2. Persistence of Non-Rubric Criteria:
4.2.3. Lessons Learned:
4.3. Discussion for Step 3.1.3: Assessing ChatGPT Reliability with a Predefined Explicit Rubric and Examples
4.3.1. Structuring the Rubric to Constrain Interpretation
4.3.2. Examples as Calibration Anchors
4.3.3. Standardization of Output and Prompt Design
4.3.4. Trade-Off Between Stability and Preparation Effort
4.3.5. Lessons Learned
4.4. Discussion for Step “Results for Step “Refining the Rubric for Consistent Evaluation Across Multiple Students” (Section 3.2)
4.4.1. Variability in Scaling Across Diverse Responses:
4.4.2. Sensitivity to Model Updates and Example Reduction:
- (a)
- Model Updates: A system update to ChatGPT-4o between iterations introduced subtle changes in how evaluation criteria were applied, even when instructions and inputs remained constant. Although the version of ChatGPT used during this phase of the study remained the same, an update to the model’s operational parameters, rather than its pre-trained weights, may account for the differences observed in the results following this update. The changes associated with GPT-4o stem from the fact that this model replaced GPT-4 as the base model for ChatGPT, and several of its features were in a testing phase during February 2025. These features were later consolidated in the latest release of GPT-4o, which introduced modifications to the default values of internal operational parameters. These adjustments enhanced the model’s ability to follow instructions and improved formatting precision, alongside changes aimed at better interpreting the implied intent behind user prompts—referred to by OpenAI as fuzzy improvements (OpenAI, n.d.). This demonstrates that versioning must be tracked and tested whenever AI tools are used in longitudinal educational assessments. Longitudinal evaluations have shown that GPT-4 performance can drift across releases; for example, its accuracy in identifying prime numbers dropped from 84% (March 2023) to 51% (June 2023), with chain of thought compliance decreasing accordingly (L. Chen et al., 2023). Such shifts threaten the reproducibility of automated grading pipelines. To contain this risk, it could be useful to (i) log the exact model identifier and timestamp for each evaluation instance, (ii) lock inference hyper-parameters (e.g., temperature = 0.1, presence-penalty = −1.0) to minimize stochasticity, and (iii) rerun a fixed calibration set after any detected version change to quantify drift.
- (b)
- Example Removal: The gradual removal of illustrative examples (Section S7.2) reduced the effort involved in rubric deployment but negatively impacted the rate of fully consistent evaluations. As shown in Sections S10.1–S10.3, the absence of examples correlated with a drop in perfect consistency. However, this effect depends on how well the examples are constructed; Section S10.2 shows that including only three selected examples was actually less effective than providing none at all, as in Section S10.3 (Hmoud et al., 2024).
4.4.3. Normalization and Decision Tables as a Scalable Alternative to Fine-Tuning
4.4.4. Operational Maintenance and Model Drift
4.4.5. Lessons Learned
4.5. Discussion for Step “Results for Step Proposed Algorithm” (Section 3.3)
4.5.1. Automating Rubric Construction for Consistency
4.5.2. Equivalent Outcomes Using Structurally Aligned Versus Original Rubrics
4.5.3. The Algorithm Accurately Reproduces Evaluative Logic
4.5.4. Enhanced Reasoning Capabilities
4.5.5. Lessons Learned
5. Conclusions
- (a)
- Automatic detection of evaluation criteria from any open-ended question.
- (b)
- Creation of fully explicit normalization artefacts—formal definitions, evaluated concepts, possible scores, problem concepts, notes, and binary decision tables.
- (c)
- Construction of a standardized output format and a tightly constrained prompt with fixed hyper-parameters.
- (d)
- An instructor-in-the-loop workflow that delivers high stability and fairness without finetuning, keeping the need for knowledge and extra resources low.
- (a)
- Offers a prompt-engineering, model-agnostic pathway that achieves stable and fair automated essay scoring, requiring only limited technical or computational resources.
- (b)
- Provides a scalable algorithm that automatically builds normalized rubrics and decision tables for new questions, widening the reach of reliable assessment.
- (c)
- Produces transparent, explainable, standardized feedback that supports hybrid human-AI workflows and lowers the technical barrier to adoption in educational settings.
- (a)
- Avoid using AI tools for grading without a clear structure. AI should not be treated as a standalone evaluator; it must operate within a carefully designed and pedagogically sound framework.
- (b)
- Translate rubrics into explicit decision rules. Converting evaluation criteria into a structured yes/no logic enhances transparency and improves consistency across assessments.
- (c)
- Standardize AI configurations. To ensure reliable results, prompts, output formats, and temperature settings should be standardized and consistently applied.
- (d)
- Use AI-generated feedback as a learning tool. Share the underlying evaluation logic with students to foster metacognitive awareness and support self-assessment.
6. Declaration of Generative AI and AI-Assisted Technologies in the Writing Process
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Almusharraf, N., & Alotaibi, H. (2023). An error-analysis study from an EFL writing context: Human and automated essay scoring approaches. Technology, Knowledge and Learning, 28(3), 1015–1031. [Google Scholar] [CrossRef]
- Anchiêta, R. T., de Sousa, R. F., & Moura, R. S. (2024, October 17). A robustness analysis of automated essay scoring methods. Anais do Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana (STIL), Belém, Brazil. [Google Scholar] [CrossRef]
- Atkinson, J., & Palma, D. (2025). An LLM-based hybrid approach for enhanced automated essay scoring. Scientific Reports, 15(1), 5623. [Google Scholar] [CrossRef] [PubMed]
- Bernard, R., Raza, S., Das, S., & Murugan, R. (2024). EQUATOR: A deterministic framework for evaluating LLM reasoning with open-ended questions. # v1.0.0-beta. arXiv. [Google Scholar] [CrossRef]
- Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Podstawski, M., Gianinazzi, L., Gajda, J., Lehmann, T., Niewiadomski, H., Nyczyk, P., & Hoefler, T. (2024, February 20–27). Graph of thoughts: Solving elaborate problems with large language models. Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada. [Google Scholar] [CrossRef]
- Brown, G. T. (2022). The past, present and future of educational assessment: A transdisciplinary perspective. Frontiers in Education, 7, 1060633. [Google Scholar] [CrossRef]
- Bui, N. M., & Barrot, J. S. (2024). ChatGPT as an automated essay-scoring tool in the writing classroom: How it compares with human scoring. Education and Information Technologies, 30, 2041–2058. [Google Scholar] [CrossRef]
- Chen, L., Zaharia, M., & Zou, J. (2023). How is ChatGPT’s behavior changing over time? arXiv. [Google Scholar] [CrossRef]
- Chen, Y., & Li, X. (2023). PMAES: Prompt-mapping contrastive learning for cross-prompt automated essay scoring. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Proceedings of the 61st annual meeting of the association for computational linguistics (Vol. 1, pp. 1489–1503). Association for Computational Linguistics. [Google Scholar]
- Golchin, S., Garuda, N., Impey, C., & Wenger, M. (2024). Grading massive open online courses using large language models. arXiv. [Google Scholar] [CrossRef]
- Hmoud, M., Swaity, H., Anjass, E., & Aguaded-Ramírez, E. M. (2024). Rubric development and validation for assessing tasks’ solving via AI chatbots. Electronic Journal of e-Learning, 22(6), 1–17. Available online: https://files.eric.ed.gov/fulltext/EJ1434299.pdf (accessed on 30 January 2025). [CrossRef]
- Ilemobayo, J. A., Durodola, O., Alade, O., Awotunde, O. J., Olanrewaju, A. T., Falana, O., Ogungbire, A., Osinuga, A., Ogunbiyi, D., Ifeanyi, A., Odezuligbo, I. E., & Edu, O. E. (2024). Hyperparameter tuning in machine learning: A comprehensive review. Journal of Engineering Research And Reports, 26(6), 388–395. [Google Scholar] [CrossRef]
- Kaldaras, L., Yoshida, N. R., & Haudek, K. C. (2022). Rubric development for AI-enabled scoring of three-dimensional constructed-response assessment aligned to NGSS learning progression. Frontiers in Education, 7, 983055. [Google Scholar] [CrossRef]
- Li, R., Wang, Y., Wen, Z., Cui, M., & Miao, Q. (2025). Different paths to the same destination: Diversifying LLM generation for multi-hop open-domain question answering. Knowledge-Based Systems, 309, 112789. [Google Scholar] [CrossRef]
- Li, S., & Ng, V. (2024a, November 12–16). Automated essay scoring: A reflection on the state of the art. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (pp. 17876–17888), Miami, FL, USA. [Google Scholar]
- Li, S., & Ng, V. (2024b, August 3–9). Automated essay scoring: Recent successes and future directions. Proceedings of the 33rd International Joint Conference on Artificial Intelligence, Jeju, Republic of Korea. [Google Scholar]
- Ling, J. H. (2024). A review of rubrics in education: Potential and challenges. Pedagogy: Indonesian Journal of Teaching and Learning Research, 2(1), 1–14. Available online: https://ejournal.aecindonesia.org/index.php/pedagogy/article/view/199 (accessed on 30 January 2025). [CrossRef]
- Lo, C. K. (2023). What is the impact of ChatGPT on education? A rapid review of the literature. Education Sciences, 13(4), 410. [Google Scholar] [CrossRef]
- Long, J. (2023). Large language model guided tree-of-thought. arXiv. [Google Scholar] [CrossRef]
- Lou, R., Zhang, K., & Yin, W. (2024). Large language model instruction following: A survey of progresses and challenges. Computational Linguistics, 50(3), 1053–1095. [Google Scholar] [CrossRef]
- Mangili, F., Adorni, G., Piatti, A., Bonesana, C., & Antonucci, A. (2022). Modelling assessment rubrics through bayesian networks: A pragmatic approach. arXiv. [Google Scholar] [CrossRef]
- Mayfield, E., & Black, A. W. (2020, July 10). Should you fine-tune BERT for automated essay scoring? The 15th Workshop on Innovative Use of NLP for Building Educational Applications (pp. 151–160), Seattle, WA, USA. Available online: https://aclanthology.org/2020.bea-1.16 (accessed on 30 January 2025).
- Miao, T., & Xu, D. (2025). KWM-B: Key-information weighting methods at multiple scale for automated essay scoring with BERT. Electronics, 14(1), 155. [Google Scholar] [CrossRef]
- OpenAI. n.d. ChatGPT release notes. OpenAI Help Center. Available online: https://help.openai.com/en/articles/6825453-chatgpt-release-notes (accessed on 25 February 2025).
- OpenAI. (2024, December 5). GPT-o1 system card [Technical report]. Available online: https://arxiv.org/pdf/2412.16720 (accessed on 30 January 2025).
- Ouyang, F., Dinh, T. A., & Xu, W. (2023). A systematic review of AI-driven educational assessment in STEM education. Journal for STEM Education Research, 6(3), 408–426. [Google Scholar] [CrossRef]
- Stahl, M., Biermann, L., Nehring, A., & Wachsmuth, H. (2024). Exploring LLM prompting strategies for joint essay scoring and feedback generation. arXiv. [Google Scholar] [CrossRef]
- Tang, X., Lin, D., & Li, K. (2024). Enhancing automated essay scoring with GCNs and multi-level features for robust multidimensional assessments. Linguistics Vanguard, 10(1). [Google Scholar] [CrossRef]
- Tate, T. P., Steiss, J., Bailey, D., Graham, S., Moon, Y., Ritchie, D., Tseng, W., & Warschauer, M. (2024). Can AI provide useful holistic essay scoring? Computers & Education: Artificial Intelligence, 5, 100255. [Google Scholar] [CrossRef]
- Triem, H., & Ding, Y. (2024). Tipping the balance: Human intervention in large language model multi-agent debate. Proceedings of the Association for Information Science and Technology, 61(1), 1034. [Google Scholar] [CrossRef]
- Uyar, A. C., & Büyükahıska, D. (2025). Artificial intelligence as an automated essay scoring tool: A focus on ChatGPT. International Journal of Assessment Tools in Education, 12(1), 20–32. [Google Scholar] [CrossRef]
- Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2023, May 1–5). Self-consistency improves chain of thought reasoning in language models. The International Conference on Learning Representations, Kigali, Rwanda. [Google Scholar]
- Wang, Y., Hu, R., & Zhao, Z. (2024). Beyond agreement: Diagnosing the rationale alignment of automated essay-scoring methods based on linguistically informed counterfactuals. arXiv. [Google Scholar] [CrossRef]
- Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E. H., Le, Q. V., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837. [Google Scholar]
- Wu, X., Saraf, P. P., Lee, G., Latif, E., Liu, N., & Zhai, X. (2024). Unveiling scoring processes: Dissecting the differences between LLMs and human graders in automatic scoring. arXiv. [Google Scholar] [CrossRef]
- Wu, Z., Jiang, M., & Shen, C. (2024, February 20–27). Get an A in math: Progressive rectification prompting. Thirty-Eighth AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada. [Google Scholar] [CrossRef]
- Xia, Y., Wang, R., Liu, X., Li, M., Yu, T., Chen, X., McAuley, J., & Li, S. (2025, January 19–24). Beyond chain-of-thought: A survey of chain-of-X paradigms for LLMs. 31st International Conference on Computational Linguistics (pp. 10795–10809), Abu Dhabi, United Arab Emirates. [Google Scholar]
- Xiao, C., Ma, W., Song, Q., Xu, S. X., Zhang, K., Wang, Y., & Fu, Q. (2025, March 3–7). Human-AI collaborative essay scoring: A dual-process framework with LLMs. 15th International Learning Analytics and Knowledge Conference (LAK 2025), Dublin, Ireland. [Google Scholar]
- Xie, W., Niu, J., Xue, C. J., & Guan, N. (2024). Grade like a human: Rethinking automated assessment with large language models. arXiv. [Google Scholar] [CrossRef]
- Yamtinah, S., Wiyarsi, A., Widarti, H. R., Shidiq, A. S., & Ramadhani, D. G. (2025). Fine-tuning AI Models for enhanced consistency and precision in chemistry educational assessments. Computers And Education Artificial Intelligence, 8, 100399. [Google Scholar] [CrossRef]
- Yancey, K. P., Laflair, G., Verardi, A., & Burstein, J. (2023, July 13). Rating short L2 essays on the CEFR scale with GPT-4. 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) (pp. 576–584), Toronto, ON, Canada. [Google Scholar]
- Yang, Y., Kim, M., Rondinelli, M., & Shao, K. (2025). Pensieve grader: An AI-powered, ready-to-use platform for effortless handwritten STEM grading. arXiv. [Google Scholar] [CrossRef]
- Yeadon, W., & Hardy, T. (2024). The impact of AI in physics education: A comprehensive review from GCSE to university levels. Physics Education, 59(2), 025010. [Google Scholar] [CrossRef]
- Yeung, C., Yu, J., Cheung, K. C., Wong, T. W., Chan, C. M., Wong, K. C., & Fujii, K. (2025). A zero-shot LLM framework for automatic assignment grading in higher education. arXiv. [Google Scholar] [CrossRef]
- Yigiter, M. S., & Boduroğlu, E. (2025). Examining the performance of artificial intelligence in scoring students’ handwritten responses to open-ended items. TED EĞİTİM VE BİLİM, 50, 1–18. [Google Scholar] [CrossRef]
- Yun, J. (2023). Meta-analysis of inter-rater agreement and discrepancy between human and automated English essay scoring. English Teaching, 78(3), 105–124. [Google Scholar] [CrossRef]
- Zhang, C., Deng, J., Dong, X., Zhao, H., Liu, K., & Cui, C. (2025). Pairwise dual-level alignment for cross-prompt automated essay scoring. Expert Systems with Applications, 125, 125924. [Google Scholar] [CrossRef]
- Zhang, W., Shen, Y., Wu, L., Peng, Q., Wang, J., Zhuang, Y., & Lu, W. (2024). Self-contrast: Better reflection through inconsistent solving perspectives. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd annual meeting of the association for computational linguistics (Vol. 1, pp. 3602–3622). Association for Computational Linguistics. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
García-Varela, F.; Nussbaum, M.; Mendoza, M.; Martínez-Troncoso, C.; Bekerman, Z. ChatGPT as a Stable and Fair Tool for Automated Essay Scoring. Educ. Sci. 2025, 15, 946. https://doi.org/10.3390/educsci15080946
García-Varela F, Nussbaum M, Mendoza M, Martínez-Troncoso C, Bekerman Z. ChatGPT as a Stable and Fair Tool for Automated Essay Scoring. Education Sciences. 2025; 15(8):946. https://doi.org/10.3390/educsci15080946
Chicago/Turabian StyleGarcía-Varela, Francisco, Miguel Nussbaum, Marcelo Mendoza, Carolina Martínez-Troncoso, and Zvi Bekerman. 2025. "ChatGPT as a Stable and Fair Tool for Automated Essay Scoring" Education Sciences 15, no. 8: 946. https://doi.org/10.3390/educsci15080946
APA StyleGarcía-Varela, F., Nussbaum, M., Mendoza, M., Martínez-Troncoso, C., & Bekerman, Z. (2025). ChatGPT as a Stable and Fair Tool for Automated Essay Scoring. Education Sciences, 15(8), 946. https://doi.org/10.3390/educsci15080946