GradeAgentOps: A Verification-First Framework for Evidence-Anchored LLM Exam Grading

Anghel, Catalin; Anghel, Andreea Alexandra; Craciun, Marian Viorel; Cocu, Adina; Vulpe, Diana-Elena; Andrei, Constantin Adrian; Maier, Calina; Scheau, Cristian; Dragosloveanu, Serban; Cergan, Romica

doi:10.3390/ai7060198

Open AccessArticle

GradeAgentOps: A Verification-First Framework for Evidence-Anchored LLM Exam Grading

by

Catalin Anghel

¹

,

Andreea Alexandra Anghel

²

,

Marian Viorel Craciun

¹

,

Adina Cocu

^1,*

,

Diana-Elena Vulpe

^3,4

,

Constantin Adrian Andrei

^3,4,

Calina Maier

^3,5,

Cristian Scheau

^3,6

,

Serban Dragosloveanu

^3,4

and

Romica Cergan

^3,6

¹

Department of Computer Science and Information Technology, “Dunărea de Jos” University of Galati, 800146 Galati, Romania

²

Faculty of Automation, Computer Science, Electrical and Electronic Engineering, “Dunărea de Jos” University of Galati, 800008 Galati, Romania

³

Faculty of Medicine, The “Carol Davila” University of Medicine and Pharmacy, 050474 Bucharest, Romania

⁴

Department of Orthopaedics, “Foisor” Clinical Hospital of Orthopaedics, Traumatology and Osteoarticular TB, 021382 Bucharest, Romania

⁵

Panait Sirbu Obstetrics and Gynaecology Hospital Bucharest, 060251 Bucharest, Romania

⁶

Department of Radiology and Medical Imaging, “Foisor” Clinical Hospital of Orthopaedics, Traumatology and Osteoarticular TB, 021382 Bucharest, Romania

^*

Author to whom correspondence should be addressed.

AI 2026, 7(6), 198; https://doi.org/10.3390/ai7060198

Submission received: 27 April 2026 / Revised: 18 May 2026 / Accepted: 28 May 2026 / Published: 29 May 2026

Download

Browse Figures

Versions Notes

Abstract

Large Language Models (LLMs) are increasingly used for rubric-based assessment, but reliable automated grading requires more than a single prompt-response step. This study presents GradeAgentOps, a verification-first framework for evidence-anchored LLM exam grading that combines strict grading contracts, deterministic verification and canonicalization, bounded semantic repair, optional memory modules, and provenance-aware logging. The framework is evaluated on a university-level dataset of 1000 short open-ended exam responses from 100 students across 10 questions, with annotations from two independent expert human graders. A controlled ablation protocol compares six configurations, including a rubric-only baseline and progressively stronger variants with repair and memory augmentation. Human–human agreement provides the reference context, with overall ICC(2,1) = 0.678 and QWK = 0.678 between the two graders. Using one expert grader as the operational reference, the FULL configuration achieves the strongest model-human agreement (MAE = 1.935, RMSE = 2.500, QWK = 0.652, Within ±2 = 0.667), with the consistency memory configuration with repair (C1) emerging as the closest alternative. The gains are not uniform: improvements are more pronounced on argumentative items than on technical ones, and pairwise comparisons show that consistency memory contributes more strongly than rubric memory alone. The stronger configurations also produce cleaner accepted outputs, with lower rates of evidence-related postprocess issues, although at a moderate operational cost. Overall, within this controlled single-course evaluation setting, the results show that reliable automated grading benefits not only from model capability, but also from pipeline design choices that promote verification, evidential coherence, and stable grading behavior.

Keywords:

automated grading; large language models; agentic AI; evidence-anchored assessment; verification-first framework; semantic repair; provenance graphs; rubric-based evaluation; short open-ended responses; educational assessment

1. Introduction

Large Language Models (LLMs) are increasingly adopted for grading short, open-ended exam responses at scale [1]. However, real-world deployment requires more than producing a plausible score: grading outputs must follow a strict scoring contract, justify awarded rubric points with evidence grounded in the student’s answer, and remain auditable under re-evaluation [2]. This paper presents GradeAgentOps, a verification-first pipeline that enforces a rubric-aligned JSON (JavaScript Object Notation) [3] contract, applies deterministic validation of evidence and point coverage, and triggers targeted semantic repair when verification fails. All prompts, model outputs, repair attempts, and timing metadata are recorded in Neo4j Desktopv2.1.4 [4] to enable reproducible experiments and systematic error analysis. We evaluate GradeAgentOps on a 1000-answer exam dataset through ablations that isolate the effects of semantic repair, rubric memory, and consistency memory.

1.1. Background and Motivation

Rubric-based assessment is a long-standing instrument for improving scoring transparency and supporting consistent evaluation, and meta-analytic evidence has quantified its effects on academic performance and related learning outcomes [5]. Building on this foundation, recent work has examined LLM-based automatic short answer grading in higher-education settings, reporting both opportunities and practical constraints when applying generative models to criterion-driven scoring [6]. Importantly, recent criterion-based grading studies show that agreement with human scores and consistency across evaluations can diverge, which complicates calibration and the interpretation of automated grades even when criteria are explicit [7]. Complementary reliability analyses further indicate that LLM evaluators may omit crucial criteria or introduce unnecessary criteria, leading to systematic deviations from expert expectations under complex evaluation requirements [2].

These empirical observations highlight several risks that are especially relevant for rubric-driven grading pipelines. First, criterion drift can occur when an evaluator operationalizes the task using a shifted or incomplete set of criteria, which is consistent with evidence that LLM evaluators may add or omit evaluation dimensions relative to expert judgments [2]. Second, comparative grading and ranking setups introduce additional instability: position bias has been documented in LLM-based evaluators and can distort pairwise judgments, with downstream consequences for score stability [8]. Third, grading justifications are only useful if they are faithful and well grounded; however, the faithfulness of model explanations remains a recognized challenge in Natural Language Processing (NLP), which motivates explicit mechanisms for checking and constraining explanatory evidence in high-stakes decision settings such as assessment [9].

From a systems perspective, these issues motivate grading pipelines that produce checkable scoring artifacts and support rigorous auditing. In particular, reproducible deployment benefits from capturing the full decision process—inputs, intermediate outputs, and evaluation metadata—in a structured provenance record that enables post hoc analysis and controlled experimentation. End-to-end provenance capture has been proposed for machine learning pipelines precisely to support reproducibility, debugging, and accountability across runs and evolving system components [10]. Related AI-based decision-support studies in other operational domains also emphasize that predictive performance must be coupled with robust pipeline design when model outputs inform real-world decisions [11].

1.2. Related Work

Recent work shows that LLM-powered automated assessment has rapidly diversified across task types (e.g., short answers, essays, and domain-specific assessments), while repeatedly flagging reliability and validation as central deployment constraints [12]. In short-answer grading, empirical studies in higher-education settings have evaluated LLM-generated grades against instructor judgments and analyzed factors affecting agreement and consistency [6]. Classroom deployments have also reported end-to-end workflows where LLMs grade text-based assignment items and provide feedback in real course settings [1]. At larger scale, work on Massive Open Online Courses (MOOCs) has examined whether LLMs can replace peer grading by combining rubric- and key-based prompting strategies to improve alignment with instructor scores [13]. Beyond “natural language only” responses, LLMs have been applied to grading short answers in software engineering courses, including approaches that combine embeddings with LLM completions to broaden acceptable answer variants [14]. LLM-assisted assessment has additionally been explored for programming assignments, where instructor-aided pipelines integrate multiple models to streamline grading while aiming to reduce time cost and inconsistency [15].

A second body of work focuses on how LLM grades should be interpreted and validated under criterion-driven scoring. Criterion-based grading studies show that human-score agreement and evaluator consistency can diverge, implying that prompt/rubric designs must be evaluated with metrics that separate these dimensions [7]. In large-scale writing assessment, psychometric frameworks such as generalizability theory and many-facet Rasch modeling have been used to compare human and LLM raters and to characterize severity and stability differences across scoring regimes [16]. Related rubric-guided evaluation research proposes multidimensional, calibrated prompting schemes where rubric questions define evaluation dimensions and judge-aware modeling is used to better match human annotation patterns [17]. Reliability analyses of LLM evaluators further report that judge models may omit crucial criteria or introduce unnecessary criteria, producing subtle but systematic deviations from expert expectations on complex requirements [2].

LLM-as-a-judge research also documents that evaluator outputs can be systematically biased, motivating both diagnostics and debiasing strategies. A dedicated benchmark for cognitive biases in LLM evaluators reports multiple bias types, including egocentric bias, raising concerns about robustness when LLMs are used as general-purpose judges [18]. Complementing this, a large-scale empirical study of position bias evaluates pairwise and list-wise judging setups and introduces bias metrics (e.g., position consistency and preference fairness) to characterize when order effects compromise reliability [19]. Recent work presented at the Annual Meeting of the Association for Computational Linguistics (ACL) proposes strengthening judge reasoning by injecting additional ‘crowd’ comparison responses, aiming to produce more comprehensive judgments and improve evaluation reliability beyond majority voting or criteria expansion alone [20].

Finally, reproducible and auditable assessment pipelines overlap with broader systems work on evaluator consistency and provenance capture. In education-specific settings, LLMs-as-evaluators have been experimentally studied for feedback consistency and inter-model agreement, emphasizing that evaluator selection and prompting materially influence reliability [21]. From a pipeline perspective, provenance models have been proposed to capture end-to-end traces of Machine Learning (ML) workflows (including artifacts and relationships) to enable querying, debugging, and repeatable experimentation [10]. Complementary provenance systems for deep-learning workflows argue for database-backed provenance graphs (rather than ad hoc logs) to support traceability across data preparation, training, and evaluation with low runtime overhead [22].

1.3. Research Gap and Contributions

Despite rapid progress in LLM-based grading and evaluation, the literature still leaves a practical gap between reporting score-level performance and building deployable exam-grading systems that can be audited and debugged over datasets comprising 100 students, 10 questions, and 1000 graded responses. A recurring issue is that LLM-based evaluators can conflate or misapply evaluation criteria, which undermines reliability even when criteria are explicitly stated. This makes it difficult to interpret improvements reported in isolation (e.g., a better prompt) without an accompanying mechanism that can detect and localize failures in a structured way [23].

A second gap concerns what a grading system should do when verification fails. In practice, many pipelines either accept imperfect outputs or regenerate the entire evaluation, which can introduce additional variance and obscure the root cause of errors. However, research on LLM self-correction shows that refinement is not uniformly reliable across tasks and conditions; success depends on the availability of trustworthy feedback signals and well-defined correction targets. This motivates a controlled repair policy that is triggered by specific, machine-detectable violations rather than unconstrained regeneration [24].

A third gap is the lack of a standard way to guarantee strict output compliance for integration into software systems. Prompting alone is often insufficient to enforce formal constraints (e.g., grammar- or schema-level requirements) in a robust manner, and recent work on constrained decoding emphasizes that enforcing strict structural constraints during generation is both feasible and beneficial for structured outputs. Moreover, as grading pipelines become agentic and multi-step, they resemble structured language model programs with multiple generation calls and structured I/O, making the system aspect (execution, tracing, and controllable decoding) an integral part of the solution [25,26].

Finally, while provenance has been widely recognized as essential for reproducibility and accountability in machine learning workflows, exam grading introduces an acute need for run-level traceability: beyond the final score, the system must expose the sequence of prompts, model outputs, retries, and timing that produced the decision. Provenance capture frameworks and practical tooling for ML workflows support exactly this form of reproducible experimentation and systematic debugging, but they are rarely integrated as first-class components in exam-grading systems [27].

GradeAgentOps is a verification-first framework for evidence-anchored LLM exam grading. Its novelty lies in operationalizing grading as a structured, checkable, and auditable decision process rather than a single prompt-response interaction. The framework integrates strict rubric-aligned output contracts, deterministic verification and canonicalization, explicit evidence-grounding checks, verifier-triggered bounded repair, optional memory-based calibration, and provenance-aware logging into a grading-specific control architecture. Each grading decision is represented as a machine-checkable artifact whose score, rubric-point coverage, supporting evidence, validation status, and repair history can be inspected and reproduced. This design makes GradeAgentOps a systems-level contribution for reliable educational assessment, with the ablation protocol isolating how repair and memory components affect agreement, evidence validity, output compliance, and operational cost. In this context, GradeAgentOps contributes a verification-first architecture for reliable LLM-based exam grading through the following elements:

First, we define a strict rubric-aligned JSON contract that represents grades as checkable artifacts, including integer-only scoring, explicit covered/missed rubric-point partitions, and evidence fields for covered points.

Second, we introduce a deterministic verifier that enforces schema and range constraints, recomputes totals from subscores, and canonicalizes rubric-point partitions into a stable representation, producing an explicit taxonomy of structural vs. semantic failures.

Third, we operationalize evidence faithfulness as an explicit obligation by automatically checking evidence spans against the student response and emitting structured postprocess signals when evidence is invalid.

Fourth, we propose a two-model repair policy triggered only by verifier-detected violations, with a dedicated semantic repair budget that corrects offending fields while preserving already-valid parts of the artifact to reduce regeneration variance.

Fifth, we integrate two optional, independently ablated modules: rubric memory that selects a compact subset of rubric content for each response, and consistency memory that retrieves prior same-question exemplars to stabilize grading.

Sixth, we record prompts, model outputs, repair attempts, and timing metadata in Neo4j, enabling auditability, reproducible experimentation, and fine-grained analyses of failures and cost.

Seventh, we evaluate on a 1000-answer university exam dataset with instructor rubrics and human scoring, using the ablation suite (B0, R1, C1, M1, M2, FULL) and reporting compliance, evidence validity, human agreement, and computational cost.

2. Materials and Methods

This section describes the dataset, grading rubric artifacts, and experimental protocol used to evaluate GradeAgentOps. We then detail the GradeAgentOps pipeline, including the strict grading contract, deterministic verification and canonicalization procedures, the targeted semantic repair policy, optional memory modules, and the Neo4j-based provenance logging layer that enables reproducible ablations and fine-grained error analysis.

2.1. Dataset and Human Reference Scores

We evaluate GradeAgentOps on a university-level exam dataset comprising 100 students answering 10 short, open-ended questions, resulting in 1000 student responses. The dataset contains both technical and argumentative items. Each instance includes the question text and a free-form student answer, together with instructor-provided grading guidance and human scoring annotations from two independent expert human graders (E1 and E2).

The data are stored in three sources. The first is a JSON file containing the student answers and the instructor-authored rubric artifacts for each question, while the second and third are Comma-Separated Values (CSV) files containing human grading annotations from the two expert graders. These sources are merged deterministically by student and question identifiers to form a single evaluation table used across all experiments, ensuring that all ablations operate on identical inputs and human reference scores.

Table 1 summarizes the dataset composition, the available rubric artifacts, and the human scoring dimensions used by the two expert graders throughout our experiments. This compact overview anchors the dataset description before detailing the rubric structure and scoring procedure.

Instructor grading guidance is represented by a reference solution and a rubric decomposed into atomic expected elements, referred to as gold points. Gold points enumerate the key concepts that should be present in a correct answer and may be weighted to reflect their relative importance. In addition, some items include explicitly listed misconceptions, referred to as banned misconceptions, that capture common incorrect statements; if a banned misconception is present in the student response, it should be penalized. Together, gold points and banned misconceptions provide a structured, point-level rubric that supports both coverage accounting—identifying which expected elements are addressed versus missed—and evidence-based justification at the granularity of individual rubric elements.

Human reference scores are provided by the two expert graders, who assign integer criterion-level subscores and a final integer score on a 0–10 scale. Technical items are scored using the dimensions accuracy, clarity, completeness, and terminology, while argumentative items are scored using clarity, coherence, originality, and dialecticality. The final E2 score serves as the primary reference for agreement analyses, providing a fixed operational reference for comparing all pipeline configurations under identical conditions, while E1 is used to quantify inter-rater reliability between the two graders. This design allows the model-human results to be interpreted together with the observed human–human variability, rather than treating the human reference as an error-free ground truth. The subscore breakdown enables finer-grained analysis of where and why automated grading decisions differ.

2.2. GradeAgentOps Pipeline and Experimental Setup

GradeAgentOps is a verification-first grading pipeline that produces structured, audit-ready grading artifacts. For each student response, the pipeline starts from a fixed set of inputs: the question statement, the student’s free-form answer, and the instructor-provided rubric artifacts (reference answer, gold points, and—when applicable—banned misconceptions). The pipeline is designed to ensure that every scoring decision can be reproduced and inspected, and that common failure modes are detected explicitly rather than hidden behind free-form explanations.

In the implementation used in this study, the primary grading stage employed the Llama 3.3 70B Instruct model [28], using the 4-bit quantized variant llama3.3:70b-instruct-q4_K_M, while targeted semantic repair employed the Qwen 2.5 14B Instruct model [29], using the 4-bit quantized variant qwen2.5:14b-instruct-q4_K_M, as a separate repair model. Both models were used as pre-trained instruction-following Large Language Models (LLMs) served locally through Ollama [30] and were not additionally trained, fine-tuned, or otherwise adapted on the exam dataset used in this study. Thus, the empirical results reported here characterize this specific model pairing and deployment setup, while the GradeAgentOps control architecture itself is not tied to these particular models. Inference was executed with fixed generation settings consisting of temperature 0, seed 42, top-p 0.9, top-k 40, and a context length of 2048 tokens. All experiments were carried out on a dedicated virtual machine running 64-bit Windows 11, using Python 3.11 in the PyCharm v2026.1.2 development environment [31]. The virtual machine was equipped with an AMD EPYC 9654 96-core processor at 2.40 GHz, 128 GB of RAM, a 3 TB SSD, and an NVIDIA L40S-48Q GPU with 48 GB of VRAM.

The end-to-end processing flow is summarized in Figure 1. The diagram emphasizes the verification-first control path, the bounded targeted repair loop triggered only by verifier signals, and the provenance logging used for auditability.

At execution time, GradeAgentOps proceeds in a staged manner. First, it performs lightweight deterministic handling for degenerate cases (e.g., empty or extremely short responses) to avoid unnecessary model calls when an answer clearly cannot satisfy rubric requirements. For non-degenerate responses, the system constructs a grading prompt that encodes the rubric-based requirements and requests a strict JSON grading artifact from a grader LLM. The generated artifact includes criterion-level subscores and a final score, together with point-level rubric coverage information and supporting evidence intended to justify awarded credit at the granularity of individual gold points. When enabled, rubric memory and consistency memory may be injected at this stage: rubric memory compresses the rubric context to a small set of highly relevant elements for the current response, while consistency memory provides same-question exemplars that act as calibration anchors.

The grader output is then processed by a deterministic verification and canonicalization stage. This stage enforces the grading contract and produces a stable representation that downstream analyses can rely on. In addition to structural checks (e.g., required fields and integer ranges), the verifier performs semantic checks that are central to trustworthy grading, most notably verifying that supporting evidence is grounded in the student’s answer and that rubric coverage is internally consistent. Verification produces structured failure signals that explicitly separate structural contract violations from semantic violations, enabling measurable error analysis and targeted intervention.

When verification detects semantic violations that cannot be resolved by canonicalization alone—such as unverifiable evidence—GradeAgentOps triggers a targeted semantic repair loop using a separate repair LLM. Repair attempts are bounded by a fixed budget and are guided by explicit verifier signals describing what must be corrected. The repair policy is designed to modify only the offending parts of the artifact while preserving already-valid content whenever possible, which reduces variance relative to full regeneration and makes the repair process easier to audit.

Throughout the grading process, GradeAgentOps records a complete provenance trace in Neo4j. For each answer and each attempt (initial grading and any repairs), the system logs the prompt, the raw LLM output, validation outcomes, structured failure signals, and timing information. The final validated grading artifact, together with its postprocess metadata, is exported as a run-level JSON output used for downstream quantitative evaluation and ablation analysis.

2.3. Verification-First Grading Contract and Deterministic Verifier

GradeAgentOps treats each grading decision as a structured artifact governed by a strict grading contract. The grader LLM is instructed to output a single JSON object following a fixed schema. The schema encodes (i) criterion-level integer subscores, (ii) an integer final score, (iii) an explicit decision about rubric-point coverage, and (iv) supporting evidence used to justify awarded credit at the level of individual rubric elements. This design makes grading outputs machine-checkable and suitable for downstream automation and auditing.

The contract is rubric-aligned and supports two question types. For technical items, the artifact contains integer subscores for accuracy, clarity, completeness, and terminology; for argumentative items, it contains integer subscores for clarity, coherence, originality, and dialecticality. In both cases, the final score is represented as an integer on a 0–10 scale and is expected to be consistent with the subscore definition. In addition to scores, the artifact includes a point-level rubric coverage representation: a list of covered gold-point indices and a complementary list of missed gold-point indices. To justify each covered gold point, the artifact includes an evidence list aligned with the covered list, where each evidence entry is intended to correspond to the respective covered point. The artifact also includes a list of detected misconceptions when such misconceptions are defined for the question, together with evidence spans, and a short free-text rationale.

Table 2 provides a compact summary of the grading contract enforced by GradeAgentOps and the corresponding deterministic verifier actions. It groups constraints into core contract components (JSON envelope, scores, coverage, and evidence) and indicates whether violations are handled by canonicalization, deterministic artifact updates, or explicit postprocess failure signals.

Outputs are processed by a deterministic verifier that performs both structural and semantic checks. Structural verification enforces that the output is valid JSON, that the top-level object contains exactly the required fields, and that all rubric subscores are integers within their allowed ranges. The verifier recomputes the final score from the reported subscores and overwrites inconsistent totals to ensure that the artifact remains internally consistent and comparable across runs. The verifier also enforces well-formed rubric-point partitions by validating point indices against the rubric definition and canonicalizing the covered and missed sets into a stable representation.

Semantic verification focuses on evidence and consistency. For each covered gold point, the associated evidence span must be verifiable against the student response; in GradeAgentOps, evidence is treated as an explicit obligation rather than an optional explanation. Evidence strings that cannot be matched to the student answer are flagged as semantic violations. In the implementation evaluated here, this evidence check is intentionally conservative and primarily verifies textual grounding: the evidence span must be recoverable from the student answer, after deterministic normalization, rather than inferred from the reference answer, rubric text, or memory context. When such violations occur, the verifier deterministically updates the grading artifact so that only rubric points with verifiable evidence remain marked as covered, and the corresponding points are moved to missed. An analogous evidence check is applied to detected misconceptions when present: misconception evidence must also be grounded in the student answer, otherwise the misconception detection is discarded to avoid unverifiable penalties. This design improves auditability by preventing unsupported evidence claims, but it may under-recognize semantically valid paraphrases when the supporting idea is expressed without a close textual span.

Finally, the verifier emits structured postprocess signals that serve as a failure taxonomy for downstream analysis and targeted repair. These signals distinguish structural contract issues from semantic violations (most notably evidence mismatches) and enable measurable reporting of failure modes at scale. This separation is critical for the targeted repair policy introduced next, which triggers corrective actions only in response to explicit verifier-detected violations.

2.4. Targeted Semantic Repair Policy

GradeAgentOps applies a two-stage control logic to handle failures after an initial grading attempt: deterministic verification and canonicalization of the generated grading artifact, followed by a bounded repair loop when the verifier signals that the artifact is unusable or semantically invalid. The design goal is to avoid treating grading as unconstrained regeneration. Instead, repair is triggered only by explicit, machine-detectable violations and is guided by structured failure information emitted by the verifier.

We distinguish two classes of failures. Structural contract failures occur when the LLM output does not satisfy the required JSON grading contract, for example because the output is not valid JSON, required fields are missing, fields have incorrect types, or integer scores fall outside allowed ranges. Semantic violations occur when the output is structurally valid but fails verifiability constraints, most notably when evidence associated with covered rubric points cannot be grounded in the student response. This separation is important because structural failures primarily block automation, whereas semantic violations undermine trust and auditability even when the artifact is machine-readable.

Repair is performed using a separate repair LLM and is executed under fixed budgets to keep computational cost predictable. For structural contract failures, the system performs up to a configurable number of repair attempts. For semantic violations, the system can enable a dedicated semantic repair mode with its own maximum number of attempts. In both cases, each repair attempt is required to return the same strict grading contract, ensuring that repaired outputs remain compatible with downstream processing and analysis.

Each repair attempt is guided by explicit verifier feedback. The repair prompt includes the original grading context (question, student answer, and rubric artifacts), the verifier’s failure description, and the previous LLM output. The failure description indicates what must be corrected—such as contract mismatch or evidence mismatch—and localizes problematic elements when applicable (e.g., which evidence entries are invalid). This design turns the verifier into a deterministic supervisor that provides reliable correction targets.

The repair policy is targeted by construction: it aims to correct only the offending parts of the grading artifact while preserving already-valid content. For example, when semantic verification flags invalid evidence, repair focuses on producing verifiable evidence and consistent coverage decisions rather than re-evaluating the entire response and unnecessarily altering subscores. If semantic repair is exhausted but a structurally valid artifact exists, the pipeline may accept the last valid artifact while retaining the semantic failure signals in the postprocess metadata, enabling later quantitative analysis of remaining semantic violations.

2.5. Memory Modules: Rubric Memory and Consistency Memory

GradeAgentOps optionally augments prompt construction with two deterministic, retrieval-based memory modules that can be enabled independently in ablation studies. The objective is to improve focus and stability without adding generative steps and without weakening the evidence-grounding obligations imposed by the grading contract. One module performs rubric-focused retrieval by selecting a bounded subset of rubric elements that are lexically related to the current student answer, reducing prompt bloat while keeping the most relevant rubric constraints salient. The other module provides within-run calibration by retrieving previously graded exemplars for the same question that are similar to the current response, enabling more consistent scoring across comparable answers. In both cases, retrieved memory is treated strictly as guidance. Awarded credit must still be justified by evidence spans copied from the current student answer, and evidence must not be taken from the reference solution, rubric text, or memory content.

Rubric-focused retrieval constructs a per-question index of short memory cards derived from the instructor rubric artifacts. Cards are created from gold points and, when present, banned misconceptions. Each card is tokenized deterministically into lowercase alphanumeric tokens, and an Inverse Document Frequency weighting is computed over the card collection so that common tokens contribute less than discriminative tokens. For a given student answer, the module tokenizes the answer and ranks rubric cards by an IDF-weighted token overlap score computed between the answer token set and the card token set. Only cards with strictly positive overlap are retained, and ranking is performed deterministically with a stable tie-breaking rule. The injected context is bounded by fixed top-K limits to ensure predictable prompt size. The default configuration injects the top-ranked three gold-point cards and the top-ranked two misconception cards as attention guides.

Within-run calibration is implemented as a streaming exemplar store maintained per question during a run. This mechanism is intended to operationalize consistency-oriented calibration, similar to human grading practice, where previously evaluated answers to the same question can help maintain a stable scoring standard across comparable responses. Because the store is updated as answers are processed, the set of exemplars available for a given answer is conditional on the fixed processing order used in that run. After each successful grading decision, the system may store a compact exemplar comprising an excerpt of the student answer together with the model-awarded final score, the criterion-level subscores, and the set of covered gold points. The store is restricted to prior same-question exemplars and contains model-generated grading artifacts rather than human reference scores or external labels. Retrieved exemplars are therefore used as calibration anchors, while awarded credit must still be justified by evidence from the current student answer. Storage is bounded and deterministic. Answers below a minimum length threshold are excluded to avoid storing non-informative exemplars. Stored answer text is truncated to a fixed maximum excerpt length. A per-question capacity limit is enforced by retaining only the most recent exemplars, yielding deterministic eviction and predictable memory size. When grading a new answer, retrieval is restricted to exemplars from the same question. Similarity is computed using an IDF-weighted Jaccard measure over token sets [32,33]. Only exemplars above a minimum similarity threshold are eligible, and the top-K most similar exemplars are returned with deterministic sorting. The default configuration retrieves three exemplars.

Both memory modules operate only at prompt-construction time and do not modify the grading contract, the deterministic verification and canonicalization logic, or the evidence requirements. Because retrieval is bounded and deterministic under identical inputs, the modules support reproducible ablation studies that isolate their individual contributions to grading accuracy and stability.

2.6. Provenance Logging and Experimental Protocol

We evaluate GradeAgentOps under a controlled protocol in which all configurations operate on identical inputs and human reference scores. Across ablations, the dataset, rubric artifacts (provided in JSON), prompt template, and deterministic verifier/canonicalizer are held fixed, and only the explicitly ablated components are varied: bounded repair and the optional memory modules. This design isolates the contribution of each component while preserving a consistent grading contract and directly comparable output structure. Unless stated otherwise, agreement metrics use E2 as the reference scorer, while E1 is used to quantify inter-rater reliability.

For each student response, execution follows a staged flow. A lightweight deterministic precheck first identifies degenerate responses, in which case the system returns a contract-compliant fallback artifact without invoking the grader LLM. For non-degenerate responses, the system constructs a rubric-aligned grading prompt and, when enabled, injects retrieved rubric hints and/or same-question exemplars as compact guidance. The grader Large Language Model (LLM) produces a draft JSON grading artifact, which is immediately processed by deterministic verification and canonicalization. This stage enforces structural constraints, canonicalizes score representations, and applies evidence-grounding and coverage-consistency checks, yielding a stable artifact together with structured postprocess signals that separate structural contract issues from semantic violations.

When repair is enabled, GradeAgentOps invokes bounded corrective loops only in response to verifier-detected violations. Contract repair is triggered when the draft output fails to satisfy the grading contract and cannot be brought into compliance by canonicalization alone. In this case, repair prompts incorporate the fixed rubric-aligned requirements together with the verifier’s structured failure signals, and the process repeats under a strict budget to limit variance and cost. Separately, when semantic repair is enabled, verifier-detected semantic violations—most notably evidence mismatches—may also trigger bounded corrective actions. The repair policy is designed to preserve already-valid content whenever possible and to make all interventions auditable through explicit verifier signals rather than implicit free-form regeneration.

Table 3 summarizes the ablation configurations used in this protocol. The ablation suite comprises six settings: a baseline configuration without repair or memory (B0), a repair-enabled configuration without memory (R1), a rubric-memory configuration without repair (M1), a consistency-memory configuration with repair (C1), a rubric-memory configuration with repair (M2), and the full configuration with both memory modules and both repair loops enabled (FULL). The baseline configuration B0 uses rubric-only prompting with a single grader call and deterministic verification/canonicalization, without repair or memory augmentation. R1 enables bounded repair with both the contract repair loop and the semantic repair loop, while leaving both memory modules disabled. M1 enables rubric memory only, without repair or consistency memory. C1 enables consistency memory together with bounded repair (contract and semantic), while leaving rubric memory disabled. M2 enables rubric memory together with bounded repair (contract and semantic), while leaving consistency memory disabled. FULL enables both memory modules together with both repair loops (contract and semantic), representing the complete GradeAgentOps pipeline used for the main results. For all configurations, the system records a complete provenance trace for each attempt, including prompts, raw outputs, verifier outcomes, postprocess signals, and repair iterations when applicable.

The methodological design above yields a verification-first grading workflow in which every output is contract-compliant, evidence-aware, and fully auditable through deterministic postprocessing and provenance logging. By holding the dataset and grading contract fixed while toggling bounded repair and memory augmentation, the ablation protocol enables a clean attribution of performance differences to specific pipeline components rather than to prompt drift or uncontrolled changes in output format.

For the quantitative analyses reported in the Results section, agreement between human graders or between a model configuration and the human reference was quantified using both error-based and ordinal agreement metrics. In addition to MAE and RMSE, we report the Intraclass Correlation Coefficient, ICC(2,1), and Quadratic Weighted Kappa, QWK. ICC(2,1) denotes a two-way random-effects, absolute-agreement, single-measure intraclass correlation coefficient, computed as:

I C C (2,1) = \frac{M S_{R} - M S_{E}}{M S_{R} + (k - 1) M S_{E} + \frac{k}{n} (M S_{C} - M S_{E})}

(1)

where

M S_{R}

is the mean square for targets,

M S_{C}

is the mean square for raters or scoring methods,

M S_{E}

is the residual mean square error,

n

is the number of scored responses, and

k

is the number of raters or scoring methods. QWK measures ordinal agreement while penalizing larger score disagreements more strongly. For score categories

i, j \in {0, \dots, 10}

, it is computed as:

Q W K = 1 - \frac{\sum_{i = 0}^{10} \sum_{j = 0}^{10} w_{i j} O_{i j}}{\sum_{i = 0}^{10} \sum_{j = 0}^{10} w_{i j} E_{i j}}

(2)

w_{i j} = \frac{{(i - j)}^{2}}{100}

(3)

where

O_{i j}

is the observed agreement matrix,

E_{i j}

is the expected agreement matrix derived from the marginal score distributions, and

w_{i j}

is the quadratic disagreement weight between score categories

i

and

j

. In the present study, scores are defined on the 0–10 scale; therefore, the maximum squared disagreement is

100

. Higher ICC(2,1) and QWK values indicate stronger agreement, while lower MAE and RMSE values indicate smaller score errors.

3. Results

The results present quantitative comparisons across the ablation configurations, focusing on agreement with human grading and the robustness of contract-compliant grading artifacts under practical failure modes. The analysis begins with human inter-rater reliability in order to establish the variability of the human reference scores, and then examines the ablation outcomes using E2 as the reference scorer. In addition to overall score-level agreement, the results also consider item-type differences, verifier-emitted failure signals, repair behavior, and the effects of the memory modules.

3.1. Human Inter-Rater Reliability

Before analyzing model–human agreement, it is necessary to establish the degree of agreement between the two expert human graders, E1 and E2, on the final 0–10 score. This provides the human reference context for interpreting the agreement levels achieved by the automated grading configurations and clarifies the extent of variability already present at the human level.

Table 4 reports the inter-rater reliability results between E1 and E2 on the final score, both overall and separately for technical and argumentative items.

Across all 1000 responses, the two graders achieved an overall intraclass correlation coefficient ICC(2,1) of 0.678 and a quadratic weighted kappa of 0.678, indicating moderate agreement on the final score. The mean score assigned by E1 was 4.746, whereas the mean score assigned by E2 was 5.415, resulting in a positive mean bias of 0.669 points for E2 − E1. The mean absolute error between the two graders was 1.827, and the root mean squared error was 2.296. Taken together, these results indicate that the two graders were reasonably consistent overall, while also showing a systematic tendency for E2 to assign slightly higher scores than E1. Thus, the human reference should be interpreted as an expert benchmark with measurable inter-rater variability, rather than as an error-free ground truth. This variability provides an important context for the subsequent model–human agreement analyses.

The same pattern remained visible after stratifying the results by item type. For technical items, the mean scores were 5.190 for E1 and 6.038 for E2, with a mean bias of 0.848, MAE = 1.828, RMSE = 2.310, QWK = 0.654, and ICC(2,1) = 0.655. For argumentative items, the mean scores were 4.302 for E1 and 4.792 for E2, with a smaller mean bias of 0.490, MAE = 1.826, RMSE = 2.283, QWK = 0.678, and ICC(2,1) = 0.678. Thus, inter-rater agreement was slightly weaker for technical items, mainly because the positive scoring shift in E2 relative to E1 was more pronounced in that subset.

Figure 2 further illustrates this pattern by showing the distribution of integer score differences, defined as Δ = E2 − E1. The distribution is centered slightly to the right of zero, with positive differences occurring more frequently than negative ones. This visual pattern is consistent with the quantitative results reported in Table 4 and confirms that E2 was, on average, the more lenient grader.

Overall, the human–human agreement results establish an important reference point for the subsequent model-human analyses, where automated grading performance should be interpreted relative to the natural variability already present between expert human evaluators.

3.2. Main Ablation Results

Having established the level of agreement between the two human graders, the analysis now turns to the main model–human comparison across the six ablation configurations. In this stage, agreement is evaluated on the final 0–10 score using E2 as the reference scorer, in accordance with the evaluation protocol defined earlier. The aim is to determine how the different ablation settings affect score-level agreement under otherwise comparable grading conditions.

Table 5 reports the main ablation results for all six configurations, including error-based measures, ordinal agreement, and tolerance-based agreement rates.

The overall pattern is directionally consistent but moderate in magnitude. FULL achieved the strongest agreement with the reference scorer, with MAE = 1.935, RMSE = 2.500, QWK = 0.652, and Within ±2 = 0.667. The second-best configuration was C1, which also performed strongly, with MAE = 1.983, RMSE = 2.540, QWK = 0.648, and Within ±2 = 0.660. In contrast, the remaining configurations formed a weaker cluster. M1 improved slightly over the baseline family, reaching MAE = 2.078 and QWK = 0.631, whereas B0, M2, and R1 remained close to one another, with MAE values between 2.089 and 2.105 and QWK values between 0.624 and 0.628. Among all six configurations, R1 produced the weakest overall agreement, with MAE = 2.105 and QWK = 0.624. This indicates that repair alone did not improve score-level alignment over the baseline in the absence of memory-based calibration. Although repair can address verifier-detected contract or evidence issues, this ablation shows that it is not, by itself, a sufficient mechanism for improving agreement with the human reference scorer. These differences should therefore be interpreted as modest score-level gains, especially when comparing FULL with B0, where MAE decreases by 0.162 points and QWK increases by 0.024.

A second pattern is the systematic negative bias observed for all configurations. The mean model score ranged from 3.864 to 4.226, whereas the mean score assigned by E2 was 5.415 in all cases. Consequently, all configurations underscored relative to the reference scorer, with Bias (Model − E2) ranging from −1.551 for R1 to −1.189 for FULL. This indicates that the automated grader remained consistently stricter than E2, even in its best-performing setting. At the same time, the strongest configurations partially reduced this gap: both C1 and FULL showed modest improvements in absolute error, ordinal agreement, and the proportion of responses graded within a narrow tolerance of the human reference.

To assess whether the observed differences between the main configurations were statistically robust, Table 6 reports paired student-cluster bootstrap comparisons with 10,000 resamples. Resampling was performed at the student level rather than at the individual-answer level, preserving the within-student dependence across the ten exam questions.

Positive Δ values indicate improvement of the first configuration over the second configuration. For MAE, Δ is computed as the second configuration value minus the first configuration value; for QWK and Within ±2, Δ is computed as the first configuration value minus the second configuration value.

The bootstrap analysis indicates that both FULL and C1 improve significantly over B0 across MAE, QWK, and Within ±2. However, the direct comparison between FULL and C1 is not statistically significant for any of these metrics. Thus, FULL should be interpreted as the numerically strongest configuration, while C1 remains statistically comparable within this dataset.

Figure 3 provides a compact visual summary of the same ablation results by showing MAE and QWK side by side for the six configurations, ordered by performance.

The figure reinforces the ranking observed in Table 5. FULL and C1 stand apart as the two strongest configurations, combining lower error with higher ordinal agreement than the remaining variants. The separation is especially clear in the QWK panel, where the gains of FULL and C1 over B0, R1, and M2 are more visually pronounced. M1 occupies an intermediate position, outperforming the weaker configurations but not reaching the agreement levels of C1 or FULL. Overall, the figure confirms that the highest-performing configurations are those that combine lower score error with stronger ordinal consistency relative to the reference scorer.

Taken together, these results show that FULL provides the strongest agreement with the human reference scorer, with C1 as a close second. The ablation pattern therefore indicates that the best-performing configurations are those that achieve both lower absolute error and higher ordinal agreement, while the weaker variants remain systematically more conservative and less aligned with the reference human scores.

3.3. Performance Breakdown by Question and Item Type

The overall ablation results reported above establish that FULL achieves the strongest agreement with the reference scorer, with B0 serving as the baseline configuration. However, those aggregate results do not show whether the observed gain is distributed evenly across the exam or whether it is concentrated in particular subsets of responses. A more fine-grained analysis is therefore required. To clarify where the improvement of FULL actually comes from, the comparison is narrowed in this subsection to B0 and FULL and is examined at two complementary levels: first by item type and then by individual question, always using E2 as the reference scorer.

The item-type analysis is important because the exam contains both technical and argumentative questions, which differ not only in content but also in the nature of the expected response. A configuration that improves overall agreement may therefore do so by performing better on one category while offering only limited gains on the other. Table 7 provides this first level of breakdown by reporting the performance of B0 and FULL separately for technical and argumentative items in terms of MAE and QWK. In this way, the table shows whether the overall advantage of FULL reflects a broad gain across both categories or a more uneven effect concentrated in only one of them.

The item-type comparison shows that the advantage of FULL is not equally strong across the two response categories. For technical items, B0 achieved MAE = 2.208 and QWK = 0.541, whereas FULL achieved MAE = 2.118 and QWK = 0.538. This corresponds to a relatively small MAE reduction of 0.090, while ordinal agreement changes only marginally and in fact decreases slightly. By contrast, the difference is much more substantial for argumentative items. In this subset, B0 obtained MAE = 1.986 and QWK = 0.673, while FULL improved to MAE = 1.752 and QWK = 0.717. Here, the gain is clearly stronger, yielding an MAE reduction of 0.234 together with a QWK increase of 0.044. Taken together, these results show that the overall superiority of FULL over B0 is driven primarily by better alignment on argumentative responses, whereas the improvement on technical items is much smaller and not equally reflected across both agreement measures.

Although the item-type breakdown clarifies the broader source of the gain, it still remains too coarse to show how this improvement is distributed across the ten individual questions. In particular, the category-level comparison cannot reveal whether the benefit of FULL is broad and consistent or whether it is concentrated in a smaller subset of items. A question-level view is therefore needed. Figure 4 provides this finer-grained perspective by plotting the per-question change in error, defined as ΔMAE = MAE(B0) − MAE(FULL), for each question from Q1 to Q10. Under this definition, positive values indicate questions on which FULL reduces error relative to the baseline, whereas negative values indicate questions on which B0 remains stronger.

Figure 4 shows that the improvement delivered by FULL is clearly non-uniform across the exam. The largest gains are observed on Q8 (ΔMAE = 0.39), Q3 (ΔMAE = 0.38), Q7 (ΔMAE = 0.38), Q2 (ΔMAE = 0.36), and Q10 (ΔMAE = 0.36). These questions therefore account for a substantial part of the overall advantage of FULL over the baseline. At the same time, the gains are much smaller on Q6 and Q9, where the improvement is only 0.02, indicating near-equivalent behavior between the two configurations on those items. In addition, FULL performs slightly worse than B0 on Q1 (−0.07), Q4 (−0.11), and Q5 (−0.11). This pattern is important because it shows that the higher overall agreement achieved by FULL does not result from a uniform improvement across all questions. Rather, it is produced by a combination of substantial gains on several items, negligible differences on others, and small losses on a limited subset of questions.

Overall, the breakdown by item type and by question clarifies the structure of the gain observed in the main ablation results. The advantage of FULL over B0 arises mainly from stronger agreement on argumentative items and from marked improvements on a subset of individual questions, rather than from a consistent improvement across the entire exam. The benefit of FULL is therefore real and meaningful, but also clearly localized, which helps explain why the aggregate results reported in the previous subsection do not translate into uniform gains at every level of analysis.

3.4. Verifier and Postprocess Outcomes

The agreement analyses reported in the previous subsections quantify how closely the six configurations align with the human reference scores. However, score-level agreement alone does not show how the accepted grading artifacts differ internally once they pass through the verification and postprocess stages. A complementary analysis is therefore needed at the level of verifier and postprocess outcomes. This is especially important in the present experiments because the final outputs show almost no contract-fail cases, which means that the relevant variation between configurations lies less in outright output invalidity and more in the postprocess profile of the accepted grading artifacts.

A first step is to summarize these outcomes at the configuration level. In addition to Pass@1 and the item-level rate of any observed contract-fail case, it is useful to examine the main postprocess signals that remain visible in the final accepted outputs. Table 8 provides this overview by reporting, for each configuration, the rates of contract-fail items together with the principal verifier- and postprocess-related signals observed in the accepted grading artifacts.

The table shows that outright contract-fail cases were essentially absent across all six configurations. B0, R1, M1, and M2 produced a contract-fail item rate of 0.000, while C1 and FULL reached only 0.001, indicating that contract non-compliance was not a meaningful source of experimental variation in the final accepted outputs. At the same time, Pass@1 remained high throughout, ranging from 0.927 for FULL to 1.000 for B0 and M1. These results indicate that the configurations differed only marginally in first-pass acceptance and that the central differences between them must be sought in the postprocess profile of the accepted outputs rather than in widespread output rejection.

The clearest variation appears in the evidence- and completeness-related signals. The rate of covered gold evidence invalid was highest for M1 (0.059) and B0 (0.057), lower for R1 (0.025) and M2 (0.022), and lowest for the strongest configurations, namely C1 (0.013) and FULL (0.010). This pattern indicates that the better-performing configurations were also less likely to retain invalid covered-gold evidence in the final accepted output. A second signal, forced completeness from gold, remained present in all configurations, but with a wider spread, ranging from 0.085 for C1 to 0.129 for M2. The rate was also relatively high for M1 (0.128), whereas B0 (0.122), R1 (0.121), and FULL (0.117) occupied an intermediate range. By contrast, short answer gate was constant across all six configurations at 0.029, indicating that this behavior was driven by the dataset and answer characteristics rather than by the configuration itself. Similarly, banned evidence invalid dropped remained low in all cases, varying only from 0.001 to 0.011.

A further distinction is visible in semantic exhausted but accepted, which appears only in the configurations that include the corresponding semantic behavior. This rate was 0.025 for R1, 0.013 for C1, 0.022 for M2, and 0.010 for FULL, while it remained 0.000 for B0 and M1. Among the configurations in which this signal occurs, FULL produces the lowest rate, which indicates that the complete pipeline leaves the smallest proportion of accepted outputs still carrying this marker.

Although the full table provides the complete configuration-level summary, not all signals are equally informative for visual comparison. In practice, the clearest differences are concentrated in the evidence- and completeness-related outcomes. Figure 5 provides a focused view of the three most informative postprocess signals by showing their rates across all six configurations: invalid gold evidence, forced completeness, and exhausted semantic repair.

The figure makes two patterns especially clear. First, the strongest configurations substantially reduce the rate of invalid gold evidence relative to the weaker variants. This is most evident for FULL and C1, which achieve the lowest values, whereas B0 and M1 retain noticeably higher rates. Second, forced completeness remains the most prevalent of the three signals in every configuration, showing that completeness-oriented postprocessing remains relevant throughout the pipeline even when the other signals are less frequent. Within this signal, C1 stands out with the lowest rate, whereas M1 and M2 remain the highest. The third signal, exhausted semantic repair, is comparatively rare overall, but its selective presence in R1, C1, M2, and FULL confirms that it is associated with the corresponding semantic behavior rather than with the dataset alone.

Overall, these results show that the internal differences between configurations are expressed far less through contract-fail outcomes than through the postprocess profile of the final accepted grading artifacts. The strongest configurations, especially FULL and C1, are characterized by lower rates of invalid covered-gold evidence and, more generally, by a cleaner postprocess profile, whereas the weaker variants retain a larger share of evidence-related issues in the accepted outputs.

3.5. Memory Module Effects

The previous subsections established the overall ablation ranking, localized the strongest gains by item type and question, and described how the accepted outputs differ in their verifier and postprocess profiles. A more targeted analysis is still needed, however, to isolate the contribution of the two memory-related components introduced in the pipeline. Because the raw ablation ranking alone does not show which improvements are attributable specifically to rubric memory and which are attributable to consistency memory, this subsection evaluates these effects through a set of pairwise comparisons designed to isolate each module as directly as possible.

The analysis is structured around four comparisons. The transition from B0 to M1 captures the effect of rubric memory without repair. The transition from R1 to C1 captures the effect of consistency memory in the repair-enabled setting. The transition from R1 to M2 captures the effect of rubric memory under the same repair-enabled conditions. Finally, the transition from M2 to FULL captures the additional contribution of consistency memory when rubric memory and repair are already present. Table 9 provides this pairwise comparison by reporting, for each transition, the baseline and augmented values for MAE and QWK, together with their corresponding deltas.

The table shows that the effect of the memory modules is not uniform. The smallest improvement is observed for B0 → M1, where introducing rubric memory without repair reduces MAE only from 2.097 to 2.078 and increases QWK only from 0.628 to 0.631, yielding ΔMAE = 0.019 and ΔQWK = 0.003. A similarly small effect appears in R1 → M2, where adding rubric memory in the repair-enabled setting changes MAE from 2.105 to 2.089 and QWK from 0.624 to 0.626, corresponding to ΔMAE = 0.016 and ΔQWK = 0.002. These two comparisons indicate that rubric memory, taken in isolation, contributes only modest gains in this dataset.

By contrast, the comparisons involving consistency memory show substantially larger improvements. In R1 → C1, the addition of consistency memory reduces MAE from 2.105 to 1.983 and improves QWK from 0.624 to 0.648, yielding ΔMAE = 0.122 and ΔQWK = 0.024. An even stronger effect is observed in M2 → FULL, where adding consistency memory on top of rubric memory and repair reduces MAE from 2.089 to 1.935 and raises QWK from 0.626 to 0.652, corresponding to ΔMAE = 0.154 and ΔQWK = 0.026. These two transitions consistently show that consistency memory produces the dominant gains among the memory-related components evaluated here.

Although the table provides the exact numerical deltas, a visual comparison helps clarify their relative magnitude. In particular, the side-by-side inspection of ΔMAE and ΔQWK makes it easier to see which component produces only marginal changes and which one leads to a clearly visible shift in agreement. Figure 6 provides this visual summary by plotting the change in MAE and QWK for the four pairwise memory-module comparisons.

The figure makes the ranking of the effects immediately visible. The two smallest bars appear in the rubric-memory comparisons B0 → M1 and R1 → M2, in both the ΔMAE and ΔQWK panels, confirming that rubric memory alone contributes only limited improvements. In contrast, the largest bars are observed for M2 → FULL and R1 → C1, showing that consistency memory produces a clearly stronger effect under both comparison settings. The same ordering appears in both panels, which indicates that the gain is not confined to one metric alone but is reflected consistently in both lower absolute error and stronger ordinal agreement.

Overall, these results show that the memory-related gains in this study are driven primarily by consistency memory, whereas rubric memory has only a modest effect when introduced on its own. The strongest memory-related improvement is obtained when consistency memory is added on top of an already stronger configuration, as seen in M2 → FULL, while the weakest gains are associated with the two transitions that isolate rubric memory. This pattern indicates that, within the present pipeline, consistency-oriented contextualization contributes more strongly to grading alignment than rubric memory alone.

3.6. Efficiency and Operational Overhead

The previous subsections established how the six configurations differ in agreement quality, where the strongest gains are concentrated, and how the accepted outputs differ internally in their verifier and postprocess profile. A practical comparison, however, also requires an operational perspective. Higher agreement is useful only if the associated overhead remains interpretable and manageable in realistic grading conditions. For this reason, the analysis in this subsection examines the computational cost of the six configurations in terms of total runtime, average time per response, average number of attempts per response, and repair rate.

A first step is to summarize these efficiency-related quantities at the configuration level. This makes it possible to determine whether the configurations that achieve stronger agreement do so at only marginal additional cost or whether the gain is accompanied by a substantial increase in operational overhead. Table 10 provides this overview by reporting, for each configuration, the total wall time, the average processing time per item, the average number of attempts per item, and the overall repair rate.

The table shows a clear efficiency gradient across the six configurations. The most lightweight configuration is B0, with an average processing time of 37.551 s/item, followed closely by M1 at 38.425 s/item. The repair-enabled configurations are consistently more expensive, with R1 reaching 40.345 s/item, M2 reaching 40.555 s/item, and C1 reaching 41.320 s/item. The largest overhead is observed for FULL, which requires 42.263 s/item on average. Relative to B0, this corresponds to an increase of approximately 12.5% in average per-item runtime. Thus, the best-performing configuration is also the most computationally demanding one.

The same pattern is reflected in the number of attempts per item. B0 and M1 both remain at 1.000, indicating that they finalize every response in a single attempt. The remaining configurations require more attempts on average: C1 reaches 1.078, R1 reaches 1.090, M2 reaches 1.091, and FULL reaches 1.093. This indicates that the additional overhead is closely tied to configurations that engage more often in iterative processing rather than to any broad increase in cost unrelated to the attempt structure. The repair rates follow the same ordering. B0 and M1 remain at 0.000, while C1, R1, M2, and FULL rise to 0.078, 0.090, 0.091, and 0.093, respectively. This confirms that the operational differences between configurations are driven primarily by the frequency of additional repair-related processing.

Although the table provides the exact numerical values, a visual comparison makes the cost profile of the configurations easier to interpret. In particular, plotting Avg Item Time next to Avg Attempts/Item shows directly whether the runtime overhead follows the same ordering as the attempt overhead. Figure 7 provides this comparison by presenting the two measures side by side for all six configurations.

The figure confirms that the ranking by runtime closely mirrors the ranking by average number of attempts. B0 and M1 occupy the lowest-cost region in both panels, while FULL occupies the highest-cost region in both. R1, M2, and C1 form an intermediate group, with C1 slightly more expensive than R1 and M2 in average item time despite a somewhat lower average number of attempts. This indicates that the operational cost of the stronger configurations is driven primarily, though not exclusively, by the frequency of additional attempts. The figure also makes clear that the increase in cost is gradual rather than abrupt, with the main difference lying between the single-attempt configurations and the more iterative ones.

Overall, these results show that the strongest-performing configurations do incur additional operational cost, but the overhead remains structured and interpretable. The best agreement is obtained by FULL, but this comes at the highest average runtime, the highest mean number of attempts per item, and the highest repair rate. Conversely, B0 and M1 are operationally the most efficient, but they do not reach the same agreement levels as the stronger repair- and memory-enabled configurations. The trade-off is therefore clear: stronger alignment with the human reference scores is associated with a moderate but measurable increase in computational overhead.

3.7. Representative Case Analysis

The aggregate quantitative analyses establish the overall ranking of the six configurations, show where the strongest gains are concentrated, and characterize the internal verifier and postprocess profile of the accepted outputs. A final qualitative step is still useful, however, in order to show how these broader patterns appear in concrete grading situations. Rather than introducing additional global statistics, this subsection focuses on a small set of representative examples selected to illustrate four distinct situations: a clear argumentative improvement, a substantial technical improvement, a case in which the internal postprocess profile becomes cleaner without changing the final score, and a case in which the baseline remains closer to the reference scorer.

Table 11 provides these representative examples by summarizing the question focus, the final scores assigned by B0 and FULL relative to the reference scorer E2, the corresponding absolute deviations from E2, and a short observation describing the main pattern illustrated by each case.

The first case illustrates a clear improvement for an argumentative response. For the item asking whether passing the Turing Test is sufficient to regard a machine as intelligent, E2 assigned a final score of 10. In this case, B0 assigned 6, whereas FULL assigned 10, matching the reference exactly. This example is representative of the broader tendency for the strongest gains to appear for argumentative responses and for questions that require coherent justification rather than short technical recall alone.

The second case illustrates a substantial but incomplete gain on a technical response. For the item concerning the role of the Naive Bayes classifier in decision-making, E2 assigned 10, B0 assigned 4, and FULL assigned 7. Here, FULL does not fully match the reference score, but it reduces the absolute scoring gap by half relative to the baseline. This case is also notable because the accepted output of B0 still carried the covered gold evidence invalid signal, whereas the accepted output of FULL did not, linking the score-level improvement to a cleaner internal postprocess profile.

The third case shows that a cleaner internal profile does not necessarily imply an immediate change in the final score. For the item asking about the role of a version control system, E2 assigned 8, while both B0 and FULL assigned 5. In other words, the score-level behavior remains unchanged. However, the baseline output still carried the covered gold evidence invalid signal, while the corresponding FULL output did not. This example shows that internal output quality can improve even when the final score itself remains fixed.

The fourth case provides an important counterexample in which the baseline remains closer to the reference scorer. For the item concerning the role of facial recognition in modern security applications, E2 assigned 1. In this case, B0 also assigned 1, whereas FULL assigned 5. This is therefore a case in which the best-performing configuration at the aggregate level is clearly worse than the baseline on an individual response. The example is important because it confirms that the gains achieved by FULL are not uniform and do not eliminate all cases in which the simpler baseline remains more accurate.

Overall, these representative cases reinforce the quantitative evidence without repeating it. They show concretely that the gains of FULL are most visible on argumentative responses, can also produce meaningful reductions in error on selected technical responses, may improve the internal postprocess profile even when the final score is unchanged, and do not eliminate all situations in which the baseline remains closer to the human reference. In this way, the case analysis provides a concise qualitative complement to the aggregate quantitative results by illustrating how the main performance patterns appear in concrete grading situations.

4. Discussion

The empirical results highlight several consistent patterns regarding the behavior of the GradeAgentOps pipeline. Most notably, the strongest agreement with the human reference scorer is achieved by the full configuration, while the gains are distributed unevenly across item types and individual questions. At the same time, the results show that the contribution of the pipeline is not explained solely by score-level agreement, but also by differences in postprocess behavior, memory-module effects, and operational overhead. These aspects are examined below in order to clarify the meaning, implications, and limitations of the observed findings.

4.1. Principal Findings

The empirical evaluation reveals several important but bounded findings regarding the behavior of the GradeAgentOps pipeline. First, among the six ablation configurations, FULL achieves the strongest numerical agreement with the human reference scorer E2, with C1 emerging as the closest alternative. This pattern is consistent across the main score-level agreement measures, although the observed differences between configurations are modest in magnitude. This interpretation should also be considered together with the paired student-cluster bootstrap analysis reported in Table 6. The bootstrap results show that both FULL and C1 improve significantly over B0 across MAE, QWK, and Within ±2, whereas the direct comparison between FULL and C1 is not statistically significant. The complete pipeline therefore provides the best numerical alignment within this dataset, while C1 remains statistically comparable to FULL, rather than evidence of a large performance gap over the baseline. At the same time, the advantage of the strongest configurations is not absolute: all configurations remain, on average, stricter than E2, which shows that improved agreement does not eliminate the underlying tendency of the automated grader to assign lower scores than the reference scorer.

A second principal finding is that the observed gains are not distributed uniformly across the evaluation set. The improvements achieved by FULL are concentrated more strongly in argumentative items than in technical items, and the question-level breakdown shows that the gain is driven by a subset of items rather than by a consistent improvement across all ten questions. This is important because it indicates that the strongest configuration does not simply raise performance in a global and homogeneous way, but instead improves alignment more effectively in response types that appear to benefit from stronger consistency and contextualization during grading.

A third finding concerns the relative contribution of the pipeline components. The pairwise comparisons show that the largest memory-related improvements are associated with consistency memory, whereas rubric memory alone produces only modest gains. In addition, the internal analysis of verifier and postprocess outcomes shows that the stronger configurations are characterized less by differences in contract-fail behavior and more by a cleaner postprocess profile, especially through lower rates of invalid covered-gold evidence in the accepted outputs. Taken together, these results suggest that the most meaningful improvements do not arise from simple rubric exposure alone, but from mechanisms that stabilize the grading decision and reduce evidence-level inconsistencies in the final artifact.

A fourth finding is that these quality gains are accompanied by a measurable but moderate operational cost. The best-performing configurations require higher average item time, more attempts per item, and higher repair rates than the most lightweight variants. Nevertheless, the overhead remains structured and interpretable rather than excessive or unstable. In practical terms, the results indicate that stronger agreement with the human reference scorer can be obtained, but not for free: the gains in grading quality are coupled with a clear, though still manageable, increase in computational effort.

Overall, the main findings support the conclusion that the full GradeAgentOps pipeline produces modest but statistically supported improvements over the baseline, while C1 remains statistically comparable to FULL within this dataset. These improvements are selective, component-dependent, and associated with a transparent operational trade-off.

4.2. The Role of Pipeline Components

The ablation results make it possible to interpret the contribution of the main GradeAgentOps components more precisely than would be possible from the overall ranking alone. The first important observation is that the pipeline does not behave as a set of equally influential additions. Instead, the results show a clearly uneven contribution across components, with some elements producing only marginal gains and others accounting for a substantial part of the observed improvement in agreement.

The weaker performance of R1 is informative in this respect. In this configuration, bounded repair is enabled without rubric memory or consistency memory. The repair mechanism is triggered by verifier-detected violations and is designed to restore contract compliance, evidence grounding, and internal consistency of the grading artifact, rather than to recalibrate the final score globally. As a result, repair alone may produce cleaner or more compliant artifacts without necessarily improving agreement with the human reference scorer. This suggests that the repair model and repair prompt are useful as targeted correction mechanisms, but they should not be interpreted as an independent source of score calibration. The stronger performance of C1 and FULL indicates that repair is more effective when combined with consistency-oriented contextualization.

The clearest distinction emerges between rubric memory and consistency memory. In the pairwise comparisons designed to isolate their effects, rubric memory produces only limited improvements, both when introduced without repair and when added in the repair-enabled setting. By contrast, consistency memory yields markedly larger gains under both comparison conditions. This indicates that simple access to rubric-related contextual information is not, by itself, sufficient to generate a strong improvement in grading alignment. What appears to matter more is the component that promotes a more stable and coherent use of that information during the grading decision itself.

This interpretation is reinforced by the broader behavior of the strongest configurations. The best-performing variants are not simply those that expose the model to more rubric-related context, but those that combine contextual support with mechanisms that reduce inconsistency in the final grading artifact. In this sense, the role of consistency memory appears to be less about adding content and more about constraining the grading process toward a more internally coherent decision pattern. This helps explain why the gains are especially visible on argumentative items, where grading depends more strongly on maintaining stable reasoning across multiple elements of the response rather than identifying only a small set of isolated technical cues.

A similar conclusion follows from the verifier and postprocess analyses. The stronger configurations are not distinguished by a dramatic reduction in outright contract-fail cases, because such failures are already rare across the board. Instead, they are distinguished by a cleaner internal postprocess profile, particularly through lower rates of invalid covered-gold evidence in the final accepted outputs. This suggests that the key contribution of the pipeline is not merely to reject malformed outputs, but to support the production of grading artifacts that are more internally aligned with the expected evidential structure. In other words, the most meaningful effect of the pipeline lies not at the level of coarse output validity alone, but at the level of how consistently the final score is supported by the accepted evidence representation.

Taken together, these findings suggest that the contribution of the GradeAgentOps architecture is best understood as a decision-stabilization effect rather than a simple information-augmentation effect. The weaker components provide some benefit, but the strongest gains arise when the pipeline introduces mechanisms that improve the consistency of grading behavior and reduce evidence-level irregularities in the final accepted output. This interpretation also helps explain why the strongest gains are selective rather than uniform: the pipeline is most helpful in cases where grading quality depends on maintaining coherent judgment across multiple elements of a response, and less helpful in cases where the answer can already be handled adequately by a simpler baseline strategy.

4.3. Human Reference Variability

The interpretation of model–human agreement in this study must be considered in light of the variability already present at the human level. The results show that the two expert graders, E1 and E2, do not coincide perfectly on the final 0–10 score, but instead display a measurable level of disagreement together with a systematic directional shift. In particular, E2 tends to assign higher scores than E1, which means that the human reference used for the main ablation analysis is not simply a neutral benchmark, but one specific realization of expert judgment within a broader range of plausible human scoring behavior.

This point is important because the automated configurations are evaluated primarily against E2. As a consequence, their agreement levels reflect not only how well they capture the intended grading standard, but also how closely they align with the particular scoring tendency represented by that grader. The consistent negative bias observed across all configurations relative to E2 therefore should not be interpreted as evidence of arbitrary underscoring alone. It also reflects the fact that the chosen reference scorer is, on average, somewhat more lenient than the second expert grader. In this sense, part of the apparent strictness of the automated grader is inseparable from the variability of the human reference itself.

The human–human comparison also helps place the model-human results in a more realistic perspective. Automated grading systems are often discussed as if they were being compared against a perfectly stable gold standard, but the present results show that this assumption is not appropriate even in an expert-graded setting. The human reference score is itself subject to variation, and the automated pipeline should therefore be interpreted relative to that already-existing uncertainty. This does not weaken the importance of agreement metrics; rather, it makes their interpretation more careful and more credible. A model–human result should not be read in isolation from the human–human baseline that defines the attainable context of agreement.

At the same time, the existence of human reference variability does not reduce the value of the observed ablation ranking. The comparisons between configurations remain meaningful because all six variants are evaluated against the same reference scorer under the same protocol. Thus, the relative ordering of the configurations is still informative, even if the absolute agreement values must be interpreted with appropriate caution. In other words, the variability between E1 and E2 mainly affects how strongly the agreement levels can be generalized as human-equivalent, but it does not undermine the internal validity of the comparison between B0, R1, M1, C1, M2, and FULL.

Overall, the presence of measurable human reference variability strengthens rather than weakens the methodological interpretation of the study. It shows that automated grading performance should be judged against a realistic human benchmark rather than an idealized one, and it provides an important context for understanding both the bias and the agreement levels observed in the main results.

4.4. Practical Implications for Automated Grading

The results have several practical implications for the design and deployment of automated grading systems for short open-ended responses. The first implication is that stronger grading alignment does not appear to depend solely on using a more capable base model or exposing the model to more rubric-related information. Instead, the findings suggest that the structure of the grading pipeline itself matters substantially. In particular, the best results are obtained not by the simplest rubric-conditioned setup, but by a configuration that combines bounded verification, postprocessing, and consistency-oriented memory. This indicates that practical gains in automated grading quality may come less from isolated prompt enrichment and more from building a grading process that is internally constrained and behaviorally stable.

A second implication concerns the type of responses for which such a pipeline is most beneficial. The gains are more pronounced on argumentative items than on technical ones, which suggests that agentic support is especially valuable when grading requires coherent judgment across multiple aspects of a response rather than recognition of a small number of expected technical elements. In practical educational settings, this means that the strongest benefits of a structured grading pipeline may appear precisely in the kinds of questions that are traditionally harder to assess consistently with simple automated methods.

A third implication is that the choice of configuration should depend on the intended operational setting. If the main priority is maximum agreement quality, the results support the use of the FULL configuration, despite its higher runtime and attempt overhead. If, however, the priority is lower computational cost with acceptable but not maximal alignment, lighter variants such as B0 or M1 may still be attractive. This suggests that automated grading pipelines should not be viewed as having a single universally optimal setting. Rather, different deployment contexts may justify different trade-offs between grading quality and efficiency.

The results also carry an implication for the role of internal output quality in practical grading systems. The stronger configurations do not merely produce better score-level agreement; they also produce accepted outputs with a cleaner postprocess profile. This matters because a grading system that is intended for real educational use must support not only a plausible final score, but also an output structure that remains internally coherent and evidentially defensible. From a deployment perspective, this makes the pipeline more suitable for settings in which transparency, reviewability, and downstream auditing are important.

From an explainable artificial intelligence (XAI) perspective, GradeAgentOps provides artifact-level and process-level explainability. Artifact-level explainability is supported by the structured grading output, which exposes the criterion-level subscores, final score, covered and missed rubric elements, and evidence spans used to justify awarded credit. Process-level explainability is supported by verifier signals, postprocess flags, repair attempts, and provenance logs, which make it possible to inspect how a grading artifact was produced, validated, and corrected. This form of explainability is therefore operational and audit-oriented: it supports teacher review, error tracing, and reproducibility at the grading-pipeline level, while mechanistic interpretation of the internal reasoning or parameters of the underlying LLM remains outside the evaluated system.

Overall, the practical message is that automated grading quality can be improved in a meaningful way through pipeline design choices that promote consistency and internal validity, but these gains must be balanced against measurable computational overhead. The results therefore support a view of automated grading not as a single inference step, but as a structured decision process whose configuration should be matched to the pedagogical and operational priorities of the intended use case.

4.5. Limitations and Future Directions

Several limitations should be considered when interpreting the present findings. First, the evaluation is conducted on a single university-level exam dataset comprising 100 students, 10 short open-ended questions, and 1000 graded responses from one course and one institutional context. Although this setting is appropriate for controlled ablation analysis, it necessarily restricts the scope of generalization. The dataset includes both technical and argumentative items, but remains domain-specific and reflects one language, one assessment format, and one set of grading conventions. The observed behavior of the pipeline may differ across disciplines, languages, educational levels, rubric structures, response formats, and institutional grading practices. The present results should therefore be interpreted as evidence of effectiveness within a well-defined assessment setting, while broader generalization requires replication on additional courses, domains, languages, and educational contexts.

A second limitation concerns the human reference itself. The study benefits from the availability of two expert human graders, which makes it possible to quantify human inter-rater variability and to interpret model-human agreement more realistically. However, the observed human–human agreement is moderate (ICC(2,1) = 0.678; QWK = 0.678), indicating that the human benchmark itself contains measurable variability and should not be treated as an error-free ground truth. At the same time, the main ablation analysis still uses E2 as the operational reference scorer. This provides a consistent basis for internal comparison across configurations, but it also means that the reported agreement values remain tied to one specific expert scoring tendency. Consequently, model-human agreement should be interpreted relative to the observed human–human agreement context. In settings where additional graders are available, future work should examine whether the same configuration ranking remains stable when evaluated against alternative human references, aggregated human scores, or adjudicated consensus scores.

A third limitation concerns the scope of the architectural evaluation. The present study focuses on a bounded set of pipeline components, namely verification, postprocessing, and two optional memory modules, under a fixed grading setting and a fixed family of generated artifacts. This design is appropriate for isolating the contribution of the proposed components, but it does not exhaust the broader design space of agentic grading systems. In particular, the evidence-matching mechanism used in the evaluated implementation is deterministic and primarily lexical, which makes it conservative from an auditability perspective. This design reduces the risk of accepting unsupported evidence claims, but it may also under-recognize semantically correct paraphrases or alternative valid reasoning patterns when they are not expressed through a close textual span in the student answer. Other forms of memory, alternative evidence-selection strategies, semantic evidence matching, different repair policies, or more explicit reasoning constraints may produce different interaction patterns than those observed here. In addition, the consistency-memory results should be interpreted in light of the streaming nature of the same-question exemplar store. This design intentionally introduces within-run calibration, similar to the way human graders may use previously evaluated answers to maintain a stable scoring standard. However, because the available exemplars depend on the fixed answer-processing order, C1 and FULL may show some order sensitivity. The memory does not contain human reference scores or external labels, but future work should quantify this order effect by repeating the consistency-memory configurations under shuffled or counterbalanced processing orders.

A fourth limitation is operational. The reported runtime and overhead results are meaningful within the experimental setup used in this study, but they should not be interpreted as hardware-independent performance guarantees. The relative differences between configurations remain informative, yet the absolute timing values depend on the specific execution environment, implementation details, and model-serving conditions under which the experiments were conducted. Future work should therefore complement the present cost analysis with evaluations across additional deployment settings and model backends.

These limitations suggest several natural directions for future research. An immediate next step is to validate the pipeline on additional courses, assessment tasks, and question types in order to test the robustness of the observed ablation patterns beyond the present dataset. A second direction is to extend the human reference framework by incorporating more expert graders and by exploring alternative reference constructions, such as aggregated or adjudicated human scores. A third direction is to investigate the interaction between consistency-oriented mechanisms and rubric-guided reasoning in more detail, especially in tasks where justification quality and evidence selection play a stronger role than short factual recall. Finally, future work should examine how the pipeline behaves when paired with other model families and deployment regimes, because the absolute agreement values, operational overhead, and even the relative ranking of configurations may vary with the grader and repair models used. Such experiments would allow both agreement quality and operational trade-offs to be assessed under a broader range of practical conditions.

Overall, the limitations of the present study do not undermine the internal validity of the reported comparison, but they do define the boundaries within which its conclusions should be interpreted. At the same time, they identify a clear research path for extending the current findings into a broader and more general account of agentic automated grading.

5. Conclusions

This study evaluated GradeAgentOps as an agentic grading pipeline for short open-ended exam responses under a controlled ablation protocol. Across the six examined configurations, the results showed that the full pipeline achieved the strongest agreement with the human reference scorer, while the clearest gains were associated primarily with consistency-oriented mechanisms rather than with rubric memory alone. The improvements were most evident on argumentative items, where the stronger configurations achieved better alignment with human grading and produced cleaner accepted outputs. These gains, however, were accompanied by a moderate and measurable increase in operational overhead.

Beyond the ranking of individual configurations, the findings also highlight a broader methodological point. Automated grading quality in this setting is not determined solely by model capability or prompt content, but also by how the grading process is structured. Verification, postprocessing, and consistency-oriented contextualization contribute to a more stable and internally coherent grading behavior, whereas simpler configurations remain more conservative and less aligned with the human reference scores. At the same time, the presence of measurable human inter-rater variability reinforces the need to interpret model–human agreement against a realistic expert benchmark rather than an idealized one.

Taken together, these results support the view that reliable automated grading is better understood as a structured decision process than as a single inference step. Within the present evaluation setting, GradeAgentOps produced modest but interpretable improvements in alignment with expert human grading, while making the trade-off between grading quality and computational overhead explicit. This makes the proposed pipeline a promising basis for further work on robust, reviewable, and practically deployable automated assessment.

Author Contributions

Conceptualization, C.A., A.A.A., M.V.C., A.C., C.A.A., C.S. and S.D.; methodology, C.A., M.V.C., A.A.A., D.-E.V. and R.C.; software, A.C., C.A., A.A.A. and C.M.; validation, C.A., M.V.C., A.A.A., A.C., C.A.A. and C.S.; formal analysis, C.A.A. and S.D.; data curation, A.C., M.V.C., D.-E.V., C.M. and R.C.; writing—original draft preparation, C.A., A.A.A., M.V.C., A.C., D.-E.V., C.A.A., C.M., C.S., S.D. and R.C.; writing—review and editing, C.A., A.A.A., A.C., M.V.C., C.A.A., C.S. and S.D.; visualization, A.A.A., A.C., D.-E.V. and C.M.; supervision, C.A., A.C., C.S. and R.C. All authors have read and agreed to the published version of the manuscript.

Funding

Publication of this paper was supported by the University of Medicine and Pharmacy Carol Davila, through the institutional program Publish not Perish.

Institutional Review Board Statement

Ethical review and approval were not required for this study because it used anonymous, non-interventional educational assessment responses collected through an online form. The form did not collect names, email addresses, or directly identifying personal data. Only responses with affirmative electronic consent were retained and analyzed. The study did not involve patients, clinical data, biological material, medical intervention, sensitive personal data, vulnerable clinical populations, or identifiable participant information. The fact that formal ethical approval was not required was confirmed by the Faculty of Automation, Computer Science, Electrical and Electronic Engineering, “Dunărea de Jos” University of Galați, Romania, through an official faculty confirmation issued by the Dean’s Office [Reference No. 754/Date 28 April 2026].

Informed Consent Statement

Informed consent was obtained electronically from all participants included in the study. The online form included an explicit consent question asking whether participants agreed that their anonymous answers could be used for scientific research purposes. Only responses with affirmative consent were retained and analyzed.

Data Availability Statement

The source code and anonymized evaluation dataset required to reproduce the GradeAgentOps experiments are publicly available at: https://github.com/anghelcata/GradeAgentOps (accessed on 8 March 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
CSV	Comma-Separated Values
E1	Expert Human Grader 1
E2	Expert Human Grader 2
ICC	Intraclass Correlation Coefficient
JSON	JavaScript Object Notation
LLM	Large Language Model
MAE	Mean Absolute Error
QWK	Quadratic Weighted Kappa
RMSE	Root Mean Squared Error
B0	Baseline configuration without repair or memory
R1	Repair enabled configuration without memory
M1	Rubric memory configuration without repair
C1	Consistency memory configuration with repair
M2	Rubric memory configuration with repair
FULL	Full configuration with repair and both memory modules

References

Poličar, P.G.; Špendl, M.; Curk, T.; Zupan, B. Automated assignment grading with large language models: Insights from a bioinformatics course. Bioinformatics 2025, 41, i21–i29. [Google Scholar] [CrossRef] [PubMed]
Li, Q.; Cui, L.; Kong, L.; Bi, W. Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks. In Proceedings of the 31st International Conference on Computational Linguistics (COLING), Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 10325–10344. [Google Scholar]
Bray, T. RFC 8259: The JavaScript Object Notation (JSON) Data Interchange Format. Available online: https://www.rfc-editor.org/rfc/rfc8259 (accessed on 28 February 2026).
Neo4j, I. Neo4j Graph Database Platform. Available online: https://neo4j.com/product/neo4j-graph-database/ (accessed on 28 February 2026).
Panadero, E.; Jonsson, A.; Pinedo, L.; Fernández-Castilla, B. Effects of Rubrics on Academic Performance, Self-Regulated Learning, and self-Efficacy: A Meta-analytic Review. Educ. Psychol. Rev. 2023, 35, 113. [Google Scholar] [CrossRef]
Christian, G. LLM-based automatic short answer grading in undergraduate medical education. BMC Med. Educ. 2024, 24, 1060. [Google Scholar] [CrossRef]
Zhang, D.-W.; Boey, M.; Tan, Y.Y.; Jia, A.H.S. Evaluating large language models for criterion-based grading from agreement to consistency. npj Sci. Learn. 2024, 9, 79. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Wang, C.; Ma, P.; Wu, D.; Wang, S.; Gao, C.; Liu, Y. Split and Merge: Aligning Position Biases in LLM-based Evaluators. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), Miami, FL, USA, 12–16 November 2024; pp. 11084–11108. [Google Scholar]
Lyu, Q.; Apidianaki, M.; Callison-Burch, C. Towards Faithful Model Explanation in NLP: A Survey. Comput. Linguist. 2024, 50, 657–723. [Google Scholar] [CrossRef]
Schlegel, M.; Sattler, K.-U. Capturing end-to-end provenance for machine learning pipelines. Inf. Syst. 2025, 132, 102495. [Google Scholar] [CrossRef]
Cavus, M.; Jiang, J.; Allahham, A. Deep Multi-Task Forecasting of Net-Load and EV Charging with a Residual-Normalised GRU in IoT-Enabled Microgrids. Energies 2026, 19, 311. [Google Scholar] [CrossRef]
Emirtekin, E. Large Language Model-Powered Automated Assessment: A Systematic Review. Appl. Sci. 2025, 15, 5683. [Google Scholar] [CrossRef]
Golchin, S.; Garuda, N.; Impey, C.; Wenger, M. Grading Massive Open Online Courses Using Large Language Models. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 3899–3912. [Google Scholar]
Duong, T.N.B.; Chai, Y.M. Automatic Grading of Short Answers Using Large Language Models in Software Engineering Courses. In Proceedings of the 2024 IEEE Global Engineering Education Conference (EDUCON), Kos, Greece, 8–11 May 2024; pp. 1–10. [Google Scholar]
Cisneros-González, J.; Gordo-Herrera, N.; Barcia-Santos, I.; Sánchez-Soriano, J. JorGPT: Instructor-Aided Grading of Programming Assignments with Large Language Models (LLMs). Future Internet 2025, 17, 265. [Google Scholar] [CrossRef]
Wang, Y.; Huang, J.; Du, L.; Guo, Y.; Liu, Y.; Wang, R. Evaluating large language models as raters in large-scale writing assessments: A psychometric framework for reliability and validity. Comput. Educ. Artif. Intell. 2025, 9, 100481. [Google Scholar] [CrossRef]
Hashemi, H.; Eisner, J.; Rosset, C.; Van Durme, B.; Kedzie, C. LLM-RUBRIC: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Bangkok, Thailand, 11–16 August 2024; pp. 13806–13834. [Google Scholar]
Koo, R.; Lee, M.; Raheja, V.; Park, J.I.; Kim, Z.M.; Kang, D. Benchmarking Cognitive Biases in Large Language Models as Evaluators. In Findings of the Association for Computational Linguistics: ACL 2024; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 517–545. [Google Scholar]
Shi, L.; Ma, C.; Liang, W.; Diao, X.; Ma, W.; Vosoughi, S. Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge. In 14th International Joint Conference on Natural Language Processing of the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics; The Asian Federation of Natural Language Processing and The Association for Computational Linguistics: Mumbai, India, 2025. [Google Scholar]
Zhang, Q.; Wang, Y.; Jiang, Y.; Li, L.; Wu, C.; Wang, Y.; Jiang, X.; Shang, L.; Tang, R.; Lyu, F.; et al. Crowd Comparative Reasoning: Unlocking Comprehensive Evaluations for LLM-as-a-Judge. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, Vienna, Austria, 27 July–1 August 2025; pp. 5059–5074. [Google Scholar]
Seo, H.; Hwang, T.; Jung, J.; Namgoong, H.; Lee, J.; Jung, S. Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy. Appl. Sci. 2025, 15, 671. [Google Scholar] [CrossRef]
Pina, D.; Kunstmann, L.; Chapman, A.; de Oliveira, D.; Mattoso, M. DLProv: A suite of provenance services for deep learning workflow analyses. PeerJ Comput. Sci. 2025, 11, e2985. [Google Scholar] [CrossRef] [PubMed]
Hu, X.; Gao, M.; Hu, S.; Zhang, Y.; Chen, Y.; Xu, T.; Wan, X. Are LLM-based Evaluators Confusing NLG Quality Criteria? In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; pp. 9530–9570. [Google Scholar]
Kamoi, R.; Zhang, Y.; Zhang, N.; Han, J.; Zhang, R. When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs. Trans. Assoc. Comput. Linguist. 2024, 12, 1417–1440. [Google Scholar] [CrossRef]
Beurer-Kellner, L.; Fischer, M.; Vechev, M. Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Zheng, L.; Yin, L.; Xie, Z.; Sun, C.; Huang, J.; Yu, C.H.; Cao, S.; Kozyrakis, C.; Stoica, I.; Gonzalez, J.E.; et al. SGLang: Efficient Execution of Structured Language Model Programs. In Advances in Neural Information Processing Systems 37; Neural Information Processing Systems Foundation, Inc. (NeurIPS): Vancouver, BC, Canada, 2024. [Google Scholar]
Padovani, G.; Anantharaj, V.; Fiore, S. yProv4ML: Effortless provenance tracking for machine learning systems. SoftwareX 2025, 31, 102298. [Google Scholar] [CrossRef]
Meta Platforms, I. Llama-3.3-70B-Instruct. Available online: https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct (accessed on 9 March 2026).
Team, Q. Qwen2.5-14B-Instruct. Available online: https://huggingface.co/Qwen/Qwen2.5-14B-Instruct (accessed on 9 March 2026).
Ollama-Inc. Ollama: Local Deployment of Large Language Models. Available online: https://ollama.com (accessed on 9 March 2026).
JetBrains. PyCharm. Available online: https://www.jetbrains.com/pycharm/ (accessed on 9 March 2026).
Jaccard, P. Étude comparative de la distribution florale dans une portion des Alpes et du Jura. Bull. Soc. Vaudoise Sci. Nat. 1901, 37, 547–579. [Google Scholar] [CrossRef]
Spärck Jones, K. A statistical interpretation of term specificity and its application in retrieval. J. Doc. 1972, 28, 11–21. [Google Scholar] [CrossRef]

Figure 1. GradeAgentOps pipeline overview: grader LLM produces a JSON grading artifact that is deterministically verified and canonicalized; verifier-detected violations trigger bounded targeted repair; provenance is logged for auditability.

Figure 2. Distribution of integer score differences between the two expert graders, defined as Δ = E2 − E1, on the final 0–10 score across all 1000 responses.

Figure 3. Main ablation results shown as dual horizontal bar charts for MAE and QWK across the six configurations, ordered by performance.

Figure 4. Per-question change in error between the baseline configuration B0 and the best-performing configuration FULL, defined as ΔMAE = MAE(B0) − MAE(FULL) relative to the reference scorer E2. Positive values indicate lower error for FULL, whereas negative values indicate lower error for B0.

Figure 5. Rates of the three most informative postprocess signals across the six grading configurations: invalid gold evidence, forced completeness, and exhausted semantic repair.

Figure 6. Memory-module effects shown as changes in MAE and QWK across four pairwise comparisons designed to isolate rubric memory and consistency memory. Positive values indicate improvement relative to the corresponding baseline configuration.

Figure 7. Efficiency and operational overhead shown as average item time and average attempts per item across the six grading configurations.

Table 1. Dataset composition, rubric artifacts, and human scoring dimensions.

Component	Summary
Dataset	University-level exam dataset: 100 students, 10 short open-ended questions, 1000 student responses
Item types	Technical and argumentative items
Instance contents	Question text + free-form student answer + instructor-provided grading guidance + human scoring annotations from two independent expert human graders
Data sources	Three sources: JSON file with student answers + instructor-authored rubric artifacts; two CSV files with human grading annotations from the two expert graders
Data integration	Sources merged deterministically by student and question identifiers into one evaluation table (shared across all ablations)
Rubric artifacts	Reference solution + gold points (atomic expected elements; may be weighted) + banned misconceptions (listed for some items)
Rubric purpose	Supports coverage accounting (covered vs. missed expected elements) and evidence-based justification at the level of individual rubric elements
Human reference scores	The two expert graders assign integer criterion-level subscores and a final integer score (0–10)
Human dimensions (technical)	Accuracy; Clarity; Completeness; Terminology
Human dimensions (argumentative)	Clarity; Coherence; Originality; Dialecticality

Table 2. Grading contract groups and deterministic verifier actions.

Contract Group	Requirement (Summary)	Verifier Action (Summary)
JSON envelope	Single JSON object with the required fields for the question type.	Reject non-JSON; enforce required fields (schema compliance).
Scores	Integer subscores (per question type) and integer final score (0–10).	Validate ranges; recompute and overwrite inconsistent final totals (canonicalization).
Gold-point coverage	Covered and missed gold-point indices form a valid, rubric-aligned partition.	Validate indices; canonicalize covered/missed into a stable representation.
Evidence (gold points)	Evidence entries aligned with covered points; each evidence span must match the student answer.	Flag mismatches; deterministically move unverifiable covered points to missed.
Banned misconceptions (when defined)	If defined, detected misconceptions must include evidence spans grounded in the student answer.	If evidence is not verifiable, discard misconception detection to avoid unverifiable penalties.
Failure signaling	Separate structural contract issues from semantic violations.	Emit structured postprocess signals used for analysis and targeted repair.

Table 3. Ablation configurations used in the experimental protocol.

Config	Rubric Memory	Consistency Memory	Contract Repair Loop	Semantic Repair Loop
B0	Off	Off	Off	Off
R1	Off	Off	On	On
M1	On	Off	Off	Off
C1	Off	On	On	On
M2	On	Off	On	On
FULL	On	On	On	On

Table 4. Human inter-rater reliability between the two expert graders E1 and E2 on the final 0–10 score, reported overall and by item type.

Category	n	Mean E1	Mean E2	Bias E2 − E1	MAE	RMSE	QWK	ICC(2,1)
Overall	1000	4.746	5.415	0.669	1.827	2.296	0.678	0.678
Technical	500	5.190	6.038	0.848	1.828	2.310	0.654	0.655
Argumentative	500	4.302	4.792	0.490	1.826	2.283	0.678	0.678

Table 5. Main ablation results for the six grading configurations, reported against the reference scorer E2 on the final 0–10 score.

Config	Mean Model	Mean E2	Bias Model − E2	MAE	RMSE	QWK	Within ±1	Within ±2
B0	3.874	5.415	−1.541	2.097	2.644	0.628	0.444	0.634
R1	3.864	5.415	−1.551	2.105	2.655	0.624	0.443	0.632
M1	3.905	5.415	−1.510	2.078	2.619	0.631	0.447	0.642
C1	4.124	5.415	−1.291	1.983	2.540	0.648	0.474	0.660
M2	3.892	5.415	−1.523	2.089	2.634	0.626	0.446	0.640
FULL	4.226	5.415	−1.189	1.935	2.500	0.652	0.468	0.667

Table 6. Paired student-cluster bootstrap comparisons between the main grading configurations.

Comparison	Metric	Δ	95% CI	p
FULL vs. B0	MAE	0.162	[0.094, 0.227]	<0.001
FULL vs. B0	QWK	0.024	[0.001, 0.046]	0.043
FULL vs. B0	Within ±2	0.033	[0.011, 0.055]	0.002
C1 vs. B0	MAE	0.114	[0.057, 0.171]	<0.001
C1 vs. B0	QWK	0.021	[0.007, 0.035]	0.005
C1 vs. B0	Within ±2	0.026	[0.006, 0.046]	0.010
FULL vs. C1	MAE	0.048	[−0.005, 0.098]	0.075
FULL vs. C1	QWK	0.003	[−0.016, 0.021]	0.658
FULL vs. C1	Within ±2	0.007	[−0.011, 0.024]	0.451

Table 7. Performance breakdown by item type for the baseline configuration B0 and the best-performing configuration FULL, reported against the reference scorer E2 on the final 0–10 score.

Category	n	B0 MAE	FULL MAE	ΔMAE (B0 − FULL)	B0 QWK	FULL QWK	ΔQWK (FULL − B0)
Technical	500	2.208	2.118	0.09	0.541	0.538	−0.003
Argumentative	500	1.986	1.752	0.234	0.673	0.717	0.044

Table 8. Verifier and postprocess outcomes across the six grading configurations, including Pass@1, contract-fail item rate, and selected postprocess signal rates.

Config	Pass@1	Contract-Fail Rate	Invalid Gold Evidence	Forced Completeness	Short-Answer Gate	Banned Evidence Dropped	Semantic Exhausted
B0	1.000	0.000	0.057	0.122	0.029	0.001	0.000
R1	0.943	0.000	0.025	0.121	0.029	0.001	0.025
M1	1.000	0.000	0.059	0.128	0.029	0.011	0.000
C1	0.941	0.001	0.013	0.085	0.029	0.004	0.013
M2	0.941	0.000	0.022	0.129	0.029	0.009	0.022
FULL	0.927	0.001	0.010	0.117	0.029	0.009	0.010

Table 9. Pairwise comparison of memory-module effects, showing the impact of rubric memory and consistency memory on MAE and QWK relative to the reference scorer E2.

Comparison	Module Effect	Baseline MAE	Augmented MAE	ΔMAE	Baseline QWK	Augmented QWK	ΔQWK
B0 → M1	Rubric memory, no repair	2.097	2.078	0.019	0.628	0.631	0.003
R1 → C1	Consistency memory, with repair	2.105	1.983	0.122	0.624	0.648	0.024
R1 → M2	Rubric memory, with repair	2.105	2.089	0.016	0.624	0.626	0.002
M2 → FULL	Consistency memory over rubric memory	2.089	1.935	0.154	0.626	0.652	0.026

Table 10. Efficiency and operational overhead across the six grading configurations, including total wall time, average item time, average attempts per item, and repair rate.

Config	Wall Time (h)	Avg Item Time (s)	Avg Attempts/Item	Repair Rate
B0	10.431	37.551	1.000	0.000
R1	11.207	40.345	1.090	0.090
M1	10.674	38.425	1.000	0.000
C1	11.478	41.320	1.078	0.078
M2	11.265	40.555	1.091	0.091
FULL	11.740	42.263	1.093	0.093

Table 11. Representative grading cases illustrating improvement, stability, cleaner postprocess behavior, and residual failure when comparing the baseline configuration B0 with the best-performing configuration FULL against the reference scorer E2, including question focus and absolute deviations from the reference score.

Case	Question Focus	Item Type	E2	B0	FULL	\|B0 − E2\|	\|FULL − E2\|	Main Pattern
1	Turing Test	Argumentative	10	6	10	4	0	Clear gain
2	Naive Bayes	Technical	10	4	7	6	3	Partial gain
3	Version control	Technical	8	5	5	3	3	Cleaner output
4	Facial recognition	Technical	1	1	5	0	4	Baseline closer to E2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Anghel, C.; Anghel, A.A.; Craciun, M.V.; Cocu, A.; Vulpe, D.-E.; Andrei, C.A.; Maier, C.; Scheau, C.; Dragosloveanu, S.; Cergan, R. GradeAgentOps: A Verification-First Framework for Evidence-Anchored LLM Exam Grading. AI 2026, 7, 198. https://doi.org/10.3390/ai7060198

AMA Style

Anghel C, Anghel AA, Craciun MV, Cocu A, Vulpe D-E, Andrei CA, Maier C, Scheau C, Dragosloveanu S, Cergan R. GradeAgentOps: A Verification-First Framework for Evidence-Anchored LLM Exam Grading. AI. 2026; 7(6):198. https://doi.org/10.3390/ai7060198

Chicago/Turabian Style

Anghel, Catalin, Andreea Alexandra Anghel, Marian Viorel Craciun, Adina Cocu, Diana-Elena Vulpe, Constantin Adrian Andrei, Calina Maier, Cristian Scheau, Serban Dragosloveanu, and Romica Cergan. 2026. "GradeAgentOps: A Verification-First Framework for Evidence-Anchored LLM Exam Grading" AI 7, no. 6: 198. https://doi.org/10.3390/ai7060198

APA Style

Anghel, C., Anghel, A. A., Craciun, M. V., Cocu, A., Vulpe, D.-E., Andrei, C. A., Maier, C., Scheau, C., Dragosloveanu, S., & Cergan, R. (2026). GradeAgentOps: A Verification-First Framework for Evidence-Anchored LLM Exam Grading. AI, 7(6), 198. https://doi.org/10.3390/ai7060198

Article Menu

GradeAgentOps: A Verification-First Framework for Evidence-Anchored LLM Exam Grading

Abstract

1. Introduction

1.1. Background and Motivation

1.2. Related Work

1.3. Research Gap and Contributions

2. Materials and Methods

2.1. Dataset and Human Reference Scores

2.2. GradeAgentOps Pipeline and Experimental Setup

2.3. Verification-First Grading Contract and Deterministic Verifier

2.4. Targeted Semantic Repair Policy

2.5. Memory Modules: Rubric Memory and Consistency Memory

2.6. Provenance Logging and Experimental Protocol

3. Results

3.1. Human Inter-Rater Reliability

3.2. Main Ablation Results

3.3. Performance Breakdown by Question and Item Type

3.4. Verifier and Postprocess Outcomes

3.5. Memory Module Effects

3.6. Efficiency and Operational Overhead

3.7. Representative Case Analysis

4. Discussion

4.1. Principal Findings

4.2. The Role of Pipeline Components

4.3. Human Reference Variability

4.4. Practical Implications for Automated Grading

4.5. Limitations and Future Directions

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI