1. Introduction
Large Language Models (LLMs) are increasingly adopted for grading short, open-ended exam responses at scale [
1]. However, real-world deployment requires more than producing a plausible score: grading outputs must follow a strict scoring contract, justify awarded rubric points with evidence grounded in the student’s answer, and remain auditable under re-evaluation [
2]. This paper presents GradeAgentOps, a verification-first pipeline that enforces a rubric-aligned JSON (JavaScript Object Notation) [
3] contract, applies deterministic validation of evidence and point coverage, and triggers targeted semantic repair when verification fails. All prompts, model outputs, repair attempts, and timing metadata are recorded in Neo4j Desktopv2.1.4 [
4] to enable reproducible experiments and systematic error analysis. We evaluate GradeAgentOps on a 1000-answer exam dataset through ablations that isolate the effects of semantic repair, rubric memory, and consistency memory.
1.1. Background and Motivation
Rubric-based assessment is a long-standing instrument for improving scoring transparency and supporting consistent evaluation, and meta-analytic evidence has quantified its effects on academic performance and related learning outcomes [
5]. Building on this foundation, recent work has examined LLM-based automatic short answer grading in higher-education settings, reporting both opportunities and practical constraints when applying generative models to criterion-driven scoring [
6]. Importantly, recent criterion-based grading studies show that agreement with human scores and consistency across evaluations can diverge, which complicates calibration and the interpretation of automated grades even when criteria are explicit [
7]. Complementary reliability analyses further indicate that LLM evaluators may omit crucial criteria or introduce unnecessary criteria, leading to systematic deviations from expert expectations under complex evaluation requirements [
2].
These empirical observations highlight several risks that are especially relevant for rubric-driven grading pipelines. First, criterion drift can occur when an evaluator operationalizes the task using a shifted or incomplete set of criteria, which is consistent with evidence that LLM evaluators may add or omit evaluation dimensions relative to expert judgments [
2]. Second, comparative grading and ranking setups introduce additional instability: position bias has been documented in LLM-based evaluators and can distort pairwise judgments, with downstream consequences for score stability [
8]. Third, grading justifications are only useful if they are faithful and well grounded; however, the faithfulness of model explanations remains a recognized challenge in Natural Language Processing (NLP), which motivates explicit mechanisms for checking and constraining explanatory evidence in high-stakes decision settings such as assessment [
9].
From a systems perspective, these issues motivate grading pipelines that produce checkable scoring artifacts and support rigorous auditing. In particular, reproducible deployment benefits from capturing the full decision process—inputs, intermediate outputs, and evaluation metadata—in a structured provenance record that enables post hoc analysis and controlled experimentation. End-to-end provenance capture has been proposed for machine learning pipelines precisely to support reproducibility, debugging, and accountability across runs and evolving system components [
10]. Related AI-based decision-support studies in other operational domains also emphasize that predictive performance must be coupled with robust pipeline design when model outputs inform real-world decisions [
11].
1.2. Related Work
Recent work shows that LLM-powered automated assessment has rapidly diversified across task types (e.g., short answers, essays, and domain-specific assessments), while repeatedly flagging reliability and validation as central deployment constraints [
12]. In short-answer grading, empirical studies in higher-education settings have evaluated LLM-generated grades against instructor judgments and analyzed factors affecting agreement and consistency [
6]. Classroom deployments have also reported end-to-end workflows where LLMs grade text-based assignment items and provide feedback in real course settings [
1]. At larger scale, work on Massive Open Online Courses (MOOCs) has examined whether LLMs can replace peer grading by combining rubric- and key-based prompting strategies to improve alignment with instructor scores [
13]. Beyond “natural language only” responses, LLMs have been applied to grading short answers in software engineering courses, including approaches that combine embeddings with LLM completions to broaden acceptable answer variants [
14]. LLM-assisted assessment has additionally been explored for programming assignments, where instructor-aided pipelines integrate multiple models to streamline grading while aiming to reduce time cost and inconsistency [
15].
A second body of work focuses on how LLM grades should be interpreted and validated under criterion-driven scoring. Criterion-based grading studies show that human-score agreement and evaluator consistency can diverge, implying that prompt/rubric designs must be evaluated with metrics that separate these dimensions [
7]. In large-scale writing assessment, psychometric frameworks such as generalizability theory and many-facet Rasch modeling have been used to compare human and LLM raters and to characterize severity and stability differences across scoring regimes [
16]. Related rubric-guided evaluation research proposes multidimensional, calibrated prompting schemes where rubric questions define evaluation dimensions and judge-aware modeling is used to better match human annotation patterns [
17]. Reliability analyses of LLM evaluators further report that judge models may omit crucial criteria or introduce unnecessary criteria, producing subtle but systematic deviations from expert expectations on complex requirements [
2].
LLM-as-a-judge research also documents that evaluator outputs can be systematically biased, motivating both diagnostics and debiasing strategies. A dedicated benchmark for cognitive biases in LLM evaluators reports multiple bias types, including egocentric bias, raising concerns about robustness when LLMs are used as general-purpose judges [
18]. Complementing this, a large-scale empirical study of position bias evaluates pairwise and list-wise judging setups and introduces bias metrics (e.g., position consistency and preference fairness) to characterize when order effects compromise reliability [
19]. Recent work presented at the Annual Meeting of the Association for Computational Linguistics (ACL) proposes strengthening judge reasoning by injecting additional ‘crowd’ comparison responses, aiming to produce more comprehensive judgments and improve evaluation reliability beyond majority voting or criteria expansion alone [
20].
Finally, reproducible and auditable assessment pipelines overlap with broader systems work on evaluator consistency and provenance capture. In education-specific settings, LLMs-as-evaluators have been experimentally studied for feedback consistency and inter-model agreement, emphasizing that evaluator selection and prompting materially influence reliability [
21]. From a pipeline perspective, provenance models have been proposed to capture end-to-end traces of Machine Learning (ML) workflows (including artifacts and relationships) to enable querying, debugging, and repeatable experimentation [
10]. Complementary provenance systems for deep-learning workflows argue for database-backed provenance graphs (rather than ad hoc logs) to support traceability across data preparation, training, and evaluation with low runtime overhead [
22].
1.3. Research Gap and Contributions
Despite rapid progress in LLM-based grading and evaluation, the literature still leaves a practical gap between reporting score-level performance and building deployable exam-grading systems that can be audited and debugged over datasets comprising 100 students, 10 questions, and 1000 graded responses. A recurring issue is that LLM-based evaluators can conflate or misapply evaluation criteria, which undermines reliability even when criteria are explicitly stated. This makes it difficult to interpret improvements reported in isolation (e.g., a better prompt) without an accompanying mechanism that can detect and localize failures in a structured way [
23].
A second gap concerns what a grading system should do when verification fails. In practice, many pipelines either accept imperfect outputs or regenerate the entire evaluation, which can introduce additional variance and obscure the root cause of errors. However, research on LLM self-correction shows that refinement is not uniformly reliable across tasks and conditions; success depends on the availability of trustworthy feedback signals and well-defined correction targets. This motivates a controlled repair policy that is triggered by specific, machine-detectable violations rather than unconstrained regeneration [
24].
A third gap is the lack of a standard way to guarantee strict output compliance for integration into software systems. Prompting alone is often insufficient to enforce formal constraints (e.g., grammar- or schema-level requirements) in a robust manner, and recent work on constrained decoding emphasizes that enforcing strict structural constraints during generation is both feasible and beneficial for structured outputs. Moreover, as grading pipelines become agentic and multi-step, they resemble structured language model programs with multiple generation calls and structured I/O, making the system aspect (execution, tracing, and controllable decoding) an integral part of the solution [
25,
26].
Finally, while provenance has been widely recognized as essential for reproducibility and accountability in machine learning workflows, exam grading introduces an acute need for run-level traceability: beyond the final score, the system must expose the sequence of prompts, model outputs, retries, and timing that produced the decision. Provenance capture frameworks and practical tooling for ML workflows support exactly this form of reproducible experimentation and systematic debugging, but they are rarely integrated as first-class components in exam-grading systems [
27].
GradeAgentOps is a verification-first framework for evidence-anchored LLM exam grading. Its novelty lies in operationalizing grading as a structured, checkable, and auditable decision process rather than a single prompt-response interaction. The framework integrates strict rubric-aligned output contracts, deterministic verification and canonicalization, explicit evidence-grounding checks, verifier-triggered bounded repair, optional memory-based calibration, and provenance-aware logging into a grading-specific control architecture. Each grading decision is represented as a machine-checkable artifact whose score, rubric-point coverage, supporting evidence, validation status, and repair history can be inspected and reproduced. This design makes GradeAgentOps a systems-level contribution for reliable educational assessment, with the ablation protocol isolating how repair and memory components affect agreement, evidence validity, output compliance, and operational cost. In this context, GradeAgentOps contributes a verification-first architecture for reliable LLM-based exam grading through the following elements:
First, we define a strict rubric-aligned JSON contract that represents grades as checkable artifacts, including integer-only scoring, explicit covered/missed rubric-point partitions, and evidence fields for covered points.
Second, we introduce a deterministic verifier that enforces schema and range constraints, recomputes totals from subscores, and canonicalizes rubric-point partitions into a stable representation, producing an explicit taxonomy of structural vs. semantic failures.
Third, we operationalize evidence faithfulness as an explicit obligation by automatically checking evidence spans against the student response and emitting structured postprocess signals when evidence is invalid.
Fourth, we propose a two-model repair policy triggered only by verifier-detected violations, with a dedicated semantic repair budget that corrects offending fields while preserving already-valid parts of the artifact to reduce regeneration variance.
Fifth, we integrate two optional, independently ablated modules: rubric memory that selects a compact subset of rubric content for each response, and consistency memory that retrieves prior same-question exemplars to stabilize grading.
Sixth, we record prompts, model outputs, repair attempts, and timing metadata in Neo4j, enabling auditability, reproducible experimentation, and fine-grained analyses of failures and cost.
Seventh, we evaluate on a 1000-answer university exam dataset with instructor rubrics and human scoring, using the ablation suite (B0, R1, C1, M1, M2, FULL) and reporting compliance, evidence validity, human agreement, and computational cost.
2. Materials and Methods
This section describes the dataset, grading rubric artifacts, and experimental protocol used to evaluate GradeAgentOps. We then detail the GradeAgentOps pipeline, including the strict grading contract, deterministic verification and canonicalization procedures, the targeted semantic repair policy, optional memory modules, and the Neo4j-based provenance logging layer that enables reproducible ablations and fine-grained error analysis.
2.1. Dataset and Human Reference Scores
We evaluate GradeAgentOps on a university-level exam dataset comprising 100 students answering 10 short, open-ended questions, resulting in 1000 student responses. The dataset contains both technical and argumentative items. Each instance includes the question text and a free-form student answer, together with instructor-provided grading guidance and human scoring annotations from two independent expert human graders (E1 and E2).
The data are stored in three sources. The first is a JSON file containing the student answers and the instructor-authored rubric artifacts for each question, while the second and third are Comma-Separated Values (CSV) files containing human grading annotations from the two expert graders. These sources are merged deterministically by student and question identifiers to form a single evaluation table used across all experiments, ensuring that all ablations operate on identical inputs and human reference scores.
Table 1 summarizes the dataset composition, the available rubric artifacts, and the human scoring dimensions used by the two expert graders throughout our experiments. This compact overview anchors the dataset description before detailing the rubric structure and scoring procedure.
Instructor grading guidance is represented by a reference solution and a rubric decomposed into atomic expected elements, referred to as gold points. Gold points enumerate the key concepts that should be present in a correct answer and may be weighted to reflect their relative importance. In addition, some items include explicitly listed misconceptions, referred to as banned misconceptions, that capture common incorrect statements; if a banned misconception is present in the student response, it should be penalized. Together, gold points and banned misconceptions provide a structured, point-level rubric that supports both coverage accounting—identifying which expected elements are addressed versus missed—and evidence-based justification at the granularity of individual rubric elements.
Human reference scores are provided by the two expert graders, who assign integer criterion-level subscores and a final integer score on a 0–10 scale. Technical items are scored using the dimensions accuracy, clarity, completeness, and terminology, while argumentative items are scored using clarity, coherence, originality, and dialecticality. The final E2 score serves as the primary reference for agreement analyses, providing a fixed operational reference for comparing all pipeline configurations under identical conditions, while E1 is used to quantify inter-rater reliability between the two graders. This design allows the model-human results to be interpreted together with the observed human–human variability, rather than treating the human reference as an error-free ground truth. The subscore breakdown enables finer-grained analysis of where and why automated grading decisions differ.
2.2. GradeAgentOps Pipeline and Experimental Setup
GradeAgentOps is a verification-first grading pipeline that produces structured, audit-ready grading artifacts. For each student response, the pipeline starts from a fixed set of inputs: the question statement, the student’s free-form answer, and the instructor-provided rubric artifacts (reference answer, gold points, and—when applicable—banned misconceptions). The pipeline is designed to ensure that every scoring decision can be reproduced and inspected, and that common failure modes are detected explicitly rather than hidden behind free-form explanations.
In the implementation used in this study, the primary grading stage employed the Llama 3.3 70B Instruct model [
28], using the 4-bit quantized variant llama3.3:70b-instruct-q4_K_M, while targeted semantic repair employed the Qwen 2.5 14B Instruct model [
29], using the 4-bit quantized variant qwen2.5:14b-instruct-q4_K_M, as a separate repair model. Both models were used as pre-trained instruction-following Large Language Models (LLMs) served locally through Ollama [
30] and were not additionally trained, fine-tuned, or otherwise adapted on the exam dataset used in this study. Thus, the empirical results reported here characterize this specific model pairing and deployment setup, while the GradeAgentOps control architecture itself is not tied to these particular models. Inference was executed with fixed generation settings consisting of temperature 0, seed 42, top-p 0.9, top-k 40, and a context length of 2048 tokens. All experiments were carried out on a dedicated virtual machine running 64-bit Windows 11, using Python 3.11 in the PyCharm v2026.1.2 development environment [
31]. The virtual machine was equipped with an AMD EPYC 9654 96-core processor at 2.40 GHz, 128 GB of RAM, a 3 TB SSD, and an NVIDIA L40S-48Q GPU with 48 GB of VRAM.
The end-to-end processing flow is summarized in
Figure 1. The diagram emphasizes the verification-first control path, the bounded targeted repair loop triggered only by verifier signals, and the provenance logging used for auditability.
At execution time, GradeAgentOps proceeds in a staged manner. First, it performs lightweight deterministic handling for degenerate cases (e.g., empty or extremely short responses) to avoid unnecessary model calls when an answer clearly cannot satisfy rubric requirements. For non-degenerate responses, the system constructs a grading prompt that encodes the rubric-based requirements and requests a strict JSON grading artifact from a grader LLM. The generated artifact includes criterion-level subscores and a final score, together with point-level rubric coverage information and supporting evidence intended to justify awarded credit at the granularity of individual gold points. When enabled, rubric memory and consistency memory may be injected at this stage: rubric memory compresses the rubric context to a small set of highly relevant elements for the current response, while consistency memory provides same-question exemplars that act as calibration anchors.
The grader output is then processed by a deterministic verification and canonicalization stage. This stage enforces the grading contract and produces a stable representation that downstream analyses can rely on. In addition to structural checks (e.g., required fields and integer ranges), the verifier performs semantic checks that are central to trustworthy grading, most notably verifying that supporting evidence is grounded in the student’s answer and that rubric coverage is internally consistent. Verification produces structured failure signals that explicitly separate structural contract violations from semantic violations, enabling measurable error analysis and targeted intervention.
When verification detects semantic violations that cannot be resolved by canonicalization alone—such as unverifiable evidence—GradeAgentOps triggers a targeted semantic repair loop using a separate repair LLM. Repair attempts are bounded by a fixed budget and are guided by explicit verifier signals describing what must be corrected. The repair policy is designed to modify only the offending parts of the artifact while preserving already-valid content whenever possible, which reduces variance relative to full regeneration and makes the repair process easier to audit.
Throughout the grading process, GradeAgentOps records a complete provenance trace in Neo4j. For each answer and each attempt (initial grading and any repairs), the system logs the prompt, the raw LLM output, validation outcomes, structured failure signals, and timing information. The final validated grading artifact, together with its postprocess metadata, is exported as a run-level JSON output used for downstream quantitative evaluation and ablation analysis.
2.3. Verification-First Grading Contract and Deterministic Verifier
GradeAgentOps treats each grading decision as a structured artifact governed by a strict grading contract. The grader LLM is instructed to output a single JSON object following a fixed schema. The schema encodes (i) criterion-level integer subscores, (ii) an integer final score, (iii) an explicit decision about rubric-point coverage, and (iv) supporting evidence used to justify awarded credit at the level of individual rubric elements. This design makes grading outputs machine-checkable and suitable for downstream automation and auditing.
The contract is rubric-aligned and supports two question types. For technical items, the artifact contains integer subscores for accuracy, clarity, completeness, and terminology; for argumentative items, it contains integer subscores for clarity, coherence, originality, and dialecticality. In both cases, the final score is represented as an integer on a 0–10 scale and is expected to be consistent with the subscore definition. In addition to scores, the artifact includes a point-level rubric coverage representation: a list of covered gold-point indices and a complementary list of missed gold-point indices. To justify each covered gold point, the artifact includes an evidence list aligned with the covered list, where each evidence entry is intended to correspond to the respective covered point. The artifact also includes a list of detected misconceptions when such misconceptions are defined for the question, together with evidence spans, and a short free-text rationale.
Table 2 provides a compact summary of the grading contract enforced by GradeAgentOps and the corresponding deterministic verifier actions. It groups constraints into core contract components (JSON envelope, scores, coverage, and evidence) and indicates whether violations are handled by canonicalization, deterministic artifact updates, or explicit postprocess failure signals.
Outputs are processed by a deterministic verifier that performs both structural and semantic checks. Structural verification enforces that the output is valid JSON, that the top-level object contains exactly the required fields, and that all rubric subscores are integers within their allowed ranges. The verifier recomputes the final score from the reported subscores and overwrites inconsistent totals to ensure that the artifact remains internally consistent and comparable across runs. The verifier also enforces well-formed rubric-point partitions by validating point indices against the rubric definition and canonicalizing the covered and missed sets into a stable representation.
Semantic verification focuses on evidence and consistency. For each covered gold point, the associated evidence span must be verifiable against the student response; in GradeAgentOps, evidence is treated as an explicit obligation rather than an optional explanation. Evidence strings that cannot be matched to the student answer are flagged as semantic violations. In the implementation evaluated here, this evidence check is intentionally conservative and primarily verifies textual grounding: the evidence span must be recoverable from the student answer, after deterministic normalization, rather than inferred from the reference answer, rubric text, or memory context. When such violations occur, the verifier deterministically updates the grading artifact so that only rubric points with verifiable evidence remain marked as covered, and the corresponding points are moved to missed. An analogous evidence check is applied to detected misconceptions when present: misconception evidence must also be grounded in the student answer, otherwise the misconception detection is discarded to avoid unverifiable penalties. This design improves auditability by preventing unsupported evidence claims, but it may under-recognize semantically valid paraphrases when the supporting idea is expressed without a close textual span.
Finally, the verifier emits structured postprocess signals that serve as a failure taxonomy for downstream analysis and targeted repair. These signals distinguish structural contract issues from semantic violations (most notably evidence mismatches) and enable measurable reporting of failure modes at scale. This separation is critical for the targeted repair policy introduced next, which triggers corrective actions only in response to explicit verifier-detected violations.
2.4. Targeted Semantic Repair Policy
GradeAgentOps applies a two-stage control logic to handle failures after an initial grading attempt: deterministic verification and canonicalization of the generated grading artifact, followed by a bounded repair loop when the verifier signals that the artifact is unusable or semantically invalid. The design goal is to avoid treating grading as unconstrained regeneration. Instead, repair is triggered only by explicit, machine-detectable violations and is guided by structured failure information emitted by the verifier.
We distinguish two classes of failures. Structural contract failures occur when the LLM output does not satisfy the required JSON grading contract, for example because the output is not valid JSON, required fields are missing, fields have incorrect types, or integer scores fall outside allowed ranges. Semantic violations occur when the output is structurally valid but fails verifiability constraints, most notably when evidence associated with covered rubric points cannot be grounded in the student response. This separation is important because structural failures primarily block automation, whereas semantic violations undermine trust and auditability even when the artifact is machine-readable.
Repair is performed using a separate repair LLM and is executed under fixed budgets to keep computational cost predictable. For structural contract failures, the system performs up to a configurable number of repair attempts. For semantic violations, the system can enable a dedicated semantic repair mode with its own maximum number of attempts. In both cases, each repair attempt is required to return the same strict grading contract, ensuring that repaired outputs remain compatible with downstream processing and analysis.
Each repair attempt is guided by explicit verifier feedback. The repair prompt includes the original grading context (question, student answer, and rubric artifacts), the verifier’s failure description, and the previous LLM output. The failure description indicates what must be corrected—such as contract mismatch or evidence mismatch—and localizes problematic elements when applicable (e.g., which evidence entries are invalid). This design turns the verifier into a deterministic supervisor that provides reliable correction targets.
The repair policy is targeted by construction: it aims to correct only the offending parts of the grading artifact while preserving already-valid content. For example, when semantic verification flags invalid evidence, repair focuses on producing verifiable evidence and consistent coverage decisions rather than re-evaluating the entire response and unnecessarily altering subscores. If semantic repair is exhausted but a structurally valid artifact exists, the pipeline may accept the last valid artifact while retaining the semantic failure signals in the postprocess metadata, enabling later quantitative analysis of remaining semantic violations.
2.5. Memory Modules: Rubric Memory and Consistency Memory
GradeAgentOps optionally augments prompt construction with two deterministic, retrieval-based memory modules that can be enabled independently in ablation studies. The objective is to improve focus and stability without adding generative steps and without weakening the evidence-grounding obligations imposed by the grading contract. One module performs rubric-focused retrieval by selecting a bounded subset of rubric elements that are lexically related to the current student answer, reducing prompt bloat while keeping the most relevant rubric constraints salient. The other module provides within-run calibration by retrieving previously graded exemplars for the same question that are similar to the current response, enabling more consistent scoring across comparable answers. In both cases, retrieved memory is treated strictly as guidance. Awarded credit must still be justified by evidence spans copied from the current student answer, and evidence must not be taken from the reference solution, rubric text, or memory content.
Rubric-focused retrieval constructs a per-question index of short memory cards derived from the instructor rubric artifacts. Cards are created from gold points and, when present, banned misconceptions. Each card is tokenized deterministically into lowercase alphanumeric tokens, and an Inverse Document Frequency weighting is computed over the card collection so that common tokens contribute less than discriminative tokens. For a given student answer, the module tokenizes the answer and ranks rubric cards by an IDF-weighted token overlap score computed between the answer token set and the card token set. Only cards with strictly positive overlap are retained, and ranking is performed deterministically with a stable tie-breaking rule. The injected context is bounded by fixed top-K limits to ensure predictable prompt size. The default configuration injects the top-ranked three gold-point cards and the top-ranked two misconception cards as attention guides.
Within-run calibration is implemented as a streaming exemplar store maintained per question during a run. This mechanism is intended to operationalize consistency-oriented calibration, similar to human grading practice, where previously evaluated answers to the same question can help maintain a stable scoring standard across comparable responses. Because the store is updated as answers are processed, the set of exemplars available for a given answer is conditional on the fixed processing order used in that run. After each successful grading decision, the system may store a compact exemplar comprising an excerpt of the student answer together with the model-awarded final score, the criterion-level subscores, and the set of covered gold points. The store is restricted to prior same-question exemplars and contains model-generated grading artifacts rather than human reference scores or external labels. Retrieved exemplars are therefore used as calibration anchors, while awarded credit must still be justified by evidence from the current student answer. Storage is bounded and deterministic. Answers below a minimum length threshold are excluded to avoid storing non-informative exemplars. Stored answer text is truncated to a fixed maximum excerpt length. A per-question capacity limit is enforced by retaining only the most recent exemplars, yielding deterministic eviction and predictable memory size. When grading a new answer, retrieval is restricted to exemplars from the same question. Similarity is computed using an IDF-weighted Jaccard measure over token sets [
32,
33]. Only exemplars above a minimum similarity threshold are eligible, and the top-K most similar exemplars are returned with deterministic sorting. The default configuration retrieves three exemplars.
Both memory modules operate only at prompt-construction time and do not modify the grading contract, the deterministic verification and canonicalization logic, or the evidence requirements. Because retrieval is bounded and deterministic under identical inputs, the modules support reproducible ablation studies that isolate their individual contributions to grading accuracy and stability.
2.6. Provenance Logging and Experimental Protocol
We evaluate GradeAgentOps under a controlled protocol in which all configurations operate on identical inputs and human reference scores. Across ablations, the dataset, rubric artifacts (provided in JSON), prompt template, and deterministic verifier/canonicalizer are held fixed, and only the explicitly ablated components are varied: bounded repair and the optional memory modules. This design isolates the contribution of each component while preserving a consistent grading contract and directly comparable output structure. Unless stated otherwise, agreement metrics use E2 as the reference scorer, while E1 is used to quantify inter-rater reliability.
For each student response, execution follows a staged flow. A lightweight deterministic precheck first identifies degenerate responses, in which case the system returns a contract-compliant fallback artifact without invoking the grader LLM. For non-degenerate responses, the system constructs a rubric-aligned grading prompt and, when enabled, injects retrieved rubric hints and/or same-question exemplars as compact guidance. The grader Large Language Model (LLM) produces a draft JSON grading artifact, which is immediately processed by deterministic verification and canonicalization. This stage enforces structural constraints, canonicalizes score representations, and applies evidence-grounding and coverage-consistency checks, yielding a stable artifact together with structured postprocess signals that separate structural contract issues from semantic violations.
When repair is enabled, GradeAgentOps invokes bounded corrective loops only in response to verifier-detected violations. Contract repair is triggered when the draft output fails to satisfy the grading contract and cannot be brought into compliance by canonicalization alone. In this case, repair prompts incorporate the fixed rubric-aligned requirements together with the verifier’s structured failure signals, and the process repeats under a strict budget to limit variance and cost. Separately, when semantic repair is enabled, verifier-detected semantic violations—most notably evidence mismatches—may also trigger bounded corrective actions. The repair policy is designed to preserve already-valid content whenever possible and to make all interventions auditable through explicit verifier signals rather than implicit free-form regeneration.
Table 3 summarizes the ablation configurations used in this protocol. The ablation suite comprises six settings: a baseline configuration without repair or memory (B0), a repair-enabled configuration without memory (R1), a rubric-memory configuration without repair (M1), a consistency-memory configuration with repair (C1), a rubric-memory configuration with repair (M2), and the full configuration with both memory modules and both repair loops enabled (FULL). The baseline configuration B0 uses rubric-only prompting with a single grader call and deterministic verification/canonicalization, without repair or memory augmentation. R1 enables bounded repair with both the contract repair loop and the semantic repair loop, while leaving both memory modules disabled. M1 enables rubric memory only, without repair or consistency memory. C1 enables consistency memory together with bounded repair (contract and semantic), while leaving rubric memory disabled. M2 enables rubric memory together with bounded repair (contract and semantic), while leaving consistency memory disabled. FULL enables both memory modules together with both repair loops (contract and semantic), representing the complete GradeAgentOps pipeline used for the main results. For all configurations, the system records a complete provenance trace for each attempt, including prompts, raw outputs, verifier outcomes, postprocess signals, and repair iterations when applicable.
The methodological design above yields a verification-first grading workflow in which every output is contract-compliant, evidence-aware, and fully auditable through deterministic postprocessing and provenance logging. By holding the dataset and grading contract fixed while toggling bounded repair and memory augmentation, the ablation protocol enables a clean attribution of performance differences to specific pipeline components rather than to prompt drift or uncontrolled changes in output format.
For the quantitative analyses reported in the Results section, agreement between human graders or between a model configuration and the human reference was quantified using both error-based and ordinal agreement metrics. In addition to MAE and RMSE, we report the Intraclass Correlation Coefficient, ICC(2,1), and Quadratic Weighted Kappa, QWK. ICC(2,1) denotes a two-way random-effects, absolute-agreement, single-measure intraclass correlation coefficient, computed as:
where
is the mean square for targets,
is the mean square for raters or scoring methods,
is the residual mean square error,
is the number of scored responses, and
is the number of raters or scoring methods. QWK measures ordinal agreement while penalizing larger score disagreements more strongly. For score categories
, it is computed as:
where
is the observed agreement matrix,
is the expected agreement matrix derived from the marginal score distributions, and
is the quadratic disagreement weight between score categories
and
. In the present study, scores are defined on the 0–10 scale; therefore, the maximum squared disagreement is
. Higher ICC(2,1) and QWK values indicate stronger agreement, while lower MAE and RMSE values indicate smaller score errors.
3. Results
The results present quantitative comparisons across the ablation configurations, focusing on agreement with human grading and the robustness of contract-compliant grading artifacts under practical failure modes. The analysis begins with human inter-rater reliability in order to establish the variability of the human reference scores, and then examines the ablation outcomes using E2 as the reference scorer. In addition to overall score-level agreement, the results also consider item-type differences, verifier-emitted failure signals, repair behavior, and the effects of the memory modules.
3.1. Human Inter-Rater Reliability
Before analyzing model–human agreement, it is necessary to establish the degree of agreement between the two expert human graders, E1 and E2, on the final 0–10 score. This provides the human reference context for interpreting the agreement levels achieved by the automated grading configurations and clarifies the extent of variability already present at the human level.
Table 4 reports the inter-rater reliability results between E1 and E2 on the final score, both overall and separately for technical and argumentative items.
Across all 1000 responses, the two graders achieved an overall intraclass correlation coefficient ICC(2,1) of 0.678 and a quadratic weighted kappa of 0.678, indicating moderate agreement on the final score. The mean score assigned by E1 was 4.746, whereas the mean score assigned by E2 was 5.415, resulting in a positive mean bias of 0.669 points for E2 − E1. The mean absolute error between the two graders was 1.827, and the root mean squared error was 2.296. Taken together, these results indicate that the two graders were reasonably consistent overall, while also showing a systematic tendency for E2 to assign slightly higher scores than E1. Thus, the human reference should be interpreted as an expert benchmark with measurable inter-rater variability, rather than as an error-free ground truth. This variability provides an important context for the subsequent model–human agreement analyses.
The same pattern remained visible after stratifying the results by item type. For technical items, the mean scores were 5.190 for E1 and 6.038 for E2, with a mean bias of 0.848, MAE = 1.828, RMSE = 2.310, QWK = 0.654, and ICC(2,1) = 0.655. For argumentative items, the mean scores were 4.302 for E1 and 4.792 for E2, with a smaller mean bias of 0.490, MAE = 1.826, RMSE = 2.283, QWK = 0.678, and ICC(2,1) = 0.678. Thus, inter-rater agreement was slightly weaker for technical items, mainly because the positive scoring shift in E2 relative to E1 was more pronounced in that subset.
Figure 2 further illustrates this pattern by showing the distribution of integer score differences, defined as Δ = E2 − E1. The distribution is centered slightly to the right of zero, with positive differences occurring more frequently than negative ones. This visual pattern is consistent with the quantitative results reported in
Table 4 and confirms that E2 was, on average, the more lenient grader.
Overall, the human–human agreement results establish an important reference point for the subsequent model-human analyses, where automated grading performance should be interpreted relative to the natural variability already present between expert human evaluators.
3.2. Main Ablation Results
Having established the level of agreement between the two human graders, the analysis now turns to the main model–human comparison across the six ablation configurations. In this stage, agreement is evaluated on the final 0–10 score using E2 as the reference scorer, in accordance with the evaluation protocol defined earlier. The aim is to determine how the different ablation settings affect score-level agreement under otherwise comparable grading conditions.
Table 5 reports the main ablation results for all six configurations, including error-based measures, ordinal agreement, and tolerance-based agreement rates.
The overall pattern is directionally consistent but moderate in magnitude. FULL achieved the strongest agreement with the reference scorer, with MAE = 1.935, RMSE = 2.500, QWK = 0.652, and Within ±2 = 0.667. The second-best configuration was C1, which also performed strongly, with MAE = 1.983, RMSE = 2.540, QWK = 0.648, and Within ±2 = 0.660. In contrast, the remaining configurations formed a weaker cluster. M1 improved slightly over the baseline family, reaching MAE = 2.078 and QWK = 0.631, whereas B0, M2, and R1 remained close to one another, with MAE values between 2.089 and 2.105 and QWK values between 0.624 and 0.628. Among all six configurations, R1 produced the weakest overall agreement, with MAE = 2.105 and QWK = 0.624. This indicates that repair alone did not improve score-level alignment over the baseline in the absence of memory-based calibration. Although repair can address verifier-detected contract or evidence issues, this ablation shows that it is not, by itself, a sufficient mechanism for improving agreement with the human reference scorer. These differences should therefore be interpreted as modest score-level gains, especially when comparing FULL with B0, where MAE decreases by 0.162 points and QWK increases by 0.024.
A second pattern is the systematic negative bias observed for all configurations. The mean model score ranged from 3.864 to 4.226, whereas the mean score assigned by E2 was 5.415 in all cases. Consequently, all configurations underscored relative to the reference scorer, with Bias (Model − E2) ranging from −1.551 for R1 to −1.189 for FULL. This indicates that the automated grader remained consistently stricter than E2, even in its best-performing setting. At the same time, the strongest configurations partially reduced this gap: both C1 and FULL showed modest improvements in absolute error, ordinal agreement, and the proportion of responses graded within a narrow tolerance of the human reference.
To assess whether the observed differences between the main configurations were statistically robust,
Table 6 reports paired student-cluster bootstrap comparisons with 10,000 resamples. Resampling was performed at the student level rather than at the individual-answer level, preserving the within-student dependence across the ten exam questions.
Positive Δ values indicate improvement of the first configuration over the second configuration. For MAE, Δ is computed as the second configuration value minus the first configuration value; for QWK and Within ±2, Δ is computed as the first configuration value minus the second configuration value.
The bootstrap analysis indicates that both FULL and C1 improve significantly over B0 across MAE, QWK, and Within ±2. However, the direct comparison between FULL and C1 is not statistically significant for any of these metrics. Thus, FULL should be interpreted as the numerically strongest configuration, while C1 remains statistically comparable within this dataset.
Figure 3 provides a compact visual summary of the same ablation results by showing MAE and QWK side by side for the six configurations, ordered by performance.
The figure reinforces the ranking observed in
Table 5. FULL and C1 stand apart as the two strongest configurations, combining lower error with higher ordinal agreement than the remaining variants. The separation is especially clear in the QWK panel, where the gains of FULL and C1 over B0, R1, and M2 are more visually pronounced. M1 occupies an intermediate position, outperforming the weaker configurations but not reaching the agreement levels of C1 or FULL. Overall, the figure confirms that the highest-performing configurations are those that combine lower score error with stronger ordinal consistency relative to the reference scorer.
Taken together, these results show that FULL provides the strongest agreement with the human reference scorer, with C1 as a close second. The ablation pattern therefore indicates that the best-performing configurations are those that achieve both lower absolute error and higher ordinal agreement, while the weaker variants remain systematically more conservative and less aligned with the reference human scores.
3.3. Performance Breakdown by Question and Item Type
The overall ablation results reported above establish that FULL achieves the strongest agreement with the reference scorer, with B0 serving as the baseline configuration. However, those aggregate results do not show whether the observed gain is distributed evenly across the exam or whether it is concentrated in particular subsets of responses. A more fine-grained analysis is therefore required. To clarify where the improvement of FULL actually comes from, the comparison is narrowed in this subsection to B0 and FULL and is examined at two complementary levels: first by item type and then by individual question, always using E2 as the reference scorer.
The item-type analysis is important because the exam contains both technical and argumentative questions, which differ not only in content but also in the nature of the expected response. A configuration that improves overall agreement may therefore do so by performing better on one category while offering only limited gains on the other.
Table 7 provides this first level of breakdown by reporting the performance of B0 and FULL separately for technical and argumentative items in terms of MAE and QWK. In this way, the table shows whether the overall advantage of FULL reflects a broad gain across both categories or a more uneven effect concentrated in only one of them.
The item-type comparison shows that the advantage of FULL is not equally strong across the two response categories. For technical items, B0 achieved MAE = 2.208 and QWK = 0.541, whereas FULL achieved MAE = 2.118 and QWK = 0.538. This corresponds to a relatively small MAE reduction of 0.090, while ordinal agreement changes only marginally and in fact decreases slightly. By contrast, the difference is much more substantial for argumentative items. In this subset, B0 obtained MAE = 1.986 and QWK = 0.673, while FULL improved to MAE = 1.752 and QWK = 0.717. Here, the gain is clearly stronger, yielding an MAE reduction of 0.234 together with a QWK increase of 0.044. Taken together, these results show that the overall superiority of FULL over B0 is driven primarily by better alignment on argumentative responses, whereas the improvement on technical items is much smaller and not equally reflected across both agreement measures.
Although the item-type breakdown clarifies the broader source of the gain, it still remains too coarse to show how this improvement is distributed across the ten individual questions. In particular, the category-level comparison cannot reveal whether the benefit of FULL is broad and consistent or whether it is concentrated in a smaller subset of items. A question-level view is therefore needed.
Figure 4 provides this finer-grained perspective by plotting the per-question change in error, defined as ΔMAE = MAE(B0) − MAE(FULL), for each question from Q1 to Q10. Under this definition, positive values indicate questions on which FULL reduces error relative to the baseline, whereas negative values indicate questions on which B0 remains stronger.
Figure 4 shows that the improvement delivered by FULL is clearly non-uniform across the exam. The largest gains are observed on Q8 (ΔMAE = 0.39), Q3 (ΔMAE = 0.38), Q7 (ΔMAE = 0.38), Q2 (ΔMAE = 0.36), and Q10 (ΔMAE = 0.36). These questions therefore account for a substantial part of the overall advantage of FULL over the baseline. At the same time, the gains are much smaller on Q6 and Q9, where the improvement is only 0.02, indicating near-equivalent behavior between the two configurations on those items. In addition, FULL performs slightly worse than B0 on Q1 (−0.07), Q4 (−0.11), and Q5 (−0.11). This pattern is important because it shows that the higher overall agreement achieved by FULL does not result from a uniform improvement across all questions. Rather, it is produced by a combination of substantial gains on several items, negligible differences on others, and small losses on a limited subset of questions.
Overall, the breakdown by item type and by question clarifies the structure of the gain observed in the main ablation results. The advantage of FULL over B0 arises mainly from stronger agreement on argumentative items and from marked improvements on a subset of individual questions, rather than from a consistent improvement across the entire exam. The benefit of FULL is therefore real and meaningful, but also clearly localized, which helps explain why the aggregate results reported in the previous subsection do not translate into uniform gains at every level of analysis.
3.4. Verifier and Postprocess Outcomes
The agreement analyses reported in the previous subsections quantify how closely the six configurations align with the human reference scores. However, score-level agreement alone does not show how the accepted grading artifacts differ internally once they pass through the verification and postprocess stages. A complementary analysis is therefore needed at the level of verifier and postprocess outcomes. This is especially important in the present experiments because the final outputs show almost no contract-fail cases, which means that the relevant variation between configurations lies less in outright output invalidity and more in the postprocess profile of the accepted grading artifacts.
A first step is to summarize these outcomes at the configuration level. In addition to Pass@1 and the item-level rate of any observed contract-fail case, it is useful to examine the main postprocess signals that remain visible in the final accepted outputs.
Table 8 provides this overview by reporting, for each configuration, the rates of contract-fail items together with the principal verifier- and postprocess-related signals observed in the accepted grading artifacts.
The table shows that outright contract-fail cases were essentially absent across all six configurations. B0, R1, M1, and M2 produced a contract-fail item rate of 0.000, while C1 and FULL reached only 0.001, indicating that contract non-compliance was not a meaningful source of experimental variation in the final accepted outputs. At the same time, Pass@1 remained high throughout, ranging from 0.927 for FULL to 1.000 for B0 and M1. These results indicate that the configurations differed only marginally in first-pass acceptance and that the central differences between them must be sought in the postprocess profile of the accepted outputs rather than in widespread output rejection.
The clearest variation appears in the evidence- and completeness-related signals. The rate of covered gold evidence invalid was highest for M1 (0.059) and B0 (0.057), lower for R1 (0.025) and M2 (0.022), and lowest for the strongest configurations, namely C1 (0.013) and FULL (0.010). This pattern indicates that the better-performing configurations were also less likely to retain invalid covered-gold evidence in the final accepted output. A second signal, forced completeness from gold, remained present in all configurations, but with a wider spread, ranging from 0.085 for C1 to 0.129 for M2. The rate was also relatively high for M1 (0.128), whereas B0 (0.122), R1 (0.121), and FULL (0.117) occupied an intermediate range. By contrast, short answer gate was constant across all six configurations at 0.029, indicating that this behavior was driven by the dataset and answer characteristics rather than by the configuration itself. Similarly, banned evidence invalid dropped remained low in all cases, varying only from 0.001 to 0.011.
A further distinction is visible in semantic exhausted but accepted, which appears only in the configurations that include the corresponding semantic behavior. This rate was 0.025 for R1, 0.013 for C1, 0.022 for M2, and 0.010 for FULL, while it remained 0.000 for B0 and M1. Among the configurations in which this signal occurs, FULL produces the lowest rate, which indicates that the complete pipeline leaves the smallest proportion of accepted outputs still carrying this marker.
Although the full table provides the complete configuration-level summary, not all signals are equally informative for visual comparison. In practice, the clearest differences are concentrated in the evidence- and completeness-related outcomes.
Figure 5 provides a focused view of the three most informative postprocess signals by showing their rates across all six configurations: invalid gold evidence, forced completeness, and exhausted semantic repair.
The figure makes two patterns especially clear. First, the strongest configurations substantially reduce the rate of invalid gold evidence relative to the weaker variants. This is most evident for FULL and C1, which achieve the lowest values, whereas B0 and M1 retain noticeably higher rates. Second, forced completeness remains the most prevalent of the three signals in every configuration, showing that completeness-oriented postprocessing remains relevant throughout the pipeline even when the other signals are less frequent. Within this signal, C1 stands out with the lowest rate, whereas M1 and M2 remain the highest. The third signal, exhausted semantic repair, is comparatively rare overall, but its selective presence in R1, C1, M2, and FULL confirms that it is associated with the corresponding semantic behavior rather than with the dataset alone.
Overall, these results show that the internal differences between configurations are expressed far less through contract-fail outcomes than through the postprocess profile of the final accepted grading artifacts. The strongest configurations, especially FULL and C1, are characterized by lower rates of invalid covered-gold evidence and, more generally, by a cleaner postprocess profile, whereas the weaker variants retain a larger share of evidence-related issues in the accepted outputs.
3.5. Memory Module Effects
The previous subsections established the overall ablation ranking, localized the strongest gains by item type and question, and described how the accepted outputs differ in their verifier and postprocess profiles. A more targeted analysis is still needed, however, to isolate the contribution of the two memory-related components introduced in the pipeline. Because the raw ablation ranking alone does not show which improvements are attributable specifically to rubric memory and which are attributable to consistency memory, this subsection evaluates these effects through a set of pairwise comparisons designed to isolate each module as directly as possible.
The analysis is structured around four comparisons. The transition from B0 to M1 captures the effect of rubric memory without repair. The transition from R1 to C1 captures the effect of consistency memory in the repair-enabled setting. The transition from R1 to M2 captures the effect of rubric memory under the same repair-enabled conditions. Finally, the transition from M2 to FULL captures the additional contribution of consistency memory when rubric memory and repair are already present.
Table 9 provides this pairwise comparison by reporting, for each transition, the baseline and augmented values for MAE and QWK, together with their corresponding deltas.
The table shows that the effect of the memory modules is not uniform. The smallest improvement is observed for B0 → M1, where introducing rubric memory without repair reduces MAE only from 2.097 to 2.078 and increases QWK only from 0.628 to 0.631, yielding ΔMAE = 0.019 and ΔQWK = 0.003. A similarly small effect appears in R1 → M2, where adding rubric memory in the repair-enabled setting changes MAE from 2.105 to 2.089 and QWK from 0.624 to 0.626, corresponding to ΔMAE = 0.016 and ΔQWK = 0.002. These two comparisons indicate that rubric memory, taken in isolation, contributes only modest gains in this dataset.
By contrast, the comparisons involving consistency memory show substantially larger improvements. In R1 → C1, the addition of consistency memory reduces MAE from 2.105 to 1.983 and improves QWK from 0.624 to 0.648, yielding ΔMAE = 0.122 and ΔQWK = 0.024. An even stronger effect is observed in M2 → FULL, where adding consistency memory on top of rubric memory and repair reduces MAE from 2.089 to 1.935 and raises QWK from 0.626 to 0.652, corresponding to ΔMAE = 0.154 and ΔQWK = 0.026. These two transitions consistently show that consistency memory produces the dominant gains among the memory-related components evaluated here.
Although the table provides the exact numerical deltas, a visual comparison helps clarify their relative magnitude. In particular, the side-by-side inspection of ΔMAE and ΔQWK makes it easier to see which component produces only marginal changes and which one leads to a clearly visible shift in agreement.
Figure 6 provides this visual summary by plotting the change in MAE and QWK for the four pairwise memory-module comparisons.
The figure makes the ranking of the effects immediately visible. The two smallest bars appear in the rubric-memory comparisons B0 → M1 and R1 → M2, in both the ΔMAE and ΔQWK panels, confirming that rubric memory alone contributes only limited improvements. In contrast, the largest bars are observed for M2 → FULL and R1 → C1, showing that consistency memory produces a clearly stronger effect under both comparison settings. The same ordering appears in both panels, which indicates that the gain is not confined to one metric alone but is reflected consistently in both lower absolute error and stronger ordinal agreement.
Overall, these results show that the memory-related gains in this study are driven primarily by consistency memory, whereas rubric memory has only a modest effect when introduced on its own. The strongest memory-related improvement is obtained when consistency memory is added on top of an already stronger configuration, as seen in M2 → FULL, while the weakest gains are associated with the two transitions that isolate rubric memory. This pattern indicates that, within the present pipeline, consistency-oriented contextualization contributes more strongly to grading alignment than rubric memory alone.
3.6. Efficiency and Operational Overhead
The previous subsections established how the six configurations differ in agreement quality, where the strongest gains are concentrated, and how the accepted outputs differ internally in their verifier and postprocess profile. A practical comparison, however, also requires an operational perspective. Higher agreement is useful only if the associated overhead remains interpretable and manageable in realistic grading conditions. For this reason, the analysis in this subsection examines the computational cost of the six configurations in terms of total runtime, average time per response, average number of attempts per response, and repair rate.
A first step is to summarize these efficiency-related quantities at the configuration level. This makes it possible to determine whether the configurations that achieve stronger agreement do so at only marginal additional cost or whether the gain is accompanied by a substantial increase in operational overhead.
Table 10 provides this overview by reporting, for each configuration, the total wall time, the average processing time per item, the average number of attempts per item, and the overall repair rate.
The table shows a clear efficiency gradient across the six configurations. The most lightweight configuration is B0, with an average processing time of 37.551 s/item, followed closely by M1 at 38.425 s/item. The repair-enabled configurations are consistently more expensive, with R1 reaching 40.345 s/item, M2 reaching 40.555 s/item, and C1 reaching 41.320 s/item. The largest overhead is observed for FULL, which requires 42.263 s/item on average. Relative to B0, this corresponds to an increase of approximately 12.5% in average per-item runtime. Thus, the best-performing configuration is also the most computationally demanding one.
The same pattern is reflected in the number of attempts per item. B0 and M1 both remain at 1.000, indicating that they finalize every response in a single attempt. The remaining configurations require more attempts on average: C1 reaches 1.078, R1 reaches 1.090, M2 reaches 1.091, and FULL reaches 1.093. This indicates that the additional overhead is closely tied to configurations that engage more often in iterative processing rather than to any broad increase in cost unrelated to the attempt structure. The repair rates follow the same ordering. B0 and M1 remain at 0.000, while C1, R1, M2, and FULL rise to 0.078, 0.090, 0.091, and 0.093, respectively. This confirms that the operational differences between configurations are driven primarily by the frequency of additional repair-related processing.
Although the table provides the exact numerical values, a visual comparison makes the cost profile of the configurations easier to interpret. In particular, plotting Avg Item Time next to Avg Attempts/Item shows directly whether the runtime overhead follows the same ordering as the attempt overhead.
Figure 7 provides this comparison by presenting the two measures side by side for all six configurations.
The figure confirms that the ranking by runtime closely mirrors the ranking by average number of attempts. B0 and M1 occupy the lowest-cost region in both panels, while FULL occupies the highest-cost region in both. R1, M2, and C1 form an intermediate group, with C1 slightly more expensive than R1 and M2 in average item time despite a somewhat lower average number of attempts. This indicates that the operational cost of the stronger configurations is driven primarily, though not exclusively, by the frequency of additional attempts. The figure also makes clear that the increase in cost is gradual rather than abrupt, with the main difference lying between the single-attempt configurations and the more iterative ones.
Overall, these results show that the strongest-performing configurations do incur additional operational cost, but the overhead remains structured and interpretable. The best agreement is obtained by FULL, but this comes at the highest average runtime, the highest mean number of attempts per item, and the highest repair rate. Conversely, B0 and M1 are operationally the most efficient, but they do not reach the same agreement levels as the stronger repair- and memory-enabled configurations. The trade-off is therefore clear: stronger alignment with the human reference scores is associated with a moderate but measurable increase in computational overhead.
3.7. Representative Case Analysis
The aggregate quantitative analyses establish the overall ranking of the six configurations, show where the strongest gains are concentrated, and characterize the internal verifier and postprocess profile of the accepted outputs. A final qualitative step is still useful, however, in order to show how these broader patterns appear in concrete grading situations. Rather than introducing additional global statistics, this subsection focuses on a small set of representative examples selected to illustrate four distinct situations: a clear argumentative improvement, a substantial technical improvement, a case in which the internal postprocess profile becomes cleaner without changing the final score, and a case in which the baseline remains closer to the reference scorer.
Table 11 provides these representative examples by summarizing the question focus, the final scores assigned by B0 and FULL relative to the reference scorer E2, the corresponding absolute deviations from E2, and a short observation describing the main pattern illustrated by each case.
The first case illustrates a clear improvement for an argumentative response. For the item asking whether passing the Turing Test is sufficient to regard a machine as intelligent, E2 assigned a final score of 10. In this case, B0 assigned 6, whereas FULL assigned 10, matching the reference exactly. This example is representative of the broader tendency for the strongest gains to appear for argumentative responses and for questions that require coherent justification rather than short technical recall alone.
The second case illustrates a substantial but incomplete gain on a technical response. For the item concerning the role of the Naive Bayes classifier in decision-making, E2 assigned 10, B0 assigned 4, and FULL assigned 7. Here, FULL does not fully match the reference score, but it reduces the absolute scoring gap by half relative to the baseline. This case is also notable because the accepted output of B0 still carried the covered gold evidence invalid signal, whereas the accepted output of FULL did not, linking the score-level improvement to a cleaner internal postprocess profile.
The third case shows that a cleaner internal profile does not necessarily imply an immediate change in the final score. For the item asking about the role of a version control system, E2 assigned 8, while both B0 and FULL assigned 5. In other words, the score-level behavior remains unchanged. However, the baseline output still carried the covered gold evidence invalid signal, while the corresponding FULL output did not. This example shows that internal output quality can improve even when the final score itself remains fixed.
The fourth case provides an important counterexample in which the baseline remains closer to the reference scorer. For the item concerning the role of facial recognition in modern security applications, E2 assigned 1. In this case, B0 also assigned 1, whereas FULL assigned 5. This is therefore a case in which the best-performing configuration at the aggregate level is clearly worse than the baseline on an individual response. The example is important because it confirms that the gains achieved by FULL are not uniform and do not eliminate all cases in which the simpler baseline remains more accurate.
Overall, these representative cases reinforce the quantitative evidence without repeating it. They show concretely that the gains of FULL are most visible on argumentative responses, can also produce meaningful reductions in error on selected technical responses, may improve the internal postprocess profile even when the final score is unchanged, and do not eliminate all situations in which the baseline remains closer to the human reference. In this way, the case analysis provides a concise qualitative complement to the aggregate quantitative results by illustrating how the main performance patterns appear in concrete grading situations.
4. Discussion
The empirical results highlight several consistent patterns regarding the behavior of the GradeAgentOps pipeline. Most notably, the strongest agreement with the human reference scorer is achieved by the full configuration, while the gains are distributed unevenly across item types and individual questions. At the same time, the results show that the contribution of the pipeline is not explained solely by score-level agreement, but also by differences in postprocess behavior, memory-module effects, and operational overhead. These aspects are examined below in order to clarify the meaning, implications, and limitations of the observed findings.
4.1. Principal Findings
The empirical evaluation reveals several important but bounded findings regarding the behavior of the GradeAgentOps pipeline. First, among the six ablation configurations, FULL achieves the strongest numerical agreement with the human reference scorer E2, with C1 emerging as the closest alternative. This pattern is consistent across the main score-level agreement measures, although the observed differences between configurations are modest in magnitude. This interpretation should also be considered together with the paired student-cluster bootstrap analysis reported in
Table 6. The bootstrap results show that both FULL and C1 improve significantly over B0 across MAE, QWK, and Within ±2, whereas the direct comparison between FULL and C1 is not statistically significant. The complete pipeline therefore provides the best numerical alignment within this dataset, while C1 remains statistically comparable to FULL, rather than evidence of a large performance gap over the baseline. At the same time, the advantage of the strongest configurations is not absolute: all configurations remain, on average, stricter than E2, which shows that improved agreement does not eliminate the underlying tendency of the automated grader to assign lower scores than the reference scorer.
A second principal finding is that the observed gains are not distributed uniformly across the evaluation set. The improvements achieved by FULL are concentrated more strongly in argumentative items than in technical items, and the question-level breakdown shows that the gain is driven by a subset of items rather than by a consistent improvement across all ten questions. This is important because it indicates that the strongest configuration does not simply raise performance in a global and homogeneous way, but instead improves alignment more effectively in response types that appear to benefit from stronger consistency and contextualization during grading.
A third finding concerns the relative contribution of the pipeline components. The pairwise comparisons show that the largest memory-related improvements are associated with consistency memory, whereas rubric memory alone produces only modest gains. In addition, the internal analysis of verifier and postprocess outcomes shows that the stronger configurations are characterized less by differences in contract-fail behavior and more by a cleaner postprocess profile, especially through lower rates of invalid covered-gold evidence in the accepted outputs. Taken together, these results suggest that the most meaningful improvements do not arise from simple rubric exposure alone, but from mechanisms that stabilize the grading decision and reduce evidence-level inconsistencies in the final artifact.
A fourth finding is that these quality gains are accompanied by a measurable but moderate operational cost. The best-performing configurations require higher average item time, more attempts per item, and higher repair rates than the most lightweight variants. Nevertheless, the overhead remains structured and interpretable rather than excessive or unstable. In practical terms, the results indicate that stronger agreement with the human reference scorer can be obtained, but not for free: the gains in grading quality are coupled with a clear, though still manageable, increase in computational effort.
Overall, the main findings support the conclusion that the full GradeAgentOps pipeline produces modest but statistically supported improvements over the baseline, while C1 remains statistically comparable to FULL within this dataset. These improvements are selective, component-dependent, and associated with a transparent operational trade-off.
4.2. The Role of Pipeline Components
The ablation results make it possible to interpret the contribution of the main GradeAgentOps components more precisely than would be possible from the overall ranking alone. The first important observation is that the pipeline does not behave as a set of equally influential additions. Instead, the results show a clearly uneven contribution across components, with some elements producing only marginal gains and others accounting for a substantial part of the observed improvement in agreement.
The weaker performance of R1 is informative in this respect. In this configuration, bounded repair is enabled without rubric memory or consistency memory. The repair mechanism is triggered by verifier-detected violations and is designed to restore contract compliance, evidence grounding, and internal consistency of the grading artifact, rather than to recalibrate the final score globally. As a result, repair alone may produce cleaner or more compliant artifacts without necessarily improving agreement with the human reference scorer. This suggests that the repair model and repair prompt are useful as targeted correction mechanisms, but they should not be interpreted as an independent source of score calibration. The stronger performance of C1 and FULL indicates that repair is more effective when combined with consistency-oriented contextualization.
The clearest distinction emerges between rubric memory and consistency memory. In the pairwise comparisons designed to isolate their effects, rubric memory produces only limited improvements, both when introduced without repair and when added in the repair-enabled setting. By contrast, consistency memory yields markedly larger gains under both comparison conditions. This indicates that simple access to rubric-related contextual information is not, by itself, sufficient to generate a strong improvement in grading alignment. What appears to matter more is the component that promotes a more stable and coherent use of that information during the grading decision itself.
This interpretation is reinforced by the broader behavior of the strongest configurations. The best-performing variants are not simply those that expose the model to more rubric-related context, but those that combine contextual support with mechanisms that reduce inconsistency in the final grading artifact. In this sense, the role of consistency memory appears to be less about adding content and more about constraining the grading process toward a more internally coherent decision pattern. This helps explain why the gains are especially visible on argumentative items, where grading depends more strongly on maintaining stable reasoning across multiple elements of the response rather than identifying only a small set of isolated technical cues.
A similar conclusion follows from the verifier and postprocess analyses. The stronger configurations are not distinguished by a dramatic reduction in outright contract-fail cases, because such failures are already rare across the board. Instead, they are distinguished by a cleaner internal postprocess profile, particularly through lower rates of invalid covered-gold evidence in the final accepted outputs. This suggests that the key contribution of the pipeline is not merely to reject malformed outputs, but to support the production of grading artifacts that are more internally aligned with the expected evidential structure. In other words, the most meaningful effect of the pipeline lies not at the level of coarse output validity alone, but at the level of how consistently the final score is supported by the accepted evidence representation.
Taken together, these findings suggest that the contribution of the GradeAgentOps architecture is best understood as a decision-stabilization effect rather than a simple information-augmentation effect. The weaker components provide some benefit, but the strongest gains arise when the pipeline introduces mechanisms that improve the consistency of grading behavior and reduce evidence-level irregularities in the final accepted output. This interpretation also helps explain why the strongest gains are selective rather than uniform: the pipeline is most helpful in cases where grading quality depends on maintaining coherent judgment across multiple elements of a response, and less helpful in cases where the answer can already be handled adequately by a simpler baseline strategy.
4.3. Human Reference Variability
The interpretation of model–human agreement in this study must be considered in light of the variability already present at the human level. The results show that the two expert graders, E1 and E2, do not coincide perfectly on the final 0–10 score, but instead display a measurable level of disagreement together with a systematic directional shift. In particular, E2 tends to assign higher scores than E1, which means that the human reference used for the main ablation analysis is not simply a neutral benchmark, but one specific realization of expert judgment within a broader range of plausible human scoring behavior.
This point is important because the automated configurations are evaluated primarily against E2. As a consequence, their agreement levels reflect not only how well they capture the intended grading standard, but also how closely they align with the particular scoring tendency represented by that grader. The consistent negative bias observed across all configurations relative to E2 therefore should not be interpreted as evidence of arbitrary underscoring alone. It also reflects the fact that the chosen reference scorer is, on average, somewhat more lenient than the second expert grader. In this sense, part of the apparent strictness of the automated grader is inseparable from the variability of the human reference itself.
The human–human comparison also helps place the model-human results in a more realistic perspective. Automated grading systems are often discussed as if they were being compared against a perfectly stable gold standard, but the present results show that this assumption is not appropriate even in an expert-graded setting. The human reference score is itself subject to variation, and the automated pipeline should therefore be interpreted relative to that already-existing uncertainty. This does not weaken the importance of agreement metrics; rather, it makes their interpretation more careful and more credible. A model–human result should not be read in isolation from the human–human baseline that defines the attainable context of agreement.
At the same time, the existence of human reference variability does not reduce the value of the observed ablation ranking. The comparisons between configurations remain meaningful because all six variants are evaluated against the same reference scorer under the same protocol. Thus, the relative ordering of the configurations is still informative, even if the absolute agreement values must be interpreted with appropriate caution. In other words, the variability between E1 and E2 mainly affects how strongly the agreement levels can be generalized as human-equivalent, but it does not undermine the internal validity of the comparison between B0, R1, M1, C1, M2, and FULL.
Overall, the presence of measurable human reference variability strengthens rather than weakens the methodological interpretation of the study. It shows that automated grading performance should be judged against a realistic human benchmark rather than an idealized one, and it provides an important context for understanding both the bias and the agreement levels observed in the main results.
4.4. Practical Implications for Automated Grading
The results have several practical implications for the design and deployment of automated grading systems for short open-ended responses. The first implication is that stronger grading alignment does not appear to depend solely on using a more capable base model or exposing the model to more rubric-related information. Instead, the findings suggest that the structure of the grading pipeline itself matters substantially. In particular, the best results are obtained not by the simplest rubric-conditioned setup, but by a configuration that combines bounded verification, postprocessing, and consistency-oriented memory. This indicates that practical gains in automated grading quality may come less from isolated prompt enrichment and more from building a grading process that is internally constrained and behaviorally stable.
A second implication concerns the type of responses for which such a pipeline is most beneficial. The gains are more pronounced on argumentative items than on technical ones, which suggests that agentic support is especially valuable when grading requires coherent judgment across multiple aspects of a response rather than recognition of a small number of expected technical elements. In practical educational settings, this means that the strongest benefits of a structured grading pipeline may appear precisely in the kinds of questions that are traditionally harder to assess consistently with simple automated methods.
A third implication is that the choice of configuration should depend on the intended operational setting. If the main priority is maximum agreement quality, the results support the use of the FULL configuration, despite its higher runtime and attempt overhead. If, however, the priority is lower computational cost with acceptable but not maximal alignment, lighter variants such as B0 or M1 may still be attractive. This suggests that automated grading pipelines should not be viewed as having a single universally optimal setting. Rather, different deployment contexts may justify different trade-offs between grading quality and efficiency.
The results also carry an implication for the role of internal output quality in practical grading systems. The stronger configurations do not merely produce better score-level agreement; they also produce accepted outputs with a cleaner postprocess profile. This matters because a grading system that is intended for real educational use must support not only a plausible final score, but also an output structure that remains internally coherent and evidentially defensible. From a deployment perspective, this makes the pipeline more suitable for settings in which transparency, reviewability, and downstream auditing are important.
From an explainable artificial intelligence (XAI) perspective, GradeAgentOps provides artifact-level and process-level explainability. Artifact-level explainability is supported by the structured grading output, which exposes the criterion-level subscores, final score, covered and missed rubric elements, and evidence spans used to justify awarded credit. Process-level explainability is supported by verifier signals, postprocess flags, repair attempts, and provenance logs, which make it possible to inspect how a grading artifact was produced, validated, and corrected. This form of explainability is therefore operational and audit-oriented: it supports teacher review, error tracing, and reproducibility at the grading-pipeline level, while mechanistic interpretation of the internal reasoning or parameters of the underlying LLM remains outside the evaluated system.
Overall, the practical message is that automated grading quality can be improved in a meaningful way through pipeline design choices that promote consistency and internal validity, but these gains must be balanced against measurable computational overhead. The results therefore support a view of automated grading not as a single inference step, but as a structured decision process whose configuration should be matched to the pedagogical and operational priorities of the intended use case.
4.5. Limitations and Future Directions
Several limitations should be considered when interpreting the present findings. First, the evaluation is conducted on a single university-level exam dataset comprising 100 students, 10 short open-ended questions, and 1000 graded responses from one course and one institutional context. Although this setting is appropriate for controlled ablation analysis, it necessarily restricts the scope of generalization. The dataset includes both technical and argumentative items, but remains domain-specific and reflects one language, one assessment format, and one set of grading conventions. The observed behavior of the pipeline may differ across disciplines, languages, educational levels, rubric structures, response formats, and institutional grading practices. The present results should therefore be interpreted as evidence of effectiveness within a well-defined assessment setting, while broader generalization requires replication on additional courses, domains, languages, and educational contexts.
A second limitation concerns the human reference itself. The study benefits from the availability of two expert human graders, which makes it possible to quantify human inter-rater variability and to interpret model-human agreement more realistically. However, the observed human–human agreement is moderate (ICC(2,1) = 0.678; QWK = 0.678), indicating that the human benchmark itself contains measurable variability and should not be treated as an error-free ground truth. At the same time, the main ablation analysis still uses E2 as the operational reference scorer. This provides a consistent basis for internal comparison across configurations, but it also means that the reported agreement values remain tied to one specific expert scoring tendency. Consequently, model-human agreement should be interpreted relative to the observed human–human agreement context. In settings where additional graders are available, future work should examine whether the same configuration ranking remains stable when evaluated against alternative human references, aggregated human scores, or adjudicated consensus scores.
A third limitation concerns the scope of the architectural evaluation. The present study focuses on a bounded set of pipeline components, namely verification, postprocessing, and two optional memory modules, under a fixed grading setting and a fixed family of generated artifacts. This design is appropriate for isolating the contribution of the proposed components, but it does not exhaust the broader design space of agentic grading systems. In particular, the evidence-matching mechanism used in the evaluated implementation is deterministic and primarily lexical, which makes it conservative from an auditability perspective. This design reduces the risk of accepting unsupported evidence claims, but it may also under-recognize semantically correct paraphrases or alternative valid reasoning patterns when they are not expressed through a close textual span in the student answer. Other forms of memory, alternative evidence-selection strategies, semantic evidence matching, different repair policies, or more explicit reasoning constraints may produce different interaction patterns than those observed here. In addition, the consistency-memory results should be interpreted in light of the streaming nature of the same-question exemplar store. This design intentionally introduces within-run calibration, similar to the way human graders may use previously evaluated answers to maintain a stable scoring standard. However, because the available exemplars depend on the fixed answer-processing order, C1 and FULL may show some order sensitivity. The memory does not contain human reference scores or external labels, but future work should quantify this order effect by repeating the consistency-memory configurations under shuffled or counterbalanced processing orders.
A fourth limitation is operational. The reported runtime and overhead results are meaningful within the experimental setup used in this study, but they should not be interpreted as hardware-independent performance guarantees. The relative differences between configurations remain informative, yet the absolute timing values depend on the specific execution environment, implementation details, and model-serving conditions under which the experiments were conducted. Future work should therefore complement the present cost analysis with evaluations across additional deployment settings and model backends.
These limitations suggest several natural directions for future research. An immediate next step is to validate the pipeline on additional courses, assessment tasks, and question types in order to test the robustness of the observed ablation patterns beyond the present dataset. A second direction is to extend the human reference framework by incorporating more expert graders and by exploring alternative reference constructions, such as aggregated or adjudicated human scores. A third direction is to investigate the interaction between consistency-oriented mechanisms and rubric-guided reasoning in more detail, especially in tasks where justification quality and evidence selection play a stronger role than short factual recall. Finally, future work should examine how the pipeline behaves when paired with other model families and deployment regimes, because the absolute agreement values, operational overhead, and even the relative ranking of configurations may vary with the grader and repair models used. Such experiments would allow both agreement quality and operational trade-offs to be assessed under a broader range of practical conditions.
Overall, the limitations of the present study do not undermine the internal validity of the reported comparison, but they do define the boundaries within which its conclusions should be interpreted. At the same time, they identify a clear research path for extending the current findings into a broader and more general account of agentic automated grading.