1. Introduction
Large language models (LLMs) are increasingly being integrated into higher-education assessment workflows, where they can generate questions, provide feedback and support automated scoring [
1]. At the same time, recent work on automatic short-answer grading and AI-assisted evaluation of student work has shown both promising results and open challenges related to reliability, transparency and alignment with human graders [
2]. These developments raise important questions about how well current LLMs perform on real university exam questions under human grading, how question source (instructor-authored versus pipeline-generated) affects their performance, and how to design exam items that remain difficult –or effectively LLM-resistant—for such models [
3].
In this work, we use distributional shift to denote a systematic change in the distribution of exam artifacts induced by the generation workflow, operationalized here by the contrast between instructor-authored (HUMAN) and pipeline-generated (PIPELINE) questions and answers. This shift can modify surface properties such as length, structure, phrasing, and completeness signals, even when course content and the instructor’s target rubric remain fixed. We use sustainable issues to denote deployment-relevant failure modes that are systematic enough to persist under repeated use of the same workflow (rather than isolated outliers), such as condition-dependent grading conservatism that scales into credit denial under strict pass thresholds.
1.1. Background and Motivation
Written exams with short-answer questions remain a central component of summative assessment in many higher-education programs, especially in computing and applied informatics, where they are used to probe conceptual understanding, basic quantitative reasoning and simple programming skills in a time-constrained setting [
1]. Such exams are typically constructed and refined by lecturers over multiple cohorts, with item difficulty and coverage adjusted informally based on experience, observed student performance and institutional grading norms [
4]. This traditional calibration assumes that only human candidates take the exam and that solving the questions requires a combination of recall, reasoning and problem-solving skills that cannot be outsourced to external systems.
The rapid adoption of large language models (LLMs) in educational technology challenges this assumption. Recent systematic reviews document a growing number of LLM-powered tools that support automatic assessment, short-answer scoring and feedback generation in higher education, moving beyond small pilots towards deployment in real courses [
1,
5]. Case studies in programming education report LLM-based assistants that help instructors review and grade code and open-ended responses, effectively embedding general-purpose models into the assessment workflow [
6,
7]. At the same time, empirical studies of LLM- and machine-learning-based scoring of scientific explanations show that current models can match or approach human performance on some explanation and reasoning tasks, while still displaying systematic weaknesses on others [
3]. In practice, this means that students can increasingly use external LLMs to obtain plausible answers to exam-style questions during preparation or unsupervised assessments, potentially altering the effective difficulty and discriminatory power of existing item banks.
A parallel line of work investigates LLMs as automatic graders or evaluators for short-answer questions, comparing their scores and feedback with those assigned by human markers in domains such as computer science, physics and health education [
2,
3,
7]. These studies often report moderate to high agreement at an aggregate level, but they also reveal discrepancies for individual items, sensitivity to prompt design and rubric specification, and context-dependent failures, especially in high-stakes settings [
8,
9]. Broader surveys on evaluation methodologies for natural language generation and LLMs argue that automatic metrics and model-based judges can be biased or unstable, and recommend grounding evaluation in carefully designed human judgement protocols wherever possible [
10,
11]. In educational contexts, research on explainable AI and teacher-facing AI tools similarly shows that trust and acceptance depend on transparent, domain-specific explanations that make automated recommendations understandable and controllable for instructors [
12,
13]. Together, these findings suggest that LLM-based graders are best viewed as auxiliary tools or baselines; we therefore treat ExamQ-Gen as an instructor-in-the-loop workflow: the system supports exam authoring and produces grading recommendations, and the instructor assigns the final grade and pass/fail decision.
Beyond the question of whether LLMs can generate or grade exam answers at a useful level of quality, recent work on LLM evaluation highlights that model performance can vary substantially across domains, item types and experimental setups, and that headline accuracy figures may conceal pockets of systematic weakness [
11,
14]. From an assessment perspective, this implies that some questions in an exam may be trivially easy for contemporary models, whereas others may remain consistently difficult even for relatively capable LLMs, depending on how they combine conceptual knowledge, numerical computation and reasoning about problem setups. For lecturers who continue to rely on written exams, it becomes important not only to know how well a given model performs overall, but also which kinds of questions it tends to fail under human grading and how large the subset of such questions is within different sources of items.
These considerations motivate a course-level, human-centered analysis in which large language models are treated explicitly as virtual students taking a real exam. In this study, we focus on exam-style questions derived from a first-year applied informatics course and use this course as an empirically grounded case study for examining LLM behavior under instructor grading, while avoiding course-level generalization beyond the studied setting. Our goals are to measure how two locally deployed instruction-tuned models perform when answering short free-text versions of university exam questions that are graded by an expert instructor, to compare their behavior on instructor-authored versus pipeline-generated items, and to identify questions that remain systematically difficult for both models. We refer to such consistently challenging items, which fail to receive a passing human exam decision from any of the LLM “students”, as LLM-resistant questions and treat their prevalence as a practically relevant indicator of how robust an exam question set is to contemporary language models.
1.2. Related Work and Research Gap
To avoid conflating adjacent research threads, we organize the related work around three themes that motivate our evaluation setting: prior studies that treat LLMs as exam takers, LLM-based grading and “LLM-as-judge” approaches, and automatic question generation pipelines. We then clarify how our work differs from each theme by focusing on an instructor-in-the-loop, course-level evaluation under an operational pass/fail policy, with an explicit HUMAN versus PIPELINE comparison and severity-aware decision analysis.
Large language models have increasingly been investigated as automatic graders for short textual answers in higher education. Schneider et al. [
15] evaluated LLM-based autograding across multiple courses and languages and showed that model scores could approximate instructor grades but still exhibited notable item-level disagreements and sensitivity to prompt design and scoring categories. Emirtekin [
1] provided a systematic review of LLM-powered automated assessment covering 49 studies and concluded that such systems could substantially reduce grading workload, yet important concerns about validity, fairness and transparency remained, especially in high-stakes settings. Duong et al. [
16] similarly reported that out-of-the-box LLMs were not yet ready to replace human examiners and should be used as decision-support tools whose recommendations are verified by instructors.
In programming and STEM education, LLM-based autograding was explored specifically for code and technical assignments. Cisneros-González et al. [
6] introduced JorGPT, an instructor-aided grading system that integrated several LLMs to assess programming assignments in an undergraduate course and found that LLM-proposed grades and feedback could be incorporated into the grading workflow, but systematic instructor review was still required before assigning final marks. Jukiewicz [
17] compared multiple LLMs for automated assessment of programming assignments and observed substantial variation between architectures and vendors, highlighting the need for careful model selection and calibration when deploying LLM-based graders in authentic courses. Together, these studies demonstrated that LLMs could support large-scale grading of open-ended student work while still leaving a non-trivial gap between model and human grading.
Another line of research treated LLMs explicitly as exam takers. Ros-Arlanzón et al. [
18] evaluated several general-purpose LLMs on end-of-course multiple-choice exams in undergraduate medical education and reported that some models reached or exceeded median student performance, while showing considerable variation across courses and exam configurations. Gaggioli et al. [
19] analyzed the reliability and validity of LLM-based assessment and argued that high performance on multiple-choice questions did not necessarily imply robust conceptual understanding, since models could exploit superficial regularities and benchmark artefacts. Taken together, these results suggested that headline accuracy on static Multiple Choice Question (MCQ) banks might overestimate LLM capabilities in more realistic exam scenarios that require open-ended reasoning and detailed explanations.
LLMs were also used as tools for exam question generation and related question-centric tasks. Nikolovski et al. [
20] presented a comparative study of LLM-based agents for exam question generation, improvement and evaluation from higher-education course materials and showed that orchestrated LLM agents could support large-scale exam design, although expert filtering was still needed to ensure alignment with learning objectives and difficulty expectations. Scaria et al. [
21] investigated automated educational question generation at different Bloom’s skill levels and found that modern LLMs produced linguistically correct and pedagogically relevant questions across multiple cognitive levels, but that quality and control over difficulty varied markedly between models and domains. Al Faraby et al. [
22] analyzed the use of ChatGPT-3.5 for educational question classification and generation and concluded that, although it covered a broad range of categories, fine-grained control over domain specificity, depth and difficulty remained challenging in practice. Most of these studies assessed generated questions in terms of face validity, topical coverage and perceived usefulness, rather than in terms of how difficult they were for LLM exam takers under human grading.
In parallel, LLMs were increasingly deployed as evaluation agents or “LLM-as-a-judge”. Hashemi et al. [
23] proposed the LLM-RUBRIC framework, a multidimensional, calibrated evaluation scheme in which an LLM was queried along several rubric dimensions and a calibration model was trained to better match the distribution of human scores, demonstrating that LLM-based judges were highly sensitive to rubric wording and prompt specification. Liang et al. [
24] introduced the HELM framework for holistic evaluation of language models and argued that evaluation was inherently scenario-dependent and multi-metric, warning that over-reliance on any single automatic metric or LLM judge could introduce systematic bias and instability in reported performance. These findings supported the view that LLM-based evaluators should be embedded into human-controlled evaluation pipelines and used primarily as auxiliary tools.
Existing work addresses key components of our setting in isolation. Studies that compare LLM-generated assessment items against faculty-authored questions report measurable differences in question quality and item properties, and consistently treat instructor review as necessary before classroom or exam use [
25,
26]. Complementarily, research on question generation from instructional materials has explored deriving multiple-choice items directly from course artifacts such as lecture or video transcripts, followed by explicit quality assessment of the generated questions [
27]. Finally, work on AI-assisted grading investigates how automated signals can support human grading decisions and shows that model-derived cues (e.g., attention and confidence) may not align perfectly with human judgment, motivating human oversight in high-stakes assessment contexts [
28].
Despite these advances, several aspects that are central to our study remained underexplored. First, existing LLM-based autograding studies typically considered only instructor-authored questions or pre-existing short-answer datasets and did not jointly compare human-authored exam questions with questions generated automatically from the same official course script, both answered by LLMs and graded under a unified human exam scale [
1,
15,
16,
20]. Second, most evaluations of LLMs as exam takers relied on multiple-choice formats and automatic scoring, instead of using free-text answers that were systematically graded by course instructors across an entire written exam [
18,
19]. Third, while work on LLM-based judges and holistic evaluation documented bias, instability and persistent gaps between model and human grades, there was almost no quantitative evidence on questions that consistently defeated multiple instruction-tuned LLMs within a single course, nor on how such items could be operationalized and measured as LLM-resistant questions for future exam design [
17,
22,
24].
This study was designed to address these gaps by combining instructor-authored and pipeline-generated exam questions from a real applied informatics course, treating two instruction-tuned student models (Llama3-8B-Instruct and Mistral-7B-Instruct) as virtual students, grading all their free-text answers on a 1–10 exam scale under expert human control, and defining LLM-resistant items as those that received a failing human exam decision across all student models. This course-level, human-aligned perspective complemented prior work on LLM-based assessment and evaluation frameworks by focusing specifically on the interaction between question source, LLM behavior and expert grading in a realistic written exam setting.
This study connects three lines of work that are often treated separately: evaluating large language models as exam takers, using LLMs as grading assistants, and generating exam questions automatically. The manuscript does not propose a new grading architecture and it does not introduce a new question generator. Instead, it contributes an instructor-in-the-loop, course-level evaluation that links these components under a realistic pass/fail policy. By contrasting instructor-authored items with pipeline-generated items and analyzing both score-level and policy-induced decision outcomes, the study isolates high-impact false-fail regimes such as credit denial, which can remain hidden under aggregate score agreement. We therefore formulate below the study design and contributions that operationalize this gap under a realistic pass/fail policy.
1.3. Study Design and Contributions
This study was designed as a course-level analysis of large language models treated explicitly as virtual students taking a real university exam. ExamQ-Gen is designed for instructor-in-the-loop exam use: the question-generation pipeline supports drafting self-contained exam items grounded in course materials, while grading-related analyses are meant to inform instructor-controlled decision policies. Accordingly, all high-stakes outcomes (including pass/fail) remain instructor-authorized, and automatic grading is treated only as a secondary baseline. We focused on an introductory applied informatics course (IA2) in which written exams with short-answer items are the main summative assessment instrument. The exam content was organized into two complementary sets of questions. The HUMAN set consisted of instructor-authored items taken from past editions of the course exams, covering conceptual questions on introductory artificial intelligence, basic probability and statistics, simple machine-learning workflows and elementary Python programming. The PIPELINE set consisted of questions generated automatically from the official course script using a dedicated ExamQ-Gen pipeline, which extracted topic-specific fragments of the PDF, prompted a teacher LLM to propose short-answer exam items with reference solutions, and filtered malformed outputs. All questions were reformulated in a self-contained style that does not rely on external materials during testing.
Two instruction-tuned LLMs deployed locally were treated as student models. For each question in both the HUMAN and PIPELINE sets, the models received only the question text and produced a single short free-text answer, without access to the course script or to the teacher-model reference solutions. The models answered the entire exam in one pass, following the same ordering and time-agnostic constraints as a human student, and no chain-of-thought prompts or external tools were used. This setup was intended to approximate a realistic “LLM sits the exam” scenario in which models must respond concisely and directly to each item.
All model answers were graded by the course instructor, using the same 1–10 numeric scale and pass/fail policy as in the actual exam. For each question-model pair, the course instructor assigned a score between 1 and 10 and a categorical correctness label (correct/partial/incorrect). From these judgments, we derived a binary exam_point indicator that takes the value 1 when the instructor score is ≥9 (passing under the local IA2 exam policy) and 0 otherwise. In addition to this human grading, we also computed scores with an auxiliary automatic grader based on a smaller LLM, which received the question, the teacher-model reference solution and the student-model answer. However, these automatic scores were used only as a secondary baseline; all analyses in this paper are based on the human-assigned scores and exam_point values. This design reflects the intended deployment setting, where automated grading can be used for triage or feedback, but does not replace instructor judgment.
Within this framework, we defined LLM-resistant questions as exam items that remained unsolved by all student models under human grading. Concretely, a question is marked as LLM-resistant if every LLM student receives a failing human decision for that item, that is, an exam_point of 0 across all models. We then analyzed the prevalence and characteristics of such questions across topics, cognitive skills and question sources (HUMAN vs. PIPELINE). This allowed us to examine not only how well the models performed on average, but also which parts of the exam landscape remained systematically challenging for them under realistic grading conditions. From an instructor-in-the-loop perspective, these items provide a practical signal for exam design: they help identify concepts and competencies that the evaluated student models fail to demonstrate reliably, and they motivate question types that are less susceptible to generic, template-like answers under realistic grading conditions. At the same time, we treat this notion as diagnostic rather than normative—LLM-resistant does not imply pedagogical superiority by itself—and we interpret it alongside standard quality checks (clarity, coverage, and alignment with the intended learning outcomes).
The study addressed three guiding research questions:
RQ1: How do locally deployed instruction-tuned LLMs (Llama3-8B-Instruct and Mistral-7B-Instruct) perform as virtual students on a real written exam, when evaluated on a 1–10 scale by the course instructor?
RQ2: How does model performance differ between HUMAN exam questions authored by the course instructor and PIPELINE questions generated automatically from the official course script?
RQ3: Which questions remain consistently difficult for both evaluated student models under human grading, and how can these items be characterized as LLM-resistant in terms of topic, required skills and source?
Our main contributions can be summarized as follows:
Course-level evaluation framework. We designed and implemented a realistic exam setting in which locally deployed LLMs act as virtual students on short-answer questions from an actual university course, with answers graded by an expert lecturer on the operational 1–10 exam scale and an explicit pass/fail decision.
Dual-source exam question set. We constructed and analyzed a paired collection of HUMAN exam questions authored by the course instructor and PIPELINE questions generated automatically from the official course script, with all items reformulated into a self-contained exam format suitable for both human and LLM candidates.
Human-aligned grading and auxiliary automatic baseline. We combined expert human grading of all LLM answers with an auxiliary LLM-based grader used strictly as a decision-support baseline, and we treated the human scores and exam_point decisions as the sole ground truth for all subsequent analyses, thereby aligning evaluation with real exam practices rather than with purely automatic metrics.
Operational definition and analysis of LLM-resistant questions. We introduced a concrete, exam-centered definition of LLM-resistant questions as items that all student models fail under human grading, and we quantified their prevalence and characteristics across topics, required skills and question sources, providing actionable signals for future exam design in the presence of powerful language models.
2. Materials and Methods
This section describes the experimental setup used to study large language models on exam-style questions derived from a real university course. We first introduce the construction of two exam question datasets, consisting of instructor-authored items and questions generated automatically from the official course script. We then present the ExamQ-Gen pipeline for automatic question generation, followed by the protocol used to obtain LLM-based answers and human-aligned grades on a 1–10 scale. Finally, we define the metrics and aggregation procedures used to quantify model performance and to analyze which types of questions are more difficult or LLM-resistant for LLMs.
2.1. Exam Question Datasets
Our experiments used exam-style questions derived from IA2, a first-year undergraduate course in a Computer and Information Technology bachelor’s program. IA2 covers introductory material on artificial intelligence, uncertainty and probabilities, simple machine-learning workflows and elementary Python programming. The official course script is provided as a single PDF and served as the authoritative content source for automatic question generation, while past written exams provided instructor-authored items.
We considered two families of exam questions. HUMAN questions were taken from historical IA2 exam sheets authored by the course instructor. These items cover conceptual aspects of artificial intelligence and typical learning tasks, numerical exercises based on odds and simple probability calculations, Naive Bayes spam filtering with small word-count tables, linear regression on receipt-style feature vectors, and short Python questions that require predicting the output or the data type of simple code fragments. Each multiple-choice item was converted into a free-text format by retaining the question stem and the correct solution, while discarding the answer options. The resulting HUMAN items were reformulated as short, self-contained prompts so that the student models received only the question text, without multiple-choice cues.
PIPELINE questions were generated automatically from the IA2 course script by our ExamQ-Gen pipeline. Starting from predefined page ranges corresponding to IA2 topics, the pipeline extracted text from the PDF, constructed prompts anchored in the script, and used a locally hosted instruction-tuned Llama-3.3-70B-Instruct [
29] model to produce one question and a short reference answer per generation call. The generator was instructed to produce self-contained exam-style questions aligned with the same topical areas as the HUMAN items. As in the HUMAN set, the question text was designed to be solvable without access to the course script, with all necessary context stated explicitly in the prompt.
All questions were represented in a unified format as short, self-contained prompts with concise reference answers and were stored as two separate line-based JSONL files, one for HUMAN questions and one for PIPELINE questions, using a common schema with metadata indicating the question source, topic labels and other identifiers. For HUMAN items we also stored the original multiple-choice options in a dedicated field, although these options were not exposed to the student models or graders. An overview of the two datasets is given in
Table 1.
Together, the HUMAN and PIPELINE datasets provide the empirical basis for all experiments reported in the remainder of this paper.
2.2. Automatic Question Generation Pipeline
While HUMAN questions were taken from past written exams, PIPELINE questions were produced automatically from the official course script by a Python pipeline that we refer to as ExamQ-Gen. The pipeline took as input the IA2 PDF, operated at the level of course topics and page ranges, and took as output exam-style question–answer pairs in the same format as the HUMAN items.
In a first step, the pipeline used a PDF reader to extract plain text from a specified contiguous range of pages corresponding to a given topic. The extracted text was lightly cleaned and truncated to a fixed maximum length in characters in order to stay within the 128 k-token context window of the Llama-3.3-70B-Instruct generator, while preserving the main definitions, examples and numerical tables relevant to the topic. This fragment was treated as teacher-only background material: it was visible to the generator but not to the LLMs that later acted as students.
In a second step, ExamQ-Gen built a chat-style prompt that combined the course fragment with instructions for producing exactly one exam-style question and one corresponding solution. The instructions required that both the question and the solution were written in the course language, were fully self-contained, and reused concrete elements from the fragment such as numerical values, feature names, dataset descriptions or code snippets. The model was also encouraged to favor medium-difficulty, multi-step reasoning tasks over purely definitional questions. The prompt was sent to a locally hosted instruction-tuned Llama-3.3-70B-Instruct model, which returned a single text block containing both the question and the solution.
In a final step, the output text was parsed into separate question and solution fields using explicit markers requested in the prompt. The generation prompt also required an internal self-check ensuring that numeric values and specific entity names used in the solution are present in the question text. We applied a simple numeric consistency check: all numeric literals in the solution were extracted and required to also appear in the question text, with any violations flagged for later inspection. Each generated item was stored as one record in a line-based JSONL file, with fields indicating the course code, topic or chapter identifier, question index, question text, reference solution and the page range used to build the context. These JSONL files constituted the PIPELINE dataset used in the LLM-based answering and auxiliary automated grading stages described later in this section. In the experiments reported here, we used the resulting 70-item PIPELINE JSONL dataset as generated, without additional filtering or manual post-processing. The main components of the ExamQ-Gen pipeline are summarized in
Figure 1.
Together, these steps defined the ExamQ-Gen pipeline and yielded the PIPELINE question set that we later used as input for the LLM-based answering and grading stages.
2.3. LLM-Based Answering and Human-Aligned Grading
For each question in the HUMAN and PIPELINE datasets, we simulated two students by querying two locally deployed instruction-tuned models, Llama3-8B-Instruct [
30] and Mistral-7B-Instruct [
31]. Each student model was instructed to answer each item in the same language as the question, using a short free-text style appropriate for a written exam. The prompt included only the question text and a brief instruction to produce a short free-text answer suitable for grading, without access to the course script, to multiple-choice options, or to any additional context. Each question was answered once by each student model, in separate runs. Student answers were generated via the Ollama chat API using temperature = 0.0, top_p = 1.0, max_tokens = 512, num_ctx = 8192, and seed = 42. The generated answers were appended to the corresponding record as additional fields, so that every item contained a question, a reference solution, and two model-generated answers produced by the Llama and Mistral models.
To obtain grades on the same 1–10 scale used in the local examination system, we use a second instruction-tuned model, Qwen2.5-7B-Instruct [
32] (the automatic grader), as a fast decision-support automatic grader. For each item, the grader received the exam question, the reference solution (instructor-authored for HUMAN items or generated by the pipeline for PIPELINE items) and one of the model-generated answers (from Llama 3 or Mistral 7B). The grading procedure was applied separately to the answer of each student model. The grading instructions asked the model to compare the model-generated answer with the reference solution, to assign an integer score between 1 and 10 based on semantic correctness and completeness, and to select a correctness label in {correct, partial, incorrect}. The grader prompt includes explicit rubric guidance (comparison against the reference solution, numerical/key-concept checks) and enforces a structured JSON output, and the grader is run deterministically (temperature = 0). From this label we derived a binary exam indicator,
exam_point, set to 1 only when the label is correct and to 0 otherwise. The resulting grader outputs were stored in intermediate JSONL files and served as an automatic baseline and sanity check, but were not used as the primary evaluation signal in our analyses.
Final grades were provided by the course instructor. For each question–model pair, the instructor reviewed the exam question, the reference solution, and the model-generated answer, and recorded (i) an integer score on the same 1–10 scale as in the local examination system and (ii) a categorical correctness label in {correct, partial, incorrect}. In our graded dataset, these labels correspond to coarse score bands: incorrect → 1 or 3, partial → 6–8, and correct → 9–10. Following the operational pass policy used in the course, we derived the binary exam indicator exam_point as 1 if the instructor score was ≥9, and 0 otherwise. These instructor-assigned scores, labels, and exam_point values were treated as the ground truth for our analyses.
Table 2 provides representative graded examples from both instructor-authored (HUMAN) and pipeline-generated (PIPELINE) items, illustrating how the instructor rubric is applied across correct, partially correct, and incorrect answers.
The two grading stages returned their decisions in a compact, structured format that specified the numeric score, the correctness label and, in some cases, a short justification. All outputs were parsed and stored both in line-based JSONL/CSV files and in a Neo4j [
33] graph database, together with the original question, reference solution and corresponding model-generated answer, yielding a graded dataset with one record per question-model pair. Unless otherwise stated, all subsequent analyses in this paper are based on the human grades and the associated human
exam_point indicator, while the grader outputs are used only as an auxiliary decision-support baseline. The overall answering and grading process is summarized in
Figure 2.
Together, the LLM-based answering setup, the automatic baseline grades and the course instructor’s final grades provided a unified graded dataset that we use in the next subsection to define our performance metrics, difficulty labels (easy/medium/hard) and the notion of LLM-resistant questions.
2.4. Metrics and Analysis of LLM Performance and Question Difficulty
The graded dataset obtained from the answering and grading setup assigned, for each question and for each student model, a numeric score on a 1–10 scale and a categorical correctness label (correct/partial/incorrect), together with a derived binary exam indicator, exam_point, that marked whether the answer was counted as correct (1) or not (0). In what follows, we used the human grades and the associated exam_point values as our primary evaluation signals.
At the dataset level, we computed the mean, median and empirical distribution of human scores and exam_point separately for each combination of question source (HUMAN versus PIPELINE) and student model (Llama3-8B-Instruct versus Mistral-7B-Instruct). In addition, we reported the proportion of questions whose answers were graded correct, partial, or incorrect. These aggregates were further broken down by topic (introductory artificial intelligence, odds and probabilities, Naive Bayes spam filtering, linear regression on receipt-style data and basic Python programming), providing an overview of how easily each model handled different types of exam questions. Beyond these marginal summaries, we also examined the agreement between the two student models by comparing their item-level scores and by measuring how often they received the same correctness label on a given question.
To obtain an interpretable notion of item difficulty that does not conflate question difficulty with student preparedness, the course instructor labeled each exam item as easy, medium, or hard based on the expected difficulty for IA2 students and the reasoning steps required by the question. The labels were assigned at the question level before any scoring and they are independent of the scores later assigned to LLM answers. We use these labels only for stratified reporting of model performance across difficulty levels.
The presence of two student models also allowed us to identify questions that were consistently challenging under the human grading scheme. Items that received human exam_point = 0 for both Llama3-8B-Instruct and Mistral-7B-Instruct were treated as LLM-resistant questions in this study. For each topic and question source we reported the proportion of such LLM-resistant items, which served as our main quantitative indicator of how robustly resistant the HUMAN and PIPELINE question sets were to reasonably capable LLM students.
2.5. Implementation Details
All experiments were carried out on a dedicated virtual machine running a 64-bit Windows operating system. The machine was equipped with an AMD EPYC 9654 96-core processor (2.40 GHz), 128 GB of RAM, a 3 TB SSD, and an NVIDIA L40S-48Q GPU with 48 GB of VRAM.
The ExamQ-Gen question generation pipeline was implemented in Python 3.11 using the HuggingFace transformers [
34] and accelerate libraries together with bitsandbytes [
35], a GPU quantization library that enables low-precision 4-bit weight representations, for 4-bit quantization and pypdf for text extraction from the IA2 course script. For PIPELINE questions we loaded a locally stored, 4-bit quantized Llama-3.3-70B-Instruct model from disk, using a helper that configured the tokenizer with left padding, set the padding token to the end-of-sequence token if needed, and mapped all layers to the single GPU with an appropriate floating-point compute type. Context fragments of up to 6000 characters were extracted from user-specified page ranges of the course PDF, and for each requested item we built a chat-style prompt, generated up to 512 new tokens using nucleus sampling (
temperature = 0.4,
top_p = 0.9), parsed the result into a single question-solution pair and wrote it to a JSONL file together with metadata about the chapter and page range.
The answering and grading stages used the HUMAN and PIPELINE JSONL files as input. Llama3-8B-Instruct and Mistral-7B-Instruct were deployed locally as student models, and Qwen2.5-7B-Instruct was deployed as the automatic grader. All three models were served through a lightweight API interface provided by Ollama [
36] on the same machine. For each question, Python scripts issued one request per student model to obtain short free-text answers, stored these answers in intermediate JSONL files, and then submitted the question, reference solution and model-generated answer to the automatic grader. Grader outputs were parsed automatically, with a small number of retries in case of malformed responses, and were written both to JSONL files and to a Neo4j graph database that stored questions, answers, models and grades as nodes and relationships. In a subsequent step, an expert instructor reviewed each question–answer pair and recorded the final human score and
exam_point label in CSV files with one record per model–question pair. Configuration files specifying model names, decoding parameters, database connection settings and random seeds were stored together with the datasets to facilitate replication and further experimentation.
3. Results
This section reports the empirical evaluation of the student LLMs under a standardized instructor grading protocol. Results are presented from an overall model-level comparison of performance and score distributions to progressively more fine-grained breakdowns that isolate the contribution of question source and other experimental factors.
3.1. Overall Performance of the Student LLMs
Overall performance of the student LLMs (Llama3-8B-Instruct and Mistral-7B-Instruct) was evaluated under a standardized instructor grading protocol on the operational 1–10 grading scale. For each model, results were pooled across both question sources (HUMAN and PIPELINE), yielding n = 150 graded answers per model (80 HUMAN + 70 PIPELINE). In addition to the numeric score, each graded response was associated with a binary exam indicator (exam_point), set to 1 when the answer would receive credit in an exam setting and 0 otherwise.
To provide a compact statistical characterization of score distributions,
Table 3 reports the mean and standard deviation (SD), median, minimum/maximum, and the interquartile range (IQR) for each model. The IQR is defined as IQR = Q3 − Q1 (75th minus 25th percentile) and captures the spread of the middle 50% of the distribution, complementing the SD, which is more sensitive to extreme values. In the pooled analysis, Mistral achieved a higher mean score (5.353) than Llama3 (3.407), reflecting stronger performance in the upper part of the score distribution. However, the median was 1.0 for both models, indicating that at least half of all responses received the minimum score, with the difference between models driven primarily by the upper tail rather than by typical (median) performance.
To move beyond aggregate statistics and expose distributional shape,
Figure 3 presents boxplots of the instructor scores and annotates the proportions of extreme outcomes at 1 and 10 for each model. In a boxplot, the box spans the first to third quartile, also known as the interquartile range, and the horizontal line marks the median. Whiskers summarize the remaining spread and outliers indicate unusually low or high values. The score distributions are highly polarized, with most mass at the endpoints rather than spread smoothly across the 1 to 10 scale. For Mistral, 149 of 150 answers received an extreme score, with 77 graded 1 and 72 graded 10, leaving only a single intermediate score of 6. For Llama3, 135 of 150 answers received an extreme score, with 104 graded 1 and 31 graded 10. This near-binary pattern indicates that in this setting the models rarely produced answers that merited partial credit. Intermediate scores were reserved for partially correct answers (correct core idea but missing required elements or containing non-trivial mistakes), consistent with the instructor rubric. Representative intermediate-score cases (e.g., scores 6 and 8) are provided in
Table 2. These boxplots pool HUMAN and PIPELINE items, aggregating over two exam conditions that can place the models in different performance regimes; we therefore break down performance by question source next.
To report exam-level outcomes aligned with the pass or fail decision used in our setting,
Table 4 reports the pass-point rate, together with 95% Wilson confidence intervals for the underlying binomial proportion. This decision-level summary remains informative even when the full 1 to 10 score distribution is polarized, because it directly measures how often each model would receive exam credit under the same human grading policy. The Wilson interval was used because it remains well-behaved for proportions away from 0.5 and provides stable uncertainty estimates at moderate sample sizes. The pass-point rate was 0.240 (36/150; 95% CI [0.179, 0.314]) for Llama3 and 0.480 (72/150; 95% CI [0.402, 0.559]) for Mistral, indicating a substantially higher likelihood of receiving exam credit under the same human grading policy. Notably, in this dataset, passing decisions occurred only for high scores 9 and 10, meaning that the pass-point rate tracked the proportion of scores ≥ 9 rather than the overall mean.
Overall, the pooled analysis indicated a clear performance advantage for Mistral over Llama3, both in score-level and in exam-level credit assignment, while also revealing a strongly polarized grading pattern (many minimum scores coexisting with a large mass at the maximum score). This global view motivated the subsequent breakdowns, which disentangled whether the observed differences were driven primarily by question source (HUMAN vs. PIPELINE) and other experimental factors.
3.2. Performance by Question Source (HUMAN vs. PIPELINE)
Because the exam set combined two qualitatively different sources of questions—instructor-authored items (HUMAN) and automatically generated items derived from the course script (PIPELINE)—overall model comparisons were decomposed by source to determine whether the pooled performance differences were driven by a specific subset. This stratification was necessary because question source can alter both the difficulty profile and the linguistic form of items, which in turn can change not only average scores but also the shape of the grading distribution (e.g., concentration at the minimum score versus concentration at the maximum score).
To quantify instructor scores within each source,
Table 5 reports descriptive statistics for the 1–10 scale by (
model,
source): mean and standard deviation (SD), median, minimum/maximum, and the interquartile range (IQR). Here, IQR = Q3 − Q1, i.e., the difference between the 75th and 25th percentiles; it summarizes the spread of the middle 50% of outcomes and is especially informative when distributions are heavy at the extremes. In highly polarized grading, it is possible for the IQR to either collapse to zero (when both Q1 and Q3 coincide at the same value, typically 1 or 10) or become very large (when Q1 sits at 1 and Q3 at 10), both of which directly reflect endpoint-heavy distributions rather than anomalies.
These summaries indicated a strong dependence on question source. On HUMAN questions (n = 80 per model), Llama3 achieved a higher mean score than Mistral (4.200 vs. 3.087), while the median remained 1.0 for both models. This combination (higher mean but identical median) implied that differences were driven primarily by the upper part of the distribution rather than by small shifts in typical performance. A striking feature was the very large IQR for HUMAN/Llama3 (9.0), which occurred because the 25th percentile sat at the minimum and the 75th percentile reached the maximum, indicating substantial mass at both endpoints within the middle 50% span. In contrast, HUMAN/Mistral had IQR = 0.0, indicating that the central mass remained tightly concentrated at the minimum despite the presence of some perfect scores (as reflected by the maximum of 10).
On PIPELINE questions (n = 70 per model), the pattern changed qualitatively. Mistral reached a mean of 7.943 and a median of 10.0, placing it in a near-ceiling regime on this subset, whereas Llama3 remained concentrated near the floor (mean 2.500, median 1.0). The fact that both PIPELINE groups had IQR = 0.0 did not mean an absence of meaningful variation; rather, it meant that at least half of each distribution’s mass accumulated at a single value (1 for Llama3, 10 for Mistral), consistent with extreme polarization.
To make this polarization explicit and visually interpretable (beyond percentiles and averages),
Figure 4 visualizes the same four groups using boxplots and annotates the proportions of score = 1 and score = 10. In a boxplot, the box spans Q1–Q3 (the IQR) and the central line marks the median; when distributions concentrate at a single value, the box can degenerate (IQR = 0), which directly reflects the data.
The annotated extremes made the distributional regimes behind the summary statistics explicit. On HUMAN, both models frequently received the minimum score, with score = 1 occurring for 58.8% of Llama3 answers and 76.3% of Mistral answers; nonetheless, perfect scores were still common (26.3% for Llama3 and 22.5% for Mistral), which explained why means differed even though both medians stayed at 1. On PIPELINE, outcomes separated sharply: Llama3 concentrated at the minimum (score = 1 in 81.4% of cases; score = 10 in 14.3%), whereas Mistral concentrated at the maximum (score = 10 in 77.1% of cases; score = 1 in 22.9%). This constituted a clear source-dependent reversal, meaning that the relative ranking of the student models depended on the question source rather than being stable across subsets.
Because exam outcomes are determined by thresholded decisions, pass/fail behavior was also reported.
Table 6 reports the pass-point rate (
exam_point) for each (
model,
source) group together with 95% Wilson confidence intervals for the corresponding binomial proportions. The Wilson interval provided stable uncertainty estimates for proportions that were far from 0.5 and avoided pathological behavior near 0 or 1.
The pass-point breakdown reinforced the score-level findings and quantified their practical impact. On HUMAN, Llama3 achieved 0.325 (26/80), while Mistral achieved 0.225 (18/80), indicating a modest advantage for Llama3 on instructor-authored items. On PIPELINE, the separation was large and reversed: Mistral achieved 0.771 (54/70), whereas Llama3 achieved 0.143 (10/70). Importantly, these were not small proportional shifts around a common baseline; they reflected fundamentally different regimes of exam credit assignment—near-ceiling performance for Mistral on PIPELINE versus persistent near-floor outcomes for Llama3 on the same subset.
To provide a compact visual summary of these exam-level outcomes,
Figure 5 presents the pass-point rates and confidence intervals across sources and models.
Overall, decomposing performance by question source showed that pooled results masked a strong interaction: the two models behaved similarly poorly on many HUMAN questions (with Llama3 retaining a modest advantage), while PIPELINE questions amplified differences substantially and favored Mistral decisively. This source dependence explained why aggregate comparisons in the pooled analysis could be misleading, and it motivated subsequent analyses of which properties of PIPELINE items were associated with near-ceiling performance for Mistral and near-floor outcomes for Llama3.
3.3. Agreement Between the Course Instructor and the Automatic Grader
To verify that the automatic grading pipeline aligns with the human grading signal used throughout the analysis, we compared the grader against the course instructor on both the continuous 1–10 score and the binary exam_point decision. Because the dataset mixes HUMAN (instructor-authored) and PIPELINE (pipeline-generated) items, we report agreement overall and by source to detect source-dependent shifts.
Score-level agreement is summarized in
Table 7, which reports MAE (mean absolute error), RMSE (root mean squared error), the mean bias (Grader − Instructor), and Spearman’s ρ (rank consistency under the discrete score scale). Overall, the grader shows moderate alignment with the instructor (MAE 2.21; bias −0.457; ρ 0.559), but this aggregate masks a strong source effect: agreement is much higher for HUMAN (MAE 0.81; bias −0.125; ρ 0.799) and substantially weaker for PIPELINE (MAE 3.81; bias −0.836; ρ 0.189). The consistently negative bias indicates that the Grader tends to assign lower scores than the instructor, with under-scoring substantially more pronounced on pipeline-generated items.
To inspect whether disagreement depends on score magnitude and source, the score differences are visualized in
Figure 6 using a Bland–Altman plot, where the y-axis is (Grader − Instructor) and the x-axis is the mean of the two scores. The plot confirms a small overall negative bias (−0.457) and wide limits of agreement (approximately [−7.674, 6.761]). Coloring by source makes the shift explicit: bias is near zero on HUMAN (−0.125) but becomes clearly more negative on PIPELINE (−0.836).
Beyond numeric scores, the evaluation also uses exam_point as an operational pass/fail decision. In this dataset, the instructor’s exam_point follows a strict threshold rule: exam_point = 1 only for instructor scores 9–10, and exam_point = 0 otherwise. Given the strongly polarized score distribution (dominated by scores at the minimum and maximum), this implies that pass-point rates primarily reflect the prevalence of scores ≥ 9 rather than partial-credit improvements.
Pass-point agreement is reported in
Table 8 through confusion counts (TP/FP/FN/TN) and derived metrics, including FPR (false-pass rate) and FNR (false-fail rate). Overall accuracy is 0.78, but the error profile is asymmetric: the Grader produces few false passes (FPR 0.03) while missing a large fraction of instructor passes (Recall 0.44; FNR 0.56). This asymmetry is again strongly source-dependent. On HUMAN, the Grader closely matches the instructor (Accuracy 0.93; Recall 0.84; FNR 0.16). On PIPELINE, the Grader becomes highly conservative (Accuracy 0.61) and recovers only 17% of Instructor passes (Recall 0.17; FNR 0.83), meaning that many answers that receive exam credit from the instructor are labeled as fails by the Grader in the pipeline setting.
In
Figure 7, pass-point rates are compared between the instructor and Grader across model × source groups, with 95% Wilson confidence intervals for each proportion. The discrepancies are not uniform: the largest gaps appear in the PIPELINE condition, where instructor pass rates remain high while Grader pass rates are much lower, and the confidence intervals remain well separated, indicating a substantive disagreement rather than sampling variability.
Overall, these results indicate that the Grader approximates Instructor grading well on HUMAN items but diverges substantially on PIPELINE, both in numeric scores, and—more critically—in pass/fail decisions, where the Grader exhibits a strong false-fail tendency.
3.4. Diagnosing Instructor–Automatic Grader Discrepancies by Model and Question Source
This analysis pinpoints where disagreements between the course instructor and the automatic Grader concentrate by stratifying results across the four model × source groups (HUMAN vs. PIPELINE; Llama3 vs. Mistral). We quantify discrepancies at two levels: score-level differences on the 1–10 scale, using Δscore = (Grader − Gnstructor), and decision-level differences on the binary exam_point outcome. For score-level comparisons, we report both magnitude (MAE, RMSE) and direction (mean bias), together with the frequency of exact matches (Δscore = 0) and large mismatches (|Δscore| ≥ 3). For pass/fail comparisons, we focus on FNR (false-fail rate among instructor passes) and FPR (false-pass rate among instructor fails), because they directly characterize whether the grader is conservative (high FNR) or permissive (high FPR).
Discrepancy metrics are reported in
Table 9. Agreement is tight in the HUMAN setting for both models: exact score matches occur for 82.5% of HUMAN/Llama3 and 76.2% of HUMAN/Mistral, and large mismatches remain rare (11.2% and 8.8%, respectively). The direction of the disagreement differs slightly by model within HUMAN: HUMAN/Llama3 shows mild under-scoring (bias −0.750), whereas HUMAN/Mistral shows mild over-scoring (bias +0.500). In contrast, PIPELINE introduces large and systematic shifts that depend strongly on the model. For PIPELINE/Llama3, the grader tends to over-score relative to the instructor (bias +1.886) and large mismatches become common (42.9%). For PIPELINE/Mistral, the grader shows strong under-scoring (bias −3.557) and very frequent large mismatches (82.9%). Overall,
Table 9 indicates that discrepancies are not uniformly larger on PIPELINE; they become directional and model-specific.
The distribution of score differences is shown in
Figure 8. The two HUMAN groups remain tightly centered around Δ
score = 0, and disagreements appear primarily as isolated outliers. In contrast, the PIPELINE distributions separate clearly: PIPELINE/Llama3 shifts upward (grader > instructor), while PIPELINE/Mistral shifts downward (grader < instructor). The PIPELINE/Mistral group also shows a pronounced negative shift, consistent with frequent large mismatches and pervasive under-scoring.
Decision-level discrepancies are summarized in
Figure 9, which reports FNR and FPR with 95% Wilson confidence intervals. Across all four groups, disagreement is dominated by false fails rather than false passes: FPR remains near zero or small (0.000–0.065), whereas FNR varies substantially across conditions. The largest discrepancy occurs for PIPELINE/Mistral, where the grader is highly conservative (FNR = 0.907, FPR = 0.000), indicating that instructor passes are rarely recovered by the grader. PIPELINE/Llama3 also shows elevated conservatism (FNR = 0.400) but to a lesser extent, while the HUMAN groups remain comparatively well calibrated, particularly HUMAN/Mistral (FNR = 0.056).
The underlying confusion counts and derived classification metrics are reported in
Table 10; PIPELINE/Mistral exhibits an extreme conservative profile relative to the instructor (ground truth): the LLM grader produces no false passes (FP = 0; Precision = 1.00) but misses most instructor passes (TP = 5 vs. FN = 49), resulting in very low recall (0.093) and low F1 (0.169). PIPELINE/Llama3 is more balanced (TP = 6, FN = 4, FP = 2), yielding moderate recall (0.600) with a low false-pass rate (FPR = 0.033). The HUMAN groups show high overall accuracy (0.925–0.938) with stronger recall.
Overall, discrepancy patterns are stable on HUMAN items but become strongly source-dependent and model-dependent on PIPELINE. The PIPELINE/Mistral condition concentrates most of the disagreement through an extreme false-fail tendency, whereas PIPELINE/Llama3 shows a higher false-pass tendency (FPR = 0.033) with more moderate pass/fail discrepancies. This localized view clarifies why aggregate agreement statistics can mask source- and model-specific effects and motivates reporting instructor–grader alignment separately by source and model.
3.5. Stability and Bias of Automatic Grading Across Experimental Conditions
Automatic grading stability is examined across experimental conditions by comparing instructor scores with the automatic grader at both the score level and the pass/fail decision level. Tolerance-invariant agreement and directional bias in 1–10 scores are quantified across model × source groups, and the exam-relevant impact at the instructor’s pass threshold is assessed by focusing on the severity of decision errors, including credit denial for instructor-perfect (score = 10) answers.
3.5.1. Grader–Instructor Score Agreement Across Experimental Conditions
We first assess grader–instructor agreement at the score level on the full 1–10 scale, independently of any pass threshold. This analysis is designed to capture whether the automatic grading procedure assigns scores that are close to the instructor’s scores in magnitude, and whether any remaining disagreement exhibits a consistent direction (systematic under- or over-scoring). For each response
, we compute the signed discrepancy
From these item-wise discrepancies, we report two complementary summaries within each model × source group (HUMAN/PIPELINE × Llama3/Mistral). The first is the mean absolute error (MAE),
which measures the typical absolute deviation in score points regardless of direction. The second is the signed bias,
which indicates whether the grader systematically assigns lower or higher scores than the instructor on average. Under this definition, negative bias implies under-scoring by the grader (instructor scores tend to be higher), whereas positive bias implies over-scoring (grader scores tend to be higher). To quantify uncertainty in the magnitude of disagreement, we additionally report 95% bootstrap confidence intervals for MAE by resampling items within each group. In the present setup, the average panel size
equals 1 because a single automatic grader score is used per item.
Table 11 reports these tolerance-invariant agreement statistics by group. Two patterns stand out. First, score-level agreement is consistently tight in the HUMAN setting: MAE is approximately 0.8 for both models (0.825 for HUMAN/Llama3 and 0.800 for HUMAN/Mistral), indicating that the grader’s score typically stays within roughly one point of the instructor’s score on average. Second, agreement degrades sharply in the PIPELINE setting, where the typical discrepancy becomes multi-point: MAE rises to 2.743 for PIPELINE/Llama3 and to 4.871 for PIPELINE/Mistral, showing that large score gaps become common, particularly for PIPELINE/Mistral. The bias values further indicate that disagreement is not purely symmetric noise but also directional and model-dependent: PIPELINE/Mistral exhibits a large negative bias (−3.557), consistent with systematic under-scoring by the grader relative to the instructor, whereas PIPELINE/Llama3 exhibits a positive bias (+1.886), consistent with systematic over-scoring. Even within HUMAN, smaller directional tendencies are visible (negative for Llama3: −0.750 and positive for Mistral: +0.500), but these are modest compared with the PIPELINE shifts.
Overall, score-level agreement is stable and close in the HUMAN setting, but substantially weaker in the PIPELINE setting, with strong directional shifts in the latter. This motivates examining whether these score differences translate into pass/fail mismatches at the instructor pass threshold.
3.5.2. Decision-Level Severity at the Pass Threshold
Instructor-facing decision errors at the operational pass threshold (pass if score ≥ 9) can vary widely in severity, so
Table 12 summarizes a severity decomposition focused on instructor-perfect cases (instructor score = 10) by measuring how often the automatic grader converts these clear passes into a failing outcome (grader < 9), with counts and 95% Wilson confidence intervals. Perfect-pass denial is rare under HUMAN sources (HUMAN/Llama3: 1/21 = 0.048; HUMAN/Mistral: 1/18 = 0.056), indicating that clear passes are typically preserved. The PIPELINE sources show a sharp escalation: PIPELINE/Llama3 reaches 5/10 = 0.500 (95% CI: 0.237–0.763), while PIPELINE/Mistral rises to 50/54 = 0.926 (95% CI: 0.824–0.971), implying that most instructor-perfect answers are denied credit in that condition even after accounting for uncertainty.
A compact visualization of this “credit-denial for clear passes” effect is provided in
Figure 10, which plots the perfect-pass miss rate (instructor score = 10 but grader < 9) with 95% Wilson confidence intervals across the four source × model groups. The figure makes the asymmetry across conditions immediately visible: both HUMAN groups remain close to zero, PIPELINE/Llama3 shifts upward with substantial uncertainty due to smaller support, and PIPELINE/Mistral concentrates near the top of the scale with a comparatively tight interval, reflecting an operationally severe regime in which instructor-perfect answers are frequently converted into failing outcomes at the pass threshold.
Overall, this severity-focused view shows that decision-level errors are not confined to borderline threshold flips: under PIPELINE conditions—especially PIPELINE/Mistral—the grader frequently denies credit even for answers judged fully correct by the instructor, which constitutes a direct, exam-relevant failure mode at the operational pass criterion.
3.5.3. Sensitivity to Alternative Pass Thresholds
Because pass policies differ across institutions, we conducted a sensitivity analysis by recomputing the pass/fail indicator at lower thresholds (
t = 6 and
t = 5), applying the same rule (score ≥
t) to both instructor and grader scores.
Table 13 reports decision-level error rates aggregated by source group (HUMAN vs. PIPELINE) across both evaluated student models. Lowering the threshold reduces false fails under PIPELINE, but it also increases false passes, making the policy-dependent trade-off explicit.
4. Discussion
The results indicate that automatic grading performance must be interpreted at two levels: agreement on the 1–10 score scale and the downstream consequences of thresholded decisions. While score-level discrepancies can be modest on average, small systematic biases can translate into large exam-relevant risks at the instructor’s pass criterion, especially under condition shifts between HUMAN and PIPELINE sources. These patterns suggest that the validation of LLM-based graders should prioritize decision-level error asymmetry and severity—not only aggregate score agreement—when assessing suitability for high-stakes educational use.
4.1. Interpreting Agreement: Score-Level Similarity vs. Threshold Risk
Automatic grading performance in educational settings is not captured by a single notion of “agreement,” because the same score-level discrepancy can have radically different operational consequences depending on how grades are consumed. When scores are used as continuous signals (e.g., for formative feedback), moderate deviations may be tolerable if rankings and broad performance bands are preserved. In contrast, when scores are mapped into discrete outcomes—most importantly, pass/fail—small systematic deviations can become high-impact, because a fixed threshold compresses all outcomes into a binary decision and concentrates the cost of error near the boundary.
This thresholding effect creates an intrinsic coupling between calibration and fairness. A grader that is slightly conservative on the 1–10 scale may still appear stable under score-level metrics, yet it can deny credit disproportionately often when the pass criterion is strict. In such cases, overall agreement statistics can mask a harmful asymmetry: low false-pass tendencies may coexist with high false-fail tendencies. Operationally, these two error types are not interchangeable. False passes can be treated as a standards risk, while false fails directly produce credit denial for legitimate passes; in many instructional contexts, the latter carries a higher fairness and trust cost.
Decision-level metrics therefore serve a distinct purpose from score-level agreement measures. They expose whether the system’s errors are balanced or skewed, and whether its performance depends on the base rate of passes/fails in the evaluated pool. Accuracy alone can be misleading under imbalance or conservative decision policies: a model can achieve acceptable accuracy by predicting “fail” frequently if many cases are indeed fails, while still producing an unacceptable rate of false fails among instructor passes. For exam deployment, the relevant question is not only “How close are the scores?” but “How often does the system flip the instructor’s decision at the pass boundary, and in which direction?”
Beyond frequency, the severity of decision errors matters. A threshold flip can arise from genuinely borderline cases (e.g., instructor score at the threshold) or from cases that are unambiguously correct by the instructor’s judgment. These two failure modes have different interpretations and remediation strategies. Borderline flips may reflect inevitable subjectivity or noise around the cutoff, whereas denial of instructor-perfect answers indicates a deeper calibration mismatch or a systematic scoring bias. Separating these regimes clarifies whether conservative grading is merely “cautious near the boundary” or whether it extends to clearly correct work and thus represents a more serious deployment risk.
Finally, this interpretation must be conditioned on experimental setting. The observed differences between HUMAN and PIPELINE sources indicate that stability is not only a property of the grader, but also of the response distribution it is exposed to. A grader that behaves acceptably under one source condition may become miscalibrated under another, turning a small score-level bias into a large decision-level impact at the pass threshold. For this reason, the discussion that follows emphasizes decision-level asymmetry and severity under condition shifts, because these features determine whether automatic grading is operationally reliable for high-stakes educational use.
4.2. Condition Effects: HUMAN vs. PIPELINE as a Robustness Stress Test
The contrast between HUMAN and PIPELINE conditions functions as a robustness stress test for automatic grading, because it probes whether the grader’s behavior is stable under changes in the distribution and presentation of answers. Even when the grading rubric and the instructor standard remain fixed, the observed performance shifts indicate that the automatic grader is sensitive to properties of the response pool that are not directly tied to content correctness alone. In practical terms, this means that “good performance” measured on one type of student response does not automatically transfer to another, and validation must explicitly cover the kinds of answers the system will face in deployment.
Several mechanisms can plausibly drive this condition dependence without requiring any change in the instructor’s underlying criteria. PIPELINE answers may differ from HUMAN answers in length, structure, phrasing, completeness signals, or stylistic markers that correlate imperfectly with correctness. A grader that relies on such cues—implicitly or explicitly—can become systematically conservative or systematically lenient when the cue distribution shifts. Importantly, this kind of shift can be difficult to detect if evaluation focuses only on average score-level agreement, because the same mean error can be operationally benign in one regime and harmful in another once a threshold is applied.
The decision-level results suggest that the main failure mode under PIPELINE is not an increase in false passes but a strong increase in false fails, consistent with a conservative grading policy under distribution shift. This pattern is especially concerning in exam settings, because it converts calibration drift into credit denial. Under such a regime, a grader may appear “safe” in the sense of rarely granting undeserved passes, yet it can simultaneously violate fairness expectations by rejecting legitimate passes at scale. The severity analysis sharpens this interpretation by showing that the shift is not confined to borderline cases: in the most affected condition, the grader denies credit even for answers that the instructor judges fully correct, indicating a misalignment that is unlikely to be explained by boundary noise alone.
From a methodological standpoint, these condition effects imply that the HUMAN/PIPELINE split should be treated as more than an incidental experimental detail. It operationalizes a realistic deployment concern: graders are often evaluated on curated or homogeneous datasets, but deployed on heterogeneous response distributions shaped by different generation processes, student writing styles, or tool-mediated workflows. A robust grader should therefore be assessed across such condition variations, and reported performance should make explicit whether observed errors are stable or shift-dependent. In this sense, HUMAN vs. PIPELINE provides an interpretable axis of stress that exposes whether the grader generalizes or whether it requires condition-specific calibration to remain reliable at the pass threshold.
4.3. Conservative Grading and Error Asymmetry at the Pass Threshold
A central theme emerging from the decision-level analyses is the asymmetry between false passes and false fails at the instructor’s pass criterion. In many educational settings, conservative grading can appear attractive because it prioritizes avoiding false passes—i.e., it reduces the chance that an insufficient answer is incorrectly awarded credit. However, when the pass threshold is operational (pass if score ≥ 9), conservative bias does not simply “play it safe”; it reallocates error mass toward false fails, which directly translates into credit denial for legitimate passes. This creates a fairness–risk trade-off that cannot be evaluated using accuracy alone and must instead be examined through directional error rates.
Error asymmetry matters because the two decision errors have different consequences and different tolerances. A small false-pass rate may be acceptable from a standards perspective, but a large false-fail rate undermines trust and can be difficult to justify to students and instructors, particularly when it affects clearly correct responses rather than borderline cases. In a thresholded grading regime, even a modest downward bias on the 1–10 scale can produce a disproportionate number of false fails if many true passes cluster near the cutoff, or if the grader’s calibration drifts under condition shifts. This explains why score-level agreement can coexist with strong decision-level harm: the same systematic bias that looks small when averaged across the entire score range becomes consequential when evaluated at a single boundary.
The severity perspective reinforces that not all false fails are equivalent. Borderline flips (e.g., an instructor score at the threshold) can often be interpreted as noise or subjectivity around a hard cutoff, and mitigation may focus on clarifying rubrics or introducing an appeal/review process for near-threshold cases. By contrast, false fails that occur for instructor-perfect answers indicate a deeper misalignment: the grader is not merely uncertain near the boundary, but is applying criteria that systematically under-recognize correctness in certain conditions. This is operationally the worst case because it cannot be addressed by small threshold adjustments alone without risking a surge in false passes; it instead suggests that the grader’s scoring function (or its calibration) changes qualitatively across conditions.
These observations imply that deploying automatic grading at a strict pass threshold requires explicit safeguards against asymmetric harm. At minimum, evaluation should report both false-pass and false-fail behavior and treat them as separate objectives rather than collapsing them into a single aggregate score. More importantly, when conservative behavior is observed—especially under distribution shift—its acceptability should be judged in terms of the severity of the denied-credit cases. In exam settings, a conservative grader that rarely grants undeserved passes but frequently rejects legitimate passes may be unsuitable without additional calibration, human-in-the-loop review for high-severity cases, or policy constraints that cap false-fail rates.
4.4. Methodological Implications for LLM-Based Educational Assessment
The results motivate several methodological implications for how LLM-based graders should be evaluated and reported in educational contexts. First, score-level agreement and decision-level reliability should be treated as complementary, not interchangeable, evaluation targets. Reporting only continuous-score metrics (e.g., average error, correlation) can obscure operational failure modes that emerge after thresholding, while reporting only pass/fail metrics can hide systematic score biases that may matter for ranking, feedback, or grade boundaries beyond a single cutoff. A complete evaluation therefore requires both views, with explicit attention to how conclusions change when moving from the 1–10 scale to thresholded outcomes.
Second, decision-level evaluation should go beyond overall accuracy and include directional error rates and their uncertainty. Accuracy can be dominated by class prevalence and may remain high even when a grader adopts a strongly conservative policy that rejects many legitimate passes. Directional rates—false-pass and false-fail behavior—directly quantify the relevant risks, while confidence intervals prevent over-interpreting small-sample fluctuations and make condition comparisons more defensible. In practice, this means that evaluation reports should treat FPR and FNR as first-class metrics, rather than secondary diagnostics.
Third, severity-aware analysis should be incorporated when automatic grading is intended for high-stakes use. Aggregated false-fail rates alone do not distinguish between benign boundary noise and severe credit denial for clearly correct work. Decompositions that isolate high-severity strata—such as instructor-perfect answers or high-confidence rubric satisfaction—reveal whether errors cluster at the threshold or extend to unequivocal cases. This distinction matters for both interpretation and mitigation: borderline flips may be addressed through review policies near the cutoff, while severe misses indicate the need for recalibration, rubric alignment, or changes to the grader’s reasoning prompts and scoring constraints.
Fourth, robustness to condition shifts should be made explicit in evaluation design. The HUMAN/PIPELINE contrast illustrates that automatic graders can change behavior under shifts in response distribution even when the instructional target remains the same. Consequently, benchmarks that evaluate graders on a single homogeneous answer pool may overestimate deployment reliability. Methodologically, this suggests that educational grading evaluations should include deliberate stress tests that vary response source, style, and structure, and should report performance disaggregated by these factors to avoid hiding worst-case regimes in aggregate averages.
Finally, these findings support a shift in how “grading quality” is operationalized for LLM-based assessment: suitability should be judged not only by average agreement but by risk profiles under the intended decision policy. When the grading output feeds into a strict pass criterion, evaluation should prioritize threshold calibration, directional decision errors, and severity—because these are the dimensions that determine whether an automatic grader behaves as a trustworthy component of an exam pipeline.
4.5. Limitations
Several limitations should be considered when interpreting these findings. First, the automatic grading configuration effectively corresponds to a single-grader setting (panel size = 1), which means that variability due to grader stochasticity or inter-grader disagreement is not averaged out. While this reflects a realistic low-cost deployment scenario, it may overstate instability relative to designs that aggregate multiple independent grading judgments.
Second, the decision analysis is tied to a specific operational policy: a fixed pass threshold (pass if score ≥ 9) applied identically to instructor and automatic scores. This choice is appropriate for the targeted exam setting, but different courses or grading schemes may use alternative cutoffs, multi-level grade bands, or curved policies. As a result, the reported decision-level risks—especially near-threshold behavior—should be interpreted as policy-conditional rather than as universal properties of the grader.
Third, the number of instructor-perfect cases varies substantially across model × source groups, leading to uneven statistical support for the severity estimates. In particular, some conditions contain relatively few instructor-perfect observations, which produces wider confidence intervals and limits the precision of cross-condition comparisons. Conversely, conditions with larger support yield tighter intervals and therefore more reliable estimates of systematic effects.
Fourth, the analysis is restricted to the experimental conditions and answer pools used in this study (a single course (IA2) with a single instructor reference, HUMAN vs. PIPELINE, and the examined model groups). These conditions capture meaningful distribution shifts, but they do not exhaust the diversity encountered in real educational deployment, where student writing quality, prompt phrasing, language proficiency, and use of external tools can introduce additional sources of variation. Generalization beyond the studied setting should therefore be made cautiously.
Finally, the evaluation compares grader-based grading directly to a single instructor reference. Although the instructor provides a practical and relevant ground truth for the target course, the technical, reference-solution-anchored nature of the items and the explicit instructor rubric constrain grading degrees of freedom, but they do not eliminate potential instructor-specific strictness or leniency. Instructional grading itself may contain subjectivity, and alternative reference designs (e.g., multi-instructor panels or adjudication) could change the estimated disagreement rates. These limitations do not negate the observed patterns, but they bound the scope of the claims to the operational setup and reference standard used here.
These findings have direct implications for teaching practice in technical, reference-solution–anchored assessments. Instructors should treat LLM-based grading primarily as decision support rather than as an autonomous pass/fail gate, because threshold-sensitive miscalibration can translate into disproportionate consequences under strict pass policies. In addition, pipeline-generated questions should undergo brief pre-deployment validation focused on clarity, uniqueness of the intended solution, and the absence of unintended shortcuts that reduce discrimination. Operationally, the grader is most useful for triage and feedback: responses near the pass threshold or exhibiting apparent mismatch with the reference solution should be prioritized for instructor review. Finally, instructors should monitor severe false-fail regimes (e.g., credit denial) as a separate risk indicator alongside aggregate score agreement, and adjust prompts, review rules, or workflow safeguards when such regimes exceed an acceptable level.
4.6. Future Work
Several concrete directions follow from these results. A first priority is threshold-focused calibration aimed at reducing false-fail severity without introducing an unacceptable increase in false passes. This can include learning a simple monotone mapping from grader scores to instructor scores, estimating a condition-specific offset, or optimizing a decision threshold under explicit constraints on false-fail rates. Because the observed risks are policy-conditional, calibration should be evaluated directly under the intended grading policy rather than only through score-level error reduction.
Second, the mechanisms behind severe failures should be localized. For cases where the instructor assigns a perfect score but the grader produces a failing outcome, targeted error analysis can determine whether the discrepancy is driven by rubric misinterpretation, missing required elements, over-penalization of style or brevity, or sensitivity to answer format. This kind of diagnosis can then inform prompt design, rubric encoding, or structured scoring strategies that reduce reliance on superficial cues and improve alignment with instructor criteria.
Third, robustness evaluation should be expanded beyond the current HUMAN/PIPELINE split to cover additional, deployment-relevant shifts. Examples include varying answer length distributions, introducing paraphrase and formatting perturbations, mixing levels of student proficiency, or including answers produced under different tool-use regimes. The goal is to identify which shifts cause calibration drift and to validate mitigation strategies under worst-case conditions rather than only on average.
Fourth, the role of aggregation should be investigated. While single-judge grading is operationally appealing, small ensembles of independent grading runs or multi-agent panels may reduce variance and mitigate systematic conservatism, particularly near the pass boundary. Future work can quantify the cost–reliability trade-off of such aggregation strategies and test whether they reduce severe false fails on instructor-perfect responses.
Finally, extending the decision analysis beyond a single threshold to multi-level grade bands would better match many real grading policies. Evaluating stability across multiple cut points (e.g., fail/pass, pass/excellent) and reporting severity-aware errors for each band would provide a more complete characterization of the grader’s operational risk profile and improve the transferability of the methodology to other courses and assessment settings.
In addition, future work should expand the set of evaluated student models to include a broader range of LLMs, including strong proprietary systems (e.g., GPT-4, Claude, Gemini), to test the stability of the observed effects across model families and capability levels.