ExamQ-Gen: Instructor-in-the-Loop Generation of Self-Contained Exam Questions from Course Materials and Decision-Support Grading

Anghel, Catalin; Pecheanu, Emilia; Anghel, Andreea Alexandra; Craciun, Marian Viorel; Cocu, Adina

doi:10.3390/computers15030177

Open AccessArticle

ExamQ-Gen: Instructor-in-the-Loop Generation of Self-Contained Exam Questions from Course Materials and Decision-Support Grading

by

Catalin Anghel

^1,*

,

Emilia Pecheanu

¹

,

Andreea Alexandra Anghel

²

,

Marian Viorel Craciun

¹

and

Adina Cocu

^1,*

¹

Department of Computer Science and Information Technology, “Dunărea de Jos” University of Galati, Științei St. 2, 800146 Galati, Romania

²

Faculty of Automation, Computer Science, Electrical and Electronic Engineering, “Dunărea de Jos” University of Galati, 800008 Galati, Romania

^*

Authors to whom correspondence should be addressed.

Computers 2026, 15(3), 177; https://doi.org/10.3390/computers15030177

Submission received: 28 January 2026 / Revised: 3 March 2026 / Accepted: 5 March 2026 / Published: 9 March 2026

(This article belongs to the Special Issue Intelligence for Complex Data: From Retrieval and Understanding to Decision-Making)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Reliable evaluation of large language models (LLMs) for educational use requires benchmarks that reflect exam constraints, instructor grading practices, and the operational consequences of thresholded decisions. This paper introduces ExamQ-Gen, an instructor-in-the-loop benchmark that couples two tasks: (i) an LLM answering university-style exam questions and (ii) decision-support grading aligned with an instructor reference. Automatic grading is used for triage and feedback; in practice, ExamQ-Gen supports instructor-led exam authoring and provides grading recommendations, while the instructor issues the final grade and pass/fail decision. ExamQ-Gen is constructed from the course content by using an LLM to generate exam-style questions directly from the lecture materials, producing a course-derived question set suitable for controlled experimentation. The benchmark then instantiates contrasting exam conditions, including instructor-authored (HUMAN) versus pipeline-generated (PIPELINE) artifacts, to evaluate robustness under distribution shifts that can occur when exam questions and answers are produced through different generation workflows. Using two LLM “students” (Llama3-8B-Instruct and Mistral-7B-Instruct) and an LLM-based grader, we compare automatic grading against an instructor reference on a 1–10 score scale and at the decision level induced by the operational pass policy (pass if score ≥ 9). Accordingly, our conclusions are conditioned on the two evaluated student models. Score-level agreement is strong under HUMAN conditions but degrades substantially under PIPELINE conditions, indicating condition-dependent stability. At the pass threshold, decision errors are highly asymmetric, with false fails dominating false passes, meaning that conservative grading may appear safe while producing credit denial. A severity-focused analysis isolates a high-stakes failure mode—denial of instructor-perfect answers—and shows that, in the most affected PIPELINE condition, the perfect-pass miss rate reaches 0.926 (50/54), consistent with systematic conservatism rather than borderline noise. Overall, the results highlight that aggregate score agreement and accuracy are insufficient for instructor-controlled exam deployment and motivate reporting practices that combine disaggregated score agreement, threshold-based error asymmetry with uncertainty, and severity-aware diagnostics under exam-relevant condition shifts.

Keywords:

large language models; exam question generation; instructor-in-the-loop; automatic grading; decision-support assessment; exam benchmarking; educational evaluation

1. Introduction

Large language models (LLMs) are increasingly being integrated into higher-education assessment workflows, where they can generate questions, provide feedback and support automated scoring [1]. At the same time, recent work on automatic short-answer grading and AI-assisted evaluation of student work has shown both promising results and open challenges related to reliability, transparency and alignment with human graders [2]. These developments raise important questions about how well current LLMs perform on real university exam questions under human grading, how question source (instructor-authored versus pipeline-generated) affects their performance, and how to design exam items that remain difficult –or effectively LLM-resistant—for such models [3].

In this work, we use distributional shift to denote a systematic change in the distribution of exam artifacts induced by the generation workflow, operationalized here by the contrast between instructor-authored (HUMAN) and pipeline-generated (PIPELINE) questions and answers. This shift can modify surface properties such as length, structure, phrasing, and completeness signals, even when course content and the instructor’s target rubric remain fixed. We use sustainable issues to denote deployment-relevant failure modes that are systematic enough to persist under repeated use of the same workflow (rather than isolated outliers), such as condition-dependent grading conservatism that scales into credit denial under strict pass thresholds.

1.1. Background and Motivation

Written exams with short-answer questions remain a central component of summative assessment in many higher-education programs, especially in computing and applied informatics, where they are used to probe conceptual understanding, basic quantitative reasoning and simple programming skills in a time-constrained setting [1]. Such exams are typically constructed and refined by lecturers over multiple cohorts, with item difficulty and coverage adjusted informally based on experience, observed student performance and institutional grading norms [4]. This traditional calibration assumes that only human candidates take the exam and that solving the questions requires a combination of recall, reasoning and problem-solving skills that cannot be outsourced to external systems.

The rapid adoption of large language models (LLMs) in educational technology challenges this assumption. Recent systematic reviews document a growing number of LLM-powered tools that support automatic assessment, short-answer scoring and feedback generation in higher education, moving beyond small pilots towards deployment in real courses [1,5]. Case studies in programming education report LLM-based assistants that help instructors review and grade code and open-ended responses, effectively embedding general-purpose models into the assessment workflow [6,7]. At the same time, empirical studies of LLM- and machine-learning-based scoring of scientific explanations show that current models can match or approach human performance on some explanation and reasoning tasks, while still displaying systematic weaknesses on others [3]. In practice, this means that students can increasingly use external LLMs to obtain plausible answers to exam-style questions during preparation or unsupervised assessments, potentially altering the effective difficulty and discriminatory power of existing item banks.

A parallel line of work investigates LLMs as automatic graders or evaluators for short-answer questions, comparing their scores and feedback with those assigned by human markers in domains such as computer science, physics and health education [2,3,7]. These studies often report moderate to high agreement at an aggregate level, but they also reveal discrepancies for individual items, sensitivity to prompt design and rubric specification, and context-dependent failures, especially in high-stakes settings [8,9]. Broader surveys on evaluation methodologies for natural language generation and LLMs argue that automatic metrics and model-based judges can be biased or unstable, and recommend grounding evaluation in carefully designed human judgement protocols wherever possible [10,11]. In educational contexts, research on explainable AI and teacher-facing AI tools similarly shows that trust and acceptance depend on transparent, domain-specific explanations that make automated recommendations understandable and controllable for instructors [12,13]. Together, these findings suggest that LLM-based graders are best viewed as auxiliary tools or baselines; we therefore treat ExamQ-Gen as an instructor-in-the-loop workflow: the system supports exam authoring and produces grading recommendations, and the instructor assigns the final grade and pass/fail decision.

Beyond the question of whether LLMs can generate or grade exam answers at a useful level of quality, recent work on LLM evaluation highlights that model performance can vary substantially across domains, item types and experimental setups, and that headline accuracy figures may conceal pockets of systematic weakness [11,14]. From an assessment perspective, this implies that some questions in an exam may be trivially easy for contemporary models, whereas others may remain consistently difficult even for relatively capable LLMs, depending on how they combine conceptual knowledge, numerical computation and reasoning about problem setups. For lecturers who continue to rely on written exams, it becomes important not only to know how well a given model performs overall, but also which kinds of questions it tends to fail under human grading and how large the subset of such questions is within different sources of items.

These considerations motivate a course-level, human-centered analysis in which large language models are treated explicitly as virtual students taking a real exam. In this study, we focus on exam-style questions derived from a first-year applied informatics course and use this course as an empirically grounded case study for examining LLM behavior under instructor grading, while avoiding course-level generalization beyond the studied setting. Our goals are to measure how two locally deployed instruction-tuned models perform when answering short free-text versions of university exam questions that are graded by an expert instructor, to compare their behavior on instructor-authored versus pipeline-generated items, and to identify questions that remain systematically difficult for both models. We refer to such consistently challenging items, which fail to receive a passing human exam decision from any of the LLM “students”, as LLM-resistant questions and treat their prevalence as a practically relevant indicator of how robust an exam question set is to contemporary language models.

1.2. Related Work and Research Gap

To avoid conflating adjacent research threads, we organize the related work around three themes that motivate our evaluation setting: prior studies that treat LLMs as exam takers, LLM-based grading and “LLM-as-judge” approaches, and automatic question generation pipelines. We then clarify how our work differs from each theme by focusing on an instructor-in-the-loop, course-level evaluation under an operational pass/fail policy, with an explicit HUMAN versus PIPELINE comparison and severity-aware decision analysis.

Large language models have increasingly been investigated as automatic graders for short textual answers in higher education. Schneider et al. [15] evaluated LLM-based autograding across multiple courses and languages and showed that model scores could approximate instructor grades but still exhibited notable item-level disagreements and sensitivity to prompt design and scoring categories. Emirtekin [1] provided a systematic review of LLM-powered automated assessment covering 49 studies and concluded that such systems could substantially reduce grading workload, yet important concerns about validity, fairness and transparency remained, especially in high-stakes settings. Duong et al. [16] similarly reported that out-of-the-box LLMs were not yet ready to replace human examiners and should be used as decision-support tools whose recommendations are verified by instructors.

In programming and STEM education, LLM-based autograding was explored specifically for code and technical assignments. Cisneros-González et al. [6] introduced JorGPT, an instructor-aided grading system that integrated several LLMs to assess programming assignments in an undergraduate course and found that LLM-proposed grades and feedback could be incorporated into the grading workflow, but systematic instructor review was still required before assigning final marks. Jukiewicz [17] compared multiple LLMs for automated assessment of programming assignments and observed substantial variation between architectures and vendors, highlighting the need for careful model selection and calibration when deploying LLM-based graders in authentic courses. Together, these studies demonstrated that LLMs could support large-scale grading of open-ended student work while still leaving a non-trivial gap between model and human grading.

Another line of research treated LLMs explicitly as exam takers. Ros-Arlanzón et al. [18] evaluated several general-purpose LLMs on end-of-course multiple-choice exams in undergraduate medical education and reported that some models reached or exceeded median student performance, while showing considerable variation across courses and exam configurations. Gaggioli et al. [19] analyzed the reliability and validity of LLM-based assessment and argued that high performance on multiple-choice questions did not necessarily imply robust conceptual understanding, since models could exploit superficial regularities and benchmark artefacts. Taken together, these results suggested that headline accuracy on static Multiple Choice Question (MCQ) banks might overestimate LLM capabilities in more realistic exam scenarios that require open-ended reasoning and detailed explanations.

LLMs were also used as tools for exam question generation and related question-centric tasks. Nikolovski et al. [20] presented a comparative study of LLM-based agents for exam question generation, improvement and evaluation from higher-education course materials and showed that orchestrated LLM agents could support large-scale exam design, although expert filtering was still needed to ensure alignment with learning objectives and difficulty expectations. Scaria et al. [21] investigated automated educational question generation at different Bloom’s skill levels and found that modern LLMs produced linguistically correct and pedagogically relevant questions across multiple cognitive levels, but that quality and control over difficulty varied markedly between models and domains. Al Faraby et al. [22] analyzed the use of ChatGPT-3.5 for educational question classification and generation and concluded that, although it covered a broad range of categories, fine-grained control over domain specificity, depth and difficulty remained challenging in practice. Most of these studies assessed generated questions in terms of face validity, topical coverage and perceived usefulness, rather than in terms of how difficult they were for LLM exam takers under human grading.

In parallel, LLMs were increasingly deployed as evaluation agents or “LLM-as-a-judge”. Hashemi et al. [23] proposed the LLM-RUBRIC framework, a multidimensional, calibrated evaluation scheme in which an LLM was queried along several rubric dimensions and a calibration model was trained to better match the distribution of human scores, demonstrating that LLM-based judges were highly sensitive to rubric wording and prompt specification. Liang et al. [24] introduced the HELM framework for holistic evaluation of language models and argued that evaluation was inherently scenario-dependent and multi-metric, warning that over-reliance on any single automatic metric or LLM judge could introduce systematic bias and instability in reported performance. These findings supported the view that LLM-based evaluators should be embedded into human-controlled evaluation pipelines and used primarily as auxiliary tools.

Existing work addresses key components of our setting in isolation. Studies that compare LLM-generated assessment items against faculty-authored questions report measurable differences in question quality and item properties, and consistently treat instructor review as necessary before classroom or exam use [25,26]. Complementarily, research on question generation from instructional materials has explored deriving multiple-choice items directly from course artifacts such as lecture or video transcripts, followed by explicit quality assessment of the generated questions [27]. Finally, work on AI-assisted grading investigates how automated signals can support human grading decisions and shows that model-derived cues (e.g., attention and confidence) may not align perfectly with human judgment, motivating human oversight in high-stakes assessment contexts [28].

Despite these advances, several aspects that are central to our study remained underexplored. First, existing LLM-based autograding studies typically considered only instructor-authored questions or pre-existing short-answer datasets and did not jointly compare human-authored exam questions with questions generated automatically from the same official course script, both answered by LLMs and graded under a unified human exam scale [1,15,16,20]. Second, most evaluations of LLMs as exam takers relied on multiple-choice formats and automatic scoring, instead of using free-text answers that were systematically graded by course instructors across an entire written exam [18,19]. Third, while work on LLM-based judges and holistic evaluation documented bias, instability and persistent gaps between model and human grades, there was almost no quantitative evidence on questions that consistently defeated multiple instruction-tuned LLMs within a single course, nor on how such items could be operationalized and measured as LLM-resistant questions for future exam design [17,22,24].

This study was designed to address these gaps by combining instructor-authored and pipeline-generated exam questions from a real applied informatics course, treating two instruction-tuned student models (Llama3-8B-Instruct and Mistral-7B-Instruct) as virtual students, grading all their free-text answers on a 1–10 exam scale under expert human control, and defining LLM-resistant items as those that received a failing human exam decision across all student models. This course-level, human-aligned perspective complemented prior work on LLM-based assessment and evaluation frameworks by focusing specifically on the interaction between question source, LLM behavior and expert grading in a realistic written exam setting.

This study connects three lines of work that are often treated separately: evaluating large language models as exam takers, using LLMs as grading assistants, and generating exam questions automatically. The manuscript does not propose a new grading architecture and it does not introduce a new question generator. Instead, it contributes an instructor-in-the-loop, course-level evaluation that links these components under a realistic pass/fail policy. By contrasting instructor-authored items with pipeline-generated items and analyzing both score-level and policy-induced decision outcomes, the study isolates high-impact false-fail regimes such as credit denial, which can remain hidden under aggregate score agreement. We therefore formulate below the study design and contributions that operationalize this gap under a realistic pass/fail policy.

1.3. Study Design and Contributions

This study was designed as a course-level analysis of large language models treated explicitly as virtual students taking a real university exam. ExamQ-Gen is designed for instructor-in-the-loop exam use: the question-generation pipeline supports drafting self-contained exam items grounded in course materials, while grading-related analyses are meant to inform instructor-controlled decision policies. Accordingly, all high-stakes outcomes (including pass/fail) remain instructor-authorized, and automatic grading is treated only as a secondary baseline. We focused on an introductory applied informatics course (IA2) in which written exams with short-answer items are the main summative assessment instrument. The exam content was organized into two complementary sets of questions. The HUMAN set consisted of instructor-authored items taken from past editions of the course exams, covering conceptual questions on introductory artificial intelligence, basic probability and statistics, simple machine-learning workflows and elementary Python programming. The PIPELINE set consisted of questions generated automatically from the official course script using a dedicated ExamQ-Gen pipeline, which extracted topic-specific fragments of the PDF, prompted a teacher LLM to propose short-answer exam items with reference solutions, and filtered malformed outputs. All questions were reformulated in a self-contained style that does not rely on external materials during testing.

Two instruction-tuned LLMs deployed locally were treated as student models. For each question in both the HUMAN and PIPELINE sets, the models received only the question text and produced a single short free-text answer, without access to the course script or to the teacher-model reference solutions. The models answered the entire exam in one pass, following the same ordering and time-agnostic constraints as a human student, and no chain-of-thought prompts or external tools were used. This setup was intended to approximate a realistic “LLM sits the exam” scenario in which models must respond concisely and directly to each item.

All model answers were graded by the course instructor, using the same 1–10 numeric scale and pass/fail policy as in the actual exam. For each question-model pair, the course instructor assigned a score between 1 and 10 and a categorical correctness label (correct/partial/incorrect). From these judgments, we derived a binary exam_point indicator that takes the value 1 when the instructor score is ≥9 (passing under the local IA2 exam policy) and 0 otherwise. In addition to this human grading, we also computed scores with an auxiliary automatic grader based on a smaller LLM, which received the question, the teacher-model reference solution and the student-model answer. However, these automatic scores were used only as a secondary baseline; all analyses in this paper are based on the human-assigned scores and exam_point values. This design reflects the intended deployment setting, where automated grading can be used for triage or feedback, but does not replace instructor judgment.

Within this framework, we defined LLM-resistant questions as exam items that remained unsolved by all student models under human grading. Concretely, a question is marked as LLM-resistant if every LLM student receives a failing human decision for that item, that is, an exam_point of 0 across all models. We then analyzed the prevalence and characteristics of such questions across topics, cognitive skills and question sources (HUMAN vs. PIPELINE). This allowed us to examine not only how well the models performed on average, but also which parts of the exam landscape remained systematically challenging for them under realistic grading conditions. From an instructor-in-the-loop perspective, these items provide a practical signal for exam design: they help identify concepts and competencies that the evaluated student models fail to demonstrate reliably, and they motivate question types that are less susceptible to generic, template-like answers under realistic grading conditions. At the same time, we treat this notion as diagnostic rather than normative—LLM-resistant does not imply pedagogical superiority by itself—and we interpret it alongside standard quality checks (clarity, coverage, and alignment with the intended learning outcomes).

The study addressed three guiding research questions:

RQ1: How do locally deployed instruction-tuned LLMs (Llama3-8B-Instruct and Mistral-7B-Instruct) perform as virtual students on a real written exam, when evaluated on a 1–10 scale by the course instructor?
RQ2: How does model performance differ between HUMAN exam questions authored by the course instructor and PIPELINE questions generated automatically from the official course script?
RQ3: Which questions remain consistently difficult for both evaluated student models under human grading, and how can these items be characterized as LLM-resistant in terms of topic, required skills and source?

Our main contributions can be summarized as follows:

Course-level evaluation framework. We designed and implemented a realistic exam setting in which locally deployed LLMs act as virtual students on short-answer questions from an actual university course, with answers graded by an expert lecturer on the operational 1–10 exam scale and an explicit pass/fail decision.
Dual-source exam question set. We constructed and analyzed a paired collection of HUMAN exam questions authored by the course instructor and PIPELINE questions generated automatically from the official course script, with all items reformulated into a self-contained exam format suitable for both human and LLM candidates.
Human-aligned grading and auxiliary automatic baseline. We combined expert human grading of all LLM answers with an auxiliary LLM-based grader used strictly as a decision-support baseline, and we treated the human scores and exam_point decisions as the sole ground truth for all subsequent analyses, thereby aligning evaluation with real exam practices rather than with purely automatic metrics.
Operational definition and analysis of LLM-resistant questions. We introduced a concrete, exam-centered definition of LLM-resistant questions as items that all student models fail under human grading, and we quantified their prevalence and characteristics across topics, required skills and question sources, providing actionable signals for future exam design in the presence of powerful language models.

2. Materials and Methods

This section describes the experimental setup used to study large language models on exam-style questions derived from a real university course. We first introduce the construction of two exam question datasets, consisting of instructor-authored items and questions generated automatically from the official course script. We then present the ExamQ-Gen pipeline for automatic question generation, followed by the protocol used to obtain LLM-based answers and human-aligned grades on a 1–10 scale. Finally, we define the metrics and aggregation procedures used to quantify model performance and to analyze which types of questions are more difficult or LLM-resistant for LLMs.

2.1. Exam Question Datasets

Our experiments used exam-style questions derived from IA2, a first-year undergraduate course in a Computer and Information Technology bachelor’s program. IA2 covers introductory material on artificial intelligence, uncertainty and probabilities, simple machine-learning workflows and elementary Python programming. The official course script is provided as a single PDF and served as the authoritative content source for automatic question generation, while past written exams provided instructor-authored items.

We considered two families of exam questions. HUMAN questions were taken from historical IA2 exam sheets authored by the course instructor. These items cover conceptual aspects of artificial intelligence and typical learning tasks, numerical exercises based on odds and simple probability calculations, Naive Bayes spam filtering with small word-count tables, linear regression on receipt-style feature vectors, and short Python questions that require predicting the output or the data type of simple code fragments. Each multiple-choice item was converted into a free-text format by retaining the question stem and the correct solution, while discarding the answer options. The resulting HUMAN items were reformulated as short, self-contained prompts so that the student models received only the question text, without multiple-choice cues.

PIPELINE questions were generated automatically from the IA2 course script by our ExamQ-Gen pipeline. Starting from predefined page ranges corresponding to IA2 topics, the pipeline extracted text from the PDF, constructed prompts anchored in the script, and used a locally hosted instruction-tuned Llama-3.3-70B-Instruct [29] model to produce one question and a short reference answer per generation call. The generator was instructed to produce self-contained exam-style questions aligned with the same topical areas as the HUMAN items. As in the HUMAN set, the question text was designed to be solvable without access to the course script, with all necessary context stated explicitly in the prompt.

All questions were represented in a unified format as short, self-contained prompts with concise reference answers and were stored as two separate line-based JSONL files, one for HUMAN questions and one for PIPELINE questions, using a common schema with metadata indicating the question source, topic labels and other identifiers. For HUMAN items we also stored the original multiple-choice options in a dedicated field, although these options were not exposed to the student models or graders. An overview of the two datasets is given in Table 1.

Together, the HUMAN and PIPELINE datasets provide the empirical basis for all experiments reported in the remainder of this paper.

2.2. Automatic Question Generation Pipeline

While HUMAN questions were taken from past written exams, PIPELINE questions were produced automatically from the official course script by a Python pipeline that we refer to as ExamQ-Gen. The pipeline took as input the IA2 PDF, operated at the level of course topics and page ranges, and took as output exam-style question–answer pairs in the same format as the HUMAN items.

In a first step, the pipeline used a PDF reader to extract plain text from a specified contiguous range of pages corresponding to a given topic. The extracted text was lightly cleaned and truncated to a fixed maximum length in characters in order to stay within the 128 k-token context window of the Llama-3.3-70B-Instruct generator, while preserving the main definitions, examples and numerical tables relevant to the topic. This fragment was treated as teacher-only background material: it was visible to the generator but not to the LLMs that later acted as students.

In a second step, ExamQ-Gen built a chat-style prompt that combined the course fragment with instructions for producing exactly one exam-style question and one corresponding solution. The instructions required that both the question and the solution were written in the course language, were fully self-contained, and reused concrete elements from the fragment such as numerical values, feature names, dataset descriptions or code snippets. The model was also encouraged to favor medium-difficulty, multi-step reasoning tasks over purely definitional questions. The prompt was sent to a locally hosted instruction-tuned Llama-3.3-70B-Instruct model, which returned a single text block containing both the question and the solution.

In a final step, the output text was parsed into separate question and solution fields using explicit markers requested in the prompt. The generation prompt also required an internal self-check ensuring that numeric values and specific entity names used in the solution are present in the question text. We applied a simple numeric consistency check: all numeric literals in the solution were extracted and required to also appear in the question text, with any violations flagged for later inspection. Each generated item was stored as one record in a line-based JSONL file, with fields indicating the course code, topic or chapter identifier, question index, question text, reference solution and the page range used to build the context. These JSONL files constituted the PIPELINE dataset used in the LLM-based answering and auxiliary automated grading stages described later in this section. In the experiments reported here, we used the resulting 70-item PIPELINE JSONL dataset as generated, without additional filtering or manual post-processing. The main components of the ExamQ-Gen pipeline are summarized in Figure 1.

Together, these steps defined the ExamQ-Gen pipeline and yielded the PIPELINE question set that we later used as input for the LLM-based answering and grading stages.

2.3. LLM-Based Answering and Human-Aligned Grading

For each question in the HUMAN and PIPELINE datasets, we simulated two students by querying two locally deployed instruction-tuned models, Llama3-8B-Instruct [30] and Mistral-7B-Instruct [31]. Each student model was instructed to answer each item in the same language as the question, using a short free-text style appropriate for a written exam. The prompt included only the question text and a brief instruction to produce a short free-text answer suitable for grading, without access to the course script, to multiple-choice options, or to any additional context. Each question was answered once by each student model, in separate runs. Student answers were generated via the Ollama chat API using temperature = 0.0, top_p = 1.0, max_tokens = 512, num_ctx = 8192, and seed = 42. The generated answers were appended to the corresponding record as additional fields, so that every item contained a question, a reference solution, and two model-generated answers produced by the Llama and Mistral models.

To obtain grades on the same 1–10 scale used in the local examination system, we use a second instruction-tuned model, Qwen2.5-7B-Instruct [32] (the automatic grader), as a fast decision-support automatic grader. For each item, the grader received the exam question, the reference solution (instructor-authored for HUMAN items or generated by the pipeline for PIPELINE items) and one of the model-generated answers (from Llama 3 or Mistral 7B). The grading procedure was applied separately to the answer of each student model. The grading instructions asked the model to compare the model-generated answer with the reference solution, to assign an integer score between 1 and 10 based on semantic correctness and completeness, and to select a correctness label in {correct, partial, incorrect}. The grader prompt includes explicit rubric guidance (comparison against the reference solution, numerical/key-concept checks) and enforces a structured JSON output, and the grader is run deterministically (temperature = 0). From this label we derived a binary exam indicator, exam_point, set to 1 only when the label is correct and to 0 otherwise. The resulting grader outputs were stored in intermediate JSONL files and served as an automatic baseline and sanity check, but were not used as the primary evaluation signal in our analyses.

Final grades were provided by the course instructor. For each question–model pair, the instructor reviewed the exam question, the reference solution, and the model-generated answer, and recorded (i) an integer score on the same 1–10 scale as in the local examination system and (ii) a categorical correctness label in {correct, partial, incorrect}. In our graded dataset, these labels correspond to coarse score bands: incorrect → 1 or 3, partial → 6–8, and correct → 9–10. Following the operational pass policy used in the course, we derived the binary exam indicator exam_point as 1 if the instructor score was ≥9, and 0 otherwise. These instructor-assigned scores, labels, and exam_point values were treated as the ground truth for our analyses.

Table 2 provides representative graded examples from both instructor-authored (HUMAN) and pipeline-generated (PIPELINE) items, illustrating how the instructor rubric is applied across correct, partially correct, and incorrect answers.

The two grading stages returned their decisions in a compact, structured format that specified the numeric score, the correctness label and, in some cases, a short justification. All outputs were parsed and stored both in line-based JSONL/CSV files and in a Neo4j [33] graph database, together with the original question, reference solution and corresponding model-generated answer, yielding a graded dataset with one record per question-model pair. Unless otherwise stated, all subsequent analyses in this paper are based on the human grades and the associated human exam_point indicator, while the grader outputs are used only as an auxiliary decision-support baseline. The overall answering and grading process is summarized in Figure 2.

Together, the LLM-based answering setup, the automatic baseline grades and the course instructor’s final grades provided a unified graded dataset that we use in the next subsection to define our performance metrics, difficulty labels (easy/medium/hard) and the notion of LLM-resistant questions.

2.4. Metrics and Analysis of LLM Performance and Question Difficulty

The graded dataset obtained from the answering and grading setup assigned, for each question and for each student model, a numeric score on a 1–10 scale and a categorical correctness label (correct/partial/incorrect), together with a derived binary exam indicator, exam_point, that marked whether the answer was counted as correct (1) or not (0). In what follows, we used the human grades and the associated exam_point values as our primary evaluation signals.

At the dataset level, we computed the mean, median and empirical distribution of human scores and exam_point separately for each combination of question source (HUMAN versus PIPELINE) and student model (Llama3-8B-Instruct versus Mistral-7B-Instruct). In addition, we reported the proportion of questions whose answers were graded correct, partial, or incorrect. These aggregates were further broken down by topic (introductory artificial intelligence, odds and probabilities, Naive Bayes spam filtering, linear regression on receipt-style data and basic Python programming), providing an overview of how easily each model handled different types of exam questions. Beyond these marginal summaries, we also examined the agreement between the two student models by comparing their item-level scores and by measuring how often they received the same correctness label on a given question.

To obtain an interpretable notion of item difficulty that does not conflate question difficulty with student preparedness, the course instructor labeled each exam item as easy, medium, or hard based on the expected difficulty for IA2 students and the reasoning steps required by the question. The labels were assigned at the question level before any scoring and they are independent of the scores later assigned to LLM answers. We use these labels only for stratified reporting of model performance across difficulty levels.

The presence of two student models also allowed us to identify questions that were consistently challenging under the human grading scheme. Items that received human exam_point = 0 for both Llama3-8B-Instruct and Mistral-7B-Instruct were treated as LLM-resistant questions in this study. For each topic and question source we reported the proportion of such LLM-resistant items, which served as our main quantitative indicator of how robustly resistant the HUMAN and PIPELINE question sets were to reasonably capable LLM students.

2.5. Implementation Details

All experiments were carried out on a dedicated virtual machine running a 64-bit Windows operating system. The machine was equipped with an AMD EPYC 9654 96-core processor (2.40 GHz), 128 GB of RAM, a 3 TB SSD, and an NVIDIA L40S-48Q GPU with 48 GB of VRAM.

The ExamQ-Gen question generation pipeline was implemented in Python 3.11 using the HuggingFace transformers [34] and accelerate libraries together with bitsandbytes [35], a GPU quantization library that enables low-precision 4-bit weight representations, for 4-bit quantization and pypdf for text extraction from the IA2 course script. For PIPELINE questions we loaded a locally stored, 4-bit quantized Llama-3.3-70B-Instruct model from disk, using a helper that configured the tokenizer with left padding, set the padding token to the end-of-sequence token if needed, and mapped all layers to the single GPU with an appropriate floating-point compute type. Context fragments of up to 6000 characters were extracted from user-specified page ranges of the course PDF, and for each requested item we built a chat-style prompt, generated up to 512 new tokens using nucleus sampling (temperature = 0.4, top_p = 0.9), parsed the result into a single question-solution pair and wrote it to a JSONL file together with metadata about the chapter and page range.

The answering and grading stages used the HUMAN and PIPELINE JSONL files as input. Llama3-8B-Instruct and Mistral-7B-Instruct were deployed locally as student models, and Qwen2.5-7B-Instruct was deployed as the automatic grader. All three models were served through a lightweight API interface provided by Ollama [36] on the same machine. For each question, Python scripts issued one request per student model to obtain short free-text answers, stored these answers in intermediate JSONL files, and then submitted the question, reference solution and model-generated answer to the automatic grader. Grader outputs were parsed automatically, with a small number of retries in case of malformed responses, and were written both to JSONL files and to a Neo4j graph database that stored questions, answers, models and grades as nodes and relationships. In a subsequent step, an expert instructor reviewed each question–answer pair and recorded the final human score and exam_point label in CSV files with one record per model–question pair. Configuration files specifying model names, decoding parameters, database connection settings and random seeds were stored together with the datasets to facilitate replication and further experimentation.

3. Results

This section reports the empirical evaluation of the student LLMs under a standardized instructor grading protocol. Results are presented from an overall model-level comparison of performance and score distributions to progressively more fine-grained breakdowns that isolate the contribution of question source and other experimental factors.

3.1. Overall Performance of the Student LLMs

Overall performance of the student LLMs (Llama3-8B-Instruct and Mistral-7B-Instruct) was evaluated under a standardized instructor grading protocol on the operational 1–10 grading scale. For each model, results were pooled across both question sources (HUMAN and PIPELINE), yielding n = 150 graded answers per model (80 HUMAN + 70 PIPELINE). In addition to the numeric score, each graded response was associated with a binary exam indicator (exam_point), set to 1 when the answer would receive credit in an exam setting and 0 otherwise.

To provide a compact statistical characterization of score distributions, Table 3 reports the mean and standard deviation (SD), median, minimum/maximum, and the interquartile range (IQR) for each model. The IQR is defined as IQR = Q3 − Q1 (75th minus 25th percentile) and captures the spread of the middle 50% of the distribution, complementing the SD, which is more sensitive to extreme values. In the pooled analysis, Mistral achieved a higher mean score (5.353) than Llama3 (3.407), reflecting stronger performance in the upper part of the score distribution. However, the median was 1.0 for both models, indicating that at least half of all responses received the minimum score, with the difference between models driven primarily by the upper tail rather than by typical (median) performance.

To move beyond aggregate statistics and expose distributional shape, Figure 3 presents boxplots of the instructor scores and annotates the proportions of extreme outcomes at 1 and 10 for each model. In a boxplot, the box spans the first to third quartile, also known as the interquartile range, and the horizontal line marks the median. Whiskers summarize the remaining spread and outliers indicate unusually low or high values. The score distributions are highly polarized, with most mass at the endpoints rather than spread smoothly across the 1 to 10 scale. For Mistral, 149 of 150 answers received an extreme score, with 77 graded 1 and 72 graded 10, leaving only a single intermediate score of 6. For Llama3, 135 of 150 answers received an extreme score, with 104 graded 1 and 31 graded 10. This near-binary pattern indicates that in this setting the models rarely produced answers that merited partial credit. Intermediate scores were reserved for partially correct answers (correct core idea but missing required elements or containing non-trivial mistakes), consistent with the instructor rubric. Representative intermediate-score cases (e.g., scores 6 and 8) are provided in Table 2. These boxplots pool HUMAN and PIPELINE items, aggregating over two exam conditions that can place the models in different performance regimes; we therefore break down performance by question source next.

To report exam-level outcomes aligned with the pass or fail decision used in our setting, Table 4 reports the pass-point rate, together with 95% Wilson confidence intervals for the underlying binomial proportion. This decision-level summary remains informative even when the full 1 to 10 score distribution is polarized, because it directly measures how often each model would receive exam credit under the same human grading policy. The Wilson interval was used because it remains well-behaved for proportions away from 0.5 and provides stable uncertainty estimates at moderate sample sizes. The pass-point rate was 0.240 (36/150; 95% CI [0.179, 0.314]) for Llama3 and 0.480 (72/150; 95% CI [0.402, 0.559]) for Mistral, indicating a substantially higher likelihood of receiving exam credit under the same human grading policy. Notably, in this dataset, passing decisions occurred only for high scores 9 and 10, meaning that the pass-point rate tracked the proportion of scores ≥ 9 rather than the overall mean.

Overall, the pooled analysis indicated a clear performance advantage for Mistral over Llama3, both in score-level and in exam-level credit assignment, while also revealing a strongly polarized grading pattern (many minimum scores coexisting with a large mass at the maximum score). This global view motivated the subsequent breakdowns, which disentangled whether the observed differences were driven primarily by question source (HUMAN vs. PIPELINE) and other experimental factors.

3.2. Performance by Question Source (HUMAN vs. PIPELINE)

Because the exam set combined two qualitatively different sources of questions—instructor-authored items (HUMAN) and automatically generated items derived from the course script (PIPELINE)—overall model comparisons were decomposed by source to determine whether the pooled performance differences were driven by a specific subset. This stratification was necessary because question source can alter both the difficulty profile and the linguistic form of items, which in turn can change not only average scores but also the shape of the grading distribution (e.g., concentration at the minimum score versus concentration at the maximum score).

To quantify instructor scores within each source, Table 5 reports descriptive statistics for the 1–10 scale by (model, source): mean and standard deviation (SD), median, minimum/maximum, and the interquartile range (IQR). Here, IQR = Q3 − Q1, i.e., the difference between the 75th and 25th percentiles; it summarizes the spread of the middle 50% of outcomes and is especially informative when distributions are heavy at the extremes. In highly polarized grading, it is possible for the IQR to either collapse to zero (when both Q1 and Q3 coincide at the same value, typically 1 or 10) or become very large (when Q1 sits at 1 and Q3 at 10), both of which directly reflect endpoint-heavy distributions rather than anomalies.

These summaries indicated a strong dependence on question source. On HUMAN questions (n = 80 per model), Llama3 achieved a higher mean score than Mistral (4.200 vs. 3.087), while the median remained 1.0 for both models. This combination (higher mean but identical median) implied that differences were driven primarily by the upper part of the distribution rather than by small shifts in typical performance. A striking feature was the very large IQR for HUMAN/Llama3 (9.0), which occurred because the 25th percentile sat at the minimum and the 75th percentile reached the maximum, indicating substantial mass at both endpoints within the middle 50% span. In contrast, HUMAN/Mistral had IQR = 0.0, indicating that the central mass remained tightly concentrated at the minimum despite the presence of some perfect scores (as reflected by the maximum of 10).

On PIPELINE questions (n = 70 per model), the pattern changed qualitatively. Mistral reached a mean of 7.943 and a median of 10.0, placing it in a near-ceiling regime on this subset, whereas Llama3 remained concentrated near the floor (mean 2.500, median 1.0). The fact that both PIPELINE groups had IQR = 0.0 did not mean an absence of meaningful variation; rather, it meant that at least half of each distribution’s mass accumulated at a single value (1 for Llama3, 10 for Mistral), consistent with extreme polarization.

To make this polarization explicit and visually interpretable (beyond percentiles and averages), Figure 4 visualizes the same four groups using boxplots and annotates the proportions of score = 1 and score = 10. In a boxplot, the box spans Q1–Q3 (the IQR) and the central line marks the median; when distributions concentrate at a single value, the box can degenerate (IQR = 0), which directly reflects the data.

The annotated extremes made the distributional regimes behind the summary statistics explicit. On HUMAN, both models frequently received the minimum score, with score = 1 occurring for 58.8% of Llama3 answers and 76.3% of Mistral answers; nonetheless, perfect scores were still common (26.3% for Llama3 and 22.5% for Mistral), which explained why means differed even though both medians stayed at 1. On PIPELINE, outcomes separated sharply: Llama3 concentrated at the minimum (score = 1 in 81.4% of cases; score = 10 in 14.3%), whereas Mistral concentrated at the maximum (score = 10 in 77.1% of cases; score = 1 in 22.9%). This constituted a clear source-dependent reversal, meaning that the relative ranking of the student models depended on the question source rather than being stable across subsets.

Because exam outcomes are determined by thresholded decisions, pass/fail behavior was also reported. Table 6 reports the pass-point rate (exam_point) for each (model, source) group together with 95% Wilson confidence intervals for the corresponding binomial proportions. The Wilson interval provided stable uncertainty estimates for proportions that were far from 0.5 and avoided pathological behavior near 0 or 1.

The pass-point breakdown reinforced the score-level findings and quantified their practical impact. On HUMAN, Llama3 achieved 0.325 (26/80), while Mistral achieved 0.225 (18/80), indicating a modest advantage for Llama3 on instructor-authored items. On PIPELINE, the separation was large and reversed: Mistral achieved 0.771 (54/70), whereas Llama3 achieved 0.143 (10/70). Importantly, these were not small proportional shifts around a common baseline; they reflected fundamentally different regimes of exam credit assignment—near-ceiling performance for Mistral on PIPELINE versus persistent near-floor outcomes for Llama3 on the same subset.

To provide a compact visual summary of these exam-level outcomes, Figure 5 presents the pass-point rates and confidence intervals across sources and models.

Overall, decomposing performance by question source showed that pooled results masked a strong interaction: the two models behaved similarly poorly on many HUMAN questions (with Llama3 retaining a modest advantage), while PIPELINE questions amplified differences substantially and favored Mistral decisively. This source dependence explained why aggregate comparisons in the pooled analysis could be misleading, and it motivated subsequent analyses of which properties of PIPELINE items were associated with near-ceiling performance for Mistral and near-floor outcomes for Llama3.

3.3. Agreement Between the Course Instructor and the Automatic Grader

To verify that the automatic grading pipeline aligns with the human grading signal used throughout the analysis, we compared the grader against the course instructor on both the continuous 1–10 score and the binary exam_point decision. Because the dataset mixes HUMAN (instructor-authored) and PIPELINE (pipeline-generated) items, we report agreement overall and by source to detect source-dependent shifts.

Score-level agreement is summarized in Table 7, which reports MAE (mean absolute error), RMSE (root mean squared error), the mean bias (Grader − Instructor), and Spearman’s ρ (rank consistency under the discrete score scale). Overall, the grader shows moderate alignment with the instructor (MAE 2.21; bias −0.457; ρ 0.559), but this aggregate masks a strong source effect: agreement is much higher for HUMAN (MAE 0.81; bias −0.125; ρ 0.799) and substantially weaker for PIPELINE (MAE 3.81; bias −0.836; ρ 0.189). The consistently negative bias indicates that the Grader tends to assign lower scores than the instructor, with under-scoring substantially more pronounced on pipeline-generated items.

To inspect whether disagreement depends on score magnitude and source, the score differences are visualized in Figure 6 using a Bland–Altman plot, where the y-axis is (Grader − Instructor) and the x-axis is the mean of the two scores. The plot confirms a small overall negative bias (−0.457) and wide limits of agreement (approximately [−7.674, 6.761]). Coloring by source makes the shift explicit: bias is near zero on HUMAN (−0.125) but becomes clearly more negative on PIPELINE (−0.836).

Beyond numeric scores, the evaluation also uses exam_point as an operational pass/fail decision. In this dataset, the instructor’s exam_point follows a strict threshold rule: exam_point = 1 only for instructor scores 9–10, and exam_point = 0 otherwise. Given the strongly polarized score distribution (dominated by scores at the minimum and maximum), this implies that pass-point rates primarily reflect the prevalence of scores ≥ 9 rather than partial-credit improvements.

Pass-point agreement is reported in Table 8 through confusion counts (TP/FP/FN/TN) and derived metrics, including FPR (false-pass rate) and FNR (false-fail rate). Overall accuracy is 0.78, but the error profile is asymmetric: the Grader produces few false passes (FPR 0.03) while missing a large fraction of instructor passes (Recall 0.44; FNR 0.56). This asymmetry is again strongly source-dependent. On HUMAN, the Grader closely matches the instructor (Accuracy 0.93; Recall 0.84; FNR 0.16). On PIPELINE, the Grader becomes highly conservative (Accuracy 0.61) and recovers only 17% of Instructor passes (Recall 0.17; FNR 0.83), meaning that many answers that receive exam credit from the instructor are labeled as fails by the Grader in the pipeline setting.

In Figure 7, pass-point rates are compared between the instructor and Grader across model × source groups, with 95% Wilson confidence intervals for each proportion. The discrepancies are not uniform: the largest gaps appear in the PIPELINE condition, where instructor pass rates remain high while Grader pass rates are much lower, and the confidence intervals remain well separated, indicating a substantive disagreement rather than sampling variability.

Overall, these results indicate that the Grader approximates Instructor grading well on HUMAN items but diverges substantially on PIPELINE, both in numeric scores, and—more critically—in pass/fail decisions, where the Grader exhibits a strong false-fail tendency.

3.4. Diagnosing Instructor–Automatic Grader Discrepancies by Model and Question Source

This analysis pinpoints where disagreements between the course instructor and the automatic Grader concentrate by stratifying results across the four model × source groups (HUMAN vs. PIPELINE; Llama3 vs. Mistral). We quantify discrepancies at two levels: score-level differences on the 1–10 scale, using Δ_score = (Grader − Gnstructor), and decision-level differences on the binary exam_point outcome. For score-level comparisons, we report both magnitude (MAE, RMSE) and direction (mean bias), together with the frequency of exact matches (Δ_score = 0) and large mismatches (|Δ_score| ≥ 3). For pass/fail comparisons, we focus on FNR (false-fail rate among instructor passes) and FPR (false-pass rate among instructor fails), because they directly characterize whether the grader is conservative (high FNR) or permissive (high FPR).

Discrepancy metrics are reported in Table 9. Agreement is tight in the HUMAN setting for both models: exact score matches occur for 82.5% of HUMAN/Llama3 and 76.2% of HUMAN/Mistral, and large mismatches remain rare (11.2% and 8.8%, respectively). The direction of the disagreement differs slightly by model within HUMAN: HUMAN/Llama3 shows mild under-scoring (bias −0.750), whereas HUMAN/Mistral shows mild over-scoring (bias +0.500). In contrast, PIPELINE introduces large and systematic shifts that depend strongly on the model. For PIPELINE/Llama3, the grader tends to over-score relative to the instructor (bias +1.886) and large mismatches become common (42.9%). For PIPELINE/Mistral, the grader shows strong under-scoring (bias −3.557) and very frequent large mismatches (82.9%). Overall, Table 9 indicates that discrepancies are not uniformly larger on PIPELINE; they become directional and model-specific.

The distribution of score differences is shown in Figure 8. The two HUMAN groups remain tightly centered around Δ_score = 0, and disagreements appear primarily as isolated outliers. In contrast, the PIPELINE distributions separate clearly: PIPELINE/Llama3 shifts upward (grader > instructor), while PIPELINE/Mistral shifts downward (grader < instructor). The PIPELINE/Mistral group also shows a pronounced negative shift, consistent with frequent large mismatches and pervasive under-scoring.

Decision-level discrepancies are summarized in Figure 9, which reports FNR and FPR with 95% Wilson confidence intervals. Across all four groups, disagreement is dominated by false fails rather than false passes: FPR remains near zero or small (0.000–0.065), whereas FNR varies substantially across conditions. The largest discrepancy occurs for PIPELINE/Mistral, where the grader is highly conservative (FNR = 0.907, FPR = 0.000), indicating that instructor passes are rarely recovered by the grader. PIPELINE/Llama3 also shows elevated conservatism (FNR = 0.400) but to a lesser extent, while the HUMAN groups remain comparatively well calibrated, particularly HUMAN/Mistral (FNR = 0.056).

The underlying confusion counts and derived classification metrics are reported in Table 10; PIPELINE/Mistral exhibits an extreme conservative profile relative to the instructor (ground truth): the LLM grader produces no false passes (FP = 0; Precision = 1.00) but misses most instructor passes (TP = 5 vs. FN = 49), resulting in very low recall (0.093) and low F1 (0.169). PIPELINE/Llama3 is more balanced (TP = 6, FN = 4, FP = 2), yielding moderate recall (0.600) with a low false-pass rate (FPR = 0.033). The HUMAN groups show high overall accuracy (0.925–0.938) with stronger recall.

Overall, discrepancy patterns are stable on HUMAN items but become strongly source-dependent and model-dependent on PIPELINE. The PIPELINE/Mistral condition concentrates most of the disagreement through an extreme false-fail tendency, whereas PIPELINE/Llama3 shows a higher false-pass tendency (FPR = 0.033) with more moderate pass/fail discrepancies. This localized view clarifies why aggregate agreement statistics can mask source- and model-specific effects and motivates reporting instructor–grader alignment separately by source and model.

3.5. Stability and Bias of Automatic Grading Across Experimental Conditions

Automatic grading stability is examined across experimental conditions by comparing instructor scores with the automatic grader at both the score level and the pass/fail decision level. Tolerance-invariant agreement and directional bias in 1–10 scores are quantified across model × source groups, and the exam-relevant impact at the instructor’s pass threshold is assessed by focusing on the severity of decision errors, including credit denial for instructor-perfect (score = 10) answers.

3.5.1. Grader–Instructor Score Agreement Across Experimental Conditions

We first assess grader–instructor agreement at the score level on the full 1–10 scale, independently of any pass threshold. This analysis is designed to capture whether the automatic grading procedure assigns scores that are close to the instructor’s scores in magnitude, and whether any remaining disagreement exhibits a consistent direction (systematic under- or over-scoring). For each response

i

, we compute the signed discrepancy

Δ_{i} = {G r a d e r}_{i} - {I n s t r u c t o r}_{i}

(1)

From these item-wise discrepancies, we report two complementary summaries within each model × source group (HUMAN/PIPELINE × Llama3/Mistral). The first is the mean absolute error (MAE),

M A E = \frac{1}{n} \sum_{i = 1}^{n} |Δ_{i}|

(2)

which measures the typical absolute deviation in score points regardless of direction. The second is the signed bias,

B i a s = \frac{1}{n} \sum_{i = 1}^{n} Δ_{i}

(3)

which indicates whether the grader systematically assigns lower or higher scores than the instructor on average. Under this definition, negative bias implies under-scoring by the grader (instructor scores tend to be higher), whereas positive bias implies over-scoring (grader scores tend to be higher). To quantify uncertainty in the magnitude of disagreement, we additionally report 95% bootstrap confidence intervals for MAE by resampling items within each group. In the present setup, the average panel size

\bar{k}

equals 1 because a single automatic grader score is used per item.

Table 11 reports these tolerance-invariant agreement statistics by group. Two patterns stand out. First, score-level agreement is consistently tight in the HUMAN setting: MAE is approximately 0.8 for both models (0.825 for HUMAN/Llama3 and 0.800 for HUMAN/Mistral), indicating that the grader’s score typically stays within roughly one point of the instructor’s score on average. Second, agreement degrades sharply in the PIPELINE setting, where the typical discrepancy becomes multi-point: MAE rises to 2.743 for PIPELINE/Llama3 and to 4.871 for PIPELINE/Mistral, showing that large score gaps become common, particularly for PIPELINE/Mistral. The bias values further indicate that disagreement is not purely symmetric noise but also directional and model-dependent: PIPELINE/Mistral exhibits a large negative bias (−3.557), consistent with systematic under-scoring by the grader relative to the instructor, whereas PIPELINE/Llama3 exhibits a positive bias (+1.886), consistent with systematic over-scoring. Even within HUMAN, smaller directional tendencies are visible (negative for Llama3: −0.750 and positive for Mistral: +0.500), but these are modest compared with the PIPELINE shifts.

Overall, score-level agreement is stable and close in the HUMAN setting, but substantially weaker in the PIPELINE setting, with strong directional shifts in the latter. This motivates examining whether these score differences translate into pass/fail mismatches at the instructor pass threshold.

3.5.2. Decision-Level Severity at the Pass Threshold

Instructor-facing decision errors at the operational pass threshold (pass if score ≥ 9) can vary widely in severity, so Table 12 summarizes a severity decomposition focused on instructor-perfect cases (instructor score = 10) by measuring how often the automatic grader converts these clear passes into a failing outcome (grader < 9), with counts and 95% Wilson confidence intervals. Perfect-pass denial is rare under HUMAN sources (HUMAN/Llama3: 1/21 = 0.048; HUMAN/Mistral: 1/18 = 0.056), indicating that clear passes are typically preserved. The PIPELINE sources show a sharp escalation: PIPELINE/Llama3 reaches 5/10 = 0.500 (95% CI: 0.237–0.763), while PIPELINE/Mistral rises to 50/54 = 0.926 (95% CI: 0.824–0.971), implying that most instructor-perfect answers are denied credit in that condition even after accounting for uncertainty.

A compact visualization of this “credit-denial for clear passes” effect is provided in Figure 10, which plots the perfect-pass miss rate (instructor score = 10 but grader < 9) with 95% Wilson confidence intervals across the four source × model groups. The figure makes the asymmetry across conditions immediately visible: both HUMAN groups remain close to zero, PIPELINE/Llama3 shifts upward with substantial uncertainty due to smaller support, and PIPELINE/Mistral concentrates near the top of the scale with a comparatively tight interval, reflecting an operationally severe regime in which instructor-perfect answers are frequently converted into failing outcomes at the pass threshold.

Overall, this severity-focused view shows that decision-level errors are not confined to borderline threshold flips: under PIPELINE conditions—especially PIPELINE/Mistral—the grader frequently denies credit even for answers judged fully correct by the instructor, which constitutes a direct, exam-relevant failure mode at the operational pass criterion.

3.5.3. Sensitivity to Alternative Pass Thresholds

Because pass policies differ across institutions, we conducted a sensitivity analysis by recomputing the pass/fail indicator at lower thresholds (t = 6 and t = 5), applying the same rule (score ≥ t) to both instructor and grader scores. Table 13 reports decision-level error rates aggregated by source group (HUMAN vs. PIPELINE) across both evaluated student models. Lowering the threshold reduces false fails under PIPELINE, but it also increases false passes, making the policy-dependent trade-off explicit.

4. Discussion

The results indicate that automatic grading performance must be interpreted at two levels: agreement on the 1–10 score scale and the downstream consequences of thresholded decisions. While score-level discrepancies can be modest on average, small systematic biases can translate into large exam-relevant risks at the instructor’s pass criterion, especially under condition shifts between HUMAN and PIPELINE sources. These patterns suggest that the validation of LLM-based graders should prioritize decision-level error asymmetry and severity—not only aggregate score agreement—when assessing suitability for high-stakes educational use.

4.1. Interpreting Agreement: Score-Level Similarity vs. Threshold Risk

Automatic grading performance in educational settings is not captured by a single notion of “agreement,” because the same score-level discrepancy can have radically different operational consequences depending on how grades are consumed. When scores are used as continuous signals (e.g., for formative feedback), moderate deviations may be tolerable if rankings and broad performance bands are preserved. In contrast, when scores are mapped into discrete outcomes—most importantly, pass/fail—small systematic deviations can become high-impact, because a fixed threshold compresses all outcomes into a binary decision and concentrates the cost of error near the boundary.

This thresholding effect creates an intrinsic coupling between calibration and fairness. A grader that is slightly conservative on the 1–10 scale may still appear stable under score-level metrics, yet it can deny credit disproportionately often when the pass criterion is strict. In such cases, overall agreement statistics can mask a harmful asymmetry: low false-pass tendencies may coexist with high false-fail tendencies. Operationally, these two error types are not interchangeable. False passes can be treated as a standards risk, while false fails directly produce credit denial for legitimate passes; in many instructional contexts, the latter carries a higher fairness and trust cost.

Decision-level metrics therefore serve a distinct purpose from score-level agreement measures. They expose whether the system’s errors are balanced or skewed, and whether its performance depends on the base rate of passes/fails in the evaluated pool. Accuracy alone can be misleading under imbalance or conservative decision policies: a model can achieve acceptable accuracy by predicting “fail” frequently if many cases are indeed fails, while still producing an unacceptable rate of false fails among instructor passes. For exam deployment, the relevant question is not only “How close are the scores?” but “How often does the system flip the instructor’s decision at the pass boundary, and in which direction?”

Beyond frequency, the severity of decision errors matters. A threshold flip can arise from genuinely borderline cases (e.g., instructor score at the threshold) or from cases that are unambiguously correct by the instructor’s judgment. These two failure modes have different interpretations and remediation strategies. Borderline flips may reflect inevitable subjectivity or noise around the cutoff, whereas denial of instructor-perfect answers indicates a deeper calibration mismatch or a systematic scoring bias. Separating these regimes clarifies whether conservative grading is merely “cautious near the boundary” or whether it extends to clearly correct work and thus represents a more serious deployment risk.

Finally, this interpretation must be conditioned on experimental setting. The observed differences between HUMAN and PIPELINE sources indicate that stability is not only a property of the grader, but also of the response distribution it is exposed to. A grader that behaves acceptably under one source condition may become miscalibrated under another, turning a small score-level bias into a large decision-level impact at the pass threshold. For this reason, the discussion that follows emphasizes decision-level asymmetry and severity under condition shifts, because these features determine whether automatic grading is operationally reliable for high-stakes educational use.

4.2. Condition Effects: HUMAN vs. PIPELINE as a Robustness Stress Test

The contrast between HUMAN and PIPELINE conditions functions as a robustness stress test for automatic grading, because it probes whether the grader’s behavior is stable under changes in the distribution and presentation of answers. Even when the grading rubric and the instructor standard remain fixed, the observed performance shifts indicate that the automatic grader is sensitive to properties of the response pool that are not directly tied to content correctness alone. In practical terms, this means that “good performance” measured on one type of student response does not automatically transfer to another, and validation must explicitly cover the kinds of answers the system will face in deployment.

Several mechanisms can plausibly drive this condition dependence without requiring any change in the instructor’s underlying criteria. PIPELINE answers may differ from HUMAN answers in length, structure, phrasing, completeness signals, or stylistic markers that correlate imperfectly with correctness. A grader that relies on such cues—implicitly or explicitly—can become systematically conservative or systematically lenient when the cue distribution shifts. Importantly, this kind of shift can be difficult to detect if evaluation focuses only on average score-level agreement, because the same mean error can be operationally benign in one regime and harmful in another once a threshold is applied.

The decision-level results suggest that the main failure mode under PIPELINE is not an increase in false passes but a strong increase in false fails, consistent with a conservative grading policy under distribution shift. This pattern is especially concerning in exam settings, because it converts calibration drift into credit denial. Under such a regime, a grader may appear “safe” in the sense of rarely granting undeserved passes, yet it can simultaneously violate fairness expectations by rejecting legitimate passes at scale. The severity analysis sharpens this interpretation by showing that the shift is not confined to borderline cases: in the most affected condition, the grader denies credit even for answers that the instructor judges fully correct, indicating a misalignment that is unlikely to be explained by boundary noise alone.

From a methodological standpoint, these condition effects imply that the HUMAN/PIPELINE split should be treated as more than an incidental experimental detail. It operationalizes a realistic deployment concern: graders are often evaluated on curated or homogeneous datasets, but deployed on heterogeneous response distributions shaped by different generation processes, student writing styles, or tool-mediated workflows. A robust grader should therefore be assessed across such condition variations, and reported performance should make explicit whether observed errors are stable or shift-dependent. In this sense, HUMAN vs. PIPELINE provides an interpretable axis of stress that exposes whether the grader generalizes or whether it requires condition-specific calibration to remain reliable at the pass threshold.

4.3. Conservative Grading and Error Asymmetry at the Pass Threshold

A central theme emerging from the decision-level analyses is the asymmetry between false passes and false fails at the instructor’s pass criterion. In many educational settings, conservative grading can appear attractive because it prioritizes avoiding false passes—i.e., it reduces the chance that an insufficient answer is incorrectly awarded credit. However, when the pass threshold is operational (pass if score ≥ 9), conservative bias does not simply “play it safe”; it reallocates error mass toward false fails, which directly translates into credit denial for legitimate passes. This creates a fairness–risk trade-off that cannot be evaluated using accuracy alone and must instead be examined through directional error rates.

Error asymmetry matters because the two decision errors have different consequences and different tolerances. A small false-pass rate may be acceptable from a standards perspective, but a large false-fail rate undermines trust and can be difficult to justify to students and instructors, particularly when it affects clearly correct responses rather than borderline cases. In a thresholded grading regime, even a modest downward bias on the 1–10 scale can produce a disproportionate number of false fails if many true passes cluster near the cutoff, or if the grader’s calibration drifts under condition shifts. This explains why score-level agreement can coexist with strong decision-level harm: the same systematic bias that looks small when averaged across the entire score range becomes consequential when evaluated at a single boundary.

The severity perspective reinforces that not all false fails are equivalent. Borderline flips (e.g., an instructor score at the threshold) can often be interpreted as noise or subjectivity around a hard cutoff, and mitigation may focus on clarifying rubrics or introducing an appeal/review process for near-threshold cases. By contrast, false fails that occur for instructor-perfect answers indicate a deeper misalignment: the grader is not merely uncertain near the boundary, but is applying criteria that systematically under-recognize correctness in certain conditions. This is operationally the worst case because it cannot be addressed by small threshold adjustments alone without risking a surge in false passes; it instead suggests that the grader’s scoring function (or its calibration) changes qualitatively across conditions.

These observations imply that deploying automatic grading at a strict pass threshold requires explicit safeguards against asymmetric harm. At minimum, evaluation should report both false-pass and false-fail behavior and treat them as separate objectives rather than collapsing them into a single aggregate score. More importantly, when conservative behavior is observed—especially under distribution shift—its acceptability should be judged in terms of the severity of the denied-credit cases. In exam settings, a conservative grader that rarely grants undeserved passes but frequently rejects legitimate passes may be unsuitable without additional calibration, human-in-the-loop review for high-severity cases, or policy constraints that cap false-fail rates.

4.4. Methodological Implications for LLM-Based Educational Assessment

The results motivate several methodological implications for how LLM-based graders should be evaluated and reported in educational contexts. First, score-level agreement and decision-level reliability should be treated as complementary, not interchangeable, evaluation targets. Reporting only continuous-score metrics (e.g., average error, correlation) can obscure operational failure modes that emerge after thresholding, while reporting only pass/fail metrics can hide systematic score biases that may matter for ranking, feedback, or grade boundaries beyond a single cutoff. A complete evaluation therefore requires both views, with explicit attention to how conclusions change when moving from the 1–10 scale to thresholded outcomes.

Second, decision-level evaluation should go beyond overall accuracy and include directional error rates and their uncertainty. Accuracy can be dominated by class prevalence and may remain high even when a grader adopts a strongly conservative policy that rejects many legitimate passes. Directional rates—false-pass and false-fail behavior—directly quantify the relevant risks, while confidence intervals prevent over-interpreting small-sample fluctuations and make condition comparisons more defensible. In practice, this means that evaluation reports should treat FPR and FNR as first-class metrics, rather than secondary diagnostics.

Third, severity-aware analysis should be incorporated when automatic grading is intended for high-stakes use. Aggregated false-fail rates alone do not distinguish between benign boundary noise and severe credit denial for clearly correct work. Decompositions that isolate high-severity strata—such as instructor-perfect answers or high-confidence rubric satisfaction—reveal whether errors cluster at the threshold or extend to unequivocal cases. This distinction matters for both interpretation and mitigation: borderline flips may be addressed through review policies near the cutoff, while severe misses indicate the need for recalibration, rubric alignment, or changes to the grader’s reasoning prompts and scoring constraints.

Fourth, robustness to condition shifts should be made explicit in evaluation design. The HUMAN/PIPELINE contrast illustrates that automatic graders can change behavior under shifts in response distribution even when the instructional target remains the same. Consequently, benchmarks that evaluate graders on a single homogeneous answer pool may overestimate deployment reliability. Methodologically, this suggests that educational grading evaluations should include deliberate stress tests that vary response source, style, and structure, and should report performance disaggregated by these factors to avoid hiding worst-case regimes in aggregate averages.

Finally, these findings support a shift in how “grading quality” is operationalized for LLM-based assessment: suitability should be judged not only by average agreement but by risk profiles under the intended decision policy. When the grading output feeds into a strict pass criterion, evaluation should prioritize threshold calibration, directional decision errors, and severity—because these are the dimensions that determine whether an automatic grader behaves as a trustworthy component of an exam pipeline.

4.5. Limitations

Several limitations should be considered when interpreting these findings. First, the automatic grading configuration effectively corresponds to a single-grader setting (panel size = 1), which means that variability due to grader stochasticity or inter-grader disagreement is not averaged out. While this reflects a realistic low-cost deployment scenario, it may overstate instability relative to designs that aggregate multiple independent grading judgments.

Second, the decision analysis is tied to a specific operational policy: a fixed pass threshold (pass if score ≥ 9) applied identically to instructor and automatic scores. This choice is appropriate for the targeted exam setting, but different courses or grading schemes may use alternative cutoffs, multi-level grade bands, or curved policies. As a result, the reported decision-level risks—especially near-threshold behavior—should be interpreted as policy-conditional rather than as universal properties of the grader.

Third, the number of instructor-perfect cases varies substantially across model × source groups, leading to uneven statistical support for the severity estimates. In particular, some conditions contain relatively few instructor-perfect observations, which produces wider confidence intervals and limits the precision of cross-condition comparisons. Conversely, conditions with larger support yield tighter intervals and therefore more reliable estimates of systematic effects.

Fourth, the analysis is restricted to the experimental conditions and answer pools used in this study (a single course (IA2) with a single instructor reference, HUMAN vs. PIPELINE, and the examined model groups). These conditions capture meaningful distribution shifts, but they do not exhaust the diversity encountered in real educational deployment, where student writing quality, prompt phrasing, language proficiency, and use of external tools can introduce additional sources of variation. Generalization beyond the studied setting should therefore be made cautiously.

Finally, the evaluation compares grader-based grading directly to a single instructor reference. Although the instructor provides a practical and relevant ground truth for the target course, the technical, reference-solution-anchored nature of the items and the explicit instructor rubric constrain grading degrees of freedom, but they do not eliminate potential instructor-specific strictness or leniency. Instructional grading itself may contain subjectivity, and alternative reference designs (e.g., multi-instructor panels or adjudication) could change the estimated disagreement rates. These limitations do not negate the observed patterns, but they bound the scope of the claims to the operational setup and reference standard used here.

These findings have direct implications for teaching practice in technical, reference-solution–anchored assessments. Instructors should treat LLM-based grading primarily as decision support rather than as an autonomous pass/fail gate, because threshold-sensitive miscalibration can translate into disproportionate consequences under strict pass policies. In addition, pipeline-generated questions should undergo brief pre-deployment validation focused on clarity, uniqueness of the intended solution, and the absence of unintended shortcuts that reduce discrimination. Operationally, the grader is most useful for triage and feedback: responses near the pass threshold or exhibiting apparent mismatch with the reference solution should be prioritized for instructor review. Finally, instructors should monitor severe false-fail regimes (e.g., credit denial) as a separate risk indicator alongside aggregate score agreement, and adjust prompts, review rules, or workflow safeguards when such regimes exceed an acceptable level.

4.6. Future Work

Several concrete directions follow from these results. A first priority is threshold-focused calibration aimed at reducing false-fail severity without introducing an unacceptable increase in false passes. This can include learning a simple monotone mapping from grader scores to instructor scores, estimating a condition-specific offset, or optimizing a decision threshold under explicit constraints on false-fail rates. Because the observed risks are policy-conditional, calibration should be evaluated directly under the intended grading policy rather than only through score-level error reduction.

Second, the mechanisms behind severe failures should be localized. For cases where the instructor assigns a perfect score but the grader produces a failing outcome, targeted error analysis can determine whether the discrepancy is driven by rubric misinterpretation, missing required elements, over-penalization of style or brevity, or sensitivity to answer format. This kind of diagnosis can then inform prompt design, rubric encoding, or structured scoring strategies that reduce reliance on superficial cues and improve alignment with instructor criteria.

Third, robustness evaluation should be expanded beyond the current HUMAN/PIPELINE split to cover additional, deployment-relevant shifts. Examples include varying answer length distributions, introducing paraphrase and formatting perturbations, mixing levels of student proficiency, or including answers produced under different tool-use regimes. The goal is to identify which shifts cause calibration drift and to validate mitigation strategies under worst-case conditions rather than only on average.

Fourth, the role of aggregation should be investigated. While single-judge grading is operationally appealing, small ensembles of independent grading runs or multi-agent panels may reduce variance and mitigate systematic conservatism, particularly near the pass boundary. Future work can quantify the cost–reliability trade-off of such aggregation strategies and test whether they reduce severe false fails on instructor-perfect responses.

Finally, extending the decision analysis beyond a single threshold to multi-level grade bands would better match many real grading policies. Evaluating stability across multiple cut points (e.g., fail/pass, pass/excellent) and reporting severity-aware errors for each band would provide a more complete characterization of the grader’s operational risk profile and improve the transferability of the methodology to other courses and assessment settings.

In addition, future work should expand the set of evaluated student models to include a broader range of LLMs, including strong proprietary systems (e.g., GPT-4, Claude, Gemini), to test the stability of the observed effects across model families and capability levels.

5. Conclusions

This study evaluated LLM-based automatic grading in a written exam from the IA2 course, by comparing the Grader outputs against an instructor reference on a 1–10 scale and by analyzing the downstream effects of applying the operational pass rule (pass if score ≥ 9). ExamQ-Gen operationalizes an instructor-in-the-loop workflow in which LLM outputs serve as recommendations for both exam authoring and grading, with the instructor assigning final grades and pass/fail decisions. Across 300 graded items (HUMAN n = 160; PIPELINE n = 140), score-level agreement is strong under HUMAN conditions but degrades sharply under PIPELINE conditions, indicating that grading stability depends on the response distribution even when the target rubric and instructor standard remain fixed.

The observed degradation is directional and model-dependent: PIPELINE/Mistral exhibits a strong conservative shift that systematically under-scores relative to the instructor, while PIPELINE/Llama3 shows the opposite tendency. When scores are thresholded into pass/fail outcomes, these score-level shifts translate into asymmetric decision errors: false passes remain rare, but false fails become the dominant risk, particularly under PIPELINE. A severity-focused analysis further shows that this is not limited to borderline threshold flips—under the most affected condition, the grader can deny credit even for instructor-perfect answers. We therefore state model-level conclusions specifically for the two evaluated student models.

Methodologically, these findings support evaluation standards for educational LLM graders that go beyond aggregate score agreement. Reliable deployment requires threshold-based decision analysis with uncertainty, disaggregated reporting under condition shifts, and severity-aware diagnostics that isolate high-cost errors such as credit denial for clearly correct work. Overall, suitability for exam use should be judged by the system’s risk profile under the intended decision policy, prioritizing the reduction in severe false fails while maintaining acceptable protection against false passes.

Author Contributions

Conceptualization, C.A., M.V.C. and A.A.A.; methodology, C.A. and M.V.C.; software, C.A., A.A.A., E.P. and A.C.; validation, C.A., M.V.C., A.C. and E.P.; data curation, A.C. and E.P.; writing—original draft preparation, C.A., M.V.C., A.C., E.P. and A.A.A.; writing—review and editing, A.A.A., C.A., A.C. and E.P.; visualization, M.V.C., E.P. and A.C.; supervision, C.A., M.V.C. and A.A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The source code of the main modules is available at https://github.com/anghelcata/ExamQ-Gen.git (accessed on 27 January 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	artificial intelligence
CI	confidence interval
CSV	comma-separated values
FN	false negatives
FNR	false-negative rate
FP	false positives
FPR	false-positive rate
GPU	graphics processing unit
HELM	Holistic Evaluation of Language Models
HUMAN	instructor-authored exam question set
IA2	Introductory Applied Informatics 2 (course)
IQR	interquartile range
JSONL	JSON Lines (line-delimited JSON)
LLM	large language model
MAE	mean absolute error
MCQ	multiple-choice question
PDF	portable document format
PIPELINE	automatically generated exam question set
RAM	random-access memory
RMSE	root mean squared error
SD	standard deviation
SSD	solid-state drive
TN	true negatives
TP	true positives
VRAM	video random-access memory

References

Emirtekin, E. Large Language Model-Powered Automated Assessment: A Systematic Review. Appl. Sci. 2025, 15, 5683. [Google Scholar] [CrossRef]
Aggarwal, D.; Sil, P.; Raman, B.; Bhattacharyya, P. “I Understand Why I Got This Grade”: Automatic Short Answer Grading (ASAG) with Feedback. In Proceedings of the Artificial Intelligence in Education (AIED 2025), Cham, Switzerland, 22–26 July 2025; pp. 304–318. [Google Scholar] [CrossRef]
Pan, Y.; Nehm, R.H. Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks. Educ. Sci. 2025, 15, 676. [Google Scholar] [CrossRef]
Panadero, E.; Jonsson, A.; Pinedo, L.; Fernández-Castilla, B. Effects of Rubrics on Academic Performance, Self-Regulated Learning, and self-Efficacy: A Meta-analytic Review. Educ. Psychol. Rev. 2023, 35, 113. [Google Scholar] [CrossRef]
Sunar, A.S.; Khalid, M.S. Natural Language Processing of Student’s Feedback to Instructors: A Systematic Review. IEEE Trans. Learn. Technol. 2023, 17, 741–753. [Google Scholar] [CrossRef]
Cisneros-González, J.; Gordo-Herrera, N.; Barcia-Santos, I.; Sánchez-Soriano, J. JorGPT: Instructor-Aided Grading of Programming Assignments with Large Language Models (LLMs). Future Internet 2025, 17, 265. [Google Scholar] [CrossRef]
Chen, Z.; Wan, T. Grading Explanations of Problem-Solving Process and Generating Feedback Using Large Language Models at Human-Level Accuracy. Phys. Rev. Phys. Educ. Res. 2025, 21, 010126. [Google Scholar] [CrossRef]
Blagec, K.; Dorffner, G.; Moradi, M.; Ott, S.; Samwald, M. A global analysis of metrics used for measuring performance in natural language processing. In Proceedings of the NLP Power! The First Workshop on Efficient Benchmarking in NLP, Dublin, Ireland, 26 May 2022; pp. 52–63. Available online: https://aclanthology.org/2022.nlppower-1.6/ (accessed on 27 January 2026).
Hackl, V.; Krainz, A.; Bock, A. Is GPT-4 a Reliable Rater? Evaluating Consistency in GPT-4’s Text Ratings. Front. Educ. 2023, 8, 1272229. [Google Scholar] [CrossRef]
Celikyilmaz, A.; Clark, E.; Gao, J. Evaluation of Text Generation: A Survey. arXiv 2020, arXiv:2006.14799. [Google Scholar]
Anghel, C.; Anghel, A.A.; Pecheanu, E.; Cocu, A.; Istrate, A. Diagnosing Bias and Instability in LLM Evaluation: A Scalable Pairwise Meta-Evaluator. Information 2025, 16, 652. [Google Scholar] [CrossRef]
Feldman-Maggor, Y.; Cukurova, M.; Kent, C.; Alexandron, G. The Impact of Explainable AI on Teachers’ Trust and Acceptance of AI EdTech Recommendations: The Power of Domain-specific Explanations. Int. J. Artif. Intell. Educ. 2025, 35, 2889–2922. [Google Scholar] [CrossRef]
Melo, E.; Silva, I.; Costa, D.G.; Viegas, C.M.D.; Barros, T.M. On the Use of eXplainable Artificial Intelligence to Evaluate School Dropout. Educ. Sci. 2022, 12, 845. [Google Scholar] [CrossRef]
Anghel, C.; Anghel, A.A.; Pecheanu, E.; Susnea, I.; Cocu, A.; Istrate, A. Multi-Model Dialectical Evaluation of LLM Reasoning Chains: A Structured Framework with Dual Scoring Agents. Informatics 2025, 12, 76. [Google Scholar] [CrossRef]
Schneider, J.; Schenk, B.; Niklaus, C. Towards LLM-Based Autograding for Short Textual Answers. In Proceedings of the 16th International Conference on Computer Supported Education (CSEDU 2024), Angers, France, 2–4 May 2024; pp. 280–288. [Google Scholar] [CrossRef]
Duong, T.N.B.; Chai, Y.M. Automatic Grading of Short Answers Using Large Language Models in Software Engineering Courses. In Proceedings of the 2024 IEEE Global Engineering Education Conference (EDUCON), Kos, Greece, 8–11 May 2024; pp. 1–10. [Google Scholar] [CrossRef]
Jukiewicz, M. A Systematic Comparison of Large Language Models for Automated Assignment Assessment in Programming Education: Exploring the Importance of Architecture and Vendor. arXiv 2025, arXiv:2509.26483. [Google Scholar] [CrossRef]
Ros-Arlanzón, P.; Renato, G.-Á.; Arrarte-Esteban, V.; Bertomeu-González, V.; Hernández-Blasco, L.; Masiá, M.; Navarro-Canto, L.; Nieto-Navarro, J.; Abarca, J.; Sempere, A.P. When AI Models Take the Exam: Large Language Models vs. Medical Students on Multiple-Choice Course Exams. Med. Educ. Online 2025, 30, 2592430. [Google Scholar] [CrossRef]
Gaggioli, A.; Casaburi, G.; Ercolani, L.; Collovà, F.; Torre, P.; Davide, F. Assessing the Reliability and Validity of Large Language Models for Automated Assessment of Student Essays in Higher Education. arXiv 2025, arXiv:2508.02442. [Google Scholar] [CrossRef]
Nikolovski, V.; Trajanov, D.; Chorbev, I. Advancing AI in Higher Education: A Comparative Study of Large Language Model-Based Agents for Exam Question Generation, Improvement, and Evaluation. Algorithms 2025, 18, 144. [Google Scholar] [CrossRef]
Scaria, N.; Dharani Chenna, S.; Subramani, D. Automated Educational Question Generation at Different Bloom’s Skill Levels Using Large Language Models: Strategies and Evaluation. In Proceedings of the Artificial Intelligence in Education. AIED 2024, Palermo, Italy, 8–12 July 2024; pp. 165–179. [Google Scholar] [CrossRef]
Al Faraby, S.; Romadhony, A.; Adiwijaya, K. Analysis of LLMs for Educational Question Classification and Generation. Comput. Educ. Artif. Intell. 2024, 7, 100298. [Google Scholar] [CrossRef]
Hashemi, H.; Eisner, J.; Rosset, C.; Van Durme, B.; Kedzie, C. LLM-RUBRIC: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Bangkok, Thailand, 11–16 August 2024; pp. 13806–13834. Available online: https://aclanthology.org/2024.acl-long.745 (accessed on 27 January 2026).
Liang, P.; Bommasani, R.; Lee, T.; Tsipras, D.; Soylu, D.; Yasunaga, M.; Zhang, Y.; Narayanan, D.; Wu, Y.; Kumar, A.; et al. Holistic Evaluation of Language Models. arXiv 2022, arXiv:2211.09110. [Google Scholar] [CrossRef]
Kiyani, A.; Hanif, F.; Muhammad, M.; Iqbal, S.; Zaib, N.; Bashir, U.; Ali, K. Benchmarking ChatGPT-generated multiple-choice questions against faculty-authored items in dental education. Sci. Rep. 2025, 15, 44805. [Google Scholar] [CrossRef] [PubMed]
Dhanvijay, A.D.; Kumari, A.; Pinjar, M.J.; Kumari, A.; Ganguly, A.; Priya, A.; Juhi, A.; Gupta, P.; Mondal, H. Faculty versus artificial intelligence chatbot: A comparative analysis of multiple-choice question quality in physiology. Adv. Physiol. Educ. 2025, 49, 1045–1051. [Google Scholar] [CrossRef]
Arif, T.; Asthana, S.; Collins-Thompson, K. Generation and Assessment of Multiple-Choice Questions from Video Transcripts using Large Language Models. In Proceedings of the Eleventh ACM Conference on Learning @ Scale (L@S ’24), Atlanta, GA, USA, 18–20 July 2024; pp. 530–534. [Google Scholar] [CrossRef]
Li, Y.; Raković, M.; Srivastava, N.; Li, X.; Guan, Q.; Gašević, D.; Chen, G. Can AI support human grading? Examining machine attention and confidence in short answer scoring. Comput. Educ. 2025, 228, 105244. [Google Scholar] [CrossRef]
Meta-AI. Llama 3.3: Model Cards and Prompt Formats. Available online: https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_3/ (accessed on 11 December 2025).
Meta-AI. Introducing Meta Llama 3. Available online: https://ai.meta.com/blog/meta-llama-3/ (accessed on 9 December 2025).
Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar] [CrossRef]
Alibaba. Qwen/Qwen2.5-7B-Instruct. Available online: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct (accessed on 9 December 2025).
Neo4j, I. Neo4j Graph Database Platform. Available online: https://neo4j.com/product/neo4j-graph-database/ (accessed on 12 December 2025).
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Stroudsburg, PA, USA (online), 16–20 November 2020; pp. 38–45. Available online: https://aclanthology.org/2020.emnlp-demos.6/ (accessed on 27 January 2026).
Dettmers, T.; Lewis, M.; Shleifer, S.; Zettlemoyer, L. 8-bit Optimizers via Block-wise Quantization. arXiv 2022, arXiv:2110.02861. [Google Scholar] [CrossRef]
Ollama-Inc. Ollama: Local Deployment of Large Language Models. Available online: https://ollama.com (accessed on 9 December 2025).

Figure 1. Block diagram of the ExamQ-Gen pipeline.

Figure 2. Block diagram of the LLM-based answering and grading setup, showing the intermediate JSONL files produced after the student and automatic grading stages, as well as the final graded dataset obtained from the course instructor and stored together with all questions in Neo4j.

Figure 3. Instructor score distribution by student model (1–10), with annotated proportions of score = 1 and score = 10.

Figure 4. Instructor score distributions by student model and question source (boxplots; annotated proportions of score = 1 and score = 10).

Figure 5. Pass-point rate by student model and question source (exam_point; error bars show 95% Wilson CI).

Figure 6. Bland–Altman plot of grader vs. instructor scores (Grader − Instructor), stratified by source (HUMAN vs. PIPELINE).

Figure 7. Pass-point rate calibration between instructor and Grader across model × source groups (95% Wilson confidence intervals).

Figure 8. Boxplot of score differences (Δ_score = automatic grader − instructor) by model × source.

Figure 9. Pass-point discrepancy rates (FNR and FPR) by model × source with 95% Wilson confidence intervals.

Figure 10. Perfect-pass miss rate (instructor score 10 but grader < 9) by model × source group; error bars indicate 95% Wilson confidence intervals.

Table 1. Overview of the exam question datasets.

Source	Origin of Items	Question Format	Questions
HUMAN	Past IA2 written exams authored by the course instructor	Multiple-choice items converted to short free-text questions with reference answers	80
PIPELINE	Automatically generated from the official IA2 course script	Directly generated short free-text questions with reference answers	70

Table 2. Representative graded examples (HUMAN and PIPELINE) with question, reference solution, model answer, and instructor rating; exam_point is derived using the local pass rule (score ≥ 9).

Setup	Question (Shortened)	Reference Solution (Shortened)	Model Answer (Shortened)	Instructor Rating
HUMAN/Mistral-7B-Instruct	What will print(6 * 2 + 3) output?	15	15	correct (10/10; exam_point = 1)
HUMAN/Llama3-8B-Instruct	What will print(type(x)) output if x = ‘81’?	<class ‘str’> (x is the string ‘81’)	str	partial (8/10; exam_point = 0)
HUMAN/Llama3-8B-Instruct	What does the ** operator do?	Exponentiation (raise to a power)	Concatenation	incorrect (1/10; exam_point = 0)
PIPELINE/Llama3-8B-Instruct	spam_amount starts at 0. If we add 4, what is the new value and its data type?	spam_amount = 4; type is int	New value is 4; type is integer.	correct (10/10; exam_point = 1)
PIPELINE/Llama3-8B-Instruct	Logistic regression: compute sigmoid(β + c1·x1 + c2·x2) for β = 1, x1 = 2, x2 = 3, c1 = 0.5, c2 = −0.2.	z = 1 + 0.5·2 − 0.2·3 = 1.4; sigmoid(z) ≈ 0.8016	0.8416	partial (6/10; exam_point = 0)
PIPELINE/Llama3-8B-Instruct	Keras model: Dense(units = 1, input_shape = [3]). How many trainable parameters?	3 weights + 1 bias = 4	9	incorrect (1/10; exam_point = 0)

Table 3. Overall instructor score summary by student model (1–10; pooled across HUMAN and PIPELINE).

Model	n	Mean	SD	Median	IQR	Min	Max
Llama3	150	3.407	3.820	1.0	6.0	1.0	10.0
Mistral	150	5.353	4.498	1.0	9.0	1.0	10.0

Table 4. Pass-point rate by student model (exam_point), with 95% Wilson confidence intervals.

Model	n	Pass Count	Pass Rate	95% CI (Wilson)
Llama3	150	36	0.240	[0.179, 0.314]
Mistral	150	72	0.480	[0.402, 0.559]

Table 5. Instructor score summary by student model and question source (1–10).

Source	Model	n	Mean	SD	Median	IQR	Min	Max
HUMAN	Llama3	80	4.200	4.117	1.0	9.0	1.0	10.0
HUMAN	Mistral	80	3.087	3.789	1.0	0.0	1.0	10.0
PIPELINE	Llama3	70	2.500	3.247	1.0	0.0	1.0	10.0
PIPELINE	Mistral	70	7.943	3.806	10.0	0.0	1.0	10.0

Table 6. Pass-point rate by student model and question source (exam_point; 95% Wilson CI).

Source	Model	n	Pass Count	Pass Rate	95% CI (Wilson)
HUMAN	Llama3	80	26	0.325	[0.232, 0.434]
HUMAN	Mistral	80	18	0.225	[0.147, 0.328]
PIPELINE	Llama3	70	10	0.143	[0.079, 0.243]
PIPELINE	Mistral	70	54	0.771	[0.660, 0.854]

Table 7. Score-level agreement between the Grader and the instructor (overall and by source).

Source	n	MAE	RMSE	Bias (Grader − Instructor)	Spearman ρ
HUMAN	160	0.81	2.23	−0.125	0.799
PIPELINE	140	3.81	4.87	−0.836	0.189
Overall	300	2.21	3.70	−0.457	0.559

Table 8. Pass-point agreement between the Grader and the Instructor (overall and by source).

Source	n	TP	FP	FN	TN	Accuracy	Precision	Recall	F1	FPR	FNR
HUMAN	160	37	4	7	112	0.931	0.902	0.841	0.871	0.034	0.159
PIPELINE	140	11	2	53	74	0.607	0.846	0.172	0.286	0.026	0.828
Overall	300	48	6	60	186	0.780	0.889	0.444	0.593	0.031	0.556

Table 9. Discrepancy breakdown between the automatic grader and the instructor by model × source (Δ_score = Grader − Instructor).

Source	Model	MAE	RMSE	Bias (Grader − Instructor)	Exact Match	\|Δ_score\| ≥ 3	Under-Scoring	Over-Scoring	FNR	FPR
HUMAN	Llama3	0.825	2.214	−0.75	0.825	0.112	0.15	0.025	0.231	0.00
HUMAN	Mistral	0.800	2.253	0.50	0.762	0.088	0.05	0.188	0.056	0.065
PIPELINE	Llama3	2.743	3.88	1.886	0.300	0.429	0.10	0.600	0.400	0.033
PIPELINE	Mistral	4.871	5.688	−3.557	0.086	0.829	0.757	0.157	0.907	0.00

Table 10. Pass-point confusion counts and derived metrics by model × source (automatic grader vs. instructor).

Source	Model	TP	FP	FN	TN	Accuracy	Precision	Recall	F1	FPR	FNR
HUMAN	Llama3	20	0	6	54	0.925	1.00	0.769	0.870	0.000	0.231
HUMAN	Mistral	17	4	1	58	0.938	0.81	0.944	0.872	0.065	0.056
PIPELINE	Llama3	6	2	4	58	0.914	0.75	0.600	0.667	0.033	0.400
PIPELINE	Mistral	5	0	49	16	0.300	1.00	0.093	0.169	0.000	0.907

Table 11. Grader–Instructor score agreement (tolerance-invariant) by model × source group. MAE is the mean of |Grader − Instructor| over items; bias is the mean signed difference (Grader − Instructor). The 95% CI for MAE is obtained by bootstrap resampling within each group.

Domain	n	$\bar{k}$ (Panel Size)	Grader − Instructor MAE	95% CI (MAE)	Bias (Grader − Instructor)
HUMAN/Llama3	80	1.00	0.825	0.400–1.300	−0.750
HUMAN/Mistral	80	1.00	0.800	0.388–1.312	0.500
PIPELINE/Llama3	70	1.00	2.743	2.114–3.400	1.886
PIPELINE/Mistral	70	1.00	4.871	4.186–5.571	−3.557

Table 12. Severity of decision-level errors at the pass threshold (score ≥ 9): among instructor-perfect answers (score 10), fraction assigned grader < 9, with miss counts and 95% Wilson confidence intervals by model × source group.

Domain	n (Total)	n (Score = 10)	Miss10 k (Grader < 9)	Miss10 Rate	95% CI (Wilson)
HUMAN/Llama3	80	21	1	0.048	[0.008, 0.227]
HUMAN/Mistral	80	18	1	0.056	[0.010, 0.258]
PIPELINE/Llama3	70	10	5	0.500	[0.237, 0.763]
PIPELINE/Mistral	70	54	50	0.926	[0.824, 0.971]

Table 13. Sensitivity of decision-level errors to the pass threshold. Pass is defined as score ≥ t for both the instructor and the automatic grader (Qwen). Reported are false fails among instructor passes, false passes among instructor fails, and perfect-pass denial (instructor score = 10, grader score < t), aggregated by source group (HUMAN vs. PIPELINE) across both evaluated student models.

Source Group	Threshold t	False-Fail (Among Instructor Passes)	False-Pass (Among Instructor Fails)	Perfect-Pass Denial (Instructor = 10)
HUMAN	9	7/44 (15.9%)	4/116 (3.4%)	2/39 (5.1%)
HUMAN	6	10/48 (20.8%)	4/112 (3.6%)	2/39 (5.1%)
HUMAN	5	10/48 (20.8%)	5/112 (4.5%)	2/39 (5.1%)
PIPELINE	9	55/64 (85.9%)	2/76 (2.6%)	55/64 (85.9%)
PIPELINE	6	35/67 (52.2%)	25/73 (34.2%)	34/64 (53.1%)
PIPELINE	5	32/67 (47.8%)	28/73 (38.4%)	31/64 (48.4%)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Anghel, C.; Pecheanu, E.; Anghel, A.A.; Craciun, M.V.; Cocu, A. ExamQ-Gen: Instructor-in-the-Loop Generation of Self-Contained Exam Questions from Course Materials and Decision-Support Grading. Computers 2026, 15, 177. https://doi.org/10.3390/computers15030177

AMA Style

Anghel C, Pecheanu E, Anghel AA, Craciun MV, Cocu A. ExamQ-Gen: Instructor-in-the-Loop Generation of Self-Contained Exam Questions from Course Materials and Decision-Support Grading. Computers. 2026; 15(3):177. https://doi.org/10.3390/computers15030177

Chicago/Turabian Style

Anghel, Catalin, Emilia Pecheanu, Andreea Alexandra Anghel, Marian Viorel Craciun, and Adina Cocu. 2026. "ExamQ-Gen: Instructor-in-the-Loop Generation of Self-Contained Exam Questions from Course Materials and Decision-Support Grading" Computers 15, no. 3: 177. https://doi.org/10.3390/computers15030177

APA Style

Anghel, C., Pecheanu, E., Anghel, A. A., Craciun, M. V., & Cocu, A. (2026). ExamQ-Gen: Instructor-in-the-Loop Generation of Self-Contained Exam Questions from Course Materials and Decision-Support Grading. Computers, 15(3), 177. https://doi.org/10.3390/computers15030177

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ExamQ-Gen: Instructor-in-the-Loop Generation of Self-Contained Exam Questions from Course Materials and Decision-Support Grading

Abstract

1. Introduction

1.1. Background and Motivation

1.2. Related Work and Research Gap

1.3. Study Design and Contributions

2. Materials and Methods

2.1. Exam Question Datasets

2.2. Automatic Question Generation Pipeline

2.3. LLM-Based Answering and Human-Aligned Grading

2.4. Metrics and Analysis of LLM Performance and Question Difficulty

2.5. Implementation Details

3. Results

3.1. Overall Performance of the Student LLMs

3.2. Performance by Question Source (HUMAN vs. PIPELINE)

3.3. Agreement Between the Course Instructor and the Automatic Grader

3.4. Diagnosing Instructor–Automatic Grader Discrepancies by Model and Question Source

3.5. Stability and Bias of Automatic Grading Across Experimental Conditions

3.5.1. Grader–Instructor Score Agreement Across Experimental Conditions

3.5.2. Decision-Level Severity at the Pass Threshold

3.5.3. Sensitivity to Alternative Pass Thresholds

4. Discussion

4.1. Interpreting Agreement: Score-Level Similarity vs. Threshold Risk

4.2. Condition Effects: HUMAN vs. PIPELINE as a Robustness Stress Test

4.3. Conservative Grading and Error Asymmetry at the Pass Threshold

4.4. Methodological Implications for LLM-Based Educational Assessment

4.5. Limitations

4.6. Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI