EvalCouncil: A Committee-Based LLM Framework for Reliable and Unbiased Automated Grading

Anghel, Catalin; Craciun, Marian Viorel; Anghel, Andreea Alexandra; Cocu, Adina; Balau, Antonio Stefan; Andrei, Constantin Adrian; Maier, Calina; Dragosloveanu, Serban; Nedelea, Dana-Georgiana; Scheau, Cristian

doi:10.3390/computers14120530

Open AccessArticle

EvalCouncil: A Committee-Based LLM Framework for Reliable and Unbiased Automated Grading

by

Catalin Anghel

¹

,

Marian Viorel Craciun

¹

,

Andreea Alexandra Anghel

²

,

Adina Cocu

^1,*

,

Antonio Stefan Balau

²

,

Constantin Adrian Andrei

^3,4,

Calina Maier

^3,5,

Serban Dragosloveanu

^3,4,*

,

Dana-Georgiana Nedelea

^3,4 and

Cristian Scheau

^3,6

¹

Department of Computer Science and Information Technology, “Dunărea de Jos” University of Galati, Științei St. 2, 800146 Galati, Romania

²

Faculty of Automation, Computer Science, Electrical and Electronic Engineering, “Dunărea de Jos” University of Galati, 800008 Galati, Romania

³

Faculty of Medicine, The “Carol Davila” University of Medicine and Pharmacy, 050474 Bucharest, Romania

⁴

Department of Orthopaedics, “Foisor” Clinical Hospital of Orthopaedics, Traumatology and Osteoarticular TB, 021382 Bucharest, Romania

⁵

Panait Sirbu Obstetrics and Gynaecology Hospital Bucharest, 060251 Bucharest, Romania

⁶

Department of Radiology and Medical Imaging, “Foisor” Clinical Hospital of Orthopaedics, Traumatology and Osteoarticular TB, 021382 Bucharest, Romania

^*

Authors to whom correspondence should be addressed.

Computers 2025, 14(12), 530; https://doi.org/10.3390/computers14120530 (registering DOI)

Submission received: 7 November 2025 / Revised: 28 November 2025 / Accepted: 2 December 2025 / Published: 3 December 2025

(This article belongs to the Section AI-Driven Innovations)

Download

Browse Figures

Versions Notes

Abstract

Large Language Models (LLMs) are increasingly used for rubric-based assessment, yet reliability is limited by instability, bias, and weak diagnostics. We present EvalCouncil, a committee-and-chief framework for rubric-guided grading with auditable traces and a human adjudication baseline. Our objectives are to (i) characterize domain structure in Human–LLM alignment, (ii) assess robustness to concordance tolerance and panel composition, and (iii) derive a domain-adaptive audit policy grounded in dispersion and chief–panel differences. Authentic student responses from two domains–Computer Networks (CNs) and Machine Learning (ML)–are graded by multiple heterogeneous LLM evaluators using identical rubric prompts. A designated chief arbitrator operates within a tolerance band and issues the final grade. We quantify within-panel dispersion via MPAD (mean pairwise absolute deviation), measure chief–panel concordance (e.g., absolute error and bias), and compute Human–LLM deviation. Robustness is examined by sweeping the tolerance and performing leave-one-out perturbations of panel composition. All outputs and reasoning traces are stored in a graph database for full provenance. Human–LLM alignment exhibits systematic domain dependence: ML shows tighter central tendency and shorter upper tails, whereas CN displays broader dispersion with heavier upper tails and larger extreme spreads. Disagreement increases with item difficulty as captured by MPAD, concentrating misalignment on a relatively small subset of items. These patterns are stable to tolerance variation and single-grader removals. The signals support a practical triage policy: accept low-dispersion, small-gap items; apply a brief check to borderline cases; and adjudicate high-dispersion or large-gap items with targeted rubric clarification. EvalCouncil instantiates a committee-and-chief, rubric-guided grading workflow with committee arbitration, a human adjudication baseline, and graph-based auditability in a real classroom deployment. By linking domain-aware dispersion (MPAD), a policy tolerance dial, and chief–panel discrepancy, the study shows how these elements can be combined into a replicable, auditable, and capacity-aware approach for organizing LLM-assisted grading and identifying instability and systematic misalignment, while maintaining pedagogical interpretability.

Keywords:

Human–LLM alignment; rubric-based assessment; committee-and-chief framework; chief–panel discrepancy; mean pairwise absolute deviation (MPAD); tolerance threshold; domain-dependent dispersion; adjudication and triage policy; student evaluation; grading reliability

1. Introduction

Large Language Models (LLMs), such as GPT-5 [1], Llama [2], or Mistral [3], have emerged as fundamental tools in natural language processing, advancing tasks like summarization, question answering, and educational assessment. Despite their rapid adoption, the evaluation of LLMs–particularly in instructional contexts–remains challenging because traditional automatic metrics such as BLEU [4], ROUGE [5], or METEOR [6] provide limited diagnostic insight into the quality of reasoning. Pairwise preference scoring, increasingly used in public leaderboards, often amplifies superficial biases and provides little insight into reasoning quality [7,8]. As a result, rubric-based evaluation has gained traction as a more interpretable and pedagogically meaningful alternative, emphasizing dimensions such as accuracy, clarity, completeness, and the quality of explanation [9,10,11].

1.1. Background and Problem Statement

Large language models (LLMs) are being increasingly explored for automated assessment due to their capacity to deliver context-aware judgments at scale. Their flexibility makes them attractive for grading both multiple-choice and open-ended tasks, while also offering potential benefits in terms of efficiency and scalability [12]. Nevertheless, the adoption of LLMs in high-stakes instructional contexts is constrained by two critical issues–instability and bias–which undermine trust and reproducibility [7,8,11].

Beyond evaluation pipelines, studies in software engineering education have examined how LLM-based tools reshape student modelling workflows in UML tasks and how tree-based machine learning can support defect-prediction-based, data-driven assessment in programming courses [13,14].

Instability refers to the lack of consistency in grading outcomes. Identical student responses can receive divergent scores across different runs due to prompt sensitivity, randomness, or even minor formatting differences. Such variability complicates reproducibility and raises concerns for educators who require transparent and defensible evaluation protocols [7,11]. The challenge becomes particularly acute when small changes in prompts shift evaluations significantly, leaving uncertainty over whether scores accurately reflect actual student performance or incidental model behavior.

Bias represents an equally significant challenge. Evaluations can be influenced by superficial attributes such as fluency, verbosity, or stylistic conventions rather than substantive correctness. Prior work has shown that pairwise preference judgments–widely used in public leaderboards–tend to amplify these biases while overlooking deeper dimensions such as factual accuracy and reasoning quality [8,11]. In educational contexts, this raises ethical concerns, as students who rely on non-standard dialects or come from underrepresented backgrounds may be unfairly penalized, thereby exacerbating existing inequities.

Although pairwise comparison methods have gained popularity for their simplicity and scalability, they provide limited diagnostic insight. A global preference label does not reveal which dimensions of a response are strong or weak–such as factuality, completeness, or clarity–and therefore cannot serve as a reliable pedagogical tool [8,11]. This lack of interpretability limits their usefulness in supporting student learning and feedback.

To overcome these shortcomings, researchers emphasize the importance of rubric-based, multi-criteria, and explanation-aware frameworks. Studies in explainable AI highlight the need to evaluate the clarity, fidelity, and usefulness of model explanations rather than relying on surface-level similarity [15,16]. Building on this perspective, committee-based and multi-model approaches have emerged as promising solutions. In particular, multi-model dialectical evaluation frameworks improve robustness, interpretability, and transparency by aggregating diverse judgments and explicitly arbitrating disagreements [17]. At the same time, recent work highlights that diagnosing bias and instability necessitates meta-evaluators who can produce auditable reasoning traces and can achieve stable, reproducible outcomes [7]. These insights collectively motivate the development of committee-and-arbiter architecture, such as the one proposed in this study.

To provide an interpretable and verifiable baseline for automated scoring, we include a human adjudication condition. Two independent Human Chief Evaluators applied a common 1–10 rubric to all responses in both Computer Networking (CN) and Machine Learning (ML) and issued one final grade per item. This human baseline enables principled comparison with the LLM chief and supports the analysis of chief–panel bias and instability in our committee-and-arbiter pipeline [18,19,20].

1.2. Related Work

One major line of research in LLM evaluation has centered on pairwise preference methods, which ask annotators–or increasingly, other LLMs–to judge which of two responses is superior. This approach has become central to leaderboards such as MT-Bench and Chatbot Arena [21]. Its main advantages are simplicity and scalability, which explain its widespread adoption by both academia and industry. However, recent studies highlight its limitations: it provides only a binary preference without diagnostic detail, and it tends to reward superficial fluency rather than deeper reasoning quality [8,22].

These limitations are compounded by the problem of instability. Empirical analyses have demonstrated that minor adjustments in prompts, evaluation seeds, or formatting can substantially impact preference outcomes [10]. Such variability undermines reproducibility and poses a serious challenge for deploying LLMs in educational contexts, where reliability is non-negotiable. Furthermore, instability can cascade in benchmarking scenarios: when comparative judgments are inconsistent, model rankings become volatile and difficult to interpret [7].

Closely related are concerns about bias. Pairwise judgments are often swayed by stylistic factors such as verbosity, politeness, or the use of formal language. Comparative assessments have been shown to amplify pre-existing biases [23,24], raising ethical concerns when such methods are used to grade student work or inform high-stakes decisions [8].

Recognizing these weaknesses, researchers have proposed rubric-based and multi-dimensional frameworks that offer more transparency. For instance, LLM-RUBRIC introduces calibrated criteria such as accuracy, completeness, and conciseness, enabling evaluations that align more closely with human graders [9]. Other approaches, such as SedarEval [25] and rubric-relation analyses in education [26,27], demonstrate that structured rubrics provide interpretable and pedagogically meaningful evaluation methods, although they can still be sensitive to perturbations [10,28].

In parallel, research in explainable AI (XAI) emphasizes that evaluation should also target the quality of explanations provided by models. Reviews underline the importance of clarity, fidelity, and usefulness as key dimensions of trustworthy AI systems [15,16,29,30,31]. Incorporating these criteria into LLM assessment provides richer insights for developers and educators alike, bridging the gap between automated scoring and human expectations for interpretability.

Complementing multi-model approaches, rubric-guided, multi-rater human evaluation with explicit inter-rater analysis provides a reference standard for calibrating LLM-mediated assessment [18,19,20]. It offers interpretable, criterion-level judgments and auditable agreement measures, enabling principled checks of alignment, bias, and stability. Accordingly, we include a human adjudication baseline.

Finally, recent work has explored multi-model and committee-based frameworks as a promising alternative to single-model graders. Dialectical and multi-agent evaluation frameworks demonstrate improvements in robustness and transparency through explicit arbitration [17]. Systematic reviews of automated assessment confirm the rising interest in such approaches [32]. By aggregating diverse evaluators and explicitly resolving disagreements, committee-based systems offer more stable and auditable outcomes. These developments lay the foundation for architectures like EvalCouncil, which extend the logic of rubric-based and dialectical evaluation into a scalable, committee-and-arbiter framework.

1.3. Research Gap and Contributions

Research on LLM evaluation has advanced along multiple directions, yet persistent limitations remain. Rather than relying on pairwise leaderboards such as MT-Bench and Chatbot Arena, we adopt rubric-aligned evaluation, which reports criterion-level evidence on accuracy, clarity, completeness, and explanation quality [8,21,33]. Moreover, empirical studies have shown that such evaluations are unstable: small changes in prompts or evaluation seeds can alter comparative outcomes and undermine reproducibility [10,29]. Concerns about bias further complicate their use, as stylistic attributes such as verbosity or politeness can distort judgments and raise concerns about fairness [8,23,24].

Rubric-based frameworks represent a more interpretable alternative, providing multi-criteria feedback aligned with human graders. Approaches such as LLM-RUBRIC [9] and SedarEval [25] introduce calibrated dimensions including accuracy, completeness, and conciseness, which enable richer diagnostic insights. Recent studies have also examined rubric interdependence in educational contexts, highlighting their pedagogical value [26,27]. We therefore adopt a perturbation-aware protocol that quantifies rubric stability under prompt and grader–panel variation reporting robustness alongside scores [10,28].

Parallel developments in explainable AI emphasize the need to assess not only outcomes but also the quality of model explanations. Surveys underline clarity, fidelity, and usefulness as core dimensions for trustworthy evaluation [15,16,29,30,31]. Yet, these insights are rarely operationalized in LLM grading pipelines, leaving a gap between the theory of XAI and its systematic application to educational assessment.

Committee-based and multi-agent approaches have recently emerged as a response to instability and bias, leveraging diversity among evaluators to improve robustness and transparency. Dialectical frameworks and systematic reviews suggest that the potential of arbitration among multiple models lies in delivering more stable outcomes [17,32]. However, to our knowledge, existing work has only partially combined rubric-driven assessment, explanation quality, and committee-based arbitration, and has rarely investigated them as a single, instrumented framework in realistic educational deployments.

This work presents EvalCouncil as a committee-and-chief evaluation framework. It combines rubric-based grading, multi-evaluator committees, chief arbitration, and graph-based logging into a single, instrumented pipeline for grading student responses with large language models. The framework is designed to turn these elements into a coherent workflow and to study their behavior in practice, with a focus on reliability, stability, fairness, and interpretability in realistic educational deployments. Its contribution is articulated along several complementary dimensions.

First, EvalCouncil introduces a committee of evaluators, composed of multiple heterogeneous LLMs. Each model independently grades the same response, thereby generating a spectrum of perspectives rather than relying on a single judgment. This multiplicity reduces the dominance of model-specific biases, ensuring that evaluation outcomes are less sensitive to random fluctuations or idiosyncratic prompt effects.

Second, EvalCouncil incorporates a chief arbiter, a dedicated component that synthesizes the committee’s evaluations. The arbiter does not merely average scores; it arbitrates disagreements, applies tolerance thresholds, and issues a final decision. In our implementation, the arbiter is instantiated as an LLM chief, and we also report a human adjudication baseline using two independent Human Chief Evaluators, who employ the same 1–10 rubric to calibrate the chief’s decisions and analyze alignment, chief–panel bias, and stability.

Third, the framework employs rubric-based multi-criteria evaluation. Rather than issuing opaque overall scores, EvalCouncil grounds its assessments in explicit rubrics that capture key educational dimensions such as accuracy, clarity, completeness, and terminology. This approach enhances interpretability, aligns automated evaluation with human grading practices, and provides actionable feedback for both students and educators.

Fourth, EvalCouncil is explanation-aware. In addition to scoring student responses, it evaluates the quality of the reasoning generated by the models, considering clarity, fidelity, and pedagogical usefulness. This integration connects the field of explainable AI with automated grading, ensuring that evaluations are not only numerically consistent but also cognitively meaningful.

Finally, EvalCouncil emphasizes auditability and transparency. Every evaluation step, from individual committee judgments to the arbiter’s decision, is stored together with reasoning traces. All evaluation artifacts and reasoning traces are stored in the graph database Neo4j [34], enabling lineage queries, reproducibility checks, and exact replay. These records can be inspected, replicated, and validated, allowing educators, researchers, and policymakers to audit the process and build trust in the outcomes.

Taken together, these contributions position EvalCouncil as a pragmatic, domain-specific framework for LLM-assisted grading in higher education. It is a unified architecture that integrates committee-based diversity, arbitration, rubric-based interpretability, explanation-awareness, and traceable decision-making into a single scalable framework. This framework is deployed in two technical university courses. By doing so, it provides a reliable and pedagogically aligned alternative to existing evaluation methods within these settings and offers a structured template that instructors can adapt and extend to related educational contexts.

2. Materials and Methods

This section presents the methodological foundation of our study. We first describe the datasets used in the experiments–authentic student responses collected from summative assessments in computer networking and from coursework in machine learning–and summarize the task categories associated with each dataset. We then detail how responses were structured, annotated with ground truth, and processed within the EvalCouncil pipeline. Finally, we explain the overall evaluation setup, including the role of the committee of evaluators, the chief arbiter, and the rubric-based assessment procedures.

2.1. Dataset Description

The experimental setup relied on authentic student responses collected from two educational contexts: computer networking (CN) and machine learning (ML). By combining these sources, the study ensured both domain diversity and a wide spectrum of response correctness, ranging from fully accurate to partially correct and incorrect answers. This variability was essential for testing the robustness and interpretability of evaluation frameworks.

2.1.1. Computer Networking Dataset (CN)

We collected responses from 26 students to 8 questions in computer networking examinations. The dataset covered several distinct categories of tasks:

Classification tasks: students had to assign communication factors (e.g., message size, complexity, or quality of the route) to internal or external categories.
Numerical tasks: for example, calculating the subnet address of an IPv4 host given its address and subnet mask.
Representation tasks: compressing IPv6 addresses into their canonical short form.
Conceptual verification tasks: verifying networking principles, such as whether a host encapsulates an IP packet with the MAC address of the default gateway when communicating outside its subnet.
Matching tasks: associating protocols such as SSH, DHCP, and DNS with their correct functions.

Each question was paired with an official ground truth solution. This allowed us to identify fully correct, partially correct, and incorrect answers, reflecting both conceptual understanding and common misconceptions among students.

2.1.2. Machine Learning Dataset (ML)

We also collected responses from 22 students to 10 open-ended questions in a machine learning course. The dataset was balanced across two categories of tasks:

Technical items (5 in total): requiring precise definitions or problem-solving steps, such as explaining version control systems, distinguishing narrow from general AI, applying Naive Bayes classification, comparing search algorithms, or discussing the use of facial recognition in security.
Argumentative items (5 in total): requiring reasoned explanations and ethical considerations, such as debating the use of facial recognition in public spaces, comparing platforms like GitHub and Moodle, discussing AI and ethics, analyzing personalization algorithms, or reflecting on the validity of the Turing Test.

Each item was provided with an instructor-defined reference answer, allowing for a systematic comparison of student responses. The dataset captured a wide range of reasoning quality, from concise and incomplete answers to well-structured and comprehensive explanations.

2.1.3. Integration for Evaluation

Both datasets are integrated into the EvalCouncil pipeline to ensure consistent handling and auditability. Table 1 summarizes the datasets and the task categories used in the study, reporting the counts of items and responses. Each item is associated with its dataset-specific task category. For Computer Networking, the categories are classification, numerical, representation, conceptual verification, and matching. For Machine Learning, the categories are technical and argumentative. Each item is linked to a reference answer and to the rubric criteria that guide grading.

Student responses are normalized to a canonical format. Personal identifiers are removed where present, technical notation is preserved, and each response is assigned a stable identifier allowing it to be traced across the pipeline.

Evaluation inputs are built from a single prompt template that injects the item text, the reference answer, and the rubric criteria. The same template is used for all committee models to limit prompt-induced variance. Model outputs are collected and reconciled by the chief arbiter, who adjudicates disagreements under predefined tolerance thresholds and issues the final grade and rationale. The full prompt templates used for LLM grading are available in the public repository (file prompts_evaluator.py).

All artifacts are stored in the graph database Neo4j with explicit provenance links. The stored entities include items, responses, rubrics, committee judgments, rationales, and arbiter decisions. This structure enables lineage queries, reproducibility checks, and exact replay. Summary tables for subsequent analysis are generated from the graph, preserving links back to the underlying nodes.

2.2. Evaluation Pipeline

The evaluation pipeline implemented in EvalCouncil was designed to ensure robustness, transparency, and pedagogical relevance in grading student responses. The workflow begins with the collection of a student’s answer, which is paired with the original question and ground-truth solution. This package is then passed to the evaluation committee, where four independent LLMs apply the same rubric and produce criterion-based scores accompanied by reasoning. Their outputs are aggregated and adjudicated by the chief arbiter, who synthesizes consensus or resolves disagreements to deliver a stable final grade, as illustrated in Figure 1. The outcome is expressed in a structured JSON format, ensuring machine-readability and comparability across runs. Finally, all intermediate evaluations, arbitration steps, and final decisions are persistently stored in a Neo4j graph database, which encodes the relationships among runs, items, responses, scores, and decisions. This graph structure captures the full evaluation lineage across prompts, rubrics, graders, and scores, supporting systematic post hoc analysis and facilitating reproducibility across experiments [35].

Five open-source LLMs were deployed locally using the Ollama runtime, ensuring full control over inference parameters. The evaluation committee consisted of Mistral-7B-Instruct [36], Gemma-7B-Instruct [37], Zephyr-7B-Beta [38], and OpenHermes [39]. Arbitration was handled by a designated chief arbiter model, LLaMA3-Instruct [40]. All models received the same structured prompt, which contained the rubric, question, and student answer, and were required to return output in a JSON-only format.

To ensure fairness and consistency, all models were configured with identical generation parameters. The temperature was fixed at 0.0 to enforce deterministic output, and the context length was set to 8192 tokens to accommodate full prompts without truncation [41]. Inference was constrained to single-pass execution without self-correction loops, memory mechanisms, or external access. These restrictions ensured that variability across outputs reflected model differences rather than stochastic sampling or dynamic context management. In larger-scale deployments that rely on commercial LLMs or non-zero temperature settings, additional safeguards such as averaging scores across multiple stochastic runs or using self-consistency schemes would be required to mitigate sampling-induced instability. The models used in the evaluation pipeline are summarized in Table 2.

A detailed description of the pipeline components follows, starting with the committee of evaluators that forms the first layer of assessment before arbitration and final scoring.

2.2.1. Committee of Evaluators

Each student’s response was graded independently by a committee of four large language models (LLMs). All evaluators received the same prompt, rubric, and reference information, ensuring consistency in task framing and eliminating opportunities for prompt leakage or differential task interpretation. The committee was intentionally designed to introduce diversity of perspectives while preserving identical evaluation conditions, allowing the system to capture the natural variance across models when applying the same rubric.

This multi-evaluator design served two methodological purposes. First, it mitigated the influence of stochastic variation and single-model bias: different LLMs may produce slightly different judgments even under identical conditions; however, by combining multiple voices, the system achieves greater robustness [17,42]. Second, it allowed responses to be characterized not only by a single grade but also by a distribution of scores, which is valuable for analyzing uncertainty, disagreement, and evaluator-specific tendencies [11,24].

For multiple-choice items, additional hard constraints were enforced to guarantee fairness and correctness. If the student’s selection did not match the ground truth solution, all committee scores were automatically clamped to a maximum of 2, regardless of other rubric dimensions. This strict rule ensured that factual correctness always dominated over stylistic clarity or proper terminology, preventing evaluators from rewarding otherwise incorrect answers.

2.2.2. Chief Evaluator

Once the committee evaluations were completed, the final grade was determined by a chief arbiter model. The chief arbiter was not merely a passive aggregator but an active decision-maker with two complementary responsibilities.

First, it conducted a self-evaluation of the student response using the same rubric framework as the committee. This independent judgment served as a stabilizing reference point offsetting variability in panel results.

Second, the chief evaluator performed arbitration across the four panel scores. If at least three evaluators converged within a tolerance band of ±1 point, the arbiter adopted their consensus through the majority method. When no such majority could be established, it defaulted to the arbitration_with_self method, combining its own self-evaluation with the distribution of panel scores to justify a single final grade. This ensured that sharp disagreements or outlier judgments did not destabilize the outcome, while still grounding the decision in the broader evidence provided by the committee.

The output of the chief arbiter included multiple layers of information: the self-evaluation, the list of panel scores, the arbitration method applied, the final rubric-based grade, and a concise justification in natural language. All these components were stored in the Neo4j database as ChiefDecision nodes, explicitly linked to both the corresponding item and the underlying panel evaluations. This graph-based representation records the complete evaluation lineage leading to each final grade and supports pedagogical feedback: students can review not only their score but also the underlying reasoning and areas of consensus or disagreement, turning automated grading into a formative learning opportunity.

2.2.3. Rubric-Based Evaluation

The grading process in EvalCouncil was driven by explicit rubrics that specified evaluation criteria, scale, and expectations. Rubrics provided a structured framework to guide both evaluators and the chief evaluator, ensuring that assessments were consistent, transparent, and pedagogically meaningful.

Two rubric families were applied depending on the task type [17]. Technical tasks were assessed in terms of accuracy, reflecting the correctness of the response with respect to core content; clarity, indicating the precision and syntactic coherence of the language; completeness, measuring the extent to which all relevant aspects of the task were addressed; and terminology, capturing the appropriateness and precision of domain-specific vocabulary. In contrast, argumentative tasks emphasized clarity as comprehensibility of the argument, coherence as the logical flow and structure of ideas, originality as the degree of novelty in reasoning, and dialecticality as the capacity to engage with counterarguments or multiple perspectives. Each criterion was scored on a 1–10 scale, and evaluators were required to provide concise textual reasoning aligned with these dimensions.

An overview of the rubric dimensions applied across task types is presented in Table 3, which contrasts the criteria guiding technical versus argumentative evaluations.

To prevent ambiguities in interpretation, all rubrics were rendered in a standardized machine-readable format before being supplied to the evaluators. This ensured that both committee members and the chief arbiter received identical instructions and constraints. Beyond enforcing consistency, rubric-based evaluation enabled fine-grained analysis, allowing researchers to examine not only overall grades but also compare criterion-level scores, track disagreement patterns, and study the stability of evaluations across runs.

2.2.4. Traceability and Auditability

A central design principle of EvalCouncil was to ensure end-to-end provenance and auditability of the grading pipeline. To this end, all intermediate and final artifacts–including student responses, evaluator outputs, rubric-based scores, panel deliberations, and chief arbitration–were persistently stored in a Neo4j graph database. Each stage of the workflow was represented as a node, while directed relationships captured the dependencies between them, transforming the evaluation process into a transparent and queryable structure.

The graph model encoded runs, items, responses, scores, panel evaluations, panel members, and chief decisions as distinct entities. Relationships such as HAS_RESPONSE, HAS_SCORE, HAS_MEMBER, and HAS_DECISION specified how results flowed from individual evaluators to the committee level and finally to the chief arbiter. This design enabled researchers and instructors to reconstruct the provenance of any grade: which evaluators contributed, how they reasoned, where disagreements arose, and how the final decision was reached. Figure 2 illustrates the EvalCouncil data model implemented in Neo4j.

Beyond methodological transparency, graph-based persistence facilitated systematic post hoc analysis. It allowed queries over distributions of panel scores, identification of systematic evaluator biases, and investigation of disagreement patterns across task types. Instructors can, for example, list all responses where the chief–panel gap exceeds a chosen threshold to build a short queue of grades for manual review, or retrieve items with persistently high MPAD across runs to prioritize rubric refinement. Auditors can retrieve, for any disputed grade, the full chain from item and student response through individual evaluator scores and rationales to the chief’s decision using a single path query over the graph. Because both raw model outputs and normalized JSON evaluations were stored, the system supports accountability and reproducibility, enabling experiments to be revisited and audited in detail. From a pedagogical perspective, this traceability also enriches feedback: students can be provided not only with their final grade but also with the reasoning and consensus dynamics that produced it.

2.3. Human Evaluation Protocol

We include a human adjudication reference that is separate from the automated pipeline and used only for comparison. In each course, the Human Chief Evaluator was the course instructor, who graded every response on eight rubric criteria using the same 1–10 scale–four technical criteria (accuracy, clarity, completeness, terminology) and four argumentative criteria (clarity, coherence, originality, dialecticality)–and also issued a single final score per response. Human decisions never feed back into the committee or the LLM chief; they serve as an external reference to examine alignment, chief–panel bias, and stability [18,43,44].

For item

i

, let the panel contain

k

graders who produce final scores

s_{i 1}, \dots, s_{i k}

(if

k

varies across items, it denotes the number of available graders for that item). Let

c_{i}

denote the LLM chief’s final decision and let

h_{i 1}

and

h_{i 2}

be the two human final scores. We summarize the panel by the mean:

{\bar{s}}_{i} = \frac{1}{k} \sum_{j = 1}^{k} s_{i j}

(1)

the chief–panel difference:

b_{i} = c_{i} - {\bar{s}}_{i}

(2)

and the panel spread:

d_{i} = \max_{j} s_{i j} - \min_{j} s_{i j}

(3)

Sensitivity to panel composition is assessed with a leave-one-out calculation. For each grader

j \in \{1, \dots, k\}

, let

d_{i (- j)}

be the spread recomputed on the

k - 1

scores that remain when grader

j

is excluded,

d_{i (- j)} = \max_{m \neq j} s_{i m} - \min_{m \neq j} s_{i m}

(4)

and define the change

Δ_{i j} = d_{i (- j)} - d_{i}

. Per item, we report the mean absolute change

{m e a n}_{j} |Δ_{i j}|

and the maximum absolute change

{m a x}_{j} |Δ_{i j}|

.

Agreement between the human reference and the LLM chief at the final-score level is measured by the human average

{\bar{h}}_{i} = \frac{1}{2} (h_{i 1} + h_{i 2})

and the absolute divergence:

δ_{i} = |c_{i} - {\bar{h}}_{i}|

(5)

When criterion-level analysis is required, we index by rubric criterion

c

. The panel provides scores

s_{i j}^{(c)}

and the two human chiefs provide

h_{i 1}^{(c)}

and

h_{i 2}^{(c)}

. We use

{\bar{s}}_{i}^{(c)} = \frac{1}{k} \sum_{j = 1}^{k} s_{i j}^{(c)}

(6)

{\bar{h}}_{i}^{(c)} = \frac{1}{2} (h_{i 1}^{(c)} + h_{i 2}^{(c)})

(7)

Where the pipeline exposes chief quantities at the criterion level (technical rubric), we compare them, accordingly; otherwise, criterion-level summaries are reported for the human reference and the panel. Domain-level summaries for CN and ML use medians, means, and the 95th percentile; where applicable, we accompany estimates with bias-corrected bootstrap 95% confidence intervals and maintain identical axes across figures.

3. Results

We evaluated EvalCouncil across two domains under a uniform committee and–chief pipeline. The corpus comprised 8 items and 26 students in Computer Networking and 10 items and 22 students in Machine Learning, yielding 208 and 220 graded responses, respectively, for a total of 428. Items fell under two rubric families, argumentative and technical, and were scored at the criterion level by four evaluators. Judgments were consolidated by the chief into a single final grade. The sections that follow present score distributions, per-item performance, within-panel agreement, chief–panel concordance, and sensitivity checks.

3.1. Score Distributions

Grades were summarized on the original 1–10 scale after chief adjudication, using identical binning and axis limits to enable direct comparison across domains. Machine Learning displayed a compact high-score concentration, with a mean of 7.73 and a median of 8.00, indicating limited dispersion and a small lower tail. Computer Networking showed a broader distribution with an extended lower tail. The mean was 7.17 and the median was 9.00, indicating a negative skew driven by a subset of minimal answers. Figure 3 displays the two histograms on the same scale and binning, with medians indicated by solid vertical lines and means by dashed lines.

Sample sizes were 208 for Computer Networking and 220 for Machine Learning. The shared scale and common binning anchor the comparison across domains and provide the baseline for the item-level and agreement analyses that follow.

3.2. Per-Item Performance

Per-item performance summarized the distribution of final grades for each assessment item across students after chief adjudication. Distributions were visualized as box-and-whisker plots which report the median, the interquartile range, the whiskers, and potential outliers. Identical y-axis limits were used across domains to ensure a common reference frame. Figure 4 displays the per-item grade distributions, ordered by each item’s median to provide a stable and interpretable ranking independent of the question index. In the CN domain, the long lower-tailed distributions and lower medians for Q3, Q7, and Q2 reflect high rates of non-response; these items cluster near the minimum of the scale, and human graders and the chief evaluator agree on all such cases, with 100% agreement.

The profiles revealed heterogeneous behaviour across items in both domains. Several items clustered near the upper band with short interquartile ranges and few outliers. This pattern indicated consistent performance across students and limited ambiguity for evaluators. Other items exhibited wider boxes with longer lower tails and occasional extreme values. This pattern indicated greater difficulty and a higher incidence of incomplete or minimal answers that pulled down the lower part of the distribution.

Differences in dispersion were informative about the roles played by the rubric criteria. Items with compact boxes suggested that criteria were aligned and saturated at higher scores, which reduced evaluator discretion and stabilized judgments. Items with broad boxes and extended lower tails suggested that criteria separated students more finely and exposed uneven mastery of underlying concepts, which increased evaluator discretion and the potential for divergent views.

These item-level distributions identified where the committee’s task was straightforward and where finer judgment was required. They also provided local context for the agreement analysis reported next, as items with wide interquartile ranges and longer lower tails are more likely to induce larger pairwise differences among evaluators and to necessitate decisive arbitration by the chief.

3.3. Within-Panel Agreement

We quantify within-panel agreement on the 1–10 scale using the mean pairwise absolute difference (MPAD) which is computed for each response. MPAD averages the absolute differences across all unordered pairs of panel scores for the same response; lower values indicate tighter consensus, while higher values reflect greater dispersion (e.g., MPAD ≈ 1 means raters differ by about one point on average). Complementarily, we report the within-panel standard deviation, which summarizes the spread around the panel mean, and the spread, defined as the max-min difference, which captures the most divergent judgments. We use these three metrics jointly–MPAD for typical pairwise disagreement, standard deviation for overall variability, and spread for extremes–to characterize panel agreement. The sample comprises n = 208 responses in Computer Networking (CN) and n = 220 in Machine Learning (ML).

Figure 5 shows vertical box-and-whisker plots of the per-response MPAD distributions by domain (CN on the left, ML on the right), using identical y-axis limits (0–5.5) for direct comparison. In both domains, most of the mass lies below MPAD = 1, indicating generally tight within-panel agreement. The central tendency in ML is marginally higher, whereas CN exhibits a heavier right tail with more high-MPAD outliers and a larger maximum, indicating rarer but more extreme disagreements on a subset of responses.

In CN, the MPAD distribution has a median of 0.50 and an interquartile range (IQR) of 0.19–0.67; the mean MPAD is 0.78, the 95th percentile is 3.67, and the maximum spread is 9. In ML, the MPAD distribution has a median of 0.67 and an IQR of 0.50–1.00; the mean MPAD is 0.84, the 95th percentile is 2.17, and the maximum spread is 8. A visible point mass at zero indicates exact agreement on a nontrivial subset of responses–about one quarter (≈25%) in CN, and present but smaller in ML. For panel-level reliability (interval), Krippendorff’s α = 0.892 in CN and α = 0.633 in ML. Taken together with Figure 5, these summaries indicate generally tight within-panel agreement in both domains, with rarer but more extreme disagreements in CN.

To quantify this difference, we also test the null hypothesis that response-level MPAD is equal across domains using a Mann–Whitney U test. Using all graded responses (CN: n = 208, ML: n = 220), we find a statistically significant shift toward higher disagreement in ML than in CN (U = 28,331, p = 1.16 × 10⁻⁵), with medians of 0.50 and 0.67, respectively. This confirms that the heavier dispersion observed graphically in ML is not just a sampling artifact but reflects a systematic cross-domain difference in response-level disagreement.

Table 4 summarizes within-panel agreement by domain. In CN (n = 208), the MPAD median is 0.500 with an interquartile range of 0.188–0.667. The mean MPAD is 0.779, the 95th percentile is 3.667, and the maximum MPAD is 5.500. Exact agreement occurs in 25.0% of responses, and 6.3% of responses have MPAD ≥ 2. Krippendorff’s α is 0.892. In ML (n = 220), the MPAD median is 0.667 with an interquartile range of 0.500–1.000. The mean MPAD is 0.841, the 95th percentile is 2.167, and the maximum MPAD is 4.333. Exact agreement occurs in 8.6% of responses, and 6.8% of responses have MPAD ≥ 2. Krippendorff’s α is 0.633. Overall, both domains show compact within-panel variability, with CN exhibiting rarer but more extreme disagreements and ML a slightly higher central tendency.

Overall, within-panel variability is compact in both domains. ML concentrates slightly higher at the center, whereas CN exhibits a heavier upper tail with occasional high-disagreement cases. Despite these rarer extremes, panel reliability is higher in CN than in ML, indicating more consistent scoring on most responses in CN, while ML maintains steadier mid-range variability. Qualitative inspection of responses with high MPAD shows that disagreement concentrates on items with underspecified prompts, multi-step reasoning where partial credit is ambiguous, and borderline rubric descriptors between adjacent score levels. In contrast, low-MPAD items typically have tightly scoped questions with unambiguous target reasoning steps and clearly separated rubric anchors.

3.4. Chief–Panel Concordance

We assess chief–panel agreement by comparing the chief’s final score with the panel mean for each response on the 1–10 scale. In CN, n = 208. The chief–panel mean absolute error is 0.813, with a 95% confidence interval of 0.569 to 1.093. The average panel size is

\bar{k}

= 4.000. On average, the chief scores 0.386 points below the panel. Absolute differences are compact near the center, with a median of 0.250 and a 75th percentile of 0.500. The upper tail is long, with a 95th percentile of 4.238 and a maximum of 9.000. ML remains as previously reported.

As shown in Figure 6, we plot the chief’s score against the panel mean on the 1–10 scale with the identity line and identical axis limits for both domains. Concordance is good in CN, with Pearson’s r = 0.767 and MAE = 0.813, and very strong in ML with r = 0.953 and MAE = 0.333. Most observations lie close to the identity line: in CN, 86.1% of responses fall within ±0.5 points and 90.4% within ±1.0 points; in ML, 90.0% fall within ±0.5 points and 98.2% within ±1.0 points. The few larger gaps visible in CN are consistent with the heavier upper tail in its MPAD distribution.

Chief–panel concordance is summarized using complementary metrics on the 1–10 scale and reported in Table 5. In CN, n = 208. Pearson r is 0.767, and Spearman ρ is 0.619. The mean absolute error is 0.813, and the RMSE is 2.085. The chief scores lower than the panel on average, with a bias of −0.386. Agreement is tight for most responses, with 86.1% within ±0.5 points and 90.4% within ±1.0 points. In ML, n = 220. Concordance is stronger, with r = 0.953 and ρ = 0.861. The MAE is 0.333 and the RMSE is 0.445. The bias is 0.071. Proximity to the panel is higher, with 90.0% within ±0.5 points and 98.2% within ±1.0 points. These results are consistent with the scatter in Figure 6 and indicate a high alignment overall, with ML near the identity line and CN exhibiting a small subset of larger divergences.

Chief–panel alignment is high in both domains, as shown by Figure 6 and Table 5. Concordance is stronger in ML, with tighter clustering near the identity line and smaller errors, whereas CN exhibits a small negative chief bias and a longer upper tail concentrated in a few responses. Overall, scores are consistent across domains, with rare but more pronounced disagreements in the CN domain.

To situate the chief configuration relative to simpler LLM baselines, Table 6 compares the best single-model grader in each domain (openhermes:latest) to the chief, both evaluated against the panel mean. In CN, the best single LLM attains lower MAE and higher within-±1.0 agreement than the chief, while in ML the two configurations are nearly indistinguishable on these metrics. These results indicate that strong single-model graders can be competitive with the committee-and-chief configuration in terms of raw alignment to the panel mean, and that EvalCouncil’s added value lies primarily in structured arbitration, tolerance-aware triage, and auditable decision traces rather than in reducing average numeric error alone.

3.5. Sensitivity Checks

We evaluate robustness along two dimensions. First, we vary tolerance bands on the 1–10 scale at ±0.5, ±1.0, and ±1.5 to determine when panel grades form a majority, defined as at least three graders whose scores fall within the band. Second, we assess the sensitivity of within-panel dispersion to panel composition by removing one grader at a time and comparing the dispersion to that of the full panel. All summaries are presented side by side for Computer Networking and Machine Learning with common axes to enable direct comparison.

As the tolerance band widens, panels coalesce quickly into a single majority, and the need for arbitration collapses. At ±0.5, a majority forms for 80.3% of CN responses and 92.7% of ML responses, indicating tighter clustering in ML at strict tolerance. Increasing the band to ±1.0 drives consolidation in both domains, with the majority share reaching 97.6% in CN and 97.3% in ML. A further increase to ±1.5 yields little additional gain in CN, which remains at 97.6%, while ML climbs to 99.1%. The complementary arbitration share falls in a mirror image. CN drops from 19.7% at ±0.5 to 2.4% at ±1.0 and stays at 2.4% at ±1.5. ML drops from 7.3% to 2.7% to 0.9%. These trajectories indicate that most disagreements occur within a single point on the 1–10 scale and that ML panels reach a majority more readily at narrower tolerances. Figure 7 displays the two curves on matched axes, CN and ML side by side for direct comparison.

The counts underpinning these curves are reported in Table 7, which tabulates, for each tolerance and domain, the number and percentage of eligible items (k ≥ 3 panel grades) determined by majority versus arbitration_with_self. As the band widens, the majority absorbs nearly all decisions, while arbitration concentrates on the narrowest band.

Chief–panel concordance is tolerance-invariant, so we report domain-level summaries in Table 8. For each domain, we include the number of items with a chief grade, the average panel size, the chief–panel mean absolute error with a 95% bootstrap confidence interval, and the signed bias defined as chief minus panel mean. Concordance is high overall. CN exhibits a larger MAE than ML, while the signed bias remains modest, being negative in CN and positive in ML.

Robustness to panel composition is assessed with leave-one-out perturbations of within-panel dispersion. For each response,

{M P A D}_{f u l l}

is the mean pairwise absolute difference computed over all panel grades for that response. For each grader k,

{M P A D}_{- k}

is the same dispersion after removing grader k’s score. The perturbation is defined as

Δ = {M P A D}_{- k} - {M P A D}_{f u l l}

For every response with at least three panel grades, we summarize two quantities across single-grader removals: the mean absolute change, written mean|Δ|, and the maximum absolute change, written max∣Δ∣ Figure 8 uses matched y-limits with CN on the left and ML on the right to enable direct comparison. Medians are small. The median of mean|Δ| is 0.250 in CN and 0.250 in ML. The median of max|Δ| is 0.333 in CN and 0.500 in ML. Upper tails remain bounded. The 95th percentile of mean|Δ| is 0.771 in CN and 0.508 in ML. The 95th percentile of max|Δ| is 1.483 in CN and 1.000 in ML. Worst-case shifts are limited, with max|Δ| reaching 3.333 in CN and 2.833 in ML. These results indicate that dispersion estimates are stable to the removal of a single grader and that no individual rater drives the conclusions.

Taken together with Figure 7, the results show that majority adoption increases monotonically with the tolerance band, the ordering of domains is preserved, and leave-one-out shifts are small and bounded, so the main findings are stable and not driven by any single grader.

3.6. Human–LLM Alignment

Human–LLM alignment is evaluated per item via the signed difference

δ_{i} = c_{i} - {\bar{h}}_{i}

, and we assess agreement through the absolute deviation

|δ_{i}|

, which captures magnitude regardless of direction. Two tolerance thresholds at 0.5 and 1.0 points provide an interpretable scale for grading practice, and identical y-axis limits are enforced across domains to support direct comparison. Figure 9 displays the domain-wise distributions of

|δ_{i}|

. In CN, n = 208, the median

|δ|

is 1.000 and the 95th percentile is 5.500; coverage within the tolerance bands is 25.5% for

|δ| \leq 0.5

and 53.4% for

|δ| \leq 1.0

. In ML, n = 220, the median

|δ|

is 0.625 and the 95th percentile is 2.394; coverage is 45.5% within 0.5 and 75.9% within 1.0. These results indicate closer alignment in ML both at the center and in the upper tail, while CN concentrates a heavier tail with fewer but more pronounced discrepancies.

We summarize Human–LLM alignment by domain using the per-item absolute deviation

|δ_{i}| = |c_{i} - {\bar{h}}_{i}|

, expressed in grading points. Table 9 reports, for each domain, the sample size n alongside the median, mean, and 95th percentile (P95) of

|δ|

. In CN (n = 208), median = 1.00, mean = 1.94, and P95 = 5.50; in ML (n = 220), median = 0.62, mean = 0.85, and P95 = 2.39. These results indicate tighter alignment in ML and a heavier upper-tail dispersion in CN.

Table 10 presents complementary agreement metrics by domain for the LLM Chief compared to the human reference. Spearman ρ captures rank agreement. Lin’s concordance (CCC) captures agreement on the scale, integrating correlation and bias. We also report the within-one-point coverage

p (|δ| \leq 1.0)

and

M A E (|δ|)

in points. Higher is better for ρ, CCC, and coverage. Lower is better for MAE. In CN: n = 208, ρ = 0.696, CCC = 0.622,

p (|δ| \leq 1.0) = 0.53

,

M A E (|δ|)

= 1.94. In ML: n = 220, ρ = 0.717, CCC = 0.766,

p (|δ| \leq 1.0) = 0.76

,

M A E (|δ|)

= 0.85. The metrics indicate tighter alignment in ML and a heavier upper tail in CN.

3.7. Human–LLM Disagreement

Human–LLM disagreement is measured as the tail beyond a grading tolerance

t

. Let

δ_{i} = c_{i} - {\bar{h}}_{i}

be the signed difference between the LLM chief and the human reference and quantify the magnitude by

|δ_{i}|

. Disagreement at tolerance

t

is the share of items with

|δ_{i}| > t

, which is the complement of coverage. The curves in Figure 10 are plotted on identical axes across domains so that lower curves indicate tighter alignment.

Disagreement decreases monotonically with

t

and is largest under strict tolerance. At

t = 0.5

, disagreement is 74.5% in CN and 54.5% in ML, which shows a sizable gap at half a point. At

t = 1.0

, disagreement falls to 46.6% in CN and 24.1% in ML. The gap narrows as tolerance increases, and both curves approach zero near three points. Across the full range the ML curve remains below the CN curve, indicating systematically fewer large errors in ML and a heavier upper tail in CN.

We relate Human–LLM disagreement to human panel dispersion by stratifying items on human MPAD. For this analysis, we operationalize difficulty using the human MPAD, i.e., the mean pairwise absolute difference between the two human grades, and items are divided into domain-specific tertiles labeled as low, mid, and high. Within each stratum, we summarize the absolute Chief–human deviation

|δ|

through the median and the 95th percentile, the signed bias defined as

c - \bar{h}

, and the within-one-point coverage. This design tests whether large gaps cluster on intrinsically difficult items rather than reflecting a uniform calibration error, and the results are reported in Table 11.

Results show a clear gradient in CN. One-point coverage falls from 0.67 in low-dispersion items to 0.36 in high-dispersion items, while the median

|δ|

rises from 1.00 to 1.50 and the 95th percentile remains at 5.50. The median signed-bias stays near zero across strata. In ML, the metrics are stable across difficulty, with coverage at 0.76 for all tertiles, medians at or below 0.75, and median signed-bias near zero.

Taken together, the tolerance–disagreement curves and the dispersion-stratified summaries show that Human–LLM disagreement is limited on average and concentrated in high-dispersion CN items, whereas ML maintains high coverage and small tails across difficulty strata. These patterns support the use of the Chief LLM as adjudicator under routine conditions and indicate the need for targeted audits of CN items with high human dispersion.

4. Discussion

Our findings indicate that Human–LLM alignment is domain-dependent: in Machine Learning (ML), human score distributions are tighter with shorter upper tails, whereas Computer Networks (CNs) display broader dispersion with more high-deviation cases. These contrasts remain under sensitivity analyses that vary the tolerance

t

(the absolute score difference below which two grades are considered concordant) and under leave-one-out perturbations of panel composition. Disagreement increases with item difficulty, captured by the mean pairwise absolute deviation (MPAD) among human graders, which concentrates adjudication on a small subset of items. Taken together, the evidence supports more stable alignment in ML and domain-specific instability in CN, with implications for rubric fidelity, workload allocation, and escalation thresholds.

4.1. Key Findings

Human–LLM alignment exhibits a clear domain dependence. In Machine Learning (ML), human score distributions are more concentrated around the center with shorter upper tails, chief–panel concordance is higher (smaller absolute deviation between the chief grade and the panel aggregate), and extreme spreads are infrequent. In contrast, Computer Networks (CNs) show broader dispersion with a heavier right tail and larger maximum spreads, indicating pockets of persistent misalignment concentrated on specific items rather than a uniform shift across the distribution.

Disagreement tracks human panel dispersion. Using the mean pairwise absolute deviation (MPAD) among human graders as a dispersion-based disagreement measure that is sensitive to difficulty and ambiguity, high-MPAD items account for a disproportionate share of Human–LLM misalignment. This concentration implies that targeted adjudication on a relatively small subset of items can deliver most of the alignment gains, while low-MPAD items rarely require escalation. The pattern also suggests practical routes for rubric improvement: where heavy tails cluster, guidance can be refined to reduce ambiguity in scoring criteria.

These contrasts are stable under reasonable design choices. Varying the tolerance

t

–the absolute score difference below which two grades are considered concordant–preserves the ML–CN gap in both central tendency and upper-tail behavior. Likewise, leave-one-out perturbations of panel composition do not materially alter the rankings or the qualitative effect signs: central estimates and dispersion profiles remain similar when any single grader is removed. Taken together, the evidence supports more stable alignment in ML and domain-specific instability in CN, with operational implications for rubric calibration, grader guidance, workload allocation, and adjudication thresholds.

4.2. Robustness to Tolerance and Panel Composition

Varying the concordance tolerance

t

over a plausible range preserves the qualitative contrast between domains. As

t

increases, concordance curves rise monotonically, yet the relative ordering remains stable: ML consistently exhibits higher agreement and shorter upper tails than CN. This indicates that the ML–CN gap is not an artifact of a particular threshold and that reasonable grading policies–whether stricter or more permissive–lead to the same directional conclusions. Practically, the tolerance acts as a policy dial: tighter settings reduce false concordance at the cost of more adjudications; looser settings reduce workload while preserving the cross-domain ranking of alignment.

Panel-composition perturbations show similar stability. Leave-one-out removal of any single grader produces small fluctuations in central estimates and dispersion, with effect signs unchanged and domain ordering intact. The absence of large swings under single-grader removals suggests that conclusions do not hinge on idiosyncratic graders and that panels near the current size deliver diminishing returns in terms of central accuracy. What does change under perturbations is the spread on a minority of contentious items, which flags these items–not the panel–as the primary source of residual instability.

Taken together, tolerance variation and leave-one-out analyses support the interpretation that domain-level differences are structural rather than procedural. This justifies fixed, domain-aware adjudication rules and provides guardrails for operational policies: select

t

to balance concordance and workload, maintain panel sizes sufficient for stability, and concentrate adjudication where dispersion remains high despite these controls.

4.3. Disagreement vs. Dispersion (MPAD) and Audit Policy

Disagreement increases with human panel dispersion. We quantify this panel dispersion using the Mean Pairwise Absolute Deviation (MPAD) among human graders: a larger MPAD indicates an item on which graders disagree more, often because it is harder or more ambiguous under the rubric. In parallel, we compute the chief–panel difference, defined as the absolute difference between the chief’s grade and the panel aggregate. This measures how far the adjudicating chief is from the consensus of the panel. A third quantity is the tolerance, a policy threshold for absolute score differences. Whenever two grades differ by no more than the tolerance, they are considered concordant. Taken together, these three signals–MPAD, chief–panel difference, and tolerance–capture (i) how contentious an item is among humans and (ii) whether the chief is aligned with the panel. These are precisely the ingredients needed for an effective, risk-aware audit policy.

To make the policy comparable across domains, we interpret MPAD within each domain. Concretely, an item is said to have low dispersion if its MPAD falls below the domain median; moderate dispersion if it lies between the domain median and the domain 90th percentile; and high dispersion if it is at or above the domain 90th percentile. The chief–panel difference is read directly against the tolerance: differences at or below the tolerance indicate small policy-relevant disagreement; differences moderately above the tolerance indicate borderline cases; and differences far above the tolerance indicate substantial risk. This domain-aware scaling preserves empirical differences between domains while avoiding one-size-fits-all cutoffs.

The audit rules are summarized in Table 12 in a format that mirrors operational practice. The Outcome column names the triage decision that will be taken (accept as is; brief human check; or full adjudication). The Criteria (per domain) column states, in plain language, the precise conditions under which that outcome should be triggered. These conditions are computed inside each domain using the item’s MPAD position (below the domain median; between the domain median and the domain 90th percentile; at or above the domain 90th percentile). They also depend on how the chief–panel difference compares to the tolerance (at most the tolerance; moderately above it; or far above it). The Action column specifies the concrete step to execute once the criteria are met. Actions range from simply recording the grade to a quick independent pass, and up to a multi-grader adjudication that may include targeted rubric clarification.

Concretely, for any fixed concordance tolerance on the 1–10 scale and a domain-specific high-dispersion cutoff for MPAD (for example, the 90th percentile of MPAD within that domain), the domain-adaptive audit policy in Table 12 can be implemented as a simple triage rule based on the item’s MPAD and the absolute chief–panel difference:

If the item’s MPAD is at or above the high-dispersion cutoff, or if the absolute chief–panel difference is greater than twice the tolerance, label the item as Adjudicate (multi-grader resolution and possible rubric clarification).
Else, if the absolute chief–panel difference is at most the tolerance and the item’s MPAD is below the domain median, label the item as Accept (record the grade without additional review).
Otherwise, label the item as Brief check (one quick, independent human pass).

This narrative policy maintains a workload proportional to the risk. Items with both low dispersion and small chief–panel differences proceed without additional effort. Borderline cases–either moderate dispersion or modest chief–panel gaps–receive a brief human pass that resolves most uncertainties at low cost. High-dispersion items or those with large chief–panel gaps are escalated for coordinated resolution because they drive most residual misalignment and benefit from explicit rubric guidance. When capacity is limited, the “high dispersion” cutoff can be tuned by replacing the domain 90th percentile with a different domain percentile to match available resources, while prioritization within outcomes can use the pair (MPAD, chief–panel difference) to surface the most contentious items first. For instructors, this means that they do not need to interpret the full set of curves and tables; they can simply read MPAD and chief–panel gaps through the three bins of the audit policy. Low-dispersion, small-gap items can be accepted as is, borderline cases warrant a brief check, and high-dispersion or large-gap items trigger adjudication and potential rubric refinement.

As an illustration, in the Machine Learning domain, where 220 responses received chief grades, with a tolerance of ±1.0 on the 1–10 scale the audit rule routes 90 responses (40.9%) to the Accept bin, 104 responses (47.3%) to the Brief check bin, and 26 responses (11.8%) to the Adjudicate bin, concentrating full adjudication on a small, high-risk subset of items.

4.4. Limitations

Interpretation is bound by several scope conditions. First, domain coverage is narrow: results are drawn from two technical domains with different grading cultures and item formats, so external validity beyond these settings remains uncertain. Sample sizes and item mixes differ across domains, which may inflate or attenuate dispersion estimates in ways that do not reflect intrinsic difficulty. Second, rubric granularity constrains measurement: coarse criteria can compress disagreement at the center while amplifying upper-tail behavior in ambiguous regions. The tolerance parameter is policy-defined rather than empirically identified; alternative calibrations of concordance thresholds would shift operating points without necessarily changing the qualitative ordering. Moreover, we do not collect explicit student attributes (e.g., gender, language background, performance level), so we cannot analyze subgroup fairness or systematic differences in chief behavior across student groups in this deployment.

Analyses in this paper are primarily descriptive and focus on characterizing domain structure, tolerance robustness, and audit policies. We include a single non-parametric test comparing response-level MPAD between CN and ML to illustrate one formal cross-domain comparison, but we do not pursue a broader hypothesis-testing program. Domain differences should therefore be read as exploratory patterns rather than as exhaustively tested effects.

Panel composition and grader training introduce additional uncertainty. Leave-one-out stability mitigates–but does not eliminate–the possibility of correlated grader errors, anchoring, or local rubric drift. The chief’s role is operationally meaningful yet not interchangeable with an average human grader; chief–panel differences are therefore interpretable as workflow signals rather than ground truth. MPAD serves as a dispersion proxy that is sensitive to difficulty and ambiguity, but it does not partition sources of variation (content difficulty vs. rubric fuzziness vs. grader inconsistency); we therefore interpret MPAD as a disagreement signal rather than as a direct measure of latent difficulty, and attribution remains indirect. Finally, aggregation choices (e.g., the specific panel summary) and bootstrap uncertainty quantification rely on standard assumptions that may understate uncertainty when item dependencies or heavy tails are present.

4.5. Future Work

Several extensions can strengthen the generality and operational value of these results. First, broader domain coverage is needed to separate domain idiosyncrasies from structural patterns of Human–LLM alignment. Replications across additional STEM and non-STEM subjects, alternative item formats (open response, multi-step derivations, code), and diverse cohorts would clarify external validity and expose where dispersion stems from content difficulty versus rubric ambiguity. Future deployments could incorporate labeled student subgroups (e.g., gender, language background, performance bands) to enable explicit fairness analyses of chief and panel behavior across strata. Second, rubric refinement and calibration merit systematic study: anchoring rubrics with worked examples at multiple score levels, auditing rubric language for latent ambiguity, and measuring inter-grader effects before and after rubric updates would test whether heavy-tail behavior in CN is reducible through guidance rather than panel size alone.

On the workflow side, policy calibration can be formalized by treating the tolerance

t

and adjudication rules as operating points on a cost–risk frontier. Estimating curves that map

t

and MPAD cutoffs to expected discordance, adjudication volume, and residual error would allow explicit optimization under capacity constraints. Adaptive triage is a complementary approach: using early signals (e.g., MPAD from a subset of graders, chief–panel preliminary gaps, or lightweight content features) to dynamically route items to acceptance, brief check, or adjudication, with online updates as evidence accumulates. Within adjudication, comparing alternative aggregation rules (median, trimmed mean, M-estimators) and chief roles (pre- vs. post-panel) would quantify trade-offs between stability and responsiveness.

From a deployment standpoint, future work should relax the single-pass deterministic assumption and study stability under stochastic sampling and model updates. Running committee members and the chief multiple times with non-zero temperature and aggregating via averaging or self-consistency schemes could reduce single-run bias when larger commercial models are used. It would also be natural to incorporate LLMs fine-tuned specifically for grading or rubric following as committee members or chiefs and to compare their behavior against the general-purpose evaluators used here in order to quantify how much targeted fine-tuning improves alignment and reduces disagreement.

Methodologically, future work should decompose sources of dispersion. MPAD aggregates difficulty, rubric fuzziness, and grader inconsistency; hierarchical or multitask models that attribute variance components at the item, rubric-criterion, and grader levels could inform targeted interventions. Longitudinal designs could assess the temporal stability of alignment and detect drift as content or cohorts change. Finally, evaluating LLM-assisted grading under the same framework–e.g., prompting strategies, rubric-aware verifiers, or committee-of-models–would test whether machine support reduces human dispersion without masking true difficulty, and whether the domain-adaptive policy remains robust when humans and models co-produce grades.

5. Conclusions

In this study, human–LLM alignment shows a consistent domain structure across the two evaluated domains. In Machine Learning (ML), human score distributions are more concentrated with shorter upper tails, chief–panel concordance is higher, and extreme spreads are rare; in Computer Networks (CNs), dispersion is broader with heavier upper tails and larger maximum spreads. Disagreement increases with human panel dispersion, as captured by the Mean Pairwise Absolute Deviation (MPAD), a dispersion-based disagreement signal that is sensitive to difficulty and ambiguity, which concentrates misalignment on a relatively small subset of items. These patterns remain under reasonable variations of the concordance tolerance and under leave-one-out changes to panel composition, indicating that conclusions are not artifacts of specific operating points or individual graders.

The results support domain-aware grading policies for LLM-assisted evaluation in these two technical university courses. Interpreting MPAD within the domain and combining it with the chief–panel difference yields transparent triage rules: accept low-dispersion, small-gap items; apply a brief check to borderline cases; and adjudicate high-dispersion or large-gap items with targeted rubric clarification when needed. Because thresholds are defined relative to domain distributions, the policy preserves cross-domain ordering while remaining adaptable to capacity via percentile tuning. In our deployment, this concentrates effort where risk is highest and reduces unnecessary escalation elsewhere.

Beyond immediate workflow benefits, the framework provides a template for calibrating cost–risk trade-offs in LLM-assisted grading in higher-education settings. MPAD functions as a dispersion-based disagreement signal that is sensitive to difficulty and ambiguity, the tolerance as a policy dial, and the chief–panel difference as a safety check on aggregation. Together, they offer a replicable, data-driven approach to improve reliability without obscuring genuine difficulty. Extensions to additional domains, item formats, and cohorts–and formal decomposition of dispersion sources–can strengthen external validity and further refine threshold selection and adjudication design.

Author Contributions

Conceptualization, C.A., M.V.C., A.A.A., A.S.B. and C.S.; methodology, C.A. and M.V.C.; software, C.A., A.A.A., C.M., C.A.A., A.S.B. and A.C.; validation, C.A., M.V.C., A.C., A.S.B., C.A.A. and C.M.; data curation, A.C., A.S.B., S.D., D.-G.N. and C.S.; writing–original draft preparation, C.A., M.V.C., A.A.A., A.C., A.S.B., C.A.A., C.M., S.D., D.-G.N. and C.S.; writing–review and editing, A.A.A., C.A., A.C., C.A.A. and S.D.; visualization, M.V.C., A.C., C.M., D.-G.N., S.D. and C.S.; supervision, C.A., M.V.C., A.A.A., D.-G.N. and C.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The source code of the main modules is available at: https://github.com/anghelcata/eval_council.git (accessed on 3 November 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BLEU	Bilingual Evaluation Understudy
CI	Confidence Interval
CN	Computer Networking
EMNLP	Empirical Methods in Natural Language Processing
ICC	Intraclass Correlation Coefficient
IQR	Interquartile Range
LLM	Large Language Model
MAE	Mean Absolute Error
METEOR	Metric for Evaluation of Translation with Explicit ORdering
ML	Machine Learning
MPAD	Mean Pairwise Absolute Difference
QWK	Quadratic-weighted Cohen’s Kappa
ROUGE	Recall-Oriented Understudy for Gisting Evaluation

References

OpenAI. Introducing GPT-5. Available online: https://openai.com/index/introducing-gpt-5/ (accessed on 22 September 2025).
Meta. The Llama 4 Herd: The Beginning of a New Era of Natively Multimodal AI Innovation. Available online: https://ai.meta.com/blog/llama-4-multimodal-intelligence (accessed on 30 July 2025).
Mistral AI. Announcing Mistral 7B: A High-Performance Open-Weight Language Model. Available online: https://mistral.ai/news/announcing-mistral-7b (accessed on 30 July 2025).
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. Available online: https://aclanthology.org/P02-1040 (accessed on 30 July 2025).
Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Text Summarization Branches Out (ACL Workshop), Barcelona, Spain, 21–26 July 2004; pp. 74–81. Available online: https://aclanthology.org/W04-1013 (accessed on 30 July 2025).
Banerjee, S.; Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. Available online: https://aclanthology.org/W05-0909 (accessed on 30 July 2025).
Gao, M.; Liu, Y.; Hu, X.; Wan, X.; Bragg, J.; Cohan, A. Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2025, Albuquerque, NM, USA, 29 April–4 May 2025; pp. 4605–4629. Available online: https://aclanthology.org/2025.findings-naacl.260 (accessed on 30 July 2025).
Jeong, H.; Park, C.; Hong, J.; Lee, H.; Choo, J. The Comparative Trap: Pairwise Comparisons Amplify Biased Preferences of LLM Evaluators. arXiv 2024. [Google Scholar] [CrossRef]
Hashemi, H.; Eisner, J.; Rosset, C.; Van Durme, B.; Kedzie, C. LLM-RUBRIC: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Bangkok, Thailand, 11–16 August 2024; pp. 13806–13834. Available online: https://aclanthology.org/2024.acl-long.745 (accessed on 30 July 2025).
Chaudhary, M.; Gupta, H.; Bhat, S.; Varma, V. Towards Understanding the Robustness of LLM-based Evaluations under Perturbations. arXiv 2024. [Google Scholar] [CrossRef]
Anghel, C.; Anghel, A.A.; Pecheanu, E.; Cocu, A.; Istrate, A. Diagnosing Bias and Instability in LLM Evaluation: A Scalable Pairwise Meta-Evaluator. Information 2025, 16, 652. [Google Scholar] [CrossRef]
Dong, B.; Bai, J.; Xu, T.; Zhou, Y. Large Language Models in Education: A Systematic Review. In Proceedings of the 2024 6th International Conference on Computer Science and Technologies in Education (CSTE), Xi’an, China, 19–21 April 2024; Available online: https://ieeexplore.ieee.org/document/10589960 (accessed on 30 July 2025).
Al-Ahmad, B.; Alsobeh, A.; Meqdadi, O.; Shaikh, N. A Student-Centric Evaluation Survey to Explore the Impact of LLMs on UML Modeling. Information 2025, 16, 565. [Google Scholar] [CrossRef]
Alhazeem, E.; Alsobeh, A.; Al-Ahmad, B. Enhancing Software Engineering Education through AI: An Empirical Study of Tree-Based Machine Learning for Defect Prediction. In Proceedings of the 25th Annual Conference on Information Technology Education, New York, NY, USA, 9–11 October 2024; pp. 153–156. [Google Scholar] [CrossRef]
Lopes, P.; Silva, E.; Braga, C.; Oliveira, T.; Rosado, L. XAI Systems Evaluation: A Review of Human and Computer-Centred Methods. Appl. Sci. 2022, 12, 9423. [Google Scholar] [CrossRef]
Nauta, M.; Trienes, J.; Pathak, S.; Nguyen, E.; Peters, M.; Schmitt, Y.; Schlötterer, J.; van Keulen, M.; Seifert, C. From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI. ACM Comput. Surv. 2023, 55, 1–42. [Google Scholar] [CrossRef]
Anghel, C.; Anghel, A.A.; Pecheanu, E.; Susnea, I.; Cocu, A.; Istrate, A. Multi-Model Dialectical Evaluation of LLM Reasoning Chains: A Structured Framework with Dual Scoring Agents. Informatics 2025, 12, 76. [Google Scholar] [CrossRef]
Anghel, C.; Crăciun, M.V.; Pecheanu, E.; Cocu, A.; Anghel, A.A.; Iacobescu, P.; Maier, C.; Andrei, C.A.; Scheau, C.; Dragoșloveanu, Ș. CourseEvalAI: Rubric-Guided Framework for Transparent and Consistent Evaluation of Large Language Models. Computers 2025, 14, 431. [Google Scholar] [CrossRef]
Qiu, W.; Su, C.L.; Jamil, N.B.; Thway, M.; Ng, S.S.H.; Zhang, L.; Lim, F.S.; Lai, J.W. A Systematic Approach to Evaluate the Use of Chatbots in Educational Contexts: Learning Gains, Engagements and Perceptions. Computers 2025, 14, 270. [Google Scholar] [CrossRef]
Seo, H.; Hwang, T.; Jung, J.; Namgoong, H.; Lee, J.; Jung, S. Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy. Appl. Sci. 2025, 15, 671. [Google Scholar] [CrossRef]
Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhang, H. Chatbot Arena Leaderboard Week 8: Introducing MT-Bench and Vicuna-33B. Available online: https://lmsys.org/blog/2023-06-22-leaderboard (accessed on 31 July 2025).
Liu, Y.; Zhou, H.; Guo, Z.; Shareghi, E.; Vulić, I.; Korhonen, A.; Collier, N. Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators. In Proceedings of the COLM 2024–Conference on Language Modeling, Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
Hu, Z.; Song, L.; Zhang, J.; Xiao, Z.; Wang, T.; Chen, Z.; Yuan, N.J.; Lian, J.; Ding, K.; Xiong, H. Explaining Length Bias in LLM-based Preference Evaluations. arXiv 2024. [Google Scholar] [CrossRef]
Shi, L.; Ma, C.; Liang, W.; Ma, W.; Vosoughi, S. Judging the Judges: A Systematic Investigation of Position Bias in Pairwise Comparative Assessments by LLMs. arXiv 2024. [Google Scholar] [CrossRef]
Fan, Z.; Wang, W.; W, X.; Zhang, D. SedarEval: Automated Evaluation using Self-Adaptive Rubrics. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 20 August 2024; pp. 16916–16930. Available online: https://aclanthology.org/2024.findings-emnlp.984 (accessed on 30 July 2025).
Martin, P.P.; Kranz, D.; Graulich, N. Revealing Rubric Relations: Investigating the Interdependence of a Research Informed and a Machine Learning Based Rubric in Assessing Student Reasoning in Chemistry. Int. J. Artif. Intell. Educ. 2024, 35, 1465–1503. [Google Scholar] [CrossRef]
Panadero, E.; Jonsson, A.; Pinedo, L.; Fernández-Castilla, B. Effects of Rubrics on Academic Performance, Self-Regulated Learning, and self-Efficacy: A Meta-analytic Review. Educ. Psychol. Rev. 2023, 35, 113. [Google Scholar] [CrossRef]
Moradi, M.; Samwald, M. Evaluating the Robustness of Neural Language Models to Input Perturbations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021), Punta Cana, Dominican Republic, 7–11 November 2021; pp. 1558–1570. Available online: https://aclanthology.org/2021.emnlp-main.117/ (accessed on 30 July 2025).
Zhao, H.; Chen, H.; Yang, F.; Liu, N.; Deng, H.; Cai, H.; Wang, S.; Yin, D.; Du, M. Explainability for Large Language Models: A Survey. ACM Comput. Surv. 2023, 56, 1–37. [Google Scholar] [CrossRef]
Zhou, J.; Gandomi, A.H.; Chen, F.; Holzinger, A. Evaluating the Quality of Machine Learning Explanations: A Survey on Methods and Metrics. Electronics 2021, 10, 593. [Google Scholar] [CrossRef]
Barredo Arrieta, A.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; García, S.; Gil-López, S.; Molina, D.; Benjamins, R. Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
Emirtekin, E. Large Language Model-Powered Automated Assessment: A Systematic Review. Appl. Sci. 2025, 15, 5683. [Google Scholar] [CrossRef]
Aggarwal, D.; Sil, P.; Raman, B.; Bhattacharyya, P. “I Understand Why I Got This Grade”: Automatic Short Answer Grading (ASAG) with Feedback. In Proceedings of the Artificial Intelligence in Education (AIED 2025), Cham, Switzerland, 22–26 July 2025; pp. 304–318. [Google Scholar] [CrossRef]
Neo4j, I. Neo4j Graph Database Platform. Available online: https://neo4j.com/product/neo4j-graph-database/ (accessed on 24 September 2025).
Monteiro, J.; Sá, F.; Bernardino, J. Experimental Evaluation of Graph Databases: JanusGraph, Nebula Graph, Neo4j, and TigerGraph. Appl. Sci. 2023, 13, 5770. [Google Scholar] [CrossRef]
Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023. [Google Scholar] [CrossRef]
Google. Gemma:7B-Instruct Model Card. Available online: https://ollama.com/library/gemma:7b-instruct (accessed on 17 September 2025).
Hugging-Face. Zephyr-7B-β. Available online: https://huggingface.co/HuggingFaceH4/zephyr-7b-beta (accessed on 24 September 2025).
Teknium. OpenHermes Model Card. Available online: https://ollama.com/library/openhermes (accessed on 17 September 2025).
Meta. Introducing Meta Llama 3. Available online: https://ai.meta.com/blog/meta-llama-3/ (accessed on 17 September 2025).
Renze, M. The Effect of Sampling Temperature on Problem Solving in Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; pp. 7346–7356. Available online: https://aclanthology.org/2024.findings-emnlp.432/ (accessed on 30 July 2025).
Xu, F.; Song, Y.; Iyyer, M.; Choi, E. A Critical Evaluation of Evaluations for Long-form Question Answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), Toronto, ON, Canada, 4–9 July 2023; pp. 3225–3245. Available online: https://aclanthology.org/2023.acl-long.181 (accessed on 30 July 2025).
Zhang, D.-W.; Boey, M.; Tan, Y.Y.; Jia, A.H.S. Evaluating large language models for criterion-based grading from agreement to consistency. npj Sci. Learn. 2024, 9, 79. [Google Scholar] [CrossRef] [PubMed]
Tam, T.Y.C.; Sivarajkumar, S.; Kapoor, S.; Stolyar, A.V.; Polanska, K.; McCarthy, K.R.; Osterhoudt, H.; Wu, X.; Visweswaran, S.; Fu, S.; et al. A framework for human evaluation of large language models in healthcare derived from literature review. npj Digit. Med. 2024, 7, 258. [Google Scholar] [CrossRef] [PubMed]

Figure 1. EvalCouncil evaluation pipeline. Each student’s response is graded independently by four evaluators (committee). The outputs are aggregated by a chief arbiter, who produces the final score.

Figure 2. EvalCouncil graph data model in Neo4j. Nodes represent runs, items, responses, scores, panel evaluations, panel members, and chief decisions. Edges encode evaluation flow and provenance.

Figure 3. Response-level histograms of final grades by domain on the 1–10 scale. Binning and axis limits are identical across panels. Solid vertical lines denote medians, and dashed lines denote means.

Figure 4. Per-item box-and-whisker plots of final grades (after chief adjudication), ordered by each item’s median. The top panel shows Computer Networking, and the bottom panel shows Machine Learning. Identical y-axis limits are used across domains. Orange lines show item medians.

Figure 5. Within-panel agreement by domain on the 1–10 scale. Vertical box-and-whisker plots display the mean pairwise absolute difference (MPAD) for each response in Computer Networking (left) and Machine Learning (right). The y-axis uses identical limits (0–5.5) to facilitate comparison. Orange lines show item medians.

Figure 6. Chief–panel concordance. Scatter of chief score against the panel mean on the 1–10 scale, with the identity line and identical axis limits across domains, CN on the left and ML on the right. Points near the line indicate agreement, and the vertical distance encodes the chief–panel absolute difference. Sample sizes: CN n = 208, ML n = 220.

Figure 7. Tolerance sweep for panel decisions. The left shows the majority share. The right shows the arbitration share. Bands are ±0.5, ±1.0, and ±1.5 on the 1–10 scale. Axes are matched, CN on the left and ML on the right. Majorities approach saturation by ±1.0, with ML higher at ±0.5.

Figure 8. Leave-one-out sensitivity of within-panel dispersion. For each item with at least three panel grades, we compute mean|Δ| and max|Δ| across single-grader removals, where

Δ = {M P A D}_{- k} - {M P A D}_{f u l l}

and

{M P A D}_{f u l l}

is computed over all panel grades. CN and ML use identical y-limits to enable direct comparison. Orange lines show item medians.

Figure 8. Leave-one-out sensitivity of within-panel dispersion. For each item with at least three panel grades, we compute mean|Δ| and max|Δ| across single-grader removals, where

Δ = {M P A D}_{- k} - {M P A D}_{f u l l}

and

{M P A D}_{f u l l}

is computed over all panel grades. CN and ML use identical y-limits to enable direct comparison. Orange lines show item medians.

Figure 9. Distribution of per-item Human–LLM alignment,

δ_{i} = |c_{i} - {\bar{h}}_{i}|

, by domain (identical y-axes). Dotted lines mark 0.5- and 1.0-point tolerances. Top labels report n, median, and P95; bottom labels report coverage within each tolerance.

Figure 9. Distribution of per-item Human–LLM alignment,

δ_{i} = |c_{i} - {\bar{h}}_{i}|

, by domain (identical y-axes). Dotted lines mark 0.5- and 1.0-point tolerances. Top labels report n, median, and P95; bottom labels report coverage within each tolerance.

Figure 10. Human–LLM disagreement by tolerance. For each tolerance

t \in [0, 3]

grading points, the curves show the proportion of items with

|c_{i} - {\bar{h}}_{i}| > t

in CN and ML, drawn on common axes with the y-axis ranging from 0 to 1. Vertical guidelines at

t = 0.5, 1.0, a n d 2.0

mark typical grading thresholds. Lower curves indicate less disagreement and therefore tighter alignment.

Figure 10. Human–LLM disagreement by tolerance. For each tolerance

t \in [0, 3]

grading points, the curves show the proportion of items with

|c_{i} - {\bar{h}}_{i}| > t

in CN and ML, drawn on common axes with the y-axis ranging from 0 to 1. Vertical guidelines at

t = 0.5, 1.0, a n d 2.0

mark typical grading thresholds. Lower curves indicate less disagreement and therefore tighter alignment.

Table 1. Summary of datasets, task categories, and counts of items and responses.

Dataset	Students	Questions	Task Types
CN (Computer Networking)	26	8	Classification, numerical (IPv4 subnetting), representation (IPv6), conceptual verification, protocol-function matching
ML (Machine Learning)	22	10	5 technical open-ended, 5 argumentative open-ended

Table 2. Large Language Model implementations used in this study.

Role	Model (Ollama Tag)	Developer	Size
Chief Evaluator	llama3:instruct	Meta	~7B
Evaluator 1	mistral:7b-instruct	Mistral	~7B
Evaluator 2	gemma:7b-instruct	Google DeepMind	~7B
Evaluator 3	zephyr:7b-beta	HuggingFace	~7B
Evaluator 4	openhermes:latest	Teknium/Nous	~7B

Table 3. Evaluation rubrics for technical and argumentative tasks.

Rubric Type	Criterion 1	Criterion 2	Criterion 3	Criterion 4
Technical	Accuracy	Clarity	Completeness	Terminology
Argumentative	Clarity	Coherence	Originality	Dialecticality

Table 4. Within-panel agreement by domain. MPAD summaries (median, IQR, mean, P95, max), % MPAD = 0 and ≥2, and Krippendorff’s α (interval). CN n = 208; ML n = 220.

Domain	n	MPAD Median	IQR (P25–P75)	MPAD Mean	P95 (MPAD)	Max	% MPAD = 0	% MPAD ≥ 2	Krippendorff’s α (Interval)
CN	208	0.50	0.19–0.67	0.78	3.67	5.50	25.00	6.30	0.892
ML	220	0.67	0.50–1.00	0.84	2.17	4.33	9.00	6.80	0.633

Table 5. Chief–panel concordance by domain. Metrics reported: Pearson r, Spearman ρ, MAE, RMSE, bias, and shares of responses within ±0.5 and ±1.0 points on the 1–10 scale. Sample sizes: CN n = 208, ML n = 220.

Domain	n	Pearson r	Spearman ρ	MAE	RMSE	Bias	% Within ±0.5	% Within ±1.0
CN	208	0.767	0.619	0.813	2.085	−0.386	86.1	90.4
ML	220	0.953	0.861	0.333	0.445	0.071	90.0	98.2

Table 6. Comparison between the best single-model grader (openhermes:latest) and the chief configuration, both evaluated against the panel mean in each domain.

Domain	Model	MAE vs. Panel Mean	% Within ±1.0 vs. Panel Mean
CN	Best single LLM (openhermes:latest)	0.40	94.7
CN	Chief (committee + chief)	0.81	90.4
ML	Best single LLM (openhermes:latest)	0.32	98.6
ML	Chief (committee + chief)	0.33	98.2

Table 7. Chief decision by method across tolerance bands (±0.5, ±1.0, ±1.5). Entries report counts and percentages over eligible items (k ≥ 3), split by domain (CN, ML).

Domain	Tolerance (±)	Method	Count	Percent (%)
CN	±0.5	majority	167	80.3
CN	±0.5	arbitration_with_self	41	19.7
CN	±1	majority	203	97.6
CN	±1	arbitration_with_self	5	2.4
CN	±1.5	majority	203	97.6
CN	±1.5	arbitration_with_self	5	2.4
ML	±0.5	majority	204	92.7
ML	±0.5	arbitration_with_self	16	7.3
ML	±1	majority	214	97.3
ML	±1	arbitration_with_self	6	2.7
ML	±1.5	majority	218	99.1
ML	±1.5	arbitration_with_self	2	0.9

Table 8. Chief–panel concordance by domain (tolerance-invariant). Reported for items with a chief grade: sample size n, average panel size

\bar{k}

, chief–panel MAE with bootstrap 95% CI, and signed bias (chief−panel mean).

Table 8. Chief–panel concordance by domain (tolerance-invariant). Reported for items with a chief grade: sample size n, average panel size

\bar{k}

, chief–panel MAE with bootstrap 95% CI, and signed bias (chief−panel mean).

Domain	n	$\bar{k}$ (Panel Size)	Chief–Panel MAE	95% CI (MAE)	Bias (Chief–Panel)
CN	208	4.000	0.813	0.569–1.093	−0.386
ML	220	4.000	0.333	0.295–0.373	0.071

Table 9. Human–LLM alignment by domain. Summary statistics of the per-item absolute deviation

|δ_{i}| = |c_{i} - {\bar{h}}_{i}|

(points): sample size n, median, mean, and 95th percentile (P95). Lower values indicate closer agreement; identical rounding is used across domains.

Table 9. Human–LLM alignment by domain. Summary statistics of the per-item absolute deviation

|δ_{i}| = |c_{i} - {\bar{h}}_{i}|

(points): sample size n, median, mean, and 95th percentile (P95). Lower values indicate closer agreement; identical rounding is used across domains.

Domain	n	$Median \|δ\|$	$Mean \|δ\|$	$P 95 \|δ\|$
CN	208	1.00	1.94	5.50
ML	220	0.62	0.85	2.39

Table 10. Agreement metrics by domain for the LLM Chief versus the human reference. Columns: sample size n, Spearman ρ, Lin’s concordance CCC, within-one-point coverage

p (|δ| \leq 1.0)

, and

M A E (|δ|)

in points. Rounding: ρ and CCC to three decimals; coverage and MAE to two.

Table 10. Agreement metrics by domain for the LLM Chief versus the human reference. Columns: sample size n, Spearman ρ, Lin’s concordance CCC, within-one-point coverage

p (|δ| \leq 1.0)

, and

M A E (|δ|)

in points. Rounding: ρ and CCC to three decimals; coverage and MAE to two.

Domain	n	Spearman ρ	Lin’s CCC	$p (\|δ\| \leq 1.0)$	$M A E (\|δ\|)$
CN	208	0.696	0.622	0.53	1.94
ML	220	0.717	0.766	0.76	0.85

Table 11. Human–LLM disagreement by human panel dispersion, stratified by MPAD tertiles.

Domain	MPAD Stratum	#Items	Median \|δ\|	P95 \|δ\|	$Median Bias$ $(c - \bar{h})$	Coverage \|δ\| ≤ 1
CN	T1 (low)	3	1	5.5	−0.5	0.67
CN	T2 (mid)	2	1	5	0.25	0.6
CN	T3 (high)	3	1.5	5.5	0	0.36
ML	T1 (low)	3	0.75	2.94	0	0.76
ML	T2 (mid)	4	0.62	1.91	−0.25	0.76
ML	T3 (high)	3	0.62	3.09	0	0.76

Table 12. Domain-adaptive audit policy (plain-language criteria).

Outcome	Criteria (per Domain)	Action
Accept	The chief–panel difference is at most the tolerance and the item’s MPAD is below the domain median.	Record the grade; no further review.
Brief check	Either the chief–panel difference is at most the tolerance and the item’s MPAD is between the domain median and the domain 90th percentile; or the chief–panel difference is greater than the tolerance but not more than twice the tolerance.	One quick, independent human check.
Adjudicate	Either the item’s MPAD is at or above the domain 90th percentile; or the chief–panel difference is greater than twice the tolerance.	Multi-grader adjudication; add a focused rubric clarification if needed.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Anghel, C.; Craciun, M.V.; Anghel, A.A.; Cocu, A.; Balau, A.S.; Andrei, C.A.; Maier, C.; Dragosloveanu, S.; Nedelea, D.-G.; Scheau, C. EvalCouncil: A Committee-Based LLM Framework for Reliable and Unbiased Automated Grading. Computers 2025, 14, 530. https://doi.org/10.3390/computers14120530

AMA Style

Anghel C, Craciun MV, Anghel AA, Cocu A, Balau AS, Andrei CA, Maier C, Dragosloveanu S, Nedelea D-G, Scheau C. EvalCouncil: A Committee-Based LLM Framework for Reliable and Unbiased Automated Grading. Computers. 2025; 14(12):530. https://doi.org/10.3390/computers14120530

Chicago/Turabian Style

Anghel, Catalin, Marian Viorel Craciun, Andreea Alexandra Anghel, Adina Cocu, Antonio Stefan Balau, Constantin Adrian Andrei, Calina Maier, Serban Dragosloveanu, Dana-Georgiana Nedelea, and Cristian Scheau. 2025. "EvalCouncil: A Committee-Based LLM Framework for Reliable and Unbiased Automated Grading" Computers 14, no. 12: 530. https://doi.org/10.3390/computers14120530

APA Style

Anghel, C., Craciun, M. V., Anghel, A. A., Cocu, A., Balau, A. S., Andrei, C. A., Maier, C., Dragosloveanu, S., Nedelea, D.-G., & Scheau, C. (2025). EvalCouncil: A Committee-Based LLM Framework for Reliable and Unbiased Automated Grading. Computers, 14(12), 530. https://doi.org/10.3390/computers14120530

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

EvalCouncil: A Committee-Based LLM Framework for Reliable and Unbiased Automated Grading

Abstract

1. Introduction

1.1. Background and Problem Statement

1.2. Related Work

1.3. Research Gap and Contributions

2. Materials and Methods

2.1. Dataset Description

2.1.1. Computer Networking Dataset (CN)

2.1.2. Machine Learning Dataset (ML)

2.1.3. Integration for Evaluation

2.2. Evaluation Pipeline

2.2.1. Committee of Evaluators

2.2.2. Chief Evaluator

2.2.3. Rubric-Based Evaluation

2.2.4. Traceability and Auditability

2.3. Human Evaluation Protocol

3. Results

3.1. Score Distributions

3.2. Per-Item Performance

3.3. Within-Panel Agreement

3.4. Chief–Panel Concordance

3.5. Sensitivity Checks

3.6. Human–LLM Alignment

3.7. Human–LLM Disagreement

4. Discussion

4.1. Key Findings

4.2. Robustness to Tolerance and Panel Composition

4.3. Disagreement vs. Dispersion (MPAD) and Audit Policy

4.4. Limitations

4.5. Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI