Author Contributions
Conceptualization, C.A., M.V.C., A.A.A., A.S.B. and C.S.; methodology, C.A. and M.V.C.; software, C.A., A.A.A., C.M., C.A.A., A.S.B. and A.C.; validation, C.A., M.V.C., A.C., A.S.B., C.A.A. and C.M.; data curation, A.C., A.S.B., S.D., D.-G.N. and C.S.; writing–original draft preparation, C.A., M.V.C., A.A.A., A.C., A.S.B., C.A.A., C.M., S.D., D.-G.N. and C.S.; writing–review and editing, A.A.A., C.A., A.C., C.A.A. and S.D.; visualization, M.V.C., A.C., C.M., D.-G.N., S.D. and C.S.; supervision, C.A., M.V.C., A.A.A., D.-G.N. and C.S. All authors have read and agreed to the published version of the manuscript.
Figure 1.
EvalCouncil evaluation pipeline. Each student’s response is graded independently by four evaluators (committee). The outputs are aggregated by a chief arbiter, who produces the final score.
Figure 1.
EvalCouncil evaluation pipeline. Each student’s response is graded independently by four evaluators (committee). The outputs are aggregated by a chief arbiter, who produces the final score.
Figure 2.
EvalCouncil graph data model in Neo4j. Nodes represent runs, items, responses, scores, panel evaluations, panel members, and chief decisions. Edges encode evaluation flow and provenance.
Figure 2.
EvalCouncil graph data model in Neo4j. Nodes represent runs, items, responses, scores, panel evaluations, panel members, and chief decisions. Edges encode evaluation flow and provenance.
Figure 3.
Response-level histograms of final grades by domain on the 1–10 scale. Binning and axis limits are identical across panels. Solid vertical lines denote medians, and dashed lines denote means.
Figure 3.
Response-level histograms of final grades by domain on the 1–10 scale. Binning and axis limits are identical across panels. Solid vertical lines denote medians, and dashed lines denote means.
Figure 4.
Per-item box-and-whisker plots of final grades (after chief adjudication), ordered by each item’s median. The top panel shows Computer Networking, and the bottom panel shows Machine Learning. Identical y-axis limits are used across domains. Orange lines show item medians.
Figure 4.
Per-item box-and-whisker plots of final grades (after chief adjudication), ordered by each item’s median. The top panel shows Computer Networking, and the bottom panel shows Machine Learning. Identical y-axis limits are used across domains. Orange lines show item medians.
Figure 5.
Within-panel agreement by domain on the 1–10 scale. Vertical box-and-whisker plots display the mean pairwise absolute difference (MPAD) for each response in Computer Networking (left) and Machine Learning (right). The y-axis uses identical limits (0–5.5) to facilitate comparison. Orange lines show item medians.
Figure 5.
Within-panel agreement by domain on the 1–10 scale. Vertical box-and-whisker plots display the mean pairwise absolute difference (MPAD) for each response in Computer Networking (left) and Machine Learning (right). The y-axis uses identical limits (0–5.5) to facilitate comparison. Orange lines show item medians.
Figure 6.
Chief–panel concordance. Scatter of chief score against the panel mean on the 1–10 scale, with the identity line and identical axis limits across domains, CN on the left and ML on the right. Points near the line indicate agreement, and the vertical distance encodes the chief–panel absolute difference. Sample sizes: CN n = 208, ML n = 220.
Figure 6.
Chief–panel concordance. Scatter of chief score against the panel mean on the 1–10 scale, with the identity line and identical axis limits across domains, CN on the left and ML on the right. Points near the line indicate agreement, and the vertical distance encodes the chief–panel absolute difference. Sample sizes: CN n = 208, ML n = 220.
Figure 7.
Tolerance sweep for panel decisions. The left shows the majority share. The right shows the arbitration share. Bands are ±0.5, ±1.0, and ±1.5 on the 1–10 scale. Axes are matched, CN on the left and ML on the right. Majorities approach saturation by ±1.0, with ML higher at ±0.5.
Figure 7.
Tolerance sweep for panel decisions. The left shows the majority share. The right shows the arbitration share. Bands are ±0.5, ±1.0, and ±1.5 on the 1–10 scale. Axes are matched, CN on the left and ML on the right. Majorities approach saturation by ±1.0, with ML higher at ±0.5.
Figure 8.
Leave-one-out sensitivity of within-panel dispersion. For each item with at least three panel grades, we compute mean|Δ| and max|Δ| across single-grader removals, where and is computed over all panel grades. CN and ML use identical y-limits to enable direct comparison. Orange lines show item medians.
Figure 8.
Leave-one-out sensitivity of within-panel dispersion. For each item with at least three panel grades, we compute mean|Δ| and max|Δ| across single-grader removals, where and is computed over all panel grades. CN and ML use identical y-limits to enable direct comparison. Orange lines show item medians.
Figure 9.
Distribution of per-item Human–LLM alignment, , by domain (identical y-axes). Dotted lines mark 0.5- and 1.0-point tolerances. Top labels report n, median, and P95; bottom labels report coverage within each tolerance.
Figure 9.
Distribution of per-item Human–LLM alignment, , by domain (identical y-axes). Dotted lines mark 0.5- and 1.0-point tolerances. Top labels report n, median, and P95; bottom labels report coverage within each tolerance.
Figure 10.
Human–LLM disagreement by tolerance. For each tolerance grading points, the curves show the proportion of items with in CN and ML, drawn on common axes with the y-axis ranging from 0 to 1. Vertical guidelines at mark typical grading thresholds. Lower curves indicate less disagreement and therefore tighter alignment.
Figure 10.
Human–LLM disagreement by tolerance. For each tolerance grading points, the curves show the proportion of items with in CN and ML, drawn on common axes with the y-axis ranging from 0 to 1. Vertical guidelines at mark typical grading thresholds. Lower curves indicate less disagreement and therefore tighter alignment.
Table 1.
Summary of datasets, task categories, and counts of items and responses.
Table 1.
Summary of datasets, task categories, and counts of items and responses.
| Dataset | Students | Questions | Task Types |
|---|
| CN (Computer Networking) | 26 | 8 | Classification, numerical (IPv4 subnetting), representation (IPv6), conceptual verification, protocol-function matching |
| ML (Machine Learning) | 22 | 10 | 5 technical open-ended, 5 argumentative open-ended |
Table 2.
Large Language Model implementations used in this study.
Table 2.
Large Language Model implementations used in this study.
| Role | Model (Ollama Tag) | Developer | Size |
|---|
| Chief Evaluator | llama3:instruct | Meta | ~7B |
| Evaluator 1 | mistral:7b-instruct | Mistral | ~7B |
| Evaluator 2 | gemma:7b-instruct | Google DeepMind | ~7B |
| Evaluator 3 | zephyr:7b-beta | HuggingFace | ~7B |
| Evaluator 4 | openhermes:latest | Teknium/Nous | ~7B |
Table 3.
Evaluation rubrics for technical and argumentative tasks.
Table 3.
Evaluation rubrics for technical and argumentative tasks.
| Rubric Type | Criterion 1 | Criterion 2 | Criterion 3 | Criterion 4 |
|---|
| Technical | Accuracy | Clarity | Completeness | Terminology |
| Argumentative | Clarity | Coherence | Originality | Dialecticality |
Table 4.
Within-panel agreement by domain. MPAD summaries (median, IQR, mean, P95, max), % MPAD = 0 and ≥2, and Krippendorff’s α (interval). CN n = 208; ML n = 220.
Table 4.
Within-panel agreement by domain. MPAD summaries (median, IQR, mean, P95, max), % MPAD = 0 and ≥2, and Krippendorff’s α (interval). CN n = 208; ML n = 220.
| Domain | n | MPAD Median | IQR (P25–P75) | MPAD Mean | P95 (MPAD) | Max | % MPAD = 0 | % MPAD ≥ 2 | Krippendorff’s α (Interval) |
|---|
| CN | 208 | 0.50 | 0.19–0.67 | 0.78 | 3.67 | 5.50 | 25.00 | 6.30 | 0.892 |
| ML | 220 | 0.67 | 0.50–1.00 | 0.84 | 2.17 | 4.33 | 9.00 | 6.80 | 0.633 |
Table 5.
Chief–panel concordance by domain. Metrics reported: Pearson r, Spearman ρ, MAE, RMSE, bias, and shares of responses within ±0.5 and ±1.0 points on the 1–10 scale. Sample sizes: CN n = 208, ML n = 220.
Table 5.
Chief–panel concordance by domain. Metrics reported: Pearson r, Spearman ρ, MAE, RMSE, bias, and shares of responses within ±0.5 and ±1.0 points on the 1–10 scale. Sample sizes: CN n = 208, ML n = 220.
| Domain | n | Pearson r | Spearman ρ | MAE | RMSE | Bias | % Within ±0.5 | % Within ±1.0 |
|---|
| CN | 208 | 0.767 | 0.619 | 0.813 | 2.085 | −0.386 | 86.1 | 90.4 |
| ML | 220 | 0.953 | 0.861 | 0.333 | 0.445 | 0.071 | 90.0 | 98.2 |
Table 6.
Comparison between the best single-model grader (openhermes:latest) and the chief configuration, both evaluated against the panel mean in each domain.
Table 6.
Comparison between the best single-model grader (openhermes:latest) and the chief configuration, both evaluated against the panel mean in each domain.
| Domain | Model | MAE vs. Panel Mean | % Within ±1.0 vs. Panel Mean |
|---|
| CN | Best single LLM (openhermes:latest) | 0.40 | 94.7 |
| CN | Chief (committee + chief) | 0.81 | 90.4 |
| ML | Best single LLM (openhermes:latest) | 0.32 | 98.6 |
| ML | Chief (committee + chief) | 0.33 | 98.2 |
Table 7.
Chief decision by method across tolerance bands (±0.5, ±1.0, ±1.5). Entries report counts and percentages over eligible items (k ≥ 3), split by domain (CN, ML).
Table 7.
Chief decision by method across tolerance bands (±0.5, ±1.0, ±1.5). Entries report counts and percentages over eligible items (k ≥ 3), split by domain (CN, ML).
| Domain | Tolerance (±) | Method | Count | Percent (%) |
|---|
| CN | ±0.5 | majority | 167 | 80.3 |
| CN | ±0.5 | arbitration_with_self | 41 | 19.7 |
| CN | ±1 | majority | 203 | 97.6 |
| CN | ±1 | arbitration_with_self | 5 | 2.4 |
| CN | ±1.5 | majority | 203 | 97.6 |
| CN | ±1.5 | arbitration_with_self | 5 | 2.4 |
| ML | ±0.5 | majority | 204 | 92.7 |
| ML | ±0.5 | arbitration_with_self | 16 | 7.3 |
| ML | ±1 | majority | 214 | 97.3 |
| ML | ±1 | arbitration_with_self | 6 | 2.7 |
| ML | ±1.5 | majority | 218 | 99.1 |
| ML | ±1.5 | arbitration_with_self | 2 | 0.9 |
Table 8.
Chief–panel concordance by domain (tolerance-invariant). Reported for items with a chief grade: sample size n, average panel size , chief–panel MAE with bootstrap 95% CI, and signed bias (chief−panel mean).
Table 8.
Chief–panel concordance by domain (tolerance-invariant). Reported for items with a chief grade: sample size n, average panel size , chief–panel MAE with bootstrap 95% CI, and signed bias (chief−panel mean).
| Domain | n | (Panel Size) | Chief–Panel MAE | 95% CI (MAE) | Bias (Chief–Panel) |
|---|
| CN | 208 | 4.000 | 0.813 | 0.569–1.093 | −0.386 |
| ML | 220 | 4.000 | 0.333 | 0.295–0.373 | 0.071 |
Table 9.
Human–LLM alignment by domain. Summary statistics of the per-item absolute deviation (points): sample size n, median, mean, and 95th percentile (P95). Lower values indicate closer agreement; identical rounding is used across domains.
Table 9.
Human–LLM alignment by domain. Summary statistics of the per-item absolute deviation (points): sample size n, median, mean, and 95th percentile (P95). Lower values indicate closer agreement; identical rounding is used across domains.
| Domain | n | | | |
|---|
| CN | 208 | 1.00 | 1.94 | 5.50 |
| ML | 220 | 0.62 | 0.85 | 2.39 |
Table 10.
Agreement metrics by domain for the LLM Chief versus the human reference. Columns: sample size n, Spearman ρ, Lin’s concordance CCC, within-one-point coverage , and in points. Rounding: ρ and CCC to three decimals; coverage and MAE to two.
Table 10.
Agreement metrics by domain for the LLM Chief versus the human reference. Columns: sample size n, Spearman ρ, Lin’s concordance CCC, within-one-point coverage , and in points. Rounding: ρ and CCC to three decimals; coverage and MAE to two.
| Domain | n | Spearman ρ | Lin’s CCC | | |
|---|
| CN | 208 | 0.696 | 0.622 | 0.53 | 1.94 |
| ML | 220 | 0.717 | 0.766 | 0.76 | 0.85 |
Table 11.
Human–LLM disagreement by human panel dispersion, stratified by MPAD tertiles.
Table 11.
Human–LLM disagreement by human panel dispersion, stratified by MPAD tertiles.
| Domain | MPAD Stratum | #Items | Median |δ| | P95 |δ| |
| Coverage |δ| ≤ 1 |
|---|
| CN | T1 (low) | 3 | 1 | 5.5 | −0.5 | 0.67 |
| CN | T2 (mid) | 2 | 1 | 5 | 0.25 | 0.6 |
| CN | T3 (high) | 3 | 1.5 | 5.5 | 0 | 0.36 |
| ML | T1 (low) | 3 | 0.75 | 2.94 | 0 | 0.76 |
| ML | T2 (mid) | 4 | 0.62 | 1.91 | −0.25 | 0.76 |
| ML | T3 (high) | 3 | 0.62 | 3.09 | 0 | 0.76 |
Table 12.
Domain-adaptive audit policy (plain-language criteria).
Table 12.
Domain-adaptive audit policy (plain-language criteria).
| Outcome | Criteria (per Domain) | Action |
|---|
| Accept | The chief–panel difference is at most the tolerance and the item’s MPAD is below the domain median. | Record the grade; no further review. |
| Brief check | Either the chief–panel difference is at most the tolerance and the item’s MPAD is between the domain median and the domain 90th percentile; or the chief–panel difference is greater than the tolerance but not more than twice the tolerance. | One quick, independent human check. |
| Adjudicate | Either the item’s MPAD is at or above the domain 90th percentile; or the chief–panel difference is greater than twice the tolerance. | Multi-grader adjudication; add a focused rubric clarification if needed. |