Towards Reliable LLM Grading Through Self-Consistency and Selective Human Review: Higher Accuracy, Less Work

Korthals, Luke; Akrong, Emma; Geller, Gali; Rosenbusch, Hannes; Grasman, Raoul; Visser, Ingmar

doi:10.3390/make8030074

This is an early access version, the complete PDF, HTML, and XML versions will be available soon.

Open AccessArticle

Towards Reliable LLM Grading Through Self-Consistency and Selective Human Review: Higher Accuracy, Less Work

by

Luke Korthals

^*

,

Emma Akrong

,

Gali Geller

,

Hannes Rosenbusch

,

Raoul Grasman

and

Ingmar Visser

University of Amsterdam, 1018 WB Amsterdam, The Netherlands

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2026, 8(3), 74; https://doi.org/10.3390/make8030074

Submission received: 11 December 2025 / Revised: 14 February 2026 / Accepted: 2 March 2026 / Published: 16 March 2026

(This article belongs to the Section Learning)

Download Review Reports Versions Notes

Abstract

Large language models (LLMs) show promise for grading open-ended assessments but still exhibit inconsistent accuracy, systematic biases, and limited reliability across assignments. To address these concerns, we introduce SURE (Selective Uncertainty-based Re-Evaluation), a human-in-the-loop pipeline that combines repeated LLM prompting, uncertainty-based flagging, and selective human regrading. Three LLMs—gpt-4.1-nano, gpt-5-nano, and the open-source gpt-oss-20b—graded answers of 46 students to 130 open questions and coding exercises across five assignments. Each student answer was scored 20 times to derive majority-voted predictions and self-consistency-based certainty estimates. We simulated human regrading by flagging low-certainty cases and replacing them with scores from four human graders. We used the first assignment as a training set for tuning certainty thresholds and to explore LLM output diversification via sampling parameters, rubric shuffling, varied personas, multilingual prompts, and post hoc ensembles. We then evaluated the effectiveness and efficiency of SURE on the other four assignments using a fixed certainty threshold. Across assignments, fully automated grading with a single prompt resulted in substantial underscoring, and majority-voting based on 20 prompts improved but did not eliminate this bias. Low certainty (i.e., high output diversity) was diagnostic of incorrect LLM scores, enabling targeted human regrading that improved grading accuracy while reducing manual grading time by 40–90%. Aggregating responses from all three LLMs in an ensemble improved certainty-based flagging and most consistently approached human-level accuracy, with 70–90% of the grades students would receive falling inside human-grader ranges. A reanalysis based on outputs from a more diversified LLM ensemble comprised of gpt-5, codestral-25.01, and llama-3.3-70b-instruct replicated these findings but also suggested that large reasoning models such as gpt-5 might eliminate the need for human oversight of LLM grading entirely. These findings demonstrate that self-consistency-based uncertainty estimation and selective human oversight can substantially improve the reliability and efficiency of AI-assisted grading.

Keywords: large language models; automatic grading; human in the loop; self-consistency; uncertainty estimation

Share and Cite

MDPI and ACS Style

Korthals, L.; Akrong, E.; Geller, G.; Rosenbusch, H.; Grasman, R.; Visser, I. Towards Reliable LLM Grading Through Self-Consistency and Selective Human Review: Higher Accuracy, Less Work. Mach. Learn. Knowl. Extr. 2026, 8, 74. https://doi.org/10.3390/make8030074

AMA Style

Korthals L, Akrong E, Geller G, Rosenbusch H, Grasman R, Visser I. Towards Reliable LLM Grading Through Self-Consistency and Selective Human Review: Higher Accuracy, Less Work. Machine Learning and Knowledge Extraction. 2026; 8(3):74. https://doi.org/10.3390/make8030074

Chicago/Turabian Style

Korthals, Luke, Emma Akrong, Gali Geller, Hannes Rosenbusch, Raoul Grasman, and Ingmar Visser. 2026. "Towards Reliable LLM Grading Through Self-Consistency and Selective Human Review: Higher Accuracy, Less Work" Machine Learning and Knowledge Extraction 8, no. 3: 74. https://doi.org/10.3390/make8030074

APA Style

Korthals, L., Akrong, E., Geller, G., Rosenbusch, H., Grasman, R., & Visser, I. (2026). Towards Reliable LLM Grading Through Self-Consistency and Selective Human Review: Higher Accuracy, Less Work. Machine Learning and Knowledge Extraction, 8(3), 74. https://doi.org/10.3390/make8030074

Article Menu

Towards Reliable LLM Grading Through Self-Consistency and Selective Human Review: Higher Accuracy, Less Work

Abstract

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI