LLM-as-a-Grader: Practical Insights from Large Language Models for Short-Answer and Report Evaluation
Abstract
1. Introduction
- RQ1: How well do LLM-generated grades align with human evaluations?
- RQ2: What are the most common reasons for disagreement between LLM and human graders?
- RQ3: Can an open-source grading toolkit be developed to support LLM-based assessment in real-world educational settings?
2. Related Work
3. Dataset
3.1. Quiz Dataset
- Quiz 1: Text ProcessingCovers basic string manipulation, tokenization, normalization (lowercasing, stemming, lemmatization), and regular expressions.
- Quiz 2: Language ModelsFocuses on n-gram language models, smoothing techniques (e.g., Laplace), and evaluation metrics such as perplexity.
- Quiz 3: Vector Space ModelsIncludes term weighting (TF-IDF), cosine similarity, and document classification using vector representations.
- Quiz 4: Distributional SemanticsTests understanding of co-occurrence matrices, word-context windows, dimensionality reduction, and distributional hypothesis.
- Quiz 5: Contextual EncodingAddresses contextual word representations using pre-trained models such as BERT, including their architectural differences and contextualization mechanisms.
3.2. Project Report Dataset
- CAAP (Capture Assistant in Academic Papers): An LLM-powered tool that extracts keyphrases and definitions from academic papers, combining textual and visual information to enhance comprehension of technical content.
- GitFolio: A web-based platform that converts static resumes into dynamic online portfolios using NLP, OCR, and chatbot assistance, targeting early-career professionals with limited web development experience.
- FutureFetch: A personalized job/internship recommendation system that extracts information from resumes using GPT-4o and filters opportunities via a custom Python 3.9 backend, evaluated through both quantitative metrics and live user feedback.
- CareerAi: A resume enhancement and job-matching system using NLP, regular expressions, and Gemini API, designed for resume formatting and recommendation of the relevant job listings through web scraping and structured parsing.
- MarketGuardian: A real-time scam detection system for e-commerce listings that analyzes textual and visual cues using LLMs, reverse image search, and keyphrase extraction to flag potentially fraudulent eBay listings.
- ATAP (Application Tracking Automation Program): An NLP-based tool that automates email parsing and application tracking for students, aiming to reduce inequality in opportunity access through centralized, real-time updates and support for diverse application types.
4. Methodology
Prompting and Scoring Strategy
5. Results
5.1. Grading Comparison on Quizzes
5.2. Grading Comparison on Project Reports
6. Discussion
7. Limitations and Future Work
8. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| LLM | Large Language Model |
| TA | Teaching Assistant |
| NLP | Natural Language Processing |
| MCQ | Multiple-Choice Question |
| TF-IDF | Term Frequency-Inverse Document Frequency |
| BERT | Bidirectional Encoder Representations from Transformers |
Appendix A. Short-Answer Quiz
Appendix A.1. Quiz Questions
- Quiz 1: Text Processing
- 1.
- What is the difference between a word token and a word type?
- 2.
- How can we interpret the most frequent words in a text?
- 3.
- What is the difference between a word and a token?
- 4.
- All delimiters used in our implementation are punctuation marks. What types of tokens should not be split by such delimiters?
- 5.
- Our tokenizer uses hard-coded rules to handle specific cases. What would be a scalable approach to handling more diverse cases?
- 6.
- The use of a more advanced tokenizer mitigates the issue of sparsity. What exactly is the sparsity issue, and how can appropriate tokenization help alleviate it?
- 7.
- What is the difference between a lemmatizer and a stemmer?
- 8.
- What are the key differences between inflectional and derivational morphology?
- 9.
- In which tasks can lemmatization negatively impact performance?
- 10.
- What are the benefits and limitations of using regular expressions for tokenization vs. the rule-based tokenization approach discussed in the previous section?
- Quiz 2: Language Models
- 1.
- What are the advantages of splitting “I” and “m” as two separate tokens, versus recognizing “I’m” as one token?
- 2.
- What advantages do unigram probabilities have over word frequencies?
- 3.
- What NLP tasks can benefit from bigram estimation over unigram estimation?
- 4.
- When applying Laplace smoothing, do unigram probabilities always decrease? If not, what conditions can cause a unigram’s probability to increase?
- 5.
- What does the Laplace smoothed bigram probability of represent when is unknown, and what is a potential problem with this estimation?
- 6.
- Why is it problematic when bigram probabilities following a given word don’t sum to 1?
- 7.
- What are the key differences between conditional and joint probabilities in sequence modeling, and how are they practically applied?
- 8.
- How do the Chain Rule and Markov Assumption simplify the estimation of sequence probability?
- 9.
- Is it worth considering the end of the text by introducing another artificial token, , to improve last-word prediction by multiplying the above product with ?
- 10.
- Why is logarithmic scale used to measure self-information in entropy calculations?
- 11.
- What indicates high entropy in a text corpus?
- 12.
- What is the relationship between corpus entropy and language model perplexity?
- Quiz 3: Vector Space Models
- 1.
- One limitation of the bag-of-words model is its inability to handle unknown words. Is there a method to enhance the bag-of-words model, allowing it to handle unknown words?
- 2.
- Another limitation of the bag-of-words model is its inability to capture word order. Is there a method to enhance the bag-of-words model, allowing it to preserve the word order?
- 3.
- If term frequency does not necessarily indicate semantic importance, what kind of significance does it convey?
- 4.
- Stop words can be filtered either during the creation of vocabulary dictionary or when generating the bag-of-words representations. Which approach is preferable and why?
- 5.
- What are the implications when a term has a high document frequency?
- 6.
- Why do both Euclidean Distance and Cosine Similarity metrics consider that D1 is more similar to D2 than to D3?
- 7.
- Why is Cosine Similarity generally preferred over Euclidean Distance in most NLP applications?
- 8.
- What potential problems might arise from the above data splitting approach, and what alternative method could mitigate these issues?
- 9.
- Why do we use only the training set to collect the vocabulary?
- 10.
- What are the primary weaknesses and limitations of the K-Nearest Neighbors (KNN) classification model when applied to document classification?
- Quiz 4: Distributional Semantics
- 1.
- Assuming that your corpus has only the following three sentences, what context would influence the meaning of the word “chair” according to the distributional hypothesis?
- 1.
- I sat on a chair.
- 2.
- I will chair the meeting.
- 3.
- I am the chair of my department.
- 2.
- What are the drawbacks of using one-hot encoding to represent word vectors?
- 3.
- Why is the performance of document_term_matrix() significantly slower thandocument_term_matrix_np()?
- 4.
- What is the maximum number of topics that LSA can identify? What are the limitations associated with discovering topics using this approach?
- 5.
- By discarding the first topic, you can observe document embeddings that are opposite (e.g., documents 4 and 5). What are the characteristics of these documents that are opposite to each other?
- 6.
- is not transposed in L3 of the above code. Should we use S.transpose() instead?
- 7.
- What role does the sigmoid function play in the logistic regression model?
- 8.
- Under what circumstances would the bias b be negative in the above example? Additionally, when might neutral terms such as “this” or “movie” exhibit non-neutral weights?
- 9.
- What is the role of the softmax function in the softmax regression model? How does it differ from the sigmoid function?
- 10.
- Notice that the above equation for MLP does not include bias terms. How are biases handled in light of this formulation?
- 11.
- What would be the weight assigned to the feature “truly” learned by softmax regression for the above example?
- 12.
- What are the limitations of a multilayer perceptron?
- 13.
- What are the advantages of using discriminative models like CBOW for constructing language models compared to generative models like n-gram models?
- 14.
- What are the advantages of CBOW models compared to Skip-gram models, and vice versa?
- 15.
- What are the implications of the weight matrices and in the Skip-gram model?
- 16.
- What limitations does the Word2Vec model have, and how can these limitations be addressed?
- Quiz 5: Contextual Encoding
- 1.
- How can document-level vector representations be derived from Word2Vec word embeddings?
- 2.
- How did the embedding representation facilitate the adaption of Neural Networks in Natural Language Processing?
- 3.
- How are embedding representations for Natural Language Processing fundamentally different from ones for Computer Vision?
- 4.
- The EM algorithm stands as a classic method in unsupervised learning. What are the advantages of unsupervised learning over supervised learning, and which tasks align well with unsupervised learning?
- 5.
- What are the disadvantages of using BPE-based tokenization instead of rule-based tokenization? What are the potential issues with the implementation of BPE above?
- 6.
- How does each hidden state in a RNN encode information relevant to sequence tagging tasks?
- 7.
- In text classification tasks, what specific information is captured by the final hidden state of a RNN?
- 8.
- What are the advantages and limitations of implementing bidirectional RNNs for text classification and sequence tagging tasks?
- 9.
- How does self-attention operate given an embedding matrix representing a document, where n is the number of words and d is the embedding dimension?
- 10.
- Given an embedding matrix representing a document, how does multi-head attention function? What advantages does multi-head attention offer over self-attention?
- 11.
- What are the outputs of each layer in the Transformer model? How do the embeddings learned in the upper layers of the Transformer differ from those in the lower layers?
- 12.
- How is a Masked Language Model used in training a language model with a transformer?
- 13.
- How can one train a document-level embedding using a transformer?
- 14.
- What are the advantages of embeddings generated by BERT compared to those generated by Word2Vec?
References
- Maiti, P.; Goel, A.K. How Do Students Interact with an LLM-powered Virtual Teaching Assistant in Different Educational Settings? arXiv 2024, arXiv:2407.17429. [Google Scholar] [CrossRef]
- Chiang, C.H.; Chen, W.C.; Kuan, C.Y.; Yang, C.; Lee, H.Y. Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course. arXiv 2024, arXiv:2407.05216. [Google Scholar] [CrossRef]
- Chu, Y.; He, P.; Li, H.; Han, H.; Yang, K.; Xue, Y.; Li, T.; Krajcik, J.; Tang, J. Enhancing LLM-Based Short Answer Grading with Retrieval-Augmented Generation. arXiv 2025, arXiv:2504.05276. [Google Scholar] [CrossRef]
- Liu, J.; Jiang, B.; Wei, Y. LLMs as Promising Personalized Teaching Assistants: How Do They Ease Teaching Work? ECNU Rev. Educ. 2025, 8, 343–348. [Google Scholar] [CrossRef]
- OpenAI. GPT-4o System Card. arXiv 2024, arXiv:2410.21276. [Google Scholar] [CrossRef]
- Jukiewicz, M. The Future of Grading Programming Assignments in Education: The Role of ChatGPT in Automating the Assessment and Feedback Process. Think. Ski. Creat. 2024, 52, 101522. [Google Scholar] [CrossRef]
- Flodén, J. Grading exams using large language models: A comparison between human and AI grading of exams in higher education using ChatGPT. Br. Educ. Res. J. 2024, 51, 201–224. [Google Scholar] [CrossRef]
- Xie, W.; Niu, J.; Xue, C.J.; Guan, N. Grade Like a Human: Rethinking Automated Assessment with Large Language Models. arXiv 2024, arXiv:2405.19694. [Google Scholar] [CrossRef]
- Yeung, C.; Yu, J.; Cheung, K.C.; Wong, T.W.; Chan, C.M.; Wong, K.C.; Fujii, K. A Zero-Shot LLM Framework for Automatic Assignment Grading in Higher Education. arXiv 2025, arXiv:2501.14305. [Google Scholar] [CrossRef]
- Miroyan, M.; Mitra, C.; Jain, R.; Ranade, G.; Norouzi, N. Analyzing Pedagogical Quality and Efficiency of LLM Responses with TA Feedback to Live Student Questions. In Proceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1, Pittsburgh, PA, USA, 26 February–1 March 2025; Association for Computing Machinery: New York, NY, USA, 2025; pp. 770–776. [Google Scholar] [CrossRef]
- Golchin, S.; Garuda, N.; Impey, C.; Wenger, M. Grading Massive Open Online Courses Using Large Language Models. arXiv 2024, arXiv:2406.11102. [Google Scholar] [CrossRef]
- Lin, J.; Han, Z.; Thomas, D.R.; Gurung, A.; Gupta, S.; Aleven, V.; Koedinger, K.R. How Can I Get It Right? Using GPT to Rephrase Incorrect Trainee Responses. arXiv 2024, arXiv:2405.00970. [Google Scholar] [CrossRef]
- Grévisse, C. LLM-based automatic short answer grading in undergraduate medical education. BMC Med. Educ. 2024, 24, 1060. [Google Scholar] [CrossRef] [PubMed]
- Poličar, P.G.; Špendl, M.; Curk, T.; Zupan, B. Automated Assignment Grading with Large Language Models: Insights From a Bioinformatics Course. arXiv 2025, arXiv:2501.14499. [Google Scholar] [CrossRef] [PubMed]
- Inc., A.S. PyMuPDF: Python Bindings for MuPDF. Available online: https://github.com/pymupdf/PyMuPDF (accessed on 9 July 2025).
- EmoryNLP. LLM-grading: Automated Grading Framework Using LLMs (GitHub Repository). 2025. Available online: https://github.com/emorynlp/LLM-Grading (accessed on 10 July 2025).



| Quiz | n | GPT Mean | Manual Mean | Mean Abs Diff | Wilcoxon W | p-Value | Corr | Corr p-Value |
|---|---|---|---|---|---|---|---|---|
| 1. Text Processing | 53 | 1.85 | 1.93 | 0.12 | 121 | 0.62 | ||
| 2. Language Models | 49 | 2.08 | 2.18 | 0.10 | 0 | 0.94 | ||
| 3. Vector Space Models | 52 | 1.71 | 1.74 | 0.03 | 0 | 0.97 | ||
| 4. Distributional Semantics | 52 | 2.94 | 3.00 | 0.06 | 7.5 | 0.88 | ||
| 5. Contextual Encoding | 52 | 2.69 | 2.70 | 0.03 | 23.5 | 0.92 | ||
| Overall | 258 | 2.26 | 2.31 | 0.07 | 710 | 0.98 |
| Section | GPT Mean | TA Mean | Mean Diff | Holm p-Value | Significant (Holm) |
|---|---|---|---|---|---|
| Abstract | 1.000 | 1.000 | 0.000 | N/A | All scores identical |
| Introduction | 0.986 | 0.979 | 0.007 | 1.0000 | False |
| Related Work | 0.943 | 0.943 | 0.000 | 1.0000 | False |
| Approach | 1.886 | 1.986 | −0.100 | 0.0332 | True |
| Results | 1.750 | 1.957 | −0.207 | 0.0102 | True |
| Conclusion | 1.000 | 1.000 | 0.000 | N/A | All scores identical |
| References | 0.500 | 0.500 | 0.000 | N/A | All scores identical |
| Format | 0.500 | 0.486 | 0.014 | 0.4719 | False |
| Deduction Reason Category | GPT (n = 13) | Human TA (n = 20) |
|---|---|---|
| Insufficient quantitative results | 4 (30.8%) | 3 (15.0%) |
| Superficial Related Work | 3 (23.1%) | 5 (25.0%) |
| Missing limitations discussion | 2 (15.4%) | – |
| Formatting issue | 1 (7.7%) | 5 (25.0%) |
| Weak novelty justification | 1 (7.7%) | 2 (10.0%) |
| Lack of detail in methods | 1 (7.7%) | 1 (5.0%) |
| Weak introduction or motivation | 1 (7.7%) | – |
| Writing quality / clarity issue | – | 2 (10.0%) |
| Missing/inadequate conclusion | – | 2 (10.0%) |
| Section | GPT | TA | GPT Feedback | TA Feedback |
|---|---|---|---|---|
| Abstract | 1.0 | 1.0 | The abstract is concise and informative, clearly summarizing the project’s objectives, intellectual merit, and broader societal impact. | |
| Introduction | 1.0 | 1.0 | The introduction clearly states the project goals and motivation, providing a well-defined problem statement. | |
| Related Work | 1.0 | 1.0 | Reviews relevant prior research and distinguishes the proposed work from existing solutions. | |
| Approach | 2.0 | 2.0 | Clearly explained and technically sound; detailed descriptions of steps and justified novelty. | |
| Results | 1.8 | 1.8 | Evaluation metrics and benchmarks are described properly, but more quantitative data would help. | Section 5 should be improved. For instance, the report states that “ADAPT was evaluated using internal benchmarks,” but no details are provided about these benchmarks or any results tables for the reader to reference. Including more detailed preliminary results would strengthen the section. (−0.2) |
| Conclusion | 1.0 | 1.0 | Summarizes findings and future work effectively, reflecting on project impact. | |
| References | 0.5 | 0.5 | References are complete, recent, and properly formatted. | |
| Format | 0.5 | 0.5 | The report is well-organized and easy to read. | |
| GPT Summary | - | - | ADAPT is an AI-powered platform designed to enhance email composition via large language models. It introduces a user-centric interface for context-aware revisions, integrating with email services for seamless communication. The project aims to improve accessibility and efficiency in digital communication, offering a novel approach to human-AI co-writing. | - |
| Section | GPT | TA | GPT Feedback | TA Feedback |
|---|---|---|---|---|
| Abstract | 1.0 | 1.0 | The abstract is concise and informative, clearly summarizing the project’s objectives, intellectual merit, and societal impact. | |
| Introduction | 1.0 | 1.0 | Clearly states the project goals and motivation with originality and well-defined context. | |
| Related Work | 1.0 | 1.0 | Reviews prior research and distinguishes the proposed work from existing solutions effectively. | |
| Approach | 1.8 | 2.0 | The proposed method is clearly explained and technically sound, with detailed descriptions of steps, models, and algorithms. However, the novelty claim could be more explicitly justified with specific examples of how ATAP’s approach differs from existing methods. | |
| Results | 1.8 | 1.8 | Preliminary results are presented with a clear evaluation plan, and metrics, datasets, and benchmarks are described properly. However, the discussion could benefit from more detailed analysis of the results and their implications. | The results show that the system achieved an overall precision of 92.31%, a perfect recall of 100%, and an F1 score of 96%. An error analysis (i.e., cases where system predictions were wrong) would have helped not only readers but also the team to improve. For instance, I see they tried to interpret the results, but including some examples (failed prediction cases) from the actual evaluation set would have been much more helpful. |
| Conclusion | 1.0 | 1.0 | Summarizes findings and reflects on impact effectively. | |
| References | 0.5 | 0.5 | References are complete, recent, and properly formatted. | |
| Format | 0.5 | 0.5 | Well-organized and easy to read, with proper formatting. | |
| GPT Summary | - | - | ATAP is an innovative application tracking tool that automates email retrieval, content classification, and status updates. It addresses inequities and improves usability for students. | - |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Byun, G.; Rajwal, S.; Choi, J.D. LLM-as-a-Grader: Practical Insights from Large Language Models for Short-Answer and Report Evaluation. Information 2026, 17, 505. https://doi.org/10.3390/info17050505
Byun G, Rajwal S, Choi JD. LLM-as-a-Grader: Practical Insights from Large Language Models for Short-Answer and Report Evaluation. Information. 2026; 17(5):505. https://doi.org/10.3390/info17050505
Chicago/Turabian StyleByun, Grace, Swati Rajwal, and Jinho D. Choi. 2026. "LLM-as-a-Grader: Practical Insights from Large Language Models for Short-Answer and Report Evaluation" Information 17, no. 5: 505. https://doi.org/10.3390/info17050505
APA StyleByun, G., Rajwal, S., & Choi, J. D. (2026). LLM-as-a-Grader: Practical Insights from Large Language Models for Short-Answer and Report Evaluation. Information, 17(5), 505. https://doi.org/10.3390/info17050505

