Can ChatGPT Replace the Teacher in Assessment? A Review of Research on the Use of Large Language Models in Grading and Providing Feedback
Abstract
1. Introduction
- RQ1:
- In what educational contexts, subjects, and time periods have GenAI models been used for automated grading and feedback, and which concrete models have been deployed?Information on where and how GenAI is applied will provide us with a better understanding of whether there are any specific fields of education in which GenAI use is already more widespread, and whether some AI tools are more popular than others.
- RQ2:
- How are GenAIs operationalized in grading workflows, including their roles (automatic grader, feedback generator, co-grader), the types of student work they assess, and the prompting or other strategies used?Information on the details of GenAI implementation in the assessment workflows is crucial, as they are likely to be a significant factor impacting the quality and relevance of algorithmic outputs, as well as the degree of its autonomy.
- RQ3:
- How does the grading and feedback quality of GenAI compare with human assessment across different tasks, domains, and educational levels?This is a central question relating to the value of using GenAI in assessment contexts.
- RQ4:
- Which technical, task-related, and contextual factors influence the performance and reliability of GenAI-based grading and feedback?This question recognizes the degree to which factors related to the technical side (e.g., prompting techniques), tasks (e.g., assessment of closed or open-ended student responses), and context (e.g., class subject) impact GenAI helpfulness in assessment contexts.
- RQ5:
- What educational and pedagogical implications, opportunities, and risks are reported for integrating GenAI-based grading and feedback into assessment practice?This question is helpful in recognizing the validity of the expectations and hopes associated with the use of GenAI in education.
2. Methods
2.1. Eligibility Criteria and the Search
- Focused on students across different educational levels, i.e., both elementary and higher education, as well as specific educational contexts like studies on only medical students or only English as a Foreign Language (EFL) students.
- Involved the use of the LLM type of generative AI, both commercially available (like ChatGPT, Gemini, or Claude) and open sourced or custom LLMs. The usage of GenAI needed to be within the context of student grading (e.g., automatic scoring, response sheets, or descriptive grades). So, for instance, we were not interested in studies that involved using GenAI solely to provide feedback to students without the grading context.
- Were empirical. Other reviews and purely opinion-based pieces were excluded from our review.
- Were published in peer-review journals and conference proceedings after the official launch of ChatGPT in late 2022.
2.2. Screening and Quality Assessment
3. Results
3.1. RQ1: In What Educational Contexts, Subjects, and Time Periods Have GenAI Models Been Used for Automated Grading and Feedback, and Which Concrete Models Have Been Deployed?
- Computer science—7 studies (mainly automatic assessment of programming tasks and code);
- Foreign languages—7 studies (assessment of essays and written assignments in foreign languages);
- Mathematics—5 studies (calculation tasks, mathematical reasoning);
- Medicine—5 studies (exam questions, open-ended tasks in medical education);
- Engineering—3 studies (engineering projects, open-ended tasks);
- Pedagogy, social sciences, and humanities—single studies (assessment of essays, reflections, creativity).
3.2. RQ2: How Are GenAIs Operationalized in Grading Workflows, Including Their Roles, the Types of Student Work They Assess, and the Prompting or Other Strategies Used?
3.3. RQ3: How Does the Grading and Feedback Quality of GenAI Compare with Human Assessment Across Different Tasks, Domains, and Educational Levels?
3.4. RQ4: Which Technical, Task-Related, and Contextual Factors Influence the Performance and Reliability of GenAI-Based Grading and Feedback?
3.5. RQ5: What Educational and Pedagogical Implications, Opportunities, and Risks Are Reported for Integrating GenAI-Based Grading and Feedback into Assessment Practice?
4. Discussion
5. Conclusions
- Adopt common reporting standards and stronger sampling beyond convenience cohorts;
- Track model generations using preregistered protocols;
- Expand multilingual and modality-diverse evaluations;
- Connect grading metrics to downstream learning outcomes and equity impacts.
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Jukiewicz, M. How generative artificial intelligence transforms teaching and influences student wellbeing in future education. Front. Educ. 2025, 10, 1594572. [Google Scholar] [CrossRef]
- Gill, S.S.; Xu, M.; Patros, P.; Wu, H.; Kaur, R.; Kaur, K.; Fuller, S.; Singh, M.; Arora, P.; Parlikad, A.K.; et al. Transformative effects of chatgpt on modern education: Emerging era of ai chatbots. Internet Things Cyber-Phys. Syst. 2024, 4, 19–23. [Google Scholar] [CrossRef]
- Al Murshidi, G.; Shulgina, G.; Kapuza, A.; Costley, J. How understanding the limitations and risks of using chatgpt can contribute to willingness to use. Smart Learn. Environ. 2024, 11, 36. [Google Scholar] [CrossRef]
- Castillo, A.; Silva, G.; Arocutipa, J.; Berrios, H.; Marcos Rodriguez, M.; Reyes, G.; Lopez, H.; Teves, R.; Rivera, H.; Arias-Gonzáles, J. Effect of chat gpt on the digitized learning process of university students. J. Namib. Stud. Hist. Politics Cult. 2023, 33, 1–15. [Google Scholar]
- Hung, J.; Chen, J. The benefits, risks and regulation of using chatgpt in chinese academia: A content analysis. Soc. Sci. 2023, 12, 380. [Google Scholar] [CrossRef]
- Raile, P. The usefulness of chatgpt for psychotherapists and patients. Humanit. Soc. Sci. Commun. 2024, 11, 47. [Google Scholar] [CrossRef]
- De Duro, E.S.; Improta, R.; Stella, M. Introducing counsellme: A dataset of simulated mental health dialogues for comparing llms like haiku, llamantino and chatgpt against humans. Emerg. Trends Drugs Addict. Health 2025, 5, 100170. [Google Scholar] [CrossRef]
- Maurya, R.K.; Montesinos, S.; Bogomaz, M.; DeDiego, A.C. Assessing the use of chatgpt as a psychoeducational tool for mental health practice. Couns. Psychother. Res. 2025, 25, 12759. [Google Scholar] [CrossRef]
- Memarian, B.; Doleck, T. Chatgpt in education: Methods, potentials, and limitations. Comput. Hum. Behav. Artif. Humans 2023, 1, 100022. [Google Scholar] [CrossRef]
- Diab Idris, M.; Feng, X.; Dyo, V. Revolutionizing higher education: Unleashing the potential of large language models for strategic transformation. IEEE Access 2024, 12, 67738–67757. [Google Scholar] [CrossRef]
- Hadi Mogavi, R.; Deng, C.; Juho Kim, J.; Zhou, P.; Kwon, Y.D.; Hosny Saleh Metwally, A.; Tlili, A.; Bassanelli, S.; Bucchiarone, A.; Gujar, S.; et al. Chatgpt in education: A blessing or a curse? A qualitative study exploring early adopters’ utilization and perceptions. Comput. Hum. Behav. Artif. Humans 2024, 2, 100027. [Google Scholar] [CrossRef]
- Davis, R.O.; Lee, Y.J. Prompt: Chatgpt, create my course, please! Educ. Sci. 2024, 14, 24. [Google Scholar] [CrossRef]
- Onal, S.; Kulavuz-Onal, D. A cross-disciplinary examination of the instructional uses of chatgpt in higher education. J. Educ. Technol. Syst. 2024, 52, 301–324. [Google Scholar] [CrossRef]
- Li, M. The impact of chatgpt on teaching and learning in higher education: Challenges, opportunities, and future scope. In Encyclopedia of Information Science and Technology, 6th ed.; IGI Global Scientific Publishing: Hershey, PA, USA, 2025; pp. 1–20. [Google Scholar]
- Lucas, H.C.; Upperman, J.S.; Robinson, J.R. A systematic review of large language models and their implications in medical education. Med Educ. 2024, 58, 1276–1285. [Google Scholar] [CrossRef] [PubMed]
- Jukiewicz, M. The future of grading programming assignments in education: The role of chatgpt in automating the assessment and feedback process. Think. Ski. Creat. 2024, 52, 101522. [Google Scholar] [CrossRef]
- Kıyak, Y.S.; Emekli, E. Chatgpt prompts for generating multiple-choice questions in medical education and evidence on their validity: A literature review. Postgrad. Med. J. 2024, 100, 858–865. [Google Scholar] [CrossRef] [PubMed]
- Ali, M.M.; Wafik, H.A.; Mahbub, S.; Das, J. Gen Z and generative AI: Shaping the future of learning and creativity. Cogniz. J. Multidiscip. Stud. 2024, 4, 1–18. [Google Scholar] [CrossRef]
- Ahmed, A.R. Navigating the integration of generative artificial intelligence in higher education: Opportunities, challenges, and strategies for fostering ethical learning. Adv. Biomed. Health Sci. 2025, 4, 1–2. [Google Scholar] [CrossRef]
- Reimer, E.C. Examining the role of generative ai in enhancing social work education: An analysis of curriculum and assessment design. Soc. Sci. 2024, 13, 648. [Google Scholar] [CrossRef]
- Tzirides, A.O.O.; Zapata, G.; Kastania, N.P.; Saini, A.K.; Castro, V.; Ismael, S.A.; You, Y.-l.; Santos, T.A.; Searsmith, D.; O’Brien, C.; et al. Combining human and artificial intelligence for enhanced ai literacy in higher education. Comput. Educ. Open 2024, 6, 100184. [Google Scholar] [CrossRef]
- Wu, D.; Chen, M.; Chen, X.; Liu, X. Analyzing k-12 ai education: A large language model study of classroom instruction on learning theories, pedagogy, tools, and ai literacy. Comput. Educ. Artif. Intell. 2024, 7, 100295. [Google Scholar] [CrossRef]
- Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Akhtar, N.; Barnes, N.; Mian, A. A comprehensive overview of large language models. ACM Trans. Intell. Syst. Technol. 2025, 16, 1–72. [Google Scholar] [CrossRef]
- Ramesh, D.; Sanampudi, S.K. An automated essay scoring systems: A systematic literature review. Artif. Intell. Rev. 2022, 55, 2495–2527. [Google Scholar] [CrossRef] [PubMed]
- Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The prisma 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef]
- Page, M.J.; Moher, D.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. Prisma 2020 explanation and elaboration: Updated guidance and exemplars for reporting systematic reviews. BMJ 2021, 372, n160. [Google Scholar] [CrossRef] [PubMed]
- Rethlefsen, M.L.; Kirtley, S.; Waffenschmidt, S.; Ayala, A.P.; Moher, D.; Page, M.J.; Koffel, J.B.; Blunt, H.; Brigham, T.; Chang, S.; et al. Prisma-s: An extension to the prisma statement for reporting literature searches in systematic reviews. Syst. Rev. 2021, 10, 39. [Google Scholar] [CrossRef] [PubMed]
- Hong, Q.N.; Fàbregues, S.; Bartlett, G.; Boardman, F.; Cargo, M.; Dagenais, P.; Gagnon, M.-P.; Griffiths, F.; Nicolau, B.; O’Cathain, A.; et al. The mixed methods appraisal tool (mmat) version 2018 for information professionals and researchers. Educ. Inf. 2018, 34, 285–291. [Google Scholar] [CrossRef]
- Grévisse, C. Llm-based automatic short answer grading in undergraduate medical education. BMC Med Educ. 2024, 24, 1060. [Google Scholar] [CrossRef]
- Jiang, Z.; Xu, Z.; Pan, Z.; He, J.; Xie, K. Exploring the role of artificial intelligence in facilitating assessment of writing performance in second language learning. Languages 2023, 8, 247. [Google Scholar] [CrossRef]
- Li, Z.; Lloyd, S.; Beckman, M.; Passonneau, R.J. Answer-state recurrent relational network (asrrn) for constructed response assessment and feedback grouping. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 3879–3891. [Google Scholar]
- Hackl, V.; Müller, A.E.; Granitzer, M.; Sailer, M. Is gpt-4 a reliable rater? evaluating consistency in gpt-4’s text ratings. Front. Educ. 2023, 8, 1272229. [Google Scholar] [CrossRef]
- Azaiz, I.; Deckarm, O.; Strickroth, S. Ai-enhanced auto-correction of programming exercises: How effective is gpt-3.5? arXiv 2023, arXiv:2311.10737. [Google Scholar] [CrossRef]
- Acar, S.; Dumas, D.; Organisciak, P.; Berthiaume, K. Measuring original thinking in elementary school: Development and validation of a computational psychometric approach. J. Educ. Psychol. 2024, 116, 953–981. [Google Scholar] [CrossRef]
- Schleifer, A.G.; Klebanov, B.B.; Ariely, M.; Alexandron, G. Transformer-based hebrew nlp models for short answer scoring in biology. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), Toronto, ON, Canada, 13 July 2023; pp. 550–555. [Google Scholar]
- Li, J.; Huang, J.; Wu, W.; Whipple, P.B. Evaluating the role of chatgpt in enhancing efl writing assessments in classroom settings: A preliminary investigation. Humanit. Soc. Sci. Commun. 2024, 11, 1268. [Google Scholar] [CrossRef]
- Chang, L.-H.; Ginter, F. Automatic short answer grading for finnish with chatgpt. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 23173–23181. [Google Scholar]
- Dong, D. Tapping into the pedagogical potential of infinigochatic: Evidence from iwrite scoring and comments and lu & ai’s linguistic complexity analyzer. Arab. World Engl. J. (AWEJ) Spec. Issue Chatgpt 2024, 124–137. [Google Scholar] [CrossRef]
- Thomas, D.R.; Lin, J.; Bhushan, S.; Abboud, R.; Gatz, E.; Gupta, S.; Koedinger, K.R. Learning and ai evaluation of tutors responding to students engaging in negative self-talk. In Proceedings of the Eleventh ACM Conference on Learning @ Scale, Atlanta, GA, USA, 18–20 July 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 481–485. [Google Scholar]
- Gao, R.; Guo, X.; Li, X.; Narayanan, A.B.L.; Thomas, N.; Srinivasa, A.R. Towards scalable automated grading: Leveraging large language models for conceptual question evaluation in engineering. In Proceedings of the Machine Learning Research, FM-EduAssess at NeurIPS 2024 Workshop, Vancouver Convention Center, Vancouver, BC, Canada, 15 December 2024. [Google Scholar]
- Jackaria, P.M.; Hajan, B.H.; Mastul, A.-R.H. A comparative analysis of the rating of college students’ essays by chatgpt versus human raters. Int. J. Learn. Teach. Educ. Res. 2024, 23, 478–492. [Google Scholar] [CrossRef]
- Jin, H.; Kim, Y.; Park, Y.S.; Tilekbay, B.; Son, J.; Kim, J. Using large language models to diagnose math problem-solving skills at scale. In Proceedings of the Eleventh ACM Conference on Learning @ Scale, Atlanta, GA, USA, 18–20 July 2024; pp. 471–475. [Google Scholar]
- Li, J.; Jangamreddy, N.K.; Hisamoto, R.; Bhansali, R.; Dyda, A.; Zaphir, L.; Glencross, M. Ai-assisted marking: Functionality and limitations of chatgpt in written assessment evaluation. Australas. J. Educ. Technol. 2024, 40, 56–72. [Google Scholar] [CrossRef]
- Cohn, C.; Hutchins, N.; Le, T.; Biswas, G. A chain-of-thought prompting approach with llms for evaluating students’ formative assessment responses in science. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 23182–23190. [Google Scholar]
- Zhang, M.; Johnson, M.; Ruan, C. Investigating sampling impacts on an llm-based ai scoring approach: Prediction accuracy and fairness. J. Meas. Eval. Educ. Psychol. 2024, 15, 348–360. [Google Scholar] [CrossRef]
- Shamim, M.S.; Zaidi, S.J.A.; Rehman, A. The revival of essay-type questions in medical education: Harnessing artificial intelligence and machine learning. J. Coll. Phys. Surg. Pak. 2024, 34, 595–599. [Google Scholar]
- Sreedhar, R.; Chang, L.; Gangopadhyaya, A.; Shiels, P.W.; Loza, J.; Chi, E.; Gabel, E.; Park, Y.S. Comparing scoring consistency of large language models with faculty for formative assessments in medical education. J. Gen. Intern. Med. 2025, 40, 127–134. [Google Scholar] [CrossRef] [PubMed]
- Lee, U.; Kim, Y.; Lee, S.; Park, J.; Mun, J.; Lee, E.; Kim, H.; Lim, C.; Yoo, Y.J. Can we use gpt-4 as a mathematics evaluator in education? Exploring the efficacy and limitation of llm-based automatic assessment system for open-ended mathematics question. Int. J. Artif. Intell. Educ. 2024, 35, 1560–1596. [Google Scholar] [CrossRef]
- Grandel, S.; Schmidt, D.C.; Leach, K. Applying large language models to enhance the assessment of parallel functional programming assignments. In Proceedings of the 1st International Workshop on Large Language Models for Code, LLM4Code ’24, Lisbon, Portugal, 20 April 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 102–110. [Google Scholar]
- Martin, P.P.; Graulich, N. Navigating the data frontier in science assessment: Advancing data augmentation strategies for machine learning applications with generative artificial intelligence. Comput. Educ. Artif. Intell. 2024, 7, 100265. [Google Scholar] [CrossRef]
- Menezes, T.; Egherman, L.; Garg, N. Ai-grading standup updates to improve project-based learning outcomes. In Proceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1, Milan, Italy, 8–10 July 2024; pp. 17–23. [Google Scholar]
- Zhang, C.; Hofmann, F.; Plößl, L.; Gläser-Zikuda, M. Classification of reflective writing: A comparative analysis with shallow machine learning and pre-trained language models. Educ. Inf. Technol. 2024, 29, 21593–21619. [Google Scholar] [CrossRef]
- Zhuang, X.; Wu, H.; Shen, X.; Yu, P.; Yi, G.; Chen, X.; Hu, T.; Chen, Y.; Ren, Y.; Zhang, Y.; et al. Toree: Evaluating topic relevance of student essays for chinese primary and middle school education. In Proceedings of the Findings of the Association for Computational Linguistics ACL 2024, Bangkok, Thailand, 11–16 August 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 5749–5765. [Google Scholar]
- Sonkar, S.; Ni, K.; Tran Lu, L.; Kincaid, K.; Hutchinson, J.S.; Baraniuk, R.G. Automated long answer grading with ricechem dataset. In Proceedings of the International Conference on Artificial Intelligence in Education, Recife, Brazil, 8–12 July 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 163–176. [Google Scholar]
- Huang, C.-W.; Coleman, M.; Gachago, D.; Van Belle, J.-P. Using chatgpt to encourage critical ai literacy skills and for assessment in higher education. In Proceedings of the ICT Education, Gauteng, South Africa, 19–21 July 2023; Van Rensburg, H.E., Snyman, D.P., Drevin, L., Drevin, G.R., Eds.; Springer: Cham, Switzerland, 2024; pp. 105–118. [Google Scholar]
- Lu, Q.; Yao, Y.; Xiao, L.; Yuan, M.; Wang, J.; Zhu, X. Can chatgpt effectively complement teacher assessment of undergraduate students’ academic writing? Assess. Eval. High. Educ. 2024, 49, 616–633. [Google Scholar] [CrossRef]
- Agostini, D.; Picasso, F.; Ballardini, H. Large language models for the assessment of students’ authentic tasks: A replication study in higher education. In Proceedings of the CEUR Workshop Proceedings, Bolzano, IT, USA, 25–28 November 2024; Volume 3879. [Google Scholar]
- Tapia-Mandiola, S.; Araya, R. From play to understanding: Large language models in logic and spatial reasoning coloring activities for children. AI 2024, 5, 1870–1892. [Google Scholar] [CrossRef]
- Lin, N.; Wu, H.; Zheng, W.; Liao, X.; Jiang, S.; Yang, A.; Xiao, L. Indocl: Benchmarking indonesian language development assessment. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; pp. 4873–4885. [Google Scholar]
- Quah, B.; Zheng, L.; Sng, T.J.H.; Yong, C.W.; Islam, I. Reliability of chatgpt in automated essay scoring for dental undergraduate examinations. BMC Med Educ. 2024, 24, 962. [Google Scholar] [CrossRef] [PubMed]
- Henkel, O.; Hills, L.; Boxer, A.; Roberts, B.; Levonian, Z. Can large language models make the grade? An empirical study evaluating llms ability to mark short answer questions in k-12 education. In Proceedings of the Eleventh ACM Conference on Learning @ Scale, Atlanta, GA, USA, 18–20 July 2024; pp. 300–304. [Google Scholar]
- Latif, E.; Zhai, X. Fine-tuning chatgpt for automatic scoring. Comput. Educ. Artif. Intell. 2024, 6, 100210. [Google Scholar] [CrossRef]
- Xiao, X.; Li, Y.; He, X.; Fang, J.; Yan, Z.; Xie, C. An assessment framework of higher-order thinking skills based on fine-tuned large language models. Expert Syst. Appl. 2025, 272, 126531. [Google Scholar] [CrossRef]
- Fokides, E.; Peristeraki, E. Comparing chatgpt’s correction and feedback comments with that of educators in the context of primary students’ short essays written in english and greek. Educ. Inf. Technol. 2025, 30, 2577–2621. [Google Scholar] [CrossRef]
- Flodén, J. Grading exams using large language models: A comparison between human and ai grading of exams in higher education using chatgpt. Br. Educ. Res. J. 2025, 51, 201–224. [Google Scholar] [CrossRef]
- Gonzalez-Maldonado, D.; Liu, J.; Franklin, D. Evaluating gpt for use in k-12 block based cs instruction using a transpiler and prompt engineering. In Proceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1, SIGCSETS 2025, Pittsburgh, PA, USA, 26 February–1 March 2025; Association for Computing Machinery: New York, NY, USA, 2025; pp. 388–394. [Google Scholar]
- Qiu, H.; White, B.; Ding, A.; Costa, R.; Hachem, A.; Ding, W.; Chen, P. Stella: A structured grading system using llms with rag. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 15–18 December 2024; IEEE: New York, NY, USA, 2024; pp. 8154–8163. [Google Scholar]
- Gandolfi, A. Gpt-4 in education: Evaluating aptness, reliability, and loss of coherence in solving calculus problems and grading submissions. Int. J. Artif. Intell. Educ. 2025, 35, 367–397. [Google Scholar] [CrossRef]
- Yavuz, F.; Çelik, Ö.; Yavaş Çelik, G. Utilizing large language models for efl essay grading: An examination of reliability and validity in rubric-based assessments. Br. J. Educ. Technol. 2025, 56, 150–166. [Google Scholar] [CrossRef]
- Manning, J.; Baldwin, J.; Powell, N. Human versus machine: The effectiveness of chatgpt in automated essay scoring. Innov. Educ. Teach. Int. 2025, 62, 1500–1513. [Google Scholar] [CrossRef]
- Zhu, M.; Zhang, K. Artificial intelligence for computer science education in higher education: A systematic review of empirical research published in 2003–2023. In Technology, Knowledge and Learning; Springer: Berlin/Heidelberg, Germany, 2025; pp. 1–25. [Google Scholar]
- Pereira, A.F.; Mello, R.F. A systematic literature review on large language models applications in computer programming teaching evaluation process. IEEE Access 2025, 13, 113449–113460. [Google Scholar] [CrossRef]
- Albadarin, Y.; Saqr, M.; Pope, N.; Tukiainen, M. A systematic literature review of empirical research on chatgpt in education. Discov. Educ. 2024, 3, 60. [Google Scholar] [CrossRef]
- Lo, C.K.; Yu, P.L.H.; Xu, S.; Ng, D.T.K.; Jong, M.S.-y. Exploring the application of chatgpt in esl/efl education and related research issues: A systematic review of empirical studies. Smart Learn. Environ. 2024, 11, 50. [Google Scholar] [CrossRef]
- Dikilitas, K.; Klippen, M.I.F.; Keles, S. A systematic rapid review of empirical research on students’ use of chatgpt in higher education. Nord. J. Syst. Rev. Educ. 2024, 2, 103–125. [Google Scholar] [CrossRef]
- Jukiewicz, M. A systematic comparison of Large Language Models for automated assignment assessment in programming education: Exploring the importance of architecture and vendor. arXiv 2025, arXiv:2509.26483. [Google Scholar] [CrossRef]

| Category | Search Terms/Filters |
|---|---|
| Inclusion: AI Technologies | “ChatGPT” OR “Claude” OR “Gemini” OR “Copilot” OR “Deepseek” OR “generative AI” OR “generative artificial intelligence” OR “large language model*” OR “LLM*” OR “text-to-text model*” OR “multimodal AI” OR “transformer-based model*” |
| Inclusion: Education Contexts | “adult education” OR “college*” OR “elementary school*” OR “graduate” OR “higher education” OR “high school*” OR “K-12” OR “kindergarten*” OR “middle school*” OR “postgrad*” OR “primary school*” OR “undergrad*” OR “vocational education” |
| Inclusion: Assessment Contexts | “assessment” OR “grading” OR “scoring” OR “feedback” OR “AI grading” OR “automated assessment” OR “automated scoring” OR “automated grading” OR “automated feedback” OR “automatic assessment” OR “automatic scoring” OR “automatic grading” OR “automatic feedback” OR “machine-generated feedback” |
| Exclusion: Study Types | NOT “bibliometrics analysis” OR “literature review” OR “rapid review” OR “scoping review” OR “systematic review” |
| Exclusion: Time Frame | AND PUBYEAR > 2022 |
| Study | Level | Field | LLM Type | Task Type | Key Findings |
|---|---|---|---|---|---|
| [29] | university or college | medicine | gpt4, gemini 1.0 pro | short answers | GPT-4 assigns significantly lower grades than the teacher, but it rarely incorrectly gives maximum scores to incorrect answers (low false positives). Gemini 1.0 Pro gives grades similar to those given by teachers—their grades did not differ significantly from human scores. Both GPT-4 and Gemini achieve moderate agreement with teacher assessments, but GPT-4 demonstrates high precision in identifying fully correct answers. |
| [30] | secondary school | foreign language | gpt3.5, gpt4, iFLYTEK, and Baidu Cloud | essay | LLM models (GPT-4, GPT-3.5, iFLYTEK) achieved results very close to human raters in identifying correct and incorrect segments of text. Their accuracy at the T-unit level (syntactic segments) was about 81%, and at the sentence level 71–77%. GPT-4 stood out with the highest precision (fewest false positives), meaning it rarely erroneously marked incorrect answers as correct. |
| [31] | university or college | math | other | calculation tasks | Even with advanced prompting (zero/few-shot), GPT-3.5 performs noticeably worse than AsRRN. GPT-3.5 improves when reference examples are provided, but it does not reach the effectiveness of the relational model. |
| [32] | university or college | economics | gpt4 | short answers | GPT-4 grades very consistently and reliably: agreement indices (ICC) for content and style scores range from 0.94 to 0.99, indicating nearly “perfect” consistency between repeated ratings on different sets and at different times. |
| [33] | university or college | computer science | gpt3.5 | coding tasks | The model correctly classified 73% of solutions as correct or incorrect. Its precision for correct solutions was 80%, and its specificity in detecting incorrect solutions was as high as 88%. However, recall—the ability to identify all correct solutions—was only 57%. In 89–100% of cases, AI-generated feedback was personalized, addressing the specific student’s code. This was much more detailed than traditional e-assessment systems, which usually rely on simple tests. |
| [34] | primary school | na | gpt3 | other | AI grades were in very high agreement with the ratings of a panel of human experts (r = 0.79–0.91 depending on the task type), meaning that AI can generate scores comparable to subjective human judgments. The model demonstrates high internal consistency of ratings (reliability at H = 0.82–0.89 for different subscales and overall originality), making AI a reliable tool in educational applications. |
| [35] | secondary school | biology | bert | short answers | The alephBERT-based system clearly outperforms classic CNN models in automatic grading of students’ short biology answers, both in overall and detailed assessment (for each category separately). The model showed surprisingly good performance even in “zero-shot” mode—that is, when grading answers to new questions (about the same biological concepts) that it had not seen during training. This means that AI can be used to automatically grade new tasks, as long as they relate to similar concepts/rubrics. |
| [36] | university or college | foreign language | gpt4, gpt3.5 | essay | ChatGPT 4 achieved higher reliability coefficients for holistic assessment (both G-coefficient and Phi-coefficient) than academic teachers evaluating English as a Foreign Language (EFL) essays. ChatGPT 3.5 had lower reliability than the teachers but was still a supportive tool. In practice, ChatGPT 4 was more consistent and predictable in its scoring than humans. Both ChatGPT 3.5 and 4 generated more feedback, and more relevant feedback, on language, content, and text organization compared to the teachers. |
| [37] | university or college | mixed | gpt4, gpt3.5 | short answers | GPT-4 performs well in grading short student answers (QWK ≥ 0.6, indicating good agreement with human scores) in 44% of cases in the one-shot setting, significantly better than GPT-3.5 (21%). The best results occur for simple, narrowly defined questions and short answers; open and longer, more elaborate answers are more challenging. ChatGPT (especially GPT-4) tends to give higher grades than teachers—less often awarding failing grades and more often scoring higher than human examiners. |
| [38] | university or college | foreign language | Infinigo ChatIC | essay | Texts “polished” by ChatGPT (infinigoChatIC) receive much higher scores in automated grading systems (iWrite) than texts written independently or with classic machine translators. The average score increase was about 17%. AI effectively eliminates grammatical errors and significantly increases the diversity and sophistication of vocabulary (e.g., more complex and precise expressions). This results in higher scores for both language and technical/structural aspects of the text. |
| [39] | university or college | na | gpt4 | multiple choice tasks | Generative AI (GPT-4) achieved high agreement with human expert ratings when grading open tutor responses in training scenarios (F1 = 0.85 for selecting the best strategy, F1 = 0.83 for justifying the approach). The Kappa coefficient indicates good agreement (0.65–0.69), although the AI scored somewhat more strictly than humans and was more sensitive to answer length (short answers were more often marked incorrect). Overall, AI grading was slightly more demanding than human grading (average scores were about 25% lower when graded by AI). |
| [40] | university or college | engineering | gpt4o | short answers | GPT-4o (ChatGPT) showed strong agreement with teaching assistants (TAs) in grading conceptual questions in mechanical engineering, especially where grading criteria were clear and unambiguous (Spearman’s p exceeded 0.6 in 7 out of 10 tasks, up to 0.94; low RMSE). |
| [41] | k12 | mixed | gpt3.5 | essay | Essay scores assigned by ChatGPT 3.5 show poor agreement with those assigned by experienced teachers. Agreement was low in three out of four criteria (content, organization, language) and only moderate for mechanics (e.g., punctuation, spelling). On average, ChatGPT 3.5 gave slightly higher grades than humans in each category, but the differences were not statistically significant (except for mechanics, where AI was noticeably more “lenient”). |
| [42] | primary school | math | gpt3.5, gpt4 | calculation tasks | A system based on GPT-3.5 (fine-tuned) achieved higher agreement with teacher ratings than “base” GPT-3.5 and GPT-4, especially for clear, unambiguous criteria. |
| [43] | university or college | computer science, medicine | gpt4 | essay, coding tasks | ChatGPT grading is more consistent and predictable for high-quality work. With weaker or average work, there is greater variability in scores (wider spread), especially in programming tasks. |
| [44] | secondary school | mixed | gpt4 | short answers | GPT-4 using the “chain-of-thought” (CoT) method and few-shot prompting achieved high agreement with human examiners when grading short open-ended answers in science subjects (QWK up to 0.95; Macro F1 up to 1.00). The CoT method allowed AI not only to assign a score but also to generate justification (explanation of the grading decision) in accordance with the criteria (rubrics). |
| [45] | k12 | mixed | bert | short answers | LLM models (here: BERT) achieve high agreement with human ratings in short open-ended response tasks in standardized educational tests (QWK ≥ 0.7 for 11 out of 13 tasks). |
| [46] | university or college | medicine | gpt3.5 | essay | ChatGPT 3.5 could grade and classify medical essays according to established criteria, while also generating detailed, individualized feedback for students. |
| [47] | university or college | medicine | gpt3.5 | essay | ChatGPT 3.5 achieves comparable grading performance to teachers in formative tasks (including critical analysis of medical literature). Overall agreement between ChatGPT and teachers was 67%, and AUCPR (accuracy index) was 0.69 (range: 0.61–0.76). The reliability of grading (Cronbach’s alpha) was 0.64, considered acceptable for formative assessment. |
| [48] | primary school | math | gpt4 | calculation tasks | GPT-4, used as an Automated Assessment System (AAS) for open-ended math responses, achieved high agreement with the ratings of three expert teachers for most questions, demonstrating the potential of AI to automate the grading of complex assignments, not just short answers. |
| [49] | university or college | computer science | gpt4 | coding tasks | ChatGPT-4, used as part of the semi-automatic GreAIter tool, achieves very high agreement with human examiners (98.2% accuracy, 100% recall, meaning no real errors in code are missed). Despite this high effectiveness, AI can be overly critical (precision 59.3%)—it generates more false alarms (incorrectly detected errors) than human examiners. |
| [50] | university or college | chemistry | gpt3.5, bard | short answers | Generative AI (e.g., ChatGPT, Bard) can significantly improve the grading of open-ended student work in natural sciences, especially thanks to data augmentation and more precise mapping of students’ reasoning levels. |
| [51] | university or college | computer science | gpt3 | short answers | AI (GPT-3, after fine-tuning) achieved very high agreement with human ratings in classifying the specificity of statements (97% accuracy) and sufficient accuracy in assessing effort (83%). Zero-shot models (without training) were much less accurate—automatic grading performance increases after fine-tuning on human-graded data. |
| [52] | university or college | education | bert | essay | Pretrained models on large datasets, such as BigBird and Longformer, achieved the highest effectiveness in classifying the level of reflection in student assignments (77.2% and 76.3% accuracy, respectively), clearly outperforming traditional “shallow” ML models (e.g., SVM, Random Forest), which did not exceed 60% accuracy. |
| [53] | k12 | mixed | gpt3.5turbo, claude3.5, gpt4turbo, spark | essay | Generative models such as GPT-4 often incorrectly assess the topic relevance of essays and generate feedback that is too general, impractical, or even misleading. This phenomenon results partly from “sycophancy” (excessive leniency) and LLM hallucinations—the models tend to give overly optimistic evaluations. |
| [54] | university or college | chemistry | bert, gpt3.5, gpt4 | other | Language models (LLMs, e.g., GPT-4) perform worse in grading long, multi-faceted, factual answers than in grading short responses—even with advanced techniques and rubrics, F1 for GPT-4 was only 0.69 (for short responses, 0.74). Using a detailed rubric (i.e., breaking grading into many small criteria) improves AI’s results by 9% (accuracy) and 15% (F1) compared to classical, general scoring. Rubrics help the model better identify which aspects of the answer are correct or incorrect. GPT-4 achieves the highest results among the tested models (F1 = 0.689, accuracy 70.9%) but still does not perform as well as for short answers and often misses nuances in complex, multi-faceted student answers. |
| [55] | university or college | computer science | other | essay | ChatGPT should not be used as a standalone grading tool. AI-generated grades are unpredictable, can be inconsistent, and do not always match the criteria used by teachers. |
| [56] | university or college | education | other | essay | ChatGPT 3.5 achieved moderate or good agreement with teacher grades (ICC from 0.6 to 0.8 in various categories), meaning it can effectively score student work based on the same criteria as humans. ChatGPT generated more feedback, often giving general suggestions and summaries, but less frequently providing specific solutions and explanations. AI feedback was less precise, less detailed, and less emotionally supportive, but more “inspiring” for some students. Students implemented teacher suggestions more often (80.2%) than ChatGPT’s (59.9%), and AI feedback was more frequently rejected (40% of cases). |
| [57] | university or college | mixed | gpt4o, claude3.5, gemini1.5pro | other | The newest models (Claude 3.5 Sonnet, GPT-4o, Qwen2 72B) can be a valuable support in grading complex, open-ended tasks (authentic assignments) in higher education, using rubric-based assessment. GPT-4o and Claude 3.5 Sonnet produce results closest to human grading in terms of score distribution and consistency. |
| [58] | primary school | math | gpt4o | calculation tasks | GPT-4o can effectively support the assessment of tasks requiring logical and spatial reasoning in children, provided that advanced prompting techniques are used. Simple prompting (zero-shot/few-shot) is insufficient—grading effectiveness by LLMs increases significantly when using advanced methods, such as visualization of thought and self-consistency (multiple generations with majority voting). In the best settings, GPT-4o achieved 93–100% correct grades on tasks testing logical and spatial skills, such as coloring objects according to math-logical-spatial rules. |
| [59] | university or college | foreign language | bert, gpt3.5turbo | essay | ChatGPT (GPT-3.5-turbo) performs much worse than dedicated BERT or SIAM models in the task of assessing whether a pair of student essays represents progression (improvement) or regression in language skills. ChatGPT’s effectiveness in this task was significantly lower (accuracy ∼56%) than the best models based on BERT architecture. |
| [60] | university or college | medicine | gpt4 | essay | The results obtained by ChatGPT strongly correlate with human ratings, especially for questions having clearer scope. AI enables immediate grading of large numbers of assignments and provides individual, detailed, rubric-based feedback—which is hard for a single teacher to deliver with large groups. |
| [61] | k12 | mixed | gpt4, gpt3.5 | short answers | The GPT-4 model using few-shot prompting achieved human-level agreement (Cohen’s Kappa 0.70), very close to expert level (Kappa 0.75). AI grades were stable regardless of subject (history/science), task difficulty, and student age. |
| [62] | secondary school | na | bert, gpt3.5turbo | multiple choice tasks | Tuned ChatGPT outperforms the state-of-the-art BERT model in both multi-label and multi-class tasks. The average grading performance improvement for ChatGPT was about 9.1% compared to BERT, and for more complex (multi-class) tasks even over 10%. The “base” version of GPT-3.5 is not effective enough—tuning is necessary for subject/class/task-specific data. |
| [63] | university or college | computer science | bert | coding tasks | A fine-tuned LLM can automatically assess student work in terms of higher-order cognitive skills (HOTS) without expert involvement, with greater accuracy and scalability than traditional systems. |
| [64] | primary school | foreign language | gpt4trubo | essay | Scores assigned by ChatGPT 3.5 were generally comparable to those given by academic staff. The average exact agreement between AI and humans was 67%. For individual criteria, agreement ranged from 59% to 81%. Using ChatGPT for grading reduced grading time fivefold (from about 7 min to 2 min per assignment), potentially saving hundreds of teacher work hours. |
| [65] | university or college | engineering | gpt3.5 | short answers | ChatGPT 3.5 more often assigns average grades (e.g., C, D, E) and less frequently than humans gives the highest or lowest marks. ChatGPT is consistent—repeated grading of the same work yielded similar results. |
| [66] | k12 | computer science | other | coding tasks | The average effectiveness of GPT autograder for simple tasks was 91.5%, and for complex ones only 51.3%. There was a strong correlation between LLM and official autograder scores (r = 0.84), but GPT-4 tended to underestimate scores, i.e., giving students lower results than they actually deserved (prevalence of so-called false negatives). |
| [67] | university or college | biology | gpt4 | short answers | SteLLA achieved “substantial agreement” with human grading—Cohen’s Kappa was 0.67 (compared to 0.83 for two human graders), and raw agreement was 0.84 (versus 0.92 for humans). This means the automated grading system is close to human-level agreement, though not equal to it yet. |
| [68] | university or college | math | gpt4, gpt3.5 | calculation tasks | The average grade given by GPT-4 for student solutions is very close to grades assigned by human examiners, but detailed accuracy is lower—random errors occur, equivalent solutions can be missed, and calculation mistakes appear. |
| [69] | university or college | foreign language | gpt4, bard | essay | The highest reliability was obtained for the FineTuned ChatGPT version (ICC = 0.972), with slightly lower reliability for ChatGPT Default (ICC = 0.947) and Bard (ICC = 0.919). The reliability of LLMs exceeded even some human examiners in terms of repeatability. The best match between LLM and human examiners was for “objective” criteria such as grammar and mechanics. Both ChatGPT and Bard almost perfectly matched human scores for grammar and mechanics (punctuation, spelling). |
| [70] | university or college | engineering, foreign language | gpt4, gpt3.5 | essay | ChatGPT (both 3.5 and 4) grades student essays more “generously” than human examiners. The average grades assigned by ChatGPT are higher than those given by teachers, regardless of model version. AI often gives higher grades even if the work quality is not high, leading to a “ceiling effect” (grades clustering near the top end). The agreement between human examiners was high (“good”), while for ChatGPT-3.5 it was only “moderate” and for ChatGPT-4 even “low”. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Jukiewicz, M.; Wyrwa, M. Can ChatGPT Replace the Teacher in Assessment? A Review of Research on the Use of Large Language Models in Grading and Providing Feedback. Appl. Sci. 2026, 16, 680. https://doi.org/10.3390/app16020680
Jukiewicz M, Wyrwa M. Can ChatGPT Replace the Teacher in Assessment? A Review of Research on the Use of Large Language Models in Grading and Providing Feedback. Applied Sciences. 2026; 16(2):680. https://doi.org/10.3390/app16020680
Chicago/Turabian StyleJukiewicz, Marcin, and Michał Wyrwa. 2026. "Can ChatGPT Replace the Teacher in Assessment? A Review of Research on the Use of Large Language Models in Grading and Providing Feedback" Applied Sciences 16, no. 2: 680. https://doi.org/10.3390/app16020680
APA StyleJukiewicz, M., & Wyrwa, M. (2026). Can ChatGPT Replace the Teacher in Assessment? A Review of Research on the Use of Large Language Models in Grading and Providing Feedback. Applied Sciences, 16(2), 680. https://doi.org/10.3390/app16020680

