A Comprehensive Review on Automated Grading Systems in STEM Using AI Techniques
Abstract
1. Introduction
1.1. Manual vs. Automated Grading
1.2. AI vs. Non-AI Grading
1.3. Evaluating AGSs
2. Methods
- 1.
- What are the key AI techniques used in AGSs?
- 2.
- How are AGSs being implemented in STEM education and what are their implications?
- 3.
- What are some challenges faced during implementation and the corresponding solutions?
3. Key AI Techniques Used in Automated Grading
3.1. Feature Engineering
3.2. Machine Learning
3.3. GenAI Approaches
4. AGS Implementation
4.1. Phase 1: Dataset Creation—Expert Annotation and Ground Truth Establishment
4.2. Phase 2: Data Preprocessing—Multimodal Feature Engineering
4.3. Phase 3: Machine Learning—Model Training and Optimization
4.4. Phase 4: Application and Deployment—System Integration and User Acceptance
5. Challenges and Solutions
- Student Trust and Perceived Fairness: Concerns about the accuracy and fairness of automated assessments
- Teacher Acceptance: Resistance due to algorithm aversion and fear of replacement
- Vulnerability to System Gaming: Students exploiting system weaknesses for higher grades
- Over-reliance on Immediate Feedback: Encouraging superficial learning approaches
- Limited Expert-Scored Data: Insufficient training datasets for algorithm development
- Bias in Algorithmic Assessment: Unfair treatment of minority groups and disabled students
- Institutional Readiness: Infrastructure and support requirements
5.1. Student Trust and Perceived Fairness
5.2. Teacher Acceptance
5.3. Vulnerability to System Gaming
5.4. Over-Reliance on Immediate Feedback
5.5. Limited Availability of Expert-Scored Data
5.6. Bias in Algorithmic Assessment
5.7. Institutional Readiness
6. Future Outlook
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Haller, S.; Aldea, A.; Seifert, C.; Strisciuglio, N. Survey on Automated Short Answer Grading with Deep Learning: From Word Embeddings to Transformers. arXiv 2022, arXiv:2204.03503. [Google Scholar] [CrossRef]
- Magalhães, P.; Ferreira, D.; Cunha, J.; Rosário, P. Online vs traditional homework: A systematic review on the benefits to students’ performance. Comput. Educ. 2020, 152, 103869. [Google Scholar] [CrossRef]
- Zupanc, K.; Bosnic, Z. Advances in the Field of Automated Essay Evaluation. Informatica 2015, 39, 383–395. [Google Scholar]
- Staubitz, T.; Klement, H.; Renz, J.; Teusner, R.; Meinel, C. Towards practical programming exercises and automated assessment in Massive Open Online Courses. In Proceedings of the 2015 IEEE International Conference on Teaching, Assessment, and Learning for Engineering (TALE), Zhuhai, China, 10–12 December 2015; pp. 23–30. [Google Scholar] [CrossRef]
- Zhu, M.; Bonk, C.J.; Sari, A.R. Instructor Experiences Designing MOOCs in Higher Education: Pedagogical, Resource, and Logistical Considerations and Challenges. Online Learn. 2018, 22, 203–241. [Google Scholar]
- Kumar, D.; Haque, A.; Mishra, K.; Islam, F.; Kumar Mishra, B.; Ahmad, S. Exploring the Transformative Role of Artificial Intelligence and Metaverse in Education: A Comprehensive Review. Metaverse Basic Appl. Res. 2023, 2, 55. [Google Scholar] [CrossRef]
- Verma, M. Artificial Intelligence and Its Scope in Different Areas with Special Reference to the Field of Education. Int. J. Adv. Educ. Res. 2018, 3, 5–10. [Google Scholar]
- Barana, A.; Marchisio, M. Ten Good Reasons to Adopt an Automated Formative Assessment Model for Learning and Teaching Mathematics and Scientific Disciplines. Procedia-Soc. Behav. Sci. 2016, 228, 608–613. [Google Scholar] [CrossRef]
- Ahmad, S.F.; Alam, M.M.; Rahmat, M.K.; Mubarik, M.S.; Hyder, S.I. Academic and Administrative Role of Artificial Intelligence in Education. Sustainability 2022, 14, 1101. [Google Scholar] [CrossRef]
- Gordon, C.L.; Lysecky, R.; Vahid, F. The Rise of Program Auto-grading in Introductory CS Courses: A Case Study of zyLabs. In Proceedings of the 2021 ASEE Virtual Annual Conference Content Access, Online, 26–29 July 2021; Available online: https://peer.asee.org/37887 (accessed on 26 July 2025).
- Combéfis, S. Automated Code Assessment for Education: Review, Classification and Perspectives on Techniques and Tools. Software 2022, 1, 3–30. [Google Scholar] [CrossRef]
- Valenti, S.; Neri, F.; Cucchiarelli, A. An Overview of Current Research on Automated Essay Grading. J. Inf. Technol. Educ. Res. 2003, 2, 319–330. [Google Scholar] [CrossRef]
- Denis, I.; Newstead, S.E.; Wright, D.E. A new approach to exploring biases in educational assessment. Br. J. Psychol. 1996, 87, 515–534. [Google Scholar] [CrossRef] [PubMed]
- Barra, E.; López-Pernas, S.; Alonso, A.; Sánchez-Rada, J.F.; Gordillo, A.; Quemada, J. Automated Assessment in Programming Courses: A Case Study during the COVID-19 Era. Sustainability 2020, 12, 7451. [Google Scholar] [CrossRef]
- Manzoor, H.; Naik, A.; Shaffer, C.A.; North, C.; Edwards, S.H. Auto-Grading Jupyter Notebooks. In Proceedings of the 51st ACM Technical Symposium on Computer Science Education, SIGCSE ’20, Portland, OR, USA, 11–14 March 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1139–1144. [Google Scholar] [CrossRef]
- Hagerer, G.; Lahesoo, L.; Anschutz, M.; Krusche, S.; Groh, G. An Analysis of Programming Course Evaluations Before and After the Introduction of an Autograder. In Proceedings of the 2021 19th International Conference on Information Technology Based Higher Education and Training (ITHET), Sydney, Australia, 4–6 November 2021; pp. 1–9. [Google Scholar] [CrossRef]
- Bey, A.; Jermann, P.; Dillenbourg, P. A Comparison between Two Automatic Assessment Approaches for Programming: An Empirical Study on MOOCs. J. Educ. Technol. Soc. 2018, 21, 259–272. [Google Scholar]
- Chen, X.; Breslow, L.; DeBoer, J. Analyzing productive learning behaviors for students using immediate corrective feedback in a blended learning environment. Comput. Educ. 2018, 117, 59–74. [Google Scholar] [CrossRef]
- Mekterović, I.; Brkić, L.; Milašinović, B.; Baranović, M. Building a Comprehensive Automated Programming Assessment System. IEEE Access 2020, 8, 81154–81172. [Google Scholar] [CrossRef]
- Macha Babitha, M.; Sushama, D.C.; Vijaya Kumar Gudivada, D.; Kutubuddin Sayyad Liyakat Kazi, D.; Srinivasa Rao Bandaru, D. Trends of Artificial Intelligence for Online Exams in Education. Int. J. Eng. Comput. Sci. 2023, 14, 2457–2463. [Google Scholar] [CrossRef]
- Cope, B.; Kalantzis, M. Sources of Evidence-of-Learning: Learning and assessment in the era of big data. Open Rev. Educ. Res. 2015, 2, 194–217. [Google Scholar] [CrossRef]
- Zhai, X. Practices and Theories: How Can Machine Learning Assist in Innovative Assessment Practices in Science Education. J. Sci. Educ. Technol. 2021, 30, 139–149. [Google Scholar] [CrossRef]
- Rao, D.M. Experiences With Auto-Grading in a Systems Course. In Proceedings of the 2019 IEEE Frontiers in Education Conference (FIE), Covington, KY, USA, 16–19 October 2019; pp. 1–8. [Google Scholar] [CrossRef]
- Kitaya, H.; Inoue, U. An Online Automated Scoring System for Java Programming Assignments. Int. J. Inf. Educ. Technol. 2016, 6, 275–279. [Google Scholar] [CrossRef]
- Riera, J.; Ardid, M.; Gómez-Tejedor, J.A.; Vidaurre, A.; Meseguer-Dueñas, J.M. Students’ perception of auto-scored online exams in blended assessment: Feedback for improvement. Educ. XX1 2018, 21, 79–103. [Google Scholar]
- Rokade, A.; Patil, B.; Rajani, S.; Revandkar, S.; Shedge, R. Automated Grading System Using Natural Language Processing. In Proceedings of the 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT), Coimbatore, India, 20–21 April 2018; pp. 1123–1127. [Google Scholar] [CrossRef]
- Hooda, M.; Rana, C.; Dahiya, O.; Rizwan, A.; Hossain, M.S. Artificial Intelligence for Assessment and Feedback to Enhance Student Success in Higher Education. Math. Probl. Eng. 2022, 2022, 5215722. [Google Scholar] [CrossRef]
- Braiki, B.; Harous, S.; Zaki, N.; Alnajjar, F. Artificial intelligence in education and assessment methods. Bull. Electr. Eng. Inform. 2020, 9, 1998–2007. [Google Scholar] [CrossRef]
- Bian, W.; Alam, O.; Kienzle, J. Automated Grading of Class Diagrams. In Proceedings of the 2019 ACM/IEEE 22nd International Conference on Model Driven Engineering Languages and Systems Companion (MODELS-C), Munich, Germany, 15–20 September 2019; pp. 700–709. [Google Scholar] [CrossRef]
- Riordan, B.; Bichler, S.; Bradford, A.; King Chen, J.; Wiley, K.; Gerard, L.; C. Linn, M. An empirical investigation of neural methods for content scoring of science explanations. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, Seattle, WA, USA, Online, 10 July 2020; pp. 135–144. [Google Scholar] [CrossRef]
- Zhai, X.; Shi, L.; Nehm, R. A Meta-Analysis of Machine Learning-Based Science Assessments: Factors Impacting Machine-Human Score Agreements. J. Sci. Educ. Technol. 2021, 30, 361–379. [Google Scholar] [CrossRef]
- von Davier, M.; Tyack, L.; Khorramdel, L. Scoring Graphical Responses in TIMSS 2019 Using Artificial Neural Networks. Educ. Psychol. Meas. 2022, 83, 556–585. [Google Scholar] [CrossRef] [PubMed]
- Condor, A. Exploring Automatic Short Answer Grading as a Tool to Assist in Human Rating. In Proceedings of the Artificial Intelligence in Education, Ifrane, Morocco, 6–10 July 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 74–79. [Google Scholar]
- García-Gorrostieta, J.M.; López-López, A.; González-López, S. Automatic argument assessment of final project reports of computer engineering students. Comput. Appl. Eng. Educ. 2018, 26, 1217–1226. [Google Scholar] [CrossRef]
- Lee, H.S.; Pallant, A.; Pryputniewicz, S.; Lord, T.; Mulholland, M.; Liu, O.L. Automated text scoring and real-time adjustable feedback: Supporting revision of scientific arguments involving uncertainty. Sci. Educ. 2019, 103, 590–622. [Google Scholar] [CrossRef]
- DiCerbo, K. Assessment for Learning with Diverse Learners in a Digital World. Educ. Meas. Issues Pract. 2020, 39, 90–93. [Google Scholar] [CrossRef]
- Zhang, Y.; Shah, R.; Chi, M. Deep Learning + Student Modeling + Clustering: A Recipe for Effective Automatic Short Answer Grading. In Proceedings of the 9th International Conference on Educational Data Mining (EDM), Raleigh, NC, USA, 29 June–2 July 2016. [Google Scholar]
- Angelone, A.M.; Vittorini, P. The Automated Grading of R Code Snippets: Preliminary Results in a Course of Health Informatics. In Proceedings of the Methodologies and Intelligent Systems for Technology Enhanced Learning, 9th International Conference, Ávila, Spain, 26–28 June 2019; Springer: Cham, Switzerland, 2020; pp. 19–27. [Google Scholar]
- Çınar, A.; Ince, E.; Gezer, M.; Yılmaz, Ö. Machine learning algorithm for grading open-ended physics questions in Turkish. Educ. Inf. Technol. 2020, 25, 3821–3844. [Google Scholar] [CrossRef]
- Julca-Aguilar, F.; Mouchère, H.; Viard-Gaudin, C.; Hirata, N. A general framework for the recognition of online handwritten graphics. IJDAR 2020, 23, 143–160. [Google Scholar] [CrossRef]
- Rowtula, V.; Oota, S.R. Towards Automated Evaluation of Handwritten Assessments. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, NSW, Australia, 20–25 September 2019; pp. 426–433. [Google Scholar] [CrossRef]
- Levin, M.; McKechnie, T.; Khalid, S.; Grantcharov, T.P.; Goldenberg, M. Automated Methods of Technical Skill Assessment in Surgery: A Systematic Review. J. Surg. Educ. 2019, 76, 1629–1639. [Google Scholar] [CrossRef]
- Bian, W.; Alam, O.; Kienzle, J. Is automated grading of models effective? assessing automated grading of class diagrams. In Proceedings of the 23rd ACM/IEEE International Conference on Model Driven Engineering Languages and Systems, MODELS ’20, Virtual, 16–23 October 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 365–376. [Google Scholar] [CrossRef]
- Vajjala, S. Automated Assessment of Non-Native Learner Essays: Investigating the Role of Linguistic Features. Int. J. Artif. Intell. Educ. 2018, 28, 79–105. [Google Scholar] [CrossRef]
- Štajduhar, I.; Mauša, G. Using string similarity metrics for automated grading of SQL statements. In Proceedings of the 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, 25–29 May 2015; pp. 1250–1255. [Google Scholar] [CrossRef]
- Fowler, M.; Chen, B.; Azad, S.; West, M.; Zilles, C. Autograding “Explain in Plain English” questions using NLP. In Proceedings of the 52nd ACM Technical Symposium on Computer Science Education, SIGCSE ’21, Virtual, 13–20 March 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 1163–1169. [Google Scholar] [CrossRef]
- Latifi, S.; Gierl, M.; Boulais, A.; De Champlain, A. Using Automated Scoring to Evaluate Written Responses in English and French on a High-Stakes Clinical Competency Examination. Eval. Health Prof. 2016, 39, 100–113. [Google Scholar] [CrossRef]
- Oquendo, Y.; Riddle, E.; Hiller, D.; Blinman, T.A.; Kuchenbecker, K.J. Automatically rating trainee skill at a pediatric laparoscopic suturing task. Surg. Endosc. 2018, 32, 1840–1857. [Google Scholar] [CrossRef]
- Lan, A.S.; Vats, D.; Waters, A.E.; Baraniuk, R.G. Mathematical Language Processing: Automatic Grading and Feedback for Open Response Mathematical Questions. arXiv 2015, arXiv:1501.04346. [Google Scholar] [CrossRef]
- Hoblos, J. Experimenting with Latent Semantic Analysis and Latent Dirichlet Allocation on Automated Essay Grading. In Proceedings of the 2020 Seventh International Conference on Social Networks Analysis, Management and Security (SNAMS), Paris, France, 14–16 December 2020; pp. 1–7. [Google Scholar] [CrossRef]
- Sung, C.; Dhamecha, T.I.; Mukhi, N. Improving Short Answer Grading Using Transformer-Based Pre-training. In Proceedings of the Artificial Intelligence in Education, Chicago, IL, USA, 25–29 June 2019; Springer International Publishing: Cham, Switzerland, 2019; pp. 469–481. [Google Scholar]
- Prabhudesai, A.; Duong, T.N.B. Automatic Short Answer Grading using Siamese Bidirectional LSTM Based Regression. In Proceedings of the 2019 IEEE International Conference on Engineering, Technology and Education (TALE), Yogyakarta, Indonesia, 10–13 December 2019; pp. 1–6. [Google Scholar] [CrossRef]
- Riordan, B.; Horbach, A.; Cahill, A.; Zesch, T.; Lee, C.M. Investigating neural architectures for short answer scoring. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications; Association for Computational Linguistics: Copenhagen, Denmark, 2017; pp. 159–168. [Google Scholar] [CrossRef]
- Baral, S.; Botelho, A.F.; Erickson, J.A.; Benachamardi, P.; Heffernan, N.T. Improving Automated Scoring of Student Open Responses in Mathematics. In Proceedings of the Proceedings of the 14th International Conference on Educational Data Mining (EDM), Online, 29 June –2 July 2021. [Google Scholar]
- Fagbohun, O.; Iduwe, N.; Abdullahi, M.; Ifaturoti, A.; Nwanna, O. Beyond Traditional Assessment: Exploring the Impact of Large Language Models on Grading Practices. J. Artif. Intell. Mach. Learn. Data Sci. 2024, 2, 1–8. [Google Scholar] [CrossRef]
- Qadir, J. Engineering Education in the Era of ChatGPT: Promise and Pitfalls of Generative AI for Education. In Proceedings of the 2023 IEEE Global Engineering Education Conference (EDUCON), Kuwait, Kuwait, 1–4 May 2023; pp. 1–9. [Google Scholar] [CrossRef]
- Rawas, S. ChatGPT: Empowering lifelong learning in the digital age of higher education. Educ. Inf. Technol. 2024, 29, 6895–6908. [Google Scholar] [CrossRef]
- Lee, S.S.; Moore, R.L. Harnessing Generative AI (GenAI) for Automated Feedback in Higher Education: A Systematic Review. Online Learn. 2024, 28, 82–106. [Google Scholar] [CrossRef]
- Lagakis, P.; Demetriadis, S.; Psathas, G. Automated Grading in Coding Exercises Using Large Language Models. In Proceedings of the Smart Mobile Communication & Artificial Intelligence; Springer Nature: Cham, Switzerland, 2024; pp. 363–373. [Google Scholar]
- Lee, G.G.; Latif, E.; Wu, X.; Liu, N.; Zhai, X. Applying large language models and chain-of-thought for automatic scoring. Comput. Educ. Artif. Intell. 2024, 6, 100213. [Google Scholar] [CrossRef]
- Chen, Z.; Wan, T. Achieving Human Level Partial Credit Grading of Written Responses to Physics Conceptual Question using GPT-3.5 with Only Prompt Engineering. arXiv 2024, arXiv:2407.15251. [Google Scholar] [CrossRef]
- Latif, E.; Zhai, X. Fine-tuning ChatGPT for automatic scoring. Comput. Educ. Artif. Intell. 2024, 6, 100210. [Google Scholar] [CrossRef]
- Smith, A.; Leeman-Munk, S.; Shelton, A.; Mott, B.; Wiebe, E.; Lester, J. A Multimodal Assessment Framework for Integrating Student Writing and Drawing in Elementary Science Learning. IEEE Trans. Learn. Technol. 2019, 12, 3–15. [Google Scholar] [CrossRef]
- Bernius, J.P.; Krusche, S.; Bruegge, B. A Machine Learning Approach for Suggesting Feedback in Textual Exercises in Large Courses. In Proceedings of the Eighth ACM Conference on Learning @ Scale, L@S ’21, Virtual, 22–25 June 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 173–182. [Google Scholar] [CrossRef]
- Messer, M.; Brown, N.C.C.; Kölling, M.; Shi, M. Machine Learning-Based Automated Grading and Feedback Tools for Programming: A Meta-Analysis. In Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1, ITiCSE 2023, Turku, Finland, 7–12 July 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 491–497. [Google Scholar] [CrossRef]
- Ariely, M.; Nazaretsky, T.; Alexandron, G. Machine Learning and Hebrew NLP for Automated Assessment of Open-Ended Questions in Biology. Int. J. Artif. Intell. Educ. 2023, 33, 1–34. [Google Scholar] [CrossRef]
- Weegar, R.; Idestam-Almquist, P. Reducing Workload in Short Answer Grading Using Machine Learning. Int. J. Artif. Intell. Educ. 2024, 34, 247–273. [Google Scholar] [CrossRef]
- Zhu, M.; Lee, H.S.; Wang, T.; Liu, O.L.; Belur, V.; Pallant, A. Investigating the impact of automated feedback on students’ scientific argumentation. Int. J. Sci. Educ. 2017, 39, 1648–1668. [Google Scholar] [CrossRef]
- Vittorini, P.; Menini, S.; Tonelli, S. An AI-Based System for Formative and Summative Assessment in Data Science Courses. Int. J. Artif. Intell. Educ. 2021, 31, 159–185. [Google Scholar] [CrossRef]
- Tisha, S.M.; Oregon, R.A.; Baumgartner, G.; Alegre, F.; Moreno, J. An Automatic Grading System for a High School-level Computational Thinking Course. In Proceedings of the 2022 IEEE/ACM 4th International Workshop on Software Engineering Education for the Next Generation (SEENG), Pittsburgh, PA, USA, 17 May 2022; pp. 20–27. [Google Scholar] [CrossRef]
- Nunes, A.; Cordeiro, C.; Limpo, T.; Castro, S.L. Effectiveness of automated writing evaluation systems in school settings: A systematic review of studies from 2000 to 2020. J. Comput. Assist. Learn. 2022, 38, 599–620. [Google Scholar] [CrossRef]
- Wan, T.; Chen, Z. Exploring generative AI assisted feedback writing for students’ written responses to a physics conceptual question with prompt engineering and few-shot learning. Phys. Rev. Phys. Educ. Res. 2024, 20, 010152. [Google Scholar] [CrossRef]
- Hsu, S.; Li, T.W.; Zhang, Z.; Fowler, M.; Zilles, C.; Karahalios, K. Attitudes Surrounding an Imperfect AI Autograder. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, CHI ’21, Yokohama, Japan, 8–13 May 2021; Association for Computing Machinery: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
- Nazaretsky, T.; Ariely, M.; Cukurova, M.; Alexandron, G. Teachers’ trust in AI-powered educational technology and a professional development program to improve it. Br. J. Educ. Technol. 2022, 53, 914–931. [Google Scholar] [CrossRef]
- Bond, M.; Khosravi, H.; De Laat, M.; Bergdahl, N.; Negrea, V.; Oxley, E.; Pham, P.; Chong, S.W.; Siemens, G. A meta systematic review of artificial intelligence in higher education: A call for increased ethics, collaboration, and rigour. Int. J. Educ. Technol. High. Educ. 2024, 21, 4. [Google Scholar] [CrossRef]
- Selwyn, N. Less Work for Teacher? The Ironies of Automated Decision-Making in Schools. In Everyday Automation, 1st ed.; Routledge: London, UK, 2022. [Google Scholar]
- Wang, P.-l. Effects of an automated writing evaluation program: Student experiences and perceptions. Electron. J. Foreign Lang. Teach. 2015, 12, 79–100. [Google Scholar]
- Wilson, J.; Ahrendt, C.; Fudge, E.A.; Raiche, A.; Beard, G.; MacArthur, C. Elementary teachers’ perceptions of automated feedback and automated scoring: Transforming the teaching and learning of writing using automated writing evaluation. Comput. Educ. 2021, 168, 104208. [Google Scholar] [CrossRef]
- Gordillo, A. Effect of an Instructor-Centered Tool for Automatic Assessment of Programming Assignments on Students’ Perceptions and Performance. Sustainability 2019, 11, 5568. [Google Scholar] [CrossRef]
- Buffardi, K.; Edwards, S.H. Reconsidering Automated Feedback: A Test-Driven Approach. In Proceedings of the 46th ACM Technical Symposium on Computer Science Education, SIGCSE ’15, Kansas City, MO, USA, 4–7 March 2015; Association for Computing Machinery: New York, NY, USA, 2015; pp. 416–420. [Google Scholar] [CrossRef]
- Baniassad, E.; Zamprogno, L.; Hall, B.; Holmes, R. STOP THE (AUTOGRADER) INSANITY: Regression Penalties to Deter Autograder Overreliance. In Proceedings of the 52nd ACM Technical Symposium on Computer Science Education, SIGCSE ’21, Virtual, 13–20 March 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 1062–1068. [Google Scholar] [CrossRef]
- Lawrence, R.; Foss, S.; Urazova, T. Evaluation of Submission Limits and Regression Penalties to Improve Student Behavior with Automatic Assessment Systems. ACM Trans. Comput. Educ. 2023, 23, 1–24. [Google Scholar] [CrossRef]
- Anjum, G.; Choubey, J.; Kushwaha, S.; Patkar, V. AI in Education: Evaluating the Efficacy and Fairness of Automated Grading Systems. Int. J. Innov. Res. Sci. Eng. Technol. (IJIRSET) 2023, 12, 9043–9050. [Google Scholar] [CrossRef]
- Ma, Y.; Hu, S.; Li, X.; Wang, Y.; Chen, Y.; Liu, S.; Cheong, K.H. When LLMs Learn to be Students: The SOEI Framework for Modeling and Evaluating Virtual Student Agents in Educational Interaction. arXiv 2024, arXiv:2410.15701. [Google Scholar]
- Morris, W.; Holmes, L.; Choi, J.S.; Crossley, S. Automated Scoring of Constructed Response Items in Math Assessment Using Large Language Models. Int. J. Artif. Intell. Educ. 2024, 35, 559–586. [Google Scholar] [CrossRef]
- Condor, A.; Litster, M.; Pardos, Z. Automatic Short Answer Grading with SBERT on Out-of-Sample Questions. In Proceedings of the International Conference on Educational Data Mining (EDM), Online, 29 June–2 July 2021. [Google Scholar]
- Akgun, S.; Greenhow, C. Artificial intelligence in education: Addressing ethical challenges in K-12 settings. AI Ethics 2022, 2, 431–440. [Google Scholar] [CrossRef]
- Reilly, E.D.; Williams, K.M.; Stafford, R.E.; Corliss, S.B.; Walkow, J.C.; Kidwell, D.K. Global Times Call for Global Measures: Investigating Automated Essay Scoring in Linguistically-Diverse MOOCs. Online Learn. 2016, 20, 217–229. [Google Scholar] [CrossRef]
- Tuah, N.A.A.; Naing, L. Is Online Assessment in Higher Education Institutions during COVID-19 Pandemic Reliable? Siriraj Med. J. 2021, 73, 61. [Google Scholar] [CrossRef]
- Teo, Y.H.; Yap, J.H.; An, H.; Yu, S.C.M.; Zhang, L.; Chang, J.; Cheong, K.H. Enhancing the MEP coordination process with BIM technology and management strategies. Sensors 2022, 22, 4936. [Google Scholar] [CrossRef]
- Cheong, K.H.; Lai, J.W.; Yap, J.H.; Cheong, G.S.W.; Budiman, S.V.; Ortiz, O.; Mishra, A.; Yeo, D.J. Utilizing Google cardboard virtual reality for visualization in multivariable calculus. IEEE Access 2023, 11, 75398–75406. [Google Scholar] [CrossRef]
- Bogar, P.Z.; Virag, M.; Bene, M.; Hardi, P.; Matuz, A.; Schlegl, A.T.; Toth, L.; Molnar, F.; Nagy, B.; Rendeki, S.; et al. Validation of a novel, low-fidelity virtual reality simulator and an artificial intelligence assessment approach for peg transfer laparoscopic training. Sci. Rep. 2024, 14, 16702. [Google Scholar] [CrossRef] [PubMed]
- Tan, L.Y.; Hu, S.; Yeo, D.J.; Cheong, K.H. Artificial Intelligence-Enabled Adaptive Learning Platforms: A Review. Comput. Educ. Artif. Intell. 2025, 9, 100429. [Google Scholar] [CrossRef]
- Drijvers, P.; Ball, L.; Barzel, B.; Heid, M.K.; Cao, Y.; Maschietto, M. Uses of Technology in Lower Secondary Mathematics Education, 1st ed.; Number 1 in ICME-13 Topical Surveys; Springer: Cham, Switzerland, 2016; pp. 1–34. ISBN 978-3-319-33666-4. [Google Scholar] [CrossRef]
- Tomić, B.B.; Kijevčanin, A.D.; Ševarac, Z.V.; Jovanović, J.M. An AI-based Approach for Grading Students’ Collaboration. IEEE Trans. Learn. Technol. 2023, 16, 292–305. [Google Scholar] [CrossRef]
- Som, A.; Kim, S.; Lopez-Prado, B.; Dhamija, S.; Alozie, N.; Tamrakar, A. Automated Student Group Collaboration Assessment and Recommendation System Using Individual Role and Behavioral Cues. Front. Comput. Sci. 2021, 3, 728801. [Google Scholar] [CrossRef]
- Guo, S.; Zheng, Y.; Zhai, X. Artificial Intelligence in Education Research During 2013–2023: A Review Based on Bibliometric Analysis. Educ. Inf. Technol. 2024, 29, 16387–16409. [Google Scholar] [CrossRef]
- Bozkurt, A.; Sharma, R.C. Challenging the Status Quo and Exploring the New Boundaries in the Age of Algorithms: Reimagining the Role of Generative AI in Distance Education and Online Learning. Asian J. Distance Educ. 2023, 18, i–viii. [Google Scholar]
Algorithm | Subject | Data Type | Performance | Description & Reference |
---|---|---|---|---|
Logistic regression | Programming | Text | 78% accuracy | SQL statement correctness using string similarity metrics [45] |
Logistic regression | Programming | Text | 88% accuracy | “Explain in Plain English” questions using word/bigram features [46] |
Ensemble learning algorithms | Physics | Text | >80% accuracy | Multiple supervised models (SVM, k-NN, AdaBoost) for Turkish physics questions [39] |
Decision Tree | Medicine | Text | 95.4% agreement | Bilingual clinical decision-making assessment [47] |
Regression Tree | Medicine | Sensor data | 71% accuracy | Medical suturing skills using motion sensor features [48] |
Clustering | Mathematics | Text | 0.04 MAE | Hybrid human-AI approach for algebraic problems [49] |
Semantic similarity | Programming | Text | 84% correlation | LSA/LDA for essay grading in programming concepts [50] |
CNNs | Science | Image | 0.82 F1 | Handwritten assessment using word spotting and NER [41] |
CNNs | Math/Science | Image | 97.53% accuracy | TIMSS 2019 graphical responses, outperformed humans [32] |
Transformers | Physics | Text | 76% accuracy | BERT for Newtonian physics conceptual questions [33] |
Transformers | Science | Text | 10% F1 increase | BERT for short-answer grading across multiple domains [51] |
LSTM | Programming | Text | 0.655 pearson correlation | Siamese BiLSTM with handcrafted features [52] |
LSTM | Science | Text | 0.732 QWK | Bidirectional LSTM with attention for ASAP-SAS dataset [53] |
Transformers + Ranking | Mathematics | Text | 0.856 AUC | SBERT embeddings for open-ended math solutions [54] |
GenAI Approach | Subject | Solution Type | Performance | Description |
---|---|---|---|---|
Zero-shot learning with GPT-4 | Programming | Text (Code) | 0.74∼0.98 F1-scores | Lagakis et al. [59] used GPT-4 for automating grade prediction of Python coding assignments in a Greek MOOC, using a zero-shot learning approach via OpenAI’s ChatCompletion API. Both minimal and detailed prompts, which had increasingly more specific grading instructions, improved performance as compared to a basic prompt that only contained the task and assignment description, in ordinal classification tasks (categorizing grades into tiers like “low”, “mediocre”, or “high”). |
Chain-of-Thought (CoT) reasoning with detailed context, rubric-based evaluation, and few-shot learning with GPT-4 | Science | Text | 0.6975 accuracy | Lee et al.’s [60] AGS scored middle school science assessments. The prompt was structured to invoke CoT reasoning with five elements: (1) a clear role instruction for GPT to act as an impartial science teacher, (2) detailed contextual information describing the task and scientific phenomenon, (3) a rubric with specific criteria for evaluation, (4) explicit instructions to compare student responses to rubric components, and (5) multiple few-shot examples showing how to evaluate responses using the rubric. |
Scaffolded CoT prompting with GPT-3.5 | Physics | Text | 20∼30% accuracy gain over conventional CoT; 70∼80% human agreement | Chen and Wan [61] employed scaffolded CoT prompting for a calculus-based physics course, which guides GPT-3.5 to explicitly compare student responses to a detailed rubric by requiring it to identify relevant portions of the response, evaluate their alignment with the rubric, and provide a step-by- step grading rationale. |
Finetune GPT-3.5 on past student solutions and scores | Science | Text | 9.1% higher average scoring accuracy over state-of-the-art finetuned BERT | Latif and Zhai [62] finetuned GPT-3.5 on a diverse dataset of middle and higher schoolers’ written responses to science questions with ground truth expert scores. The finetuned model then assessed students on questions that required them to interpret data and explain their answer with scientific concepts (e.g., gases, heat) and mathematical reasoning. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tan, L.Y.; Hu, S.; Yeo, D.J.; Cheong, K.H. A Comprehensive Review on Automated Grading Systems in STEM Using AI Techniques. Mathematics 2025, 13, 2828. https://doi.org/10.3390/math13172828
Tan LY, Hu S, Yeo DJ, Cheong KH. A Comprehensive Review on Automated Grading Systems in STEM Using AI Techniques. Mathematics. 2025; 13(17):2828. https://doi.org/10.3390/math13172828
Chicago/Turabian StyleTan, Le Ying, Shiyu Hu, Darren J. Yeo, and Kang Hao Cheong. 2025. "A Comprehensive Review on Automated Grading Systems in STEM Using AI Techniques" Mathematics 13, no. 17: 2828. https://doi.org/10.3390/math13172828
APA StyleTan, L. Y., Hu, S., Yeo, D. J., & Cheong, K. H. (2025). A Comprehensive Review on Automated Grading Systems in STEM Using AI Techniques. Mathematics, 13(17), 2828. https://doi.org/10.3390/math13172828