Educational Evaluation with MLLMs: Framework, Dataset, and Comprehensive Assessment
Abstract
1. Introduction
- (1)
- We shift the application of MLLMs from content generation to competency-based assessment and design a lightweight, extensible framework for evaluating student outcomes in non-exam tasks. This framework provides structured support for applying MLLMs to automated assessment across five key educational dimensions.
- (2)
- We introduce a multimodal evaluation dataset comprising student essays, slide decks, and presentation videos, rated by human experts across five educational dimensions: Format Compliance, Content Quality, Slide Design, Verbal Expression, and Nonverbal Performance.
- (3)
- We conduct a comprehensive evaluation of leading MLLMs (GPT-4o, Gemini 2.5, Doubao1.6, and Kimi 1.5) across five educational dimensions and three technology dimensions. This study presents the performance of each model and provides an in-depth analysis of the strengths and limitations of MLLMs in learning assessment.
2. Literature Review
2.1. MLLMs in Education
2.2. Learning Assessment
3. Methodology
3.1. Research Design
3.2. Scoring Framework Design
3.3. Dataset Construction
3.4. MLLM Setting and Evaluation
4. Results Analysis
4.1. Model Performance Across Educational Dimensions
4.2. Model Performance Across Stability and Interpretability
5. Discussion
5.1. Potential and Performance Variations of MLLMs in Educational Assessment
5.2. Discussion and Insights on Scoring Stability and Interpretability of MLLMs
5.3. Ethical Considerations and Risk Mitigation in MLLM Use
6. Conclusions
6.1. Limitations
6.2. Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 2024, 15, 39. [Google Scholar] [CrossRef]
- Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Akhtar, N.; Barnes, N.; Mian, A. A Comprehensive Overview of Large Language Models. ACM Trans. Intell. Syst. Technol. 2025, 16, 106. [Google Scholar] [CrossRef]
- Gan, W.; Qi, Z.; Wu, J.; Lin, J.C.W. Large Language Models in Education: Vision and Opportunities. In Proceedings of the 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 15–18 December 2023; pp. 4776–4785. [Google Scholar] [CrossRef]
- Kasneci, E.; Sessler, K.; Küchemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; Günnemann, S.; Hüllermeier, E.; et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learn. Individ. Differ. 2023, 103, 102274. [Google Scholar] [CrossRef]
- Milano, S.; McGrane, J.A.; Leonelli, S. Large language models challenge the future of higher education. Nat. Mach. Intell. 2023, 5, 333–334. [Google Scholar] [CrossRef]
- Yin, S.; Fu, C.; Zhao, S.; Li, K.; Sun, X.; Xu, T.; Chen, E. A survey on multimodal large language models. Natl. Sci. Rev. 2024, 11, nwae403. [Google Scholar] [CrossRef]
- Zhang, D.; Yu, Y.; Dong, J.; Li, C.; Su, D.; Chu, C.; Yu, D. MM-LLMs: Recent Advances in MultiModal Large Language Models. arXiv 2024, arXiv:2401.13601. [Google Scholar]
- Imran, M.; Almusharraf, N. Google Gemini as a Next Generation AI Educational Tool: A Review of Emerging Educational Technology. Smart Learn. Environ. 2024, 11, 22. [Google Scholar] [CrossRef]
- Küchemann, S.; Avila, K.E.; Dinc, Y.; Hortmann, C.; Revenga, N.; Ruf, V.; Stausberg, N.; Steinert, S.; Fischer, F.; Fischer, M.; et al. On opportunities and challenges of large multimodal foundation models in education. npj Sci. Learn. 2025, 10, 11. [Google Scholar] [CrossRef]
- Magnusson, P.; Godhe, A.L. Multimodality in Language Education: Implications for Teaching. Des. Learn. 2019, 11, 127–137. [Google Scholar] [CrossRef]
- Monib, W.K.; Qazi, A.; Apong, R.A.; Azizan, M.T.; De Silva, L.; Yassin, H. Generative AI and future education: A review, theoretical validation, and authors’ perspective on challenges and solutions. PeerJ Comput. Sci. 2024, 10, e2105. [Google Scholar] [CrossRef]
- Munaye, Y.Y.; Admass, W.; Belayneh, Y.; Molla, A.; Asmare, M. ChatGPT in Education: A Systematic Review on Opportunities, Challenges, and Future Directions. Algorithms 2025, 18, 352. [Google Scholar] [CrossRef]
- Albadarin, Y.; Saqr, M.; Pope, N.; Tukiainen, M. A systematic literature review of empirical research on ChatGPT in education. Discov. Educ. 2024, 3, 60. [Google Scholar] [CrossRef]
- Emirtekin, E. Large Language Model-Powered Automated Assessment: A Systematic Review. Appl. Sci. 2025, 15, 5683. [Google Scholar] [CrossRef]
- Wiliam, D. What is assessment for learning? Stud. Educ. Eval. 2011, 37, 3–14. [Google Scholar] [CrossRef]
- Berry, R. Assessment for Learning; Hong Kong University Press: Hong Kong, 2008; Volume 1. [Google Scholar]
- OECD. The Definition and Selection of Key Competencies: Executive Summary; OECD: Paris, France, 2005. [Google Scholar]
- Trilling, B.; Fadel, C. 21st Century Skills: Learning for Life in Our Times; John Wiley & Sons: Hoboken, NJ, USA, 2009. [Google Scholar]
- Van Roekel, D. Global Competence is a 21st Century Imperative; NEA Policy and Practice Department: Washington, DC, USA, 2010; pp. 1–4. [Google Scholar]
- Bloxham, S.; Boyd, P. Developing Effective Assessment in Higher Education: A Practical Guide: A Practical Guide; McGraw-Hill Education (UK): Maidenhead, UK, 2007. [Google Scholar]
- Fagbohun, O.; Iduwe, N.P.; Abdullahi, M.; Ifaturoti, A.; Nwanna, O. Beyond traditional assessment: Exploring the impact of large language models on grading practices. J. Artif. Intell. Mach. Learn. Data Sci. 2024, 2, 1–8. [Google Scholar] [CrossRef]
- Yan, L.; Sha, L.; Zhao, L.; Li, Y.; Martinez-Maldonado, R.; Chen, G.; Li, X.; Jin, Y.; Gašević, D. Practical and ethical challenges of large language models in education: A systematic scoping review. Br. J. Educ. Technol. 2024, 55, 90–112. [Google Scholar] [CrossRef]
- Luckin, R.; Cukurova, M.; Kent, C.; Du Boulay, B. Empowering educators to be AI-ready. Comput. Educ. Artif. Intell. 2022, 3, 100076. [Google Scholar] [CrossRef]
- Kasneci, E.; Gao, H.; Ozdel, S.; Maquiling, V.; Thaqi, E.; Lau, C.; Rong, Y.; Kasneci, G.; Bozkir, E. Introduction to Eye Tracking: A Hands-On Tutorial for Students and Practitioners. arXiv 2024, arXiv:2404.15435. [Google Scholar] [CrossRef]
- Pankiewicz, M.; Baker, R.S. Large Language Models (GPT) for automating feedback on programming assignments. arXiv 2023, arXiv:2307.00150. [Google Scholar] [CrossRef]
- Kohnke, L.; Moorhouse, B.L.; Zou, D. ChatGPT for language teaching and learning. Relc J. 2023, 54, 537–550. [Google Scholar] [CrossRef]
- Shanahan, M.; McDonell, K.; Reynolds, L. Role play with large language models. Nature 2023, 623, 493–498. [Google Scholar] [CrossRef] [PubMed]
- Mayer, R.E. Cognitive theory of multimedia learning. Camb. Handb. Multimed. Learn. 2005, 41, 31–48. [Google Scholar]
- Hutson, J.; Robertson, B. Exploring the educational potential of AI generative art in 3D design fundamentals: A case study on prompt engineering and creative workflows. Glob. J. Hum.-Soc. Sci. A Arts Humanit.-Psychol. 2023, 23, 485. [Google Scholar]
- Waghmare, C. Introduction to chatgpt. In Unleashing The Power of ChatGPT: A Real World Business Applications; Springer: Berkeley, CA, USA, 2023; pp. 1–26. [Google Scholar]
- Kooli, C. Chatbots in education and research: A critical examination of ethical implications and solutions. Sustainability 2023, 15, 5614. [Google Scholar] [CrossRef]
- Gupta, M.S.; Nitu Kumar, D.V.R. AI and Teacher Productivity: A Quantitative Analysis of Time-Saving and Workload Reduction in Education. In Proceedings of the International Conference on Advancing Synergies in Science, Engineering, and Management (ASEM-2024), Virginia Beach, VA, USA, 6–9 November 2024. [Google Scholar]
- Weegar, R.; Idestam-Almquist, P. Reducing workload in short answer grading using machine learning. Int. J. Artif. Intell. Educ. 2024, 34, 247–273. [Google Scholar] [CrossRef]
- Yusuf, H.; Money, A.; Daylamani-Zad, D. Towards reducing teacher burden in Performance-Based assessments using aivaluate: An emotionally intelligent LLM-Augmented pedagogical AI conversational agent. Educ. Inf. Technol. 2025, 30, 1–45. [Google Scholar] [CrossRef]
- Taras, M. Assessment for learning: Assessing the theory and evidence. Procedia-Soc. Behav. Sci. 2010, 2, 3015–3022. [Google Scholar] [CrossRef]
- Taras, M. Assessment–summative and formative–some theoretical reflections. Br. J. Educ. Stud. 2005, 53, 466–478. [Google Scholar] [CrossRef]
- Black, P.; Wiliam, D. Developing the theory of formative assessment. Educ. Assess. Eval. Account. (Former. J. Pers. Eval. Educ.) 2009, 21, 5–31. [Google Scholar] [CrossRef]
- Dixson, D.D.; Worrell, F.C. Formative and summative assessment in the classroom. Theory Pract. 2016, 55, 153–159. [Google Scholar] [CrossRef]
- Pearson, P.D.; Hiebert, E.H. Based Practices for Teaching Common Core Literacy; Teachers College Press: New York, NY, USA, 2015. [Google Scholar]
- Calfee, R.; Wilson, K.M.; Flannery, B.; Kapinus, B. Formative assessment for the common core literacy standards. Teach. Coll. Rec. 2014, 116, 1–32. [Google Scholar] [CrossRef]
- Darvishi, A.; Khosravi, H.; Sadiq, S.; Gašević, D.; Siemens, G. Impact of AI assistance on student agency. Comput. Educ. 2024, 210, 104967. [Google Scholar] [CrossRef]
- Song, C.; Song, Y. Enhancing academic writing skills and motivation: Assessing the efficacy of ChatGPT in AI-assisted language learning for EFL students. Front. Psychol. 2023, 14, 1260843. [Google Scholar] [CrossRef]
- Urban, M.; Děchtěrenko, F.; Lukavskỳ, J.; Hrabalová, V.; Svacha, F.; Brom, C.; Urban, K. ChatGPT improves creative problem-solving performance in university students: An experimental study. Comput. Educ. 2024, 215, 105031. [Google Scholar] [CrossRef]
- Park, J.Y.; Ko, C.B. Proposal for AI video interview using image data analysis. Int. J. Internet Broadcast. Commun. 2022, 14, 212–218. [Google Scholar]
- Su, Y.S.; Suen, H.Y.; Hung, K.E. Predicting behavioral competencies automatically from facial expressions in real-time video-recorded interviews. J. Real-Time Image Process. 2021, 18, 1011–1021. [Google Scholar] [CrossRef]
- Wu, J. Real-time Interactive Assessment in English Oral Teaching Using Speech Recognition Technology. In Proceedings of the 2024 International Conference on Machine Intelligence and Digital Applications, Ningbo, China, 30–31 May 2024; pp. 200–204. [Google Scholar]
- Krathwohl, D.R. A revision of Bloom’s taxonomy: An overview. Theory Pract. 2002, 41, 212–218. [Google Scholar] [CrossRef]
- Mayer, R.E. Multimedia learning. In Psychology of Learning and Motivation; Elsevier: Amsterdam, The Netherlands, 2002; Volume 41, pp. 85–139. [Google Scholar]
- Richards, J.C.; Schmidt, R.W. Language and Communication; Routledge: New York, NY, USA, 2014. [Google Scholar]
- Mehrabian, A. Nonverbal Communication; Routledge: New York, NY, USA, 2017. [Google Scholar]
- Team, G.; Georgiev, P.; Lei, V.I.; Burnell, R.; Bai, L.; Gulati, A.; Tanzer, G.; Vincent, D.; Pan, Z.; Wang, S.; et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv 2024, arXiv:2403.05530. [Google Scholar] [CrossRef]
- Hurst, A.; Lerer, A.; Goucher, A.P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.; Welihinda, A.; Hayes, A.; Radford, A.; et al. Gpt-4o system card. arXiv 2024, arXiv:2410.21276. [Google Scholar] [CrossRef]
- Gandolfi, A. GPT-4 in education: Evaluating aptness, reliability, and loss of coherence in solving calculus problems and grading submissions. Int. J. Artif. Intell. Educ. 2025, 35, 367–397. [Google Scholar] [CrossRef]
- Koçak, M.; Oğuz, A.K.; Akçalı, Z. The role of artificial intelligence in medical education: An evaluation of Large Language Models (LLMs) on the Turkish Medical Specialty Training Entrance Exam. BMC Med. Educ. 2025, 25, 609. [Google Scholar] [CrossRef]
- Lee, S.; Jung, S.; Park, J.H.; Cho, H.; Moon, S.; Ahn, S. Performance of ChatGPT, Gemini and DeepSeek for non-critical triage support using real-world conversations in emergency department. BMC Emerg. Med. 2025, 25, 176. [Google Scholar] [CrossRef]
- Dai, Z.; McReynolds, A.; Whitehill, J. In Search of Negative Moments: Multi-Modal Analysis of Teacher Negativity in Classroom Observation Videos. In Proceedings of the 16th International Conference on Educational Data Mining, Bengaluru, India, 11–14 July 2023; pp. 278–285. [Google Scholar] [CrossRef]
- Lu, W.; Yang, Y.; Song, R.; Chen, Y.; Wang, T.; Bian, C. A Video Dataset for Classroom Group Engagement Recognition. Sci. Data 2025, 12, 644. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1 (long and short papers), pp. 4171–4186. [Google Scholar]
- Hua, H.; Shi, J.; Kafle, K.; Jenni, S.; Zhang, D.; Collomosse, J.; Cohen, S.; Luo, J. Finematch: Aspect-based fine-grained image and text mismatch detection and correction. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; pp. 474–491. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Lee, S.; Sim, W.; Shin, D.; Seo, W.; Park, J.; Lee, S.; Hwang, S.; Kim, S.; Kim, S. Reasoning abilities of large language models: In-depth analysis on the abstraction and reasoning corpus. ACM Trans. Intell. Syst. Technol. 2025, Accepted. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PmLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Ahmed, J.; Nadeem, G.; Majeed, M.K.; Ghaffar, R.; Baig, A.K.K.; Shah, S.R.; Razzaq, R.A.; Irfan, T. THE RISE OF MULTIMODAL AI: A QUICK REVIEW OF GPT-4V AND GEMINI. Spectr. Eng. Sci. 2025, 3, 778–786. [Google Scholar]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the Advances in Neural Information Processing Systems 32, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
- Mustaqeem, M.; Kwon, S. Speech emotion recognition based on deep networks: A review. In Proceedings of the Annual Conference of KIPS, Jeju, Republic of Korea, 22–24 April 2021; Korea Information Processing Society: Seoul, Republic of Korea, 2021; pp. 331–334. [Google Scholar]
- Ekman, P. Facial expressions of emotion: New findings, new questions. Psychol. Sci. 1992, 3, 34–38. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Xie, J.; Yang, J.; Luo, Z.; Cao, Y.; Gao, Q.; Zhang, M.; Hu, W. AdaDARE-gamma: Balancing Stability and Plasticity in Multi-modal LLMs through Efficient Adaptation. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 19758–19768. [Google Scholar]
- Zhou, L.; Schellaert, W.; Martínez-Plumed, F.; Moros-Daval, Y.; Ferri, C.; Hernández-Orallo, J. Larger and more instructable language models become less reliable. Nature 2024, 634, 61–68. [Google Scholar] [CrossRef] [PubMed]
- Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
- Singh, C.; Inala, J.P.; Galley, M.; Caruana, R.; Gao, J. Rethinking interpretability in the era of large language models. arXiv 2024, arXiv:2402.01761. [Google Scholar] [CrossRef]
- Nagrani, A.; Yang, S.; Arnab, A.; Jansen, A.; Schmid, C.; Sun, C. Attention bottlenecks for multimodal fusion. Adv. Neural Inf. Process. Syst. 2021, 34, 14200–14213. [Google Scholar]
- Dang, Y.; Huang, K.; Huo, J.; Yan, Y.; Huang, S.; Liu, D.; Gao, M.; Zhang, J.; Qian, C.; Wang, K.; et al. Explainable and interpretable multimodal large language models: A comprehensive survey. arXiv 2024, arXiv:2412.02104. [Google Scholar] [CrossRef]
- Doshi-Velez, F.; Kim, B. Towards a rigorous science of interpretable machine learning. arXiv 2017, arXiv:1702.08608. [Google Scholar] [CrossRef]
- Singh, A. Evaluating the Transparency and Explainability of llm-Based Educational Systems. SSRN. Available online: https://ssrn.com/abstract=5198565 (accessed on 3 March 2025).
- Saarela, M.; Gunasekara, S.; Karimov, A. The EU AI Act: Implications for Ethical AI in Education. In Proceedings of the International Conference on Design Science Research in Information Systems and Technology, Montego Bay, Jamaica, 2–4 June 2025; Springer: Cham, Switzerland, 2025; pp. 36–50. [Google Scholar]
- Schiff, D. Education for AI, not AI for education: The role of education and ethics in national AI policy strategies. Int. J. Artif. Intell. Educ. 2022, 32, 527–563. [Google Scholar] [CrossRef]
- Pedro, F.; Subosa, M.; Rivas, A.; Valverde, P. Artificial Intelligence in Education: Challenges and Opportunities for Sustainable Development; UNESCO: Paris, France, 2019. [Google Scholar]
Dimension | Format | Prompt Design | Prompt Example |
---|---|---|---|
Format Compliance | Essay (Text) | Evaluator Role + Student Background + Work Type + Task Goal + Evaluation Dimension + Output Criteria | You are a university professor teaching the course “Introduction to Computer Science”. Please evaluate a first-year computer science student’s short paper based on the two dimensions below: Dimension 1: Formatting Standards. The grading criteria are as follows: A: Formatting with almost no errors; structure is clear, providing an excellent reading experience. B: Basically standardized, with a few minor issues, structure is relatively clear. C: Contains some noticeable formatting problems, structure is somewhat disorganized. D: Formatting is non-standard and affects readability, structure is disorganized. E: Seriously lacking in formatting standards, difficult to review. Dimension 2: Content Quality. The grading criteria are as follows: A: In-depth content with a clear topic focus, strong logical coherence, and close relevance to the course. B: Solid content with a certain level of depth and logic, and good relevance. C: Average content; some issues with logic and structure; some parts deviate from the topic. D: Shallow or disorganized content, incomplete information, weak course relevance. E: Severely lacking in content, disorganized structure, unclear or unrelated topic. Please output the evaluation results for the two dimensions in the following format: FormattingGrade: X. Brief Justification: …… Content Quality Grade: X. Brief Justification: …… |
Content Quality | Essay (Text) | ||
Slide Design | PPT | You are a university professor teaching the course “Introduction to Computer Science”. Please evaluate the PPT used by a first-year computer science student during a recorded speech based on the following grading criteria: A: Clear content, appealing design, strongly supports the speech. B: Well-structured content, decent design with minor flaws, effectively supports the speech. C: Mostly clear content; design may be cluttered or simplistic but still supports understanding. D: Noticeable issues in content or design that hinder communication; only main ideas come through. E: Disorganized or poorly designed slides that severely impair communication and support for the speech. Please present your evaluation in the following format: PPT Quality Rating: X. Brief Justification: … | |
Verbal Expression | Presentation (Video) | Evaluator Role + Student Background + Work Type + Task Goal + Evaluation Dimension + Output Criteria | You are a university professor teaching the course “Introduction to Computer Science”, tasked with evaluating a first-year student’s 3-min presentation. Please assess the presentation on the following two dimensions: Dimension 1: Verbal Expression. The grading criteria are as follows: A: Clear and fluent, professional language, strong logic and persuasiveness. B: Mostly clear, appropriate terminology, well-structured logic. C: Moderate clarity, minor fluency or logic issues. D: Unclear, disorganized, or poor terminology use. E: Very poor expression, hard to understand, lacks basic verbal skills. Dimension 2: Nonverbal Performance. With only limited visuals (e.g., headshots), focus your evaluation on observable cues like facial expression, gaze, and emotional engagement; do not penalize for missing body gestures. The grading criteria are as follows: A: Natural, confident, expressive face and visible gestures, highly engaging. B: Generally confident with natural facial expressions and some emotional involvement. C: Average performance; slightly tense or monotonous, with moderate engagement. D: Weak performance; lacks confidence, with stiff expressions or limited emotional input. Please provide your evaluation in the following format: Verbal Expression: X. Brief Justification: … Nonverbal Performance: X. Brief Justification: … |
Nonverbal Performance | Presentation (Video) |
FC | CQ | SD | VE | NP | |
---|---|---|---|---|---|
Gemini2.5 | 45.71% | 50.00% | 52.86% | 52.86% | 44.29% |
GPT-4o | 47.14% | 61.43% | 54.29% | 48.57% | 55.71% |
Doubao1.6 | 52.86% | 72.86% | 45.71% | 58.57% | 54.29% |
Kimi1.5 | 84.29% | 42.86% | 58.57% | 57.14% | 77.14% |
FC | CQ | SD | VE | NP | |
---|---|---|---|---|---|
Gemini2.5 | 0.355 | 0.304 | 0.116 | 0.372 | 0.292 |
GPT-4o | 0.477 | 0.372 | 0.437 | 0.429 | 0.381 |
Doubao1.6 | 0.436 | 0.231 | 0.427 | 0.381 | 0.176 |
Kimi1.5 | 0.262 | 0.413 | 0.272 | 0.242 | 0.257 |
FC | CQ | SD | VE | NP | |
---|---|---|---|---|---|
Gemini2.5 | 4.48 | 4.50 | 4.33 | 3.95 | 3.63 |
GPT-4o | 4.63 | 4.55 | 4.08 | 4.03 | 3.83 |
Doubao1.6 | 4.53 | 4.68 | 3.98 | 3.98 | 3.73 |
Kimi1.5 | 4.38 | 4.60 | 4.23 | 3.90 | 3.43 |
Dimension | Examples |
---|---|
Format Compliance | Score: B. Reasoning: The paper’s layout is tidy, and the paragraph structure is clear, with an overall format that is fairly standardized. The formatting of references generally meets requirements but has some minor issues. For example, some citations and references are inconsistent, particularly for sources like the CSDN website and journal articles, where citation rules are not strictly followed. Additionally, incomplete publication years, DOIs, and source details affect formatting consistency. Overall, the structure is clear, offering an acceptable reading experience, though details could be further refined. (From GPT-4o) |
Content Quality | Score: B. Reasoning: The paper focuses clearly on “Natural Language Processing”, covering various aspects such as the concept of NLP, its development history, project workflows, implementation methods, applications, and challenges, providing comprehensive information. The logic is coherent, gradually progressing from basic concepts to specific applications and future trends. It closely relates to the knowledge of artificial intelligence and natural language processing taught in the “Introduction to Computer Science” course, demonstrating a basic understanding of the field with a certain level of depth, meeting the criteria for a B grade. (From Doubao1.6) |
Slide Design | Rating: C. Reasoning: The PPT is well-organized in terms of content, dividing the topic of “Natural Language Processing (NLP)” into several aspects including definitions, development history, project processes, core technologies, application areas, and future challenges. The logical structure is reasonable and effectively guides the audience’s understanding of the subject. The PPT appropriately uses flowcharts, word clouds, and technical diagrams (such as SVM, RNN illustrations, and comparison tables) to help explain complex concepts, aiding in the visual presentation of information. However, there are some notable shortcomings in the overall design quality of the PPT: Dense text layout: Several slides (e.g., “NLP Development History” and “Future Development and Challenges of NLP”) contain large blocks of text without emphasizing key points, which may make it difficult for the audience to quickly grasp important information during the presentation, thereby affecting the effective communication of information. Formatting errors: The page numbers on all content slides incorrectly display as “01/14,” which is a basic error that should not appear in a formal presentation and reflects a design oversight. Visual effects need improvement: The slide background features a consistent design with campus scenery and geometric patterns on the cover and closing slides, but most of the content slides use a plain white background. While this switch is acceptable, it fails to create a cohesive visual style throughout, making the design appear somewhat simplistic. In summary, although the PPT’s content is generally clear, its dense text layout, obvious formatting errors, and room for improvement in visual design prevent it from reaching an “aesthetically pleasing and harmonious” level. Therefore, it is rated as a C. (From Gemini) |
Verbal Expression | Rating: B. Reasoning: Based on the text content, the student’s expression is relatively clear, with appropriate use of terminology and an accurate explanation of topics related to Natural Language Processing (NLP) technology. The narrative structure demonstrates logical progression, beginning with an introduction to the basic concepts of NLP, followed by a review of its development history, then explaining relevant technologies and models, and concluding with current challenges and future directions. Nonetheless, some details lack in-depth analysis, and a few sentence transitions are somewhat plain, slightly affecting the overall fluency of the expression. (From GPT-4o) |
Nonverbal Performance | Rating: C. Reasoning: From the screenshots, the student’s facial expressions appear relatively natural; however, in some images, the expressions seem somewhat flat, lacking emotional engagement. Eye contact is limited, and there is no noticeable body language to observe. Therefore, although the student maintained a moderate level of engagement, the overall confidence and involvement were insufficient, which may have affected the speech’s impact and expressiveness. (From GPT-4o) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, Y.; Li, Y.; Ren, Y.; Liu, Y.; Ma, Y. Educational Evaluation with MLLMs: Framework, Dataset, and Comprehensive Assessment. Electronics 2025, 14, 3713. https://doi.org/10.3390/electronics14183713
Chen Y, Li Y, Ren Y, Liu Y, Ma Y. Educational Evaluation with MLLMs: Framework, Dataset, and Comprehensive Assessment. Electronics. 2025; 14(18):3713. https://doi.org/10.3390/electronics14183713
Chicago/Turabian StyleChen, Yuqing, Yixin Li, Yupei Ren, Yixin Liu, and Yiping Ma. 2025. "Educational Evaluation with MLLMs: Framework, Dataset, and Comprehensive Assessment" Electronics 14, no. 18: 3713. https://doi.org/10.3390/electronics14183713
APA StyleChen, Y., Li, Y., Ren, Y., Liu, Y., & Ma, Y. (2025). Educational Evaluation with MLLMs: Framework, Dataset, and Comprehensive Assessment. Electronics, 14(18), 3713. https://doi.org/10.3390/electronics14183713