A Survey on MLLMs in Education: Application and Future Directions
Abstract
1. Introduction
2. Preliminaries of MLLMs
2.1. Overview of Large Language Models (LLMs)
2.2. Evolution to Multimodal Large Language Models (MLLMs)
2.3. Key Technologies and Architectures
2.3.1. Transformer Architectures and Attention Mechanisms
2.3.2. Multimodal Fusion Techniques
- Early fusion (single-stream): Integrates modalities at the input level, allowing the model to learn cross-modal representations from the beginning. This method processes all modalities simultaneously, enabling the model to capture interactions between different types of data early in the processing pipeline [6].
- Late fusion (dual-stream): Processes each modality separately before combining their representations at a higher level. This approach allows for specialized processing of each modality, which can be beneficial when modalities have very different characteristics or when pre-trained unimodal models are used [6].
2.3.3. Pre-Training and Fine-Tuning
2.3.4. Encoder and Decoder Architectures
- Encoder-only models: Models like CLIP focus on creating embeddings for different modalities that can be compared or combined [14]. The encoder processes input data to generate a fixed-size representation, capturing the essential features of the input regardless of its modality. This approach is effective for tasks that require matching or retrieving information across modalities.
- Encoder–decoder models: Models used in tasks like image captioning process input data through an encoder and generate outputs via a decoder, allowing for generative tasks [5]. The encoder transforms the input data into a latent representation, which the decoder then uses to generate a sequence of outputs in another modality. This architecture is well suited for tasks that involve translation between modalities, such as generating descriptive text from images.
2.4. Examples of Prominent MLLMs
2.4.1. Open-Source MLLMs
2.4.2. Proprietary MLLMs
3. Applications of MLLMs in Education
- Adaptive learning platforms: Examining how MLLMs enable personalized learning experiences by dynamically adjusting instructional content to meet individual learners’ needs, preferences, and performance levels.
- Virtual tutors and chatbots: Exploring the role of MLLMs in developing intelligent virtual assistants that provide personalized support, guidance, and interactive learning opportunities through natural language conversations.
- Intelligent content creation: Investigating how MLLMs automate and enhance the development of educational materials, including textbooks, lesson plans, assessments, and multimedia resources, thereby increasing efficiency and accessibility.
- AI-powered learning management systems (LMSs): Analyzing the integration of MLLMs into LMS platforms to enhance content delivery, personalize learning paths, and facilitate more engaging and interactive educational experiences.
- AI-based insight and predictive analytics for educators: Discussing how MLLMs provide educators with actionable insights by analyzing multimodal educational data, enabling early identification of at-risk students and informed decision making.
- Grading and assessment tools: Assessing the application of MLLMs in automating grading processes, providing objective evaluations, and delivering detailed, personalized feedback across various types of student work.
3.1. Adaptive Learning Platforms
3.1.1. Introduction to Adaptive Learning Platforms
3.1.2. Technology Used: MLLMs in Adaptive Learning
3.1.3. Examples of Applications
3.1.4. Case Study: Integration of MLLMs in Duolingo
3.2. Virtual Tutors and Chatbots
3.2.1. Introduction to Virtual Tutors and Chatbots
3.2.2. Technology Used: MLLMs in Virtual Tutors and Chatbots
3.2.3. Examples of Applications
3.2.4. Case Study: Implementation of LOVA3 in Virtual Tutoring Systems
3.3. Intelligent Content Creation
3.3.1. Introduction to Intelligent Content Creation
3.3.2. Technology Used: MLLMs in Intelligent Content Creation
3.3.3. Examples of Applications
3.3.4. Case Study: Automated Quiz Generation Using GPT-3
3.4. AI-Powered Learning Management Systems (LMSs)
3.4.1. Introduction to AI-Powered Learning Management Systems
3.4.2. Technology Used: MLLMs in AI-Powered LMSs
3.4.3. Examples of Applications
3.4.4. Case Study: Integration of MLLMs in Coursera
3.5. AI-Based Insight and Predictive Analytics for Educators
3.5.1. Introduction to AI-Based Insight and Predictive Analytics for Educators
3.5.2. Technology Used: MLLMs in AI-Based Insight and Predictive Analytics
3.5.3. Examples of Applications
3.5.4. Case Study: Early Warning Systems Using MLLMs
3.6. Grading and Assessment Tools
3.6.1. Introduction to Grading and Assessment Tools
3.6.2. Technology Used: MLLMs in Grading and Assessment
3.6.3. Examples of Applications
3.6.4. Case Study: Reducing the Cost of Short-Answer Scoring with MLLMs
4. Discussion and Future Directions
4.1. Benefits of Using MLLMs in Educational Applications
4.2. Limitations of Using MLLMs in Educational Applications
4.3. Future Directions: Towards a Unified AI-Powered Educational Ecosystem
4.4. Shift of Tides in Future Education
5. Conclusions
5.1. Summary of Key Insights
5.2. Final Thoughts
Author Contributions
Funding
Conflicts of Interest
Abbreviations
| LLM | Large language model | 
| MLLM | Multimodal large language model | 
| ADS | Advanced dialogue system | 
References
- Kasneci, E.; Seßler, K.; Küchemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; Günnemann, S.; Hüllermeier, E.; et al. ChatGPT for Good? On Opportunities and Challenges of Large Language Models for Education. Learn. Individ. Differ. 2023, 103, 102274. [Google Scholar] [CrossRef]
- Zhou, W.; Zhu, X.; Han, Q.-L.; Li, L.; Chen, X.; Wen, S.; Xiang, Y. The Security of Using Large Language Models—A Survey with Emphasis on ChatGPT. IEEE/CAA J. Autom. Sin. 2024. [Google Scholar] [CrossRef]
- Luckin, R.; Holmes, W.; Griffiths, M.; Forcier, L.B. Intelligence Unleashed: An Argument for AI in Education; Pearson Education: London, UK, 2016. [Google Scholar]
- OpenAI. OpenAI o1 System Card. OpenAI. 12 September 2024. [Online]. Available online: https://openai.com/index/openai-o1-system-card/ (accessed on 3 October 2024).
- Li, J.; Li, D.; Savarese, S.; Hoi, S. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv 2023, arXiv:2301.12597. [Google Scholar]
- Baltrušaitis, T.; Ahuja, C.; Morency, L.-P. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 423–445. [Google Scholar] [CrossRef]
- Williamson, B.; Eynon, R. Historical threads, missing links, and future directions in {AI} in education. Learn. Media Technol. 2020, 45, 223–235. [Google Scholar] [CrossRef]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Anthropic. Claude 3.5 Sonnet Model Card Addendum. Anthropic. 2023. [Online]. Available online: https://www-cdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf (accessed on 5 October 2024).
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; Ng, A.Y. Multimodal Deep Learning. In Proceedings of the 28th International Conference on Machine Learning, Bellevue, WA, USA, 28 June–2 July 2011; pp. 689–696. [Google Scholar]
- Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A Visual Language Model for Few-Shot Learning; DeepMind: London, UK, 2022. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-Shot Text-to-Image Generation. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8821–8831. [Google Scholar]
- Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv 2022, arXiv:2204.06125. [Google Scholar]
- Mialon, G.; Dessì, R.; Lomeli, M.; Nalmpantis, C.; Pasunuru, R.; Raileanu, R.; Rozière, B.; Schick, T.; Dwivedi-Yu, J.; Celikyilmaz, A.; et al. Augmented Language Models: A Survey. arXiv 2023, arXiv:2302.07842. [Google Scholar]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.L.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training Language Models to Follow Instructions with Human Feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL, Florence, Italy, 28 July–2 August 2019; pp. 4171–4186. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
- Gong, Y.; Chung, Y.-A.; Glass, J. AST: Audio Spectrogram Transformer. In Proceedings of the Interspeech, Brno, Czech Republic, 30 August–3 September 2021; pp. 571–575. [Google Scholar]
- Tsai, Y.-H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.-P.; Salakhutdinov, R. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 6558–6569. [Google Scholar]
- Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6077–6086. [Google Scholar]
- Chen, Y.-C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. UNITER: Universal Image-Text Representation Learning. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 104–120. [Google Scholar]
- Li, J.; Li, D.; Xiong, C.; Hoi, S. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 12888–12900. [Google Scholar]
- Tan, H.; Bansal, M. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Hong Kong, China, 3–7 November 2019; pp. 5100–5111. [Google Scholar]
- Su, W.; Zhu, X.; Cao, Y. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Dai, W.; Lee, N.; Wang, B.; Yang, Z.; Liu, Z.; Barker, J.; Rintamaki, T.; Shoeybi, M.; Catanzaro, B.; Ping, W. NVLM: Open Frontier-Class Multimodal LLMs. arXiv 2024, arXiv:2409.11402. [Google Scholar]
- Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar]
- Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual Instruction Tuning. arXiv 2023, arXiv:2304.08485. [Google Scholar]
- Wu, C.; Yin, S.; Qi, W.; Wang, X.; Tang, Z.; Duan, N. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models. arXiv 2023, arXiv:2303.04671. [Google Scholar]
- Huang, S.; Dong, L.; Wang, W.; Hao, Y.; Singhal, S.; Ma, S.; Lv, T.; Cui, L.; Mohammed, O.K.; Patra, B.; et al. Language Is Not All You Need: Aligning Perception with Language Models. arXiv 2023, arXiv:2302.14045. [Google Scholar]
- McKinzie, B.; Gan, Z.; Fauconnier, J.; Dodge, S.; Zhang, B.; Dufter, P.; Shah, D.; Du, X.; Peng, F.; Weers, F.; et al. MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training. arXiv 2024, arXiv:2403.09611. [Google Scholar]
- Madsen, S.; Nauta, M.; Trienes, J.; Pathak, S.; Nguyen, E.; Peters, M.; Schmitt, Y.; Schlötterer, J.; van Keulen, M.; Seifert, C. Evaluating the Explainability of Machine Learning Models in Education. IEEE Trans. Learn. Technol. 2023, 16, 1–14. [Google Scholar]
- Fitzpatrick, K.K.; Darcy, A.; Vierhile, M. Delivering Cognitive Behavior Therapy to Young Adults With Symptoms of Depression and Anxiety Using a Fully Automated Conversational Agent (Woebot): A Randomized Controlled Trial. JMIR Ment. Health 2017, 4, e19. [Google Scholar] [CrossRef]
- Murray, T. An Overview of Intelligent Tutoring System Authoring Tools: Updated Analysis of the State of the Art. In Authoring Tools for Advanced Technology Learning Environments; Springer: Berlin/Heidelberg, Germany, 2003; pp. 491–544. [Google Scholar]
- Park, Y.; Lee, G.M. Adaptive Learning Systems. In Encyclopedia of Education and Information Technologies; Springer: Berlin/Heidelberg, Germany, 2019. [Google Scholar]
- Durlach, P.J.; Lesgold, A.M. Adaptive Technologies for Training and Education; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
- Maycock, K. Multimodal Learning. In International Handbook of the Learning Sciences; Routledge: Oxfordshire, UK, 2019; pp. 261–271. [Google Scholar]
- Cognii. AI and Education. 2020. [Online]. Available online: https://www.cognii.com/ (accessed on 14 October 2024).
- Carnegie Learning. MATHia: Personalized Math Learning Software. 2020. [Online]. Available online: https://www.carnegielearning.com/mathia/ (accessed on 14 October 2024).
- Knewton. Adaptive Learning Technology. 2018. [Online]. Available online: https://www.knewton.com/ (accessed on 14 October 2024).
- Settles, B.; Laurel, T.; Briggs, A. Machine Learning–Driven Language Education. Trans. Assoc. Comput. Linguist. 2020, 8, 451–466. [Google Scholar]
- Von Ahn, L. Duolingo: Learn a Language for Free While Helping to Translate the Web. In Proceedings of the International Conference on Intelligent User Interfaces, Santa Monica, CA, USA, 19–22 March 2013; pp. 1–2. [Google Scholar]
- Smart Sparrow. Adaptive Learning Platform. 2018. [Online]. Available online: https://www.smartsparrow.com/ (accessed on 14 October 2024).
- Duolingo Team. Introducing Duolingo Max, a Learning Experience Powered by GPT-4. Duolingo Blog. 2024. Available online: https://blog.duolingo.com/duolingo-max/ (accessed on 11 November 2024).
- Oord, A.v.d.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. WaveNet: A Generative Model for Raw Audio. arXiv 2016, arXiv:1609.03499. [Google Scholar]
- Zheng, L.; Long, M.; Zhong, L.; Gyasi, J.F. The Effectiveness of Technology-Facilitated Personalized Learning on Learning Achievements and Learning Perceptions: A Meta-Analysis. Educ. Inf. Technol. 2022, 27, 11807–11830. [Google Scholar] [CrossRef]
- Panigrahi, S.; Rath, P.K.; Sahoo, B. Intelligent Tutoring Systems Using Large Language Models: A Review. J. Educ. Technol. Syst. 2023, 51, 5–27. [Google Scholar]
- Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 121–137. [Google Scholar]
- Zhao, H.H.; Zhou, P.; Gao, D.; Bai, Z.; Shou, M.Z. LOVA3: Learning to Visual Question Answering, Asking and Assessment. arXiv 2024, arXiv:2405.14974v2. [Google Scholar]
- Yi, Z.; Ouyang, J.; Liu, Y.; Liao, T.; Xu, Z.; Shen, Y. A Survey on Recent Advances in LLM-Based Multi-turn Dialogue Systems. arXiv 2024, arXiv:2402.18013. [Google Scholar]
- Sun, X.; Panda, R.; Feris, R.; Saenko, K. AdaShare: Learning What to Share for Efficient Deep Multi-Task Learning. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, BC, Canada, 6–12 December 2020; Available online: https://cs-people.bu.edu/sunxm/AdaShare/project.html (accessed on 1 November 2024).
- Zhao, Y.; Qu, Y.; Xiang, Y.; Uddin, M.P.; Peng, D.; Gao, L. A Comprehensive Survey on Edge Data Integrity Verification: Fundamentals and Future Trends. ACM Comput. Surv. 2024, 57, 8:1–8:34. [Google Scholar] [CrossRef]
- Roschelle, J.; Lester, J.; Fusco, J. (Eds.) AI and the Future of Learning: Expert Panel Report; [Report]. Digital Promise; 2020. Available online: https://circls.org/reports/ai-report (accessed on 1 November 2024).
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
- Gligorea, I.; Cioca, M.; Oancea, R.; Gorski, A.-T.; Gorski, H.; Tudorache, P. Adaptive Learning Using Artificial Intelligence in e-Learning: A Literature Review. Educ. Sci. 2023, 13, 1216. [Google Scholar] [CrossRef]
- Abdelghani, R.; Wang, Y.-H.; Yuan, X.; Wang, T.; Lucas, P.; Sauzéon, H.; Oudeyer, P.-Y. GPT-3-Driven Pedagogical Agents for Training Children’s Curious Question-Asking Skills. In Proceedings of the 14th International Conference on Computer Supported Education (CSEDU), Online, 22–24 April 2022. [Google Scholar]
- Al-Ansi, A.M.; Jaboob, M.; Garad, A.; Al-Ansi, A. Analyzing Augmented Reality (AR) and Virtual Reality (VR) Recent Development in Education. Soc. Sci. Humanit. Open 2023, 8, 100532. [Google Scholar] [CrossRef]
- Zhou, L.; Palangi, H.; Zhang, L.; Hu, H.; Corso, J.J.; Gao, J. Unified Vision-Language Pre-Training for Image Captioning and VQA. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20), New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13041–13049. [Google Scholar]
- Fan, A.; Bhosale, S.; Schwenk, H.; Ma, Z.; El-Kishky, A.; Goyal, S.; Baines, M.; Celebi, O.; Wenzek, G.; Chaudhary, V.; et al. Beyond English-Centric Multilingual Machine Translation. J. Mach. Learn. Res. 2021, 22, 1–48. [Google Scholar]
- Holmes, W.; Bialik, M.; Fadel, C. Artificial Intelligence in Education: Promises and Implications for Teaching and Learning; Center for Curriculum Redesign: Jamaica Plain, MA, USA, 2019. [Google Scholar]
- Dwivedi, S.K.; Bharadwaj, A.K.; Jha, S.K. Role of Artificial Intelligence in Empowering Teaching and Learning. In Proceedings of the International Conference on Advances in Computing and Data Sciences, Ghazibad, India, 12–13 April 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 12–24. [Google Scholar]
- Zhang, S.; Yao, L.; Sun, A.; Tay, Y. Deep Learning-Based Recommender System: A Survey and New Perspectives. ACM Comput. Surv. 2021, 52, 1–38. [Google Scholar] [CrossRef]
- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
- Shen, J.; Pang, R.; Weiss, R.J.; Schuster, M.; Jaitly, N.; Yang, Z.; Chen, Z.; Zhang, Y.; Wang, Y.; Skerry-Ryan, R.J.; et al. Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4779–4783. [Google Scholar]
- Liao, C.-H.; Wu, J.-Y. Deploying Multimodal Learning Analytics Models to Explore the Impact of Digital Distraction and Peer Learning on Student Performance. Comput. Educ. 2022, 190, 104599. [Google Scholar] [CrossRef]
- Ouyang, F.; Wu, M.; Zheng, L.; Zhang, L.; Jiao, P. Integration of Artificial Intelligence Performance Prediction and Learning Analytics to Improve Student Learning in Online Engineering Course. Int. J. Educ. Technol. High. Educ. 2023, 20, 4. [Google Scholar] [CrossRef]
- edX Team. edX Debuts Two AI-Powered Learning Assistants Built on ChatGPT. edX Press Release. 12 May 2023. [Online]. Available online: https://press.edx.org/edx-debuts-two-ai-powered-learning-assistants-built-on-chatgpt (accessed on 15 October 2024).
- Udemy. Udemy’s AI-Powered Learning. 2020. [Online]. Available online: https://about.udemy.com/ (accessed on 17 October 2024).
- Knewton. Knewton Alta. 2020. [Online]. Available online: https://japan.knewton.com/news/n2020112401.html (accessed on 17 October 2024).
- Shah, D. By The Numbers: MOOCs in 2021. Class Central. 2021. [Online]. Available online: https://www.classcentral.com/report/mooc-stats-2021/ (accessed on 19 October 2024).
- Zawacki-Richter, O.; Marín, V.I.; Bond, M.; Gouverneur, F. Systematic Review of Research on Artificial Intelligence Applications in Higher Education—Where Are the Educators? Int. J. Educ. Technol. High. Educ. 2019, 16, 39. [Google Scholar] [CrossRef]
- Piech, C.; Spencer, J.; Huang, J.; Ganguli, S.; Sahami, M.; Guibas, L.; Sohl-Dickstein, J. Deep Knowledge Tracing. arXiv 2015, arXiv:1506.05908. [Google Scholar]
- Lakew, S.M.; Federico, M.; Negri, M.; Turchi, M. Multilingual Neural Machine Translation for Zero-Resource Languages. arXiv 2018, arXiv:1909.07342. [Google Scholar]
- Molina, J.P.; Turchi, V.M. GDPR challenges for leveraging big data in the education and research sectors. In Proceedings of the 14th International Conference on Web Information Systems and Technologies, Seville, Spain, 18–20 September 2018; pp. 659–666. [Google Scholar]
- Blodgett, S.L.; Barocas, S.; Daumé, H., III; Wallach, H. Language (Technology) is Power: A Critical Survey of “Bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 5454–5476. [Google Scholar]
- Holzinger, A.; Biemann, C.; Pattichis, C.S.; Kell, D.B. What Do We Need to Build Explainable AI Systems for the Medical Domain? arXiv 2017, arXiv:1712.09923. [Google Scholar]
- Baker, R.S.; Siemens, G. Educational Data Mining and Learning Analytics. In Learning Analytics; Springer: Berlin/Heidelberg, Germany, 2019; pp. 61–75. [Google Scholar]
- Baltrušaitis, T.; Robinson, P.; Morency, L.-P. OpenFace: An Open Source Facial Behavior Analysis Toolkit. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1–10. [Google Scholar]
- Alwahaby, H.; Cukurova, M.; Papamitsiou, Z.; Giannakos, M. The Evidence of Impact and Ethical Considerations of Multimodal Learning Analytics: A Systematic Literature Review. In The Multimodal Learning Analytics Handbook; Springer: Berlin/Heidelberg, Germany, 2022; pp. 289–325. [Google Scholar] [CrossRef]
- Arnold, K.E.; Pistilli, M.D. Course Signals at Purdue: Using Learning Analytics to Increase Student Success. In Proceedings of the 2nd International Conference on Learning Analytics and Knowledge, Vancouver, BC, Canada, 29 April 2012–2 May 2012; pp. 267–270. [Google Scholar]
- Shoumy, N.J.; Ang, L.-M.; Seng, K.P.; Rahaman, D.M.M.; Zia, T. Multimodal Big Data Affective Analytics: A Comprehensive Survey Using Text, Audio, Visual and Physiological Signals. J. Netw. Comput. Appl. 2020, 149, 102447. [Google Scholar] [CrossRef]
- Papamitsiou, Z.; Economides, A.A. Learning Analytics and Educational Data Mining in Practice: A Systematic Literature Review of Empirical Evidence. Educ. Technol. Soc. 2014, 17, 49–64. [Google Scholar]
- Blanchard, E.G.; Bousbia, D.; Franceschini, B. Identifying Group Dynamics and Emotion in E-Learning: An Integrated Approach. In Intelligent Tutoring Systems; Springer: Berlin/Heidelberg, Germany, 2016; pp. 354–359. [Google Scholar]
- Giannakos, M.; Cukurova, M. The Role of Learning Theory in Multimodal Learning Analytics. Br. J. Educ. Technol. 2023, 54, 1246–1267. [Google Scholar] [CrossRef]
- Ellis, R.A.; Goodyear, P. Developing and Using a Learning Analytics Framework: A Case Study. Teach. High. Educ. 2019, 24, 394–407. [Google Scholar]
- Çeken, B.; Taşkın, N. Multimedia Learning Principles in Different Learning Environments: A Systematic Review. Smart Learn. Environ. 2022, 9, 19. [Google Scholar] [CrossRef]
- Romero, C.; Ventura, S. Educational Data Mining and Learning Analytics: An Updated Survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2020, 10, e1355. [Google Scholar] [CrossRef]
- Giannakos, M.N.; Sharma, K.; Pappas, I.O.; Kostakos, V.; Velloso, E. Multimodal Data as a Means to Understand the Learning Experience. In Proceedings of the 9th International Conference on Learning Analytics & Knowledge, Tempe, AZ, USA, 4–8 March 2019; pp. 639–640. [Google Scholar]
- Zhao, J.; Wu, M.; Zhou, L.; Wang, X.; Jia, J. Cognitive Psychology-Based Artificial Intelligence Review. Front. Neurosci. 2022, 16, 1024316. [Google Scholar] [CrossRef]
- Ke, Z.; Ng, V. Automated Essay Scoring: A Survey of the State of the Art. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI 2019), Macao, China, 10–16 August 2019; pp. 6300–6308. [Google Scholar] [CrossRef]
- Prinsloo, P. Of ‘Black Boxes’ and Algorithmic Decision-Making in (Higher) Education—A Commentary. Big Data Soc. 2020, 7, 2053951720933994. [Google Scholar] [CrossRef]
- Zhu, X.; Zhou, W.; Han, Q.-L.; Ma, W.; Wen, S.; Xiang, Y. When Software Security Meets Large Language Models: A Survey. IEEE/CAA J. Autom. Sin. 2024. accepted. [Google Scholar]
- Gierl, M.J.; Zhang, H. Automated Scoring in the Classroom. In Handbook of Automated Essay Evaluation: Current Applications and New Directions; Routledge: Oxfordshire, UK, 2018; pp. 136–154. [Google Scholar]
- Gašević, D.; Dawson, S.; Siemens, G. Let’s Not Forget: Learning Analytics are about Learning. TechTrends 2015, 59, 64–71. [Google Scholar] [CrossRef]
- Mizumoto, A.; Eguchi, M. Exploring the potential of using an AI language model for automated essay scoring. Res. Methods Appl. Linguist. 2023, 3, 100050. [Google Scholar] [CrossRef]
- Latif, S.; Zaidi, A.; Cuayahuitl, H.; Shamshad, F.; Shoukat, M.; Qadir, J. Transformers in Speech Processing: A Survey. arXiv 2023, arXiv:2303.11607. [Google Scholar]
- Li, L.H.; Yatskar, M.; Yin, D.; Hsieh, C.-J.; Chang, K.-W. VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv 2019, arXiv:1908.03557. [Google Scholar]
- Schneider, J.; Schenk, B.; Niklaus, C. Towards LLM-based Autograding for Short Textual Answers. In Proceedings of the 16th International Conference on Computer Supported Education (CSEDU 2024), Angers, France, 2–4 May 2024. [Google Scholar]
- MOSS (Measure of Software Similarity). Available online: https://theory.stanford.edu/~aiken/moss/ (accessed on 13 November 2022).
- Ramesh, D.; Sanampudi, S.K. An Automated Essay Scoring Systems: A Systematic Literature Review. Artif. Intell. Rev. 2022, 55, 2495–2527. [Google Scholar] [CrossRef] [PubMed]
- Chen, J.; Guo, H.; Yi, K.; Li, B.; Elhoseiny, M. VisualGPT: Data-Efficient Adaptation of Pretrained Language Models for Image Captioning. arXiv 2021, arXiv:2102.10407. [Google Scholar]
- Funayama, H.; Asazuma, Y.; Matsubayashi, Y.; Mizumoto, T.; Inui, K. Reducing the Cost: Cross-Prompt Pre-Finetuning for Short Answer Scoring. In Artificial Intelligence in Education (AIED 2023); Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2023; Volume 13916, pp. 78–89. [Google Scholar]
- Gupta, S.; Sharda, N. Content Generation for Serious Games in Education: The ANFIS Approach. IEEE Trans. Learn. Technol. 2018, 11, 493–507. [Google Scholar]
- Rose, D.H.; Meyer, A. A Practical Reader in Universal Design for Learning; Harvard Education Press: London, UK, 2006. [Google Scholar]
- Graesser, A.C.; Cai, Z.; Morgan, B.; Wang, L. Assessment with Computer Agents That Engage in Conversational Dialogues and Trialogues with Learners. Comput. Hum. Behav. 2018, 76, 607–616. [Google Scholar] [CrossRef]
- Williamson, B.; Hogan, A. Commercialisation and Privatisation in/of Education in the Context of COVID-19; Education International: Brussels, Belgium, 2020. [Google Scholar]
- Regan, P.M.; Jesse, J. Ethical Challenges of EdTech, Big Data and Personalized Learning: Twenty-First Century Student Sorting and Tracking. Ethics Inf. Technol. 2019, 21, 167–179. [Google Scholar] [CrossRef]
- Tian, H.; Liu, B.; Zhu, T.; Zhou, W.; Yu, P.S. MultiFair: Model Fairness with Multiple Sensitive Attributes. IEEE Trans. Neural Netw. Learn. Syst. 2024, 1–14. [Google Scholar] [CrossRef]
- Tian, H.; Liu, B.; Zhu, T.; Zhou, W.; Yu, P.S. Distilling Fair Representations From Fair Teachers. IEEE Trans. Big Data 2024, 1–14. [Google Scholar] [CrossRef]
- Chen, H.; Zhu, T.; Zhang, T.; Zhou, W.; Yu, P.S. Privacy and Fairness in Federated Learning: On the Perspective of Tradeoff. ACM Comput. Surv. 2023, 56, 39:1–39:37. [Google Scholar] [CrossRef]
- Kumar, V.; Sharma, D.K.; Singh, H. Interoperability Issues in e-Learning: A Review. Int. J. Recent Technol. Eng. 2019, 8, 115–121. [Google Scholar]
- Lipton, Z.C. The Mythos of Model Interpretability: In Machine Learning, the Concept of Interpretability is Both Important and Slippery. Queue 2018, 16, 31–57. [Google Scholar] [CrossRef]
- Driess, D.; Xia, F.; Sajjadi, M.S.; Lynch, C.; Chowdhery, A.; Ichter, B.; Wahid, A.; Tompson, J.; Vuong, Q.; Yu, T.; et al. PaLM-E: An Embodied Multimodal Language Model. arXiv 2023, arXiv:2303.03378. [Google Scholar]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-T.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv 2020, arXiv:2005.11401. [Google Scholar]
- Guu, K.; Lee, K.; Tung, Z.; Pasupat, P.; Chang, M. REALM: Retrieval-Augmented Language Model Pre-Training. In Proceedings of the 37th International Conference on Machine Learning, Online, 13–18 July 2020; pp. 3929–3938. [Google Scholar]
- Popenici, S.A.D.; Kerr, S. Exploring the Impact of Artificial Intelligence on Teaching and Learning in Higher Education. Res. Pract. Technol. Enhanc. Learn. 2017, 12, 22. [Google Scholar] [CrossRef]
- Villaronga, E.F.; Kieseberg, P.; Li, T. Humans Forget, Machines Remember: Artificial Intelligence and the Right to Be Forgotten. Comput. Law Secur. Rev. 2018, 34, 304–313. [Google Scholar] [CrossRef]
- Knight, E.; Cook, S. Educating Global Citizens in a Digital Age: The Role of MOOCs. J. Glob. Educ. Res. 2020, 4, 97–111. [Google Scholar]
- Amin, S.; Uddin, M.I.; Alarood, A.A.; Mashwani, W.K.; Alzahrani, A.; Alzahrani, A.O. Smart E-Learning Framework for Personalized Adaptive Learning and Sequential Path Recommendations Using Reinforcement Learning. 2024. Available online: https://ieeexplore.ieee.org/document/10220065 (accessed on 10 November 2024).
- Strielkowski, W.; Grebennikova, V.; Lisovskiy, A.; Rakhimova, G.; Vasileva, T. AI-driven adaptive learning for sustainable educational transformation. Sustain. Dev. 2024. [Google Scholar] [CrossRef]
- Khan Academy. Introducing Khanmigo: AI for Education. 2023. Available online: https://www.microsoft.com/en-us/education/blog/2024/08/khanmigo-for-teachers-your-free-ai-powered-teaching-tool/ (accessed on 10 October 2024).
- Maiti, P.; Goel, A.K. How Do Students Interact with an LLM-powered Virtual Teaching Assistant in Different Educational Settings? In Proceedings of the Seventeenth International Conference on Educational Data Mining (EDM) Workshop: Leveraging LLMs for Next Generation Educational Technologies, Atlanta, GA, USA, 14–17 July 2024. [Google Scholar]
- Liu, B.; Ding, M.; Shaham, S.; Rahayu, W.; Farokhi, F.; Lin, Z. When Machine Learning Meets Privacy: A Survey and Outlook. ACM Comput. Surv. 2021, 54, 31:1–31:36. [Google Scholar] [CrossRef]
- Chen, L.; Chen, P.; Lin, Z. Artificial intelligence in education: A review. IEEE Access 2020, 8, 75264–75278. [Google Scholar] [CrossRef]




| Model | Key Capabilities | Educational Applications | Open Source | 
|---|---|---|---|
| PaLM-E | Integrates vision and language for robotics and embodied AI. | Enhances interactive, physical learning environments, especially in STEM and robotics education. | No | 
| LLaVA | Combines vision and language for general-purpose understanding. | Visual question answering, image captioning, supporting visually enriched content in learning platforms. | Yes | 
| Kosmos-G | Processes text and images for multimodal comprehension. | Facilitates interactive content, supports collaborative and visual learning tools. | No | 
| GPT-4o | Extends LLM capabilities to visual data, enabling conversational responses to images. | Supports interactive learning with text and visual input, such as image-based Q&A and description generation. | No | 
| MM1 | Achieves state-of-the-art performance in multimodal tasks by combining high-resolution visual processing and language models. | Supports tasks like in-context learning, multi-image reasoning, and few-shot learning, useful for assessments and exploratory learning. | No | 
| Llama 3-V | Combines vision, language, coding, reasoning, and tool usage with multilingual support. | Enables adaptive and multilingual learning, coding education, and collaborative projects. | Yes | 
| NVLM 1.0 | Frontier-class multimodal model with exceptional vision-language reasoning and text-only improvements. | Supports OCR tasks, multimodal math reasoning, and document analysis in educational environments. | Yes | 
| BLIP | Pre-trained for text-image retrieval and multimodal content generation. | Facilitates creative content development and storytelling in visual education. | Yes | 
| Application Area | Technologies Used | Examples of Applications | Case Study | 
|---|---|---|---|
| Adaptive Learning Platforms | MLLMs with transformer architectures (ViT, AST) | Cognii, Carnegie Learning’s MATHia, Knewton, Duolingo, Smart Sparrow | Integration of MLLMs in Duolingo | 
| Virtual Tutors and Chatbots | Transformer-based language models (GPT-3, GPT-4), VisualGPT, DeepSpeech, Tacotron 2, ADS | Squirrel AI Learning, Duolingo, Watson Tutor, Woebot [35] | Implementation of LOVA3 in Virtual Tutoring Systems | 
| Intelligent Content Creation | Transformer-based multimodal models (GPT-4, BLIP-2), DALL·E and DALL·E 2, NLG (GPT-3, T5) | Automated textbook generation, quiz and assessment generation, interactive simulations, multimedia content creation, language translation and localization | Automated Quiz Generation Using GPT-3 | 
| AI-powered LMS | GPT-4 and CLIP, natural language interfaces, Wav2Vec 2.0, Tacotron 2 | Coursera, edX, Udemy, Knewton’s Alta | Integration of MLLMs in Coursera | 
| Insight and Predictive Analytics | GPT-4, CLIP, computer vision models (OpenFace), speech and audio processing models (Wav2Vec 2.0) | Early warning systems, sentiment and emotion analysis, adaptive feedback generation, curriculum and instructional design insights, collaborative skills assessment | Early Warning System Using MLLMs | 
| Grading and Assessment Tools | NLP and NLU (GPT-4, BERT), computer vision and image recognition (CLIP, ViT), speech and audio processing (Wav2Vec 2.0) | E-Rater by ETS, MOSS, CodeRunner, Duolingo English Test, tools for multimodal assignment evaluation | Reducing the Cost of Short-Answer Scoring with MLLMs | 
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. | 
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xing, W.; Zhu, T.; Wang, J.; Liu, B. A Survey on MLLMs in Education: Application and Future Directions. Future Internet 2024, 16, 467. https://doi.org/10.3390/fi16120467
Xing W, Zhu T, Wang J, Liu B. A Survey on MLLMs in Education: Application and Future Directions. Future Internet. 2024; 16(12):467. https://doi.org/10.3390/fi16120467
Chicago/Turabian StyleXing, Weicheng, Tianqing Zhu, Jenny Wang, and Bo Liu. 2024. "A Survey on MLLMs in Education: Application and Future Directions" Future Internet 16, no. 12: 467. https://doi.org/10.3390/fi16120467
APA StyleXing, W., Zhu, T., Wang, J., & Liu, B. (2024). A Survey on MLLMs in Education: Application and Future Directions. Future Internet, 16(12), 467. https://doi.org/10.3390/fi16120467
 
        



 
       