Vision-Language Models in Teaching and Learning: A Systematic Literature Review

Jing Tian

doi:10.3390/educsci16010123

Abstract

Vision-language models (VLMs) integrate visual and textual information and are increasingly being used as innovative tools in educational applications. However, there is a lack of evidence regarding current practices for integrating VLMs into teaching and learning. To address this research gap and identify the opportunities and challenges associated with the integration of VLMs in education, this paper presents a systematic review of VLM use in formal educational contexts. Peer-reviewed articles published between 2020 and 2025 were retrieved from five major databases: ACM Digital Library, Scopus, Web of Science, Engineering Village, and IEEE Xplore. Following the PRISMA-guided framework, 42 articles were selected for inclusion. Data were extracted and analyzed against six research questions: (1) where VLMs are applied across academic disciplines and educational levels; (2) what types of VLM solutions are deployed and which image–text modalities they infer and generate; (3) the pedagogical roles of VLMs within teaching workflows; (4) reported outcomes and benefits for learners and instructors; (5) challenges and risks identified in practice, together with corresponding mitigation strategies; and (6) reported evaluation methods. The included studies span K-12 through higher education and cover diverse disciplines, with deployments dominated by pre-trained models and a smaller number of domain-adapted approaches. VLM-supported pedagogical functions cluster into five roles: analyst, assessor, content curator, simulator, and tutor. This review concludes by discussing implications for VLM adoption in educational settings and offering recommendations for future research.

Keywords:

vision–language models; AI in education; systematic literature review

1. Introduction

Digital technologies have reshaped teaching and learning, changing how curricula are delivered, assessed, and experienced. Artificial intelligence (AI) now plays a central role in educational practice, such as adaptive learning management, intelligent tutors, and automated assistance that tailors instructions to learner profiles and supports diverse needs (L. Chen et al., 2020; Zhan et al., 2024). In addition, they have broadened access to resources, enabled personalized pathways, and provided interactive methods to enrich teaching and learning (Ng et al., 2025; Tian, 2025).

Vision–language models (VLMs) combine visual and textual understanding to tackle tasks ranging from image captioning and visual question answering to object localization (Bommasani et al., 2022; Zhou et al., 2025). A pivotal breakthrough was OpenAI’s CLIP (Radford et al., 2021), which was trained on 400 million image–caption pairs. It employed a contrastive learning strategy in the model training to enable its zero-shot recognition capability so that it could classify images via textual prompts without explicitly requiring the relevant images in the training image dataset.

Recently, there has been rapid progress in both open-weight and proprietary VLMs. The open-weight VLMs release their model configurations and weights, so users can deploy them on local or private infrastructure, while proprietary VLMs are typically only accessed through APIs. Firstly, on the open-weight side, LLaVA-v1.5 (Liu et al., 2024) applied visual instruction tuning by connecting an image understanding component to a language model and fine-tuning using instruction-style question–answer pairs. Qwen2-VL (P. Wang et al., 2024) is another representative open-weight VLM solution that supports image-and-text inputs and produces language outputs. Secondly, on the proprietary side, OpenAI’s GPT-4V (Yang et al., 2023) was released as a multimodal model accepting both image inputs and textual inputs, and demonstrated capabilities on reasoning over common educational visuals such as charts and figures. Claude-3 (Anthropic, 2024) is a proprietary multimodal model that also supports image-and-text interaction. Gemini 2.5 (Comanici et al., 2025) is Google’s multimodal model line; it can interpret images together with accompanying prompts and return structured text outputs. A brief description of these VLMs is provided in Table 1. By jointly interpreting images and text, VLMs can be viewed as multimodal assistants for educators. They can read visual teaching artifacts (e.g., diagrams and slides) with text prompts (e.g., “explain this chart” or “generate quiz questions from this slide”) and respond with explanations or feedback in language.

Table 1. A summary of typical VLMs that are used in teaching and learning. The symbol − means that the information is undisclosed in the article.

This paper aims to provide a systematic review of applying VLMs in the context of enhancing teaching and learning. The motivation of this review is highlighted in the following three aspects.

Educational motivation: Learners typically gain understanding from more than one channel, such as linguistic input (e.g., speech and text) and imagery input (e.g., pictures and diagrams) (Paivio, 2013). The first language pathway excels at sequential and symbolic processing, whereas the second visual pathway supports holistic, spatial reasoning. From a constructivist perspective, VLMs act as scaffolds to help learners actively build knowledge and provide feedback. From a connectivist lens, VLMs retrieve, filter, and translate multimodal resources (e.g., slides, and lecture video recordings), enabling just-in-time connections among learners, instructors, and learning materials.
Technological motivation: Beyond text-only large language models, VLMs combine inputs across formats such as text, images, and video (Danish et al., 2026; Shinde et al., 2025; Zhang et al., 2024). This integration allows systems to interpret and generate pedagogy-relevant artifacts (e.g., slides and figures) within one integrated framework, aligning naturally with how students learn. Traditionally, these abilities were pursued in separate communities, including computer vision for understanding images and natural language processing for reasoning over language. However, the multimodal architectures used in VLMs can integrate both modalities, enabling new practical opportunities for teaching, assessment, and feedback.
Research gap: A few systematic review articles have studied the application of large language models (LLMs) in education (Agbo et al., 2025; Ali et al., 2024; Kostopoulos et al., 2025; H. Y. Lee et al., 2025; Raihan et al., 2025; P. Wang et al., 2025). However, they have only addressed the single modality model (i.e., language only), which is different from the focus of this paper on two modalities (i.e., both vision and language), as shown in Table 2. On the other hand, multimodal large language models (MLLMs) improve learner engagement, support personalized pathways, and deepen understanding by processing and producing context-aware content (G. Lee et al., 2025). The integration of diverse cognitive channels motivates the integration of MLLMs to enhance learners’ educational experience (Xing et al., 2024). It outlines both the opportunities and the challenges created by bringing MLLMs into educational practice (Küchemann et al., 2025). Different from these studies, this review has a focus on a set of new research questions curated for VLMs, including where VLMs are applied, which VLM solutions are applied, what the input–output modalities are, what the pedagogical roles of VLMs and their associated outcomes for learners and instructors are, and how they are evaluated.

Table 2. Comparison of this paper with existing review papers.

In view of both opportunities and challenges offered by VLMs for educational applications, the contribution of this paper is to address that gap by presenting a systematic literature review of the impact of VLMs on teaching and learning in formal educational contexts. It is essential to understand their impact on the learner experience for improving curriculum design, assessment practices, and instructional support. We organize the review around six research questions designed to explain where and how VLMs are used, the roles of humans and systems in the teaching workflow, the benefits and outcomes reported, and the challenges and mitigation strategies identified. We followed the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) framework (Page et al., 2021) and applied a multi-database search strategy with predefined inclusion and exclusion criteria to ensure relevance and quality of studied articles.

The rest of this review paper is organized as follows. The literature search process and six research questions are presented in Section 2. The findings are presented in Section 3. A discussion is provided in Section 4 to present the recommendations and the limitations of this study. Finally, Section 5 concludes this paper.

2. Methodology

2.1. Literature Search Process

Following the PRISMA framework (Page et al., 2021), we conducted a structured, multi-stage process to identify, screen, and select studies for inclusion in the review.

Firstly, in the identification stage, we conducted our search across five major databases commonly used for publishing and indexing computing and educational technology research: Scopus, Web of Science, IEEE Xplore, Engineering Village, and ACM Digital Library. To retrieve relevant studies, we developed a set of constructed search strings using logical operators. The query combined technical terms related to vision–language models, as shown in Table 3. The search was conducted on 22 December 2025. We retrieved a total of 538 records from five major research databases: Scopus (n = 253), ACM Digital Library (n = 20), IEEE Xplore (n = 36), Web of Science (n = 75), and Engineering Village (n = 154). After importing all records into the Excel tool, 256 duplicate records were removed based on the title and Digital Object Identifier (DOI); 282 records remained for the subsequent screening.

Table 3. The search syntax used in the article identification process in this paper. The asterisk (*) represents any group of characters, including no character.

Next, in the screening stage using title and abstract, based on inclusion and exclusion criteria in Table 4, 190 records were excluded (not relevant to teaching and learning), and 19 studies were excluded (not research articles, e.g., reviews, tutorials, and abstracts). Then, the remaining 73 records proceeded to the eligibility stage. In this stage, 31 studies were excluded because they focused solely on data evaluation rather than enhancing teaching and learning. Finally, 42 studies met all criteria and were included in the final synthesis.

Table 4. The inclusion and exclusion criteria used in the article selection in this paper.

The complete flow of records through each PRISMA stage, including identification, screening, eligibility, and inclusion, is illustrated in Figure 1. The annual distribution of these articles is provided in Table 5. The full list of articles during the identification stage and the screening stage is provided in Supplementary Tables S1–S6, as shown in Supplementary Materials. Their associated brief descriptions are summarized in Table A1.

Figure 1. The PRISMA flow diagram used in this paper.

Table 5. The annual distribution of articles covered in this paper.

2.2. Quality Assessment

To evaluate the methodological quality of the 42 studies included in this review, we applied the Mixed Methods Appraisal Tool (MMAT) checklist (Hong et al., 2018) to perform a quality assessment. Firstly, each article was classified into one of MMAT’s five design categories, including (i) qualitative study, (ii) quantitative randomized controlled trial, (iii) quantitative non-randomized study, (iv) quantitative descriptive study, and (v) mixed method. Then, each article was evaluated against the corresponding MMAT criteria. Each criterion was rated using a three-point scale (“yes”, “no”, or “can’t tell”). Finally, each study was assigned an overall quality rating based on the number of criteria met, such as “High” (5 “yes”), “Moderate” (3–4 “yes”), “Low” (≤2 “yes”), and “Not applicable” (where MMAT is not suitable). This three-step approach supports consistent and transparent evaluation of studies included in the review. A full list of MMAT ratings for all articles is provided in Supplementary Table S7, as shown in Supplementary Materials.

2.3. Research Questions

The objective of this paper is to provide a systematic literature review of the impact of VLMs on teaching and learning in formal educational contexts. To achieve this goal, we carefully crafted six research questions. The field is rapidly evolving across diverse subjects, educational levels, and deployment settings. Therefore, the first two RQs are to establish the landscape, which area VLMs are being used, and what kinds of solutions are actually deployed. Then, to further study how VLMs can be integrated into the pedagogy, the next two RQs study the specific role of VLMs, and interpret effectiveness by collating learner and instructor outcomes. Finally, the last two RQs compile challenges and mitigation, as well as the technology evaluation quality and validity. The detailed description of six RQs and their respective motivations is provided as follows.

RQ1. In which educational levels and academic disciplines have VLMs been applied to teaching? It is important to map the deployment landscape to clarify where VLMs are already used. Therefore, we need to extract discipline and education level from each article.
RQ2. What type of VLM solutions are used, what modalities do they handle, and what do they generate? It is critical to apply a user-facing taxonomy (off-the-shelf pre-trained vs tuned) that supports practical adoption decisions, and understands input/output modalities, which indicate how to fit VLMs with real teaching artifacts (e.g., diagrams, slides). Therefore, we need to record the solution type, input modalities, and outputs from each article.
RQ3. What is the role of the VLMs in the teaching workflow? It is essential to clarify the role of VLMs in the workflow for effective deployment. Therefore, we need to record the VLM role from each article.
RQ4. What benefits are reported for learners and instructors? VLMs might introduce different benefits for different stakeholders (e.g., learners and instructors). Therefore, we need to extract learner and instructor benefits from each article.
RQ5. What challenges and risks are reported for learners and instructors, and what mitigation strategies are described or evaluated? The adoption challenges and tested mitigation strategies would benefit other educators in integrating VLMs into their practice. Therefore, we need to record technical, pedagogical, and ethical issues, and identify which mitigation is empirically evaluated in each article.
RQ6. How are studies designed and evaluated? To verify the educational impacts of VLMs on teaching and learning, it is required to assess methodological quality. Therefore, we need to record the dataset and reproducibility from each article.

The findings of the six research questions are reported in the following section.

3. Results

3.1. RQ1: Education Levels and Disciplines

This question examines the educational levels and disciplinary contexts represented in the reviewed studies. As summarized in Table 6, VLM applications span a wide range of educational levels, from K-12 (5 articles) to higher education (14 articles). Higher education serves as a common testbed for the emerging educational technologies because they have greater access to computational infrastructure (for proprietary VLM solutions) or established learning platforms that record multimodal instructional artifacts (slides, recorded lectures, and digital submissions), or teaching teams with higher AI skills (for open-weighted VLMs). On the other hand, in K-12 education, studies often focus on engagement and age-appropriate content generation, reflecting strong demand for learners’ support.

Table 6. Findings of RQ1. Summary of educational levels and disciplinary contexts represented in the articles included in the review. The symbol − means that it is undisclosed in the article.

The disciplinary distribution is also diverse. Reported applications include computer science, engineering, and several non-STEM fields, indicating that interest in VLM-supported teaching is emerging across multiple domains. This might be due to the unique capability of VLMs, which can operate on the visual materials that are ubiquitous across disciplines, such as figures, charts, slides, sketches, photos of physical work, and scanned handwritten responses. In summary, the findings suggest that VLM-supported teaching is emerging as a cross-disciplinary capability.

3.2. RQ2: VLMs Solution and Modality

This question examines which VLM solutions are used, the modalities they accept, and what they generate. As shown in Figure 2, the distribution is led by the GPT family (26 studies, (Yang et al., 2023)), followed by the LLaVA family (8 studies, (Liu et al., 2024)), the Gemini family (6 articles, (Comanici et al., 2025)), the Claude family (5 studies, (Anthropic, 2024)), and the Qwen family (5 studies, (P. Wang et al., 2024)). In terms of configuration, pre-trained models are used in 35 studies, while fine-tuned models are used in 6 studies.

Figure 2. VLM solutions and configurations in the included studies (RQ2). (a) Summary of VLM solutions used (models appearing once grouped as Others; studies may use multiple solutions). (b) Distribution of VLM configurations, including pre-trained and tuned.

Table 7 summarizes the input and output modalities used by VLMs. In most articles, VLMs are applied to visual inputs, such as static images (e.g., classroom photos, slides, scanned worksheets, or scanned transcripts) or video feeds (lecture videos or skills training recordings). Several studies pair visuals with text prompts or course materials. The major types of outputs are textual: answers, feedback, assessment items (quiz questions), descriptions, and lecture scripts. A second subset yields structured labels and analytics results (e.g., engagement/attention states, and recognized content) to support monitoring or grading. A smaller set produces visual artifacts, such as matched images for textbooks or generated visualizations in tutoring.

Table 7. The input and output modalities of VLM solutions in the included studies (RQ2).

3.3. RQ3: Pedagogical Functions

VLMs play a range of roles across the educational settings reported in the included studies. For example, they can support instructional material development and guide learners through structured feedback. These roles are similar to those in intelligent tutoring systems Zawacki-Richter et al. (2019), but VLMs differ in their ability to work directly with multimodal artifacts and to generate multimodal outputs. In this review paper, we define the following five roles with distinct pedagogical functions, as summarized in Table 8.

Table 8. A summary of pedagogical functions of VLMs in the studies (RQ3).

Analyst (13 articles). In this role, VLMs analyze educational data and provide insights to augment instructors’ decisions, rather than making decisions automatically. For example, as classroom analytics tools, VLMs analyze visual classroom data (images/video) to infer engagement and participation, and can also support academic integrity monitoring in online settings (e.g., activity detection). These functions provide instructors with timely and actionable insights.
Assessor (10 articles). In this role, VLMs support assessment by interpreting learners’ submissions, such as scanned worksheets, and automatically producing scores.
Content curator (9 articles). VLMs support content authoring and augmentation by transforming raw instructional assets (e.g., slides, textbooks, and lecture videos) into richer learning materials. They can also generate assessment content from multimodal inputs, turning lecture videos into structured questions (MCQs, short-answer open-ended questions).
Simulator (1 article). VLMs help craft interactive, persona-driven experiences that simulate learning in concrete visual contexts. They combine visual perception with dialogue to create situated practice opportunities for learners.
Tutor (9 articles). For this role, VLMs deliver just-in-time guidance grounded in visual context, such as interpreting diagrams or student sketches, and responding with hints or clarifications. Such approaches can be deployed synchronously (during class) or asynchronously (homework support). This tutor role differs from the analyst role because it decides to provide direct feedback to learners, while the analyst only supports users’ decision-making through analyzed insights.

3.4. RQ4: Learners and Instructors’ Benefits from VLMs

This question studies different benefits brought by VLMs for different stakeholders (e.g., learners and instructors) and summarizes them in Table 9 and Table 10.

Table 9. A summary of learners’ benefits reported in articles (RQ4).

Table 10. A summary of instructors’ benefits reported in articles (RQ4).

For the learners, VLMs are reported to improve learner motivation, attention, and participation by providing a responsive, supportive teaching and learning environment (4 articles). VLMs can provide actionable guidance and timely answers that help learners know what to do next, covering formative feedback, explanations, and on-demand question answering (10 articles). VLMs improve understanding and learning gains enabled by multimodal inputs/outputs (e.g., images, diagrams, and captions) that clarify concepts and connect representations (8 articles).

For the instructors, VLMs assist instructors in reducing routine workload (e.g., transcript processing) and streamlining operations for higher-value teaching activities (2 articles). VLMs help instructors evaluate learner work or understanding, grading artifacts (text, diagrams, and videos), and generating assessment items for testing and formative feedback (10 articles). VLMs are also reported to analyze classroom behaviors and exam conditions (e.g., engagement and attention) to surface actionable insights for monitoring learning and safeguarding integrity (3 articles). They support creating teaching materials, drafting lecture content, slides, scripts, and question–answer materials, for saving preparation time and improving instructional resources (5 articles). VLMs help educators adapt instruction, refining explanations, personalizing supports, and responding to learner needs during sessions (6 articles).

3.5. RQ5: Challenge and Mitigation

This question studies what challenges and risks are reported for learners and instructors, and what mitigation strategies are tested, from the three aspects: technical, pedagogical, and ethical issues.

Technical challenges. VLMs face several technical difficulties in educational settings. Low-quality inputs can degrade performance. For example, poor images lead to inaccurate color identification and object counting (Tapia-Mandiola & Araya, 2025), and heterogeneous transcript layouts cause recognition errors (Bhaskaran & Pardos, 2025). Robust pre-processing (e.g., segmentation and layout normalization) is recommended to improve accuracy. The inherent hallucination risk of VLMs might produce irrelevant content; targeted fine-tuning was used to improve question relevance (Nguyen & Park, 2025; Stamatakis et al., 2025). The performance of VLMs is affected by how they are called via various prompts. For that, specific prompt designs are reported in (Xie et al., 2025) for enhancing grading accuracy and consistency. Finally, VLMs can be computationally expensive to use; (J. Chen et al., 2024) highlights the need for computationally efficient fine-tuning methods.
Pedagogical challenges. In content generation, (Kunuku & Dehbozorgi, 2025) notes that outputs must align with higher-order thinking in Bloom’s taxonomy (e.g., application, analysis, and evaluation) and proposes an LLM-as-Judge framework to automate evaluation of cognitive alignment. The type of feedback significantly influences students’ motivation to revise their work. The direct and informative feedback conditions were found to be more effective in encouraging students’ revisions compared to the general feedback conditions (Zhuang et al., 2025). Therefore, VLMs should provide more detailed and guided feedback.
Many ethical concerns have been reported in the reviewed articles; ethical safeguards are essential for the VLM-enabled applications in teaching (Marquez-Carpintero et al., 2025). They emphasize the need for informed consent, recommend establishing regulatory frameworks, and preventing surveillance or misuse of sensitive information. For example, one reported concern is the risk of unauthorized disclosure of student information and the privacy risks when learner data are processed with proprietary VLM services (U. Lee et al., 2024). Furthermore, cultural misalignment is also reported; they often struggle to interpret visual content in non-Western contexts, potentially misrepresenting under-resourced cultures (Tan et al., 2025). To mitigate these risks, many strategies are reported. One typical strategy is to implement a human-in-the-loop approach, where instructors provide essential oversight to ensure pedagogical validity (Shu et al., 2025). Retrieval-Augmented Generation (RAG) is also implemented to reduce hallucinations and ensure factual, context-aware outputs (Tan et al., 2025). Lastly, compared with proprietary VLMs, open-weight models might enhance privacy by allowing for on-device deployment (U. Lee et al., 2024). These measures should be integrated into the implementation plans to ensure responsible use of VLMs in educational settings.

3.6. RQ6: Validation and Evaluation

Extensive studies have been conducted to verify the educational impacts of VLMs using various experiments and datasets. The detailed experimental setups are summarized in Table 11. As seen in Table 11, only 6 articles provided reproduced datasets (Edwards et al., 2025; Singh et al., 2023; Zheng et al., 2025) and code (Bhaskaran & Pardos, 2025; J. Lee et al., 2025; Stamatakis et al., 2025; Zheng et al., 2025).

Table 11. Various performance evaluation datasets used in the reviewed articles.

Furthermore, we applied the Mixed Methods Appraisal Tool (MMAT) checklist (Hong et al., 2018) to evaluate the quality of each article, as described in Section 2.2. The distribution of quality rating of these articles is illustrated in Figure 3. A full list of MMAT ratings for all articles is provided in Supplementary Table S7, as shown in Supplementary Materials. As shown in Figure 3, most included studies were rated as high quality (35 articles), while five were rated as moderate quality. In terms of the study design, quantitative descriptive studies form the largest group (18 articles), followed by mixed-methods studies that combine quantitative and qualitative methods (14 articles). The remaining studies are quantitative non-randomized (7 articles) and qualitative (2 articles).

Figure 3. The quality analysis of 42 articles using the MMAT checklist (Hong et al., 2018). (a) The distribution of the study design category. (b) The distribution of study quality.

4. Discussions

4.1. Future Research Directions

Based on the studied articles in this review paper, several key issues have emerged regarding the use of VLMs in education for future studies.

Firstly, VLMs might face challenges in tasks involving complex visual reasoning, such as interpreting graphs versus photographs, or multi-step diagrammatic reasoning (Bhaskaran & Pardos, 2025; Tapia-Mandiola & Araya, 2025). Therefore, it is interesting to pinpoint the hard cases for VLMs in educational settings and to develop specialized models to address these challenges. To address this issue, it requires developing new multimodal datasets that capture a wide range of educational visuals.

Secondly, it is essential to ensure the pedagogical quality of AI-generated content (questions and feedback) and align it with learning objectives (Kunuku & Dehbozorgi, 2025; Zhuang et al., 2025). As VLMs start generating questions and teaching materials, research must address how to evaluate and constrain these outputs. This includes maintaining appropriate difficulty levels and pedagogical soundness of AI-generated content.

Lastly, several practices were reported in integrating VLMs into classrooms to support instructors and learners. It is important to study human–AI interaction in educational settings: understanding how instructors’ roles might shift, and how students respond to AI-generated feedback (Mittal et al., 2025).

4.2. Limitations

This research area is evolving quickly; this review paper searched peer-reviewed articles published between 2020 and 2025. Therefore, some recent relevant studies may have been missed during retrieval and screening, which constrains the generalizability of our findings. In addition, our exclusion criteria in Table 4 may introduce selection bias. Firstly, non-peer-reviewed articles (e.g., preprints on arXiv.org) were excluded from this review paper; however, they might report state-of-the-art research works and trials. Secondly, limiting the scope to English-language publications omits work from non-English contexts, even though VLMs are often used in multilingual settings. These limitations point to clear opportunities for broader language coverage and validations.

5. Conclusions

This review synthesizes recent research works on the usage of LVMs in formal educational applications. It maps where they are deployed, which solutions and platforms are used, how they are integrated into teaching workflows, the benefits reported for learners and instructors, the challenges and mitigation strategies identified, and evaluation methods. The review paper provides a structured framework, consisting of disciplines, solution types, pedagogical roles, outcomes, risks, and methodological validation, which guide educators in selecting and implementing VLM tools.

The findings suggest significant opportunities for VLMs in teaching and learning, while also highlighting practical challenges for their responsible adoption.

Opportunities. VLMs are shown in the reviewed articles to provide opportunities to transform teaching and learning practices by enabling more interactive and tailored educational environments, even for instructors and learners without computer science backgrounds. For learners, VLMs offer personalized support and interactive learning activities. Instructors benefit from VLMs through automatic content generation and analytics insights from educational data.
Challenges. It is important to understand the new challenges that are inherent in this emerging technology. Learners face challenges of verifying the accuracy and managing potential biases of VLM outputs, which could affect critical thinking and independent learning. On the other hand, instructors also face challenges of developing effective prompting strategies to use VLMs and ensuring the alignment between the generated content with learning outcomes.

In the future, to further facilitate the integration of VLMs in teaching and learning, it is required to establish the computing infrastructure, set the performance evaluation benchmarks and efficient prompting practices, and develop a regulatory framework for responsible usage, for effective and ethical integration of VLMs in teaching and learning.

Supplementary Materials

The following supporting information and Supplementary Materials from this paper can be downloaded at https://www.mdpi.com/article/10.3390/educsci16010123/s1; Table S1: List of articles from Scopus; Table S2: List of articles from Engineering Village; Table S3: List of articles from Web of Science; Table S4: List of articles from IEEE Xplore; Table S5: List of articles from ACM Digital Library; Table S6: List of articles included in review; Table S7: List of MMAT ratings of all articles.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Dataset available on request from the author.

Conflicts of Interest

The author declares no conflicts of interest.

Appendix A

The full list of articles and their associated descriptions is summarized in Table A1.

Table A1. A list of the 42 studied articles in this review paper.

Reference	Description
Abdelhadi et al. (2025)	A real-time classroom and examination system verifies identity, monitors behaviors, and recommends teacher actions.
Ambali Parambil et al. (2025)	It classifies student emotions from online-classroom images.
Anderer et al. (2024)	It analyzes lecture slides for visually impaired access by converting visual elements to multilayer textual descriptions.
Anderer et al. (2025)	It provides a lecture assistant to support accessible lecture video navigation and visual question answering.
Asseri et al. (2025)	It applies VLMs on emotion recognition for Arabic children’s storybook images under multiple prompting schemes.
Bhaskaran and Pardos (2025)	It compares OCR- and VLM-based pipelines to extract structured course and grade information from heterogeneous student transcript documents.
Bossema et al. (2025)	A course with older adults to explore human–robot co-creativity with a multimodal LLM.
Busic et al. (2024)	It applies multimodal LLMs to automate grading in GUI design courses.
Cao et al. (2025)	A classroom-video analytics system that integrates LLM prompting to identify teaching behaviors and produce interpretable feedback.
J. Chen et al. (2024)	A VLM-based approach generates primary-school multiple-choice questions from cartoon images by combining image captioning with text generation.
Dang et al. (2025)	It studies learner–agent interaction patterns with an embodied AI assistant in mixed reality.
Edwards et al. (2025)	A statistical procedure to evaluate whether VLMs grade engineering sketches equivalently to human experts.
Fahmi and Bousmah (2025)	An AI virtual teacher assistant for student diagnosis and remediation.
Han et al. (2025)	A tangible storytelling pipeline where multimodal LLMs generate paper-cut style assets and support in crafting stage-based narratives.
Hang and Man Ho (2025)	It applies multimodal LLM workflows to generate personalized vocabulary flashcards for early childhood education.
Ibanez et al. (2025)	It applies multimodal LLMs as graders of UML class diagrams.
Kunuku and Dehbozorgi (2025)	A multimodal framework creates quiz questions from video lectures by fusing text, visuals, and audio.
U. Lee et al. (2024)	It develops a personalized art-appreciation tutor, including the creation of a GPT-generated dialogue dataset and benchmarking.
G. G. Lee and Zhai (2025)	It provides educational visual question answering for researchers to query and analyze image data through textural prompts.
J. Lee et al. (2025)	A multimodal tutoring system offers step-by-step textual and visual guidance by generating diagrams through code-assisted reasoning.
Marquez-Carpintero et al. (2025)	A two-phase method leverages VLMs to estimate student attention and emotions from classroom imagery for STEM education.
Mittal et al. (2025)	A VLM-enabled system provides real-time, context-aware answers to student questions during live lectures using live content and retrieved course materials.
Nguyen and Hayward (2025)	It uses a multimodal LLM to annotate K-12 science assessments and suggest revisions.
Nguyen and Park (2025)	It applies multimodal LLMs for scoring and feedback on multimodal science assessments.
Pang et al. (2026)	A tuned VLM to recognize student engagement cues in still images.
Picard et al. (2025)	It evaluates VLMs across engineering design tasks and provides benchmark datasets for continued assessment.
Rahmanian et al. (2025)	It applies VLMs to assess student ER diagrams under different input contexts and prompting strategies.
Sheng et al. (2025)	It combines digital-pen traces with a multimodal LLM to reconstruct students’ step-by-step reasoning chains.
Shu et al. (2025)	It extracts learning outcomes from lecture notes and uses a multimodal LLM to generate multiple-choice questions with solutions and explanations.
Singh et al. (2023)	A VLM-driven pipeline retrieves and assigns web images to e-textbooks via a text–image matching optimization.
Stamatakis et al. (2025)	It applies VLMs to generate learning-oriented questions from educational videos.
Su et al. (2025)	An intelligent tutoring system for learning Chinese characters that leverages a multimodal LLM to deliver corrective feedback.
Tan et al. (2025)	A picture-guided conversational chatbot for early childhood language learning.
Tapia-Mandiola and Araya (2025)	A two-step approach segments key regions in students’ coloring-task images and then applies a VLM to analyze the cropped sections for automated grading support.
Teotia et al. (2024)	It evaluates VLMs on classroom learning-engagement detection using behavior and emotion datasets.
Tschope et al. (2025)	It applies VLMs for recognizing activities in nursing-procedure training videos.
Y. Wang et al. (2025)	A modular system generates lecture scripts from multimodal slide inputs using instruction-guided VLM workflows.
X. Wang et al. (2025)	It uses a multimodal LLM to perform automated essay scoring.
X. Wang et al. (2025)	A wearable system uses computer vision, VLMs, and speech technologies to equip everyday objects with conversational personas for interactive guidance.
Xie et al. (2025)	It studies prompt-engineering strategies for automated K-12 exam grading with a VLM across six question types and proposes an evaluation framework for grading behavior.
Zheng et al. (2025)	It conducts art-evaluation dialogues with multimodal LLMs for teacher support.
Zhuang et al. (2025)	It scores picture-cued student writing against images and provides feedback for middle-school language learning.

References

Abdelhadi, Z., Naseif, M., Alhejali, W., & Elhayek, A. (2025, January 15–16). TeacherEye: An AI-powered system for monitoring student engagement in online education. International Learning and Technology Conference (pp. 25–30), Jeddah, Saudi Arabia. [Google Scholar] [CrossRef]
Agbo, F. J., Olivia, C., Oguibe, G., Sanusi, I. T., & Sani, G. (2025). Computing education using generative artificial intelligence tools: A systematic literature review. Computers and Education Open, 9, 100266. [Google Scholar] [CrossRef]
Ali, D., Fatemi, Y., Boskabadi, E., Nikfar, M., Ugwuoke, J., & Ali, H. B. (2024). ChatGPT in teaching and learning: A systematic review. Education Sciences, 14(6), 643. [Google Scholar] [CrossRef]
Ambali Parambil, M. M., Bouktif, S., Gochoo, M., & Alnajjar, F. S. K. (2025, April 22–25). Comparing emotion detection methods in online classrooms: YOLO models, multimodal LLM, and human baseline. IEEE Global Engineering Education Conference (pp. 1–7), London, UK. [Google Scholar] [CrossRef]
Anderer, K., Muller, K. E., Strobel, L., Wolfel, M., Niehues, J. M., & Gerling, K. M. (2025, October 26–29). Making lecture videos accessible for students who are blind or have low vision through AI-assisted navigation and visual question answering. International ACM SIGACCESS Conference on Computers and Accessibility, Denver, CO, USA. [Google Scholar] [CrossRef]
Anderer, K., Wölfel, M., & Niehues, J. M. (2024, September 2–6). Identifying the information gap for visually impaired students during lecture talks. IEEE Symposium on Visual Languages and Human-Centric Computing (pp. 168–173), Liverpool, UK. [Google Scholar] [CrossRef]
Anthropic. (2024). The Claude 3 model family: Opus, sonnet, haiku. Available online: https://www.anthropic.com/claude-3-model-card (accessed on 24 December 2025).
Asseri, B., Abaker, E., Al Mogren, M., Alhefdhi, T., & Al-Wabil, A. (2025). Deciphering emotions in children’s storybooks: A comparative analysis of multimodal LLMs in educational applications. AI, 6(9), 211. [Google Scholar] [CrossRef]
Bhaskaran, M., & Pardos, Z. A. (2025, July 21–23). Automating academic transcript evaluation: A comparative study of OCR techniques for course and grade evaluation. ACM Conference on Learning and Scale (pp. 366–370), Palermo, Italy. [Google Scholar] [CrossRef]
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N., Chen, A., Creel, K., Davis, J. Q., Demszky, D., … Liang, P. (2022). On the opportunities and risks of foundation models. arXiv. [Google Scholar] [CrossRef]
Bossema, M., Ben Allouch, S., Plaat, A., & Saunders, R. (2025, August 25–29). LLM-enhanced interactions in human-robot collaborative drawing with older adults. IEEE International Conference on Robot and Human Interactive Communication (pp. 700–707), Eindhoven, The Netherlands. [Google Scholar] [CrossRef]
Busic, B., Leventic, H., Romic, K., & Habijan, M. (2024, September 16–18). Towards using multimodal LLMs as graders in a GUI design course. International Symposium ELMAR (pp. 97–100), Zadar, Croatia. [Google Scholar] [CrossRef]
Cao, Y., Xiong, X., Shao, X., Chen, R., Hou, Y., Li, B., Zhao, P., & Guo, K. (2025, February 21–23). Research on teaching video monitoring platform based on large language model prompt engineering. International Conference on Computer Science, Engineering, and Education (pp. 118–125), Nanjing, China. [Google Scholar] [CrossRef]
Chen, J., Atmosukarto, I., & Bin Abbas, M. F. (2024, December 1–4). Image question-distractors generation as a conversational model. IEEE Region 10 Annual International Conference (pp. 39–42), Singapore. [Google Scholar] [CrossRef]
Chen, L., Chen, P., & Lin, Z. (2020). Artificial Intelligence in education: A review. IEEE Access, 8, 75264–75278. [Google Scholar] [CrossRef]
Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., Marris, L., Petulla, S., Gaffney, C., Aharoni, A., Lintz, N., Pais, T. C., Jacobsson, H., Szpektor, I., Jiang, N.-J., … Helmholz, W. (2025). Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. Available online: https://arxiv.org/abs/2507.06261 (accessed on 24 December 2025).
Dang, B., Huynh, L., Gul, F., Rose, C. P., Jarvela, S. M., & Nguyen, A. (2025). Human–AI collaborative learning in mixed reality: Examining the cognitive and socio-emotional interactions. British Journal of Educational Technology, 56(5), 2078–2101. [Google Scholar] [CrossRef]
Danish, S., Sadeghi-Niaraki, A., Khan, S. U., Dang, L. M., Tightiz, L., & Moon, H. (2026). A comprehensive survey of Vision–Language Models: Pretrained models, fine-tuning, prompt engineering, adapters, and benchmark datasets. Information Fusion, 126, 103623. [Google Scholar] [CrossRef]
Edwards, K. M., Tehranchi, F., Miller, S. R., & Ahmed, F. (2025, August 17–20). AI judges in design: Statistical perspectives on achieving human expert equivalence with vision-language models. International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, Anaheim, CA, USA. [Google Scholar] [CrossRef]
Fahmi, Y., & Bousmah, M. (2025, October 4–10). Designing an educational virtual assistant based on agentic AI for student diagnosis and individualized remediation. IEEE Congress on Information Science and Technology (pp. 406–415), Marrakech, Morocco. [Google Scholar] [CrossRef]
Han, K., Tang, K., & Wang, M. (2025, March 4–7). Stage wizard: Enhancing tangible storytelling with multimodal LLMs. International Conference on Tangible, Embedded, and Embodied Interaction (pp. 1–13), Bordeaux/Talence Colorado, France. [Google Scholar] [CrossRef]
Hang, C. N., & Man Ho, S. (2025, March 15). Personalized vocabulary learning through images: Harnessing multimodal large language models for early childhood education. IEEE Integrated STEM Education Conference (pp. 1–7), Princeton, NJ, USA. [Google Scholar] [CrossRef]
Hong, Q. N., Pluye, P., Fabregues, S., Bartlett, G., Boardman, F., Cargo, M., Dagenais, P., Gagnon, M., Griffiths, F., Nicolau, B., O’Cathain, A., Rousseau, M., & Vedel, I. (2018). Mixed methods appraisal tool (mmat), version 2018. Available online: https://medschool.cuanschutz.edu/docs/librariesprovider94/di-docs/methods-%28design%29-docx/mmat_2018_criteria-manual_2018-08-01_eng.pdf (accessed on 24 December 2025).
Ibanez, M. B., Barron-Estrada, M. L., & Zatarain-Cabada, R. (2025). Can multimodal large language models grade like an expert? A study on UML class diagram assessment accuracy. Computer Applications in Engineering Education, 33(5), e70080. [Google Scholar] [CrossRef]
Kostopoulos, G., Vasileios, G., Rigou, M., & Kotsiantis, S. B. (2025). Agentic AI in education: State of the art and future directions. IEEE Access, 13, 177467–177491. [Google Scholar] [CrossRef]
Kunuku, M. T., & Dehbozorgi, N. (2025, July 22–26). Exploring multimodal quiz generation and evaluation aligned with higher-order learning objectives in bloom’s taxonomy. International Conference on Artificial Intelligence in Education (pp. 433–438), Palermo, Italy. [Google Scholar] [CrossRef]
Küchemann, S., Avila, K. E., Dinc, Y., Hortmann, C., Revenga, N., Ruf, V., Stausberg, N., Steinert, S., Fischer, F., Fischer, M. R., Kasneci, E., Kasneci, G., Kuhr, T., Kutyniok, G., Malone, S., Sailer, M., Schmidt, A., Stadler, M. J., Weller, J., & Kuhn, J. (2025). On opportunities and challenges of large multimodal foundation models in education. npj Science of Learning, 10(1), 11. [Google Scholar] [CrossRef]
Lee, G., Shi, L., Latif, E., Gao, Y., Bewersdorff, A., Nyaaba, M., Guo, S., Liu, Z., Mai, G., Liu, T., & Zhai, X. (2025). Multimodality of AI for education: Toward artificial general intelligence. IEEE Transactions on Learning Technologies, 18, 666–683. [Google Scholar] [CrossRef]
Lee, G. G., & Zhai, X. (2025). Realizing visual question answering for education: GPT-4V as a multimodal AI. TechTrends, 69(2), 271–287. [Google Scholar] [CrossRef]
Lee, H. Y., Huang, Y. M., & Wu, T. T. (2025). ChatGPT in education: A systematic review of current landscape, limitations and future directions through general system theory lens. European Journal of Education, 60(4), e70262. [Google Scholar] [CrossRef]
Lee, J., Chen, S. S., & Liang, P. P. (2025, April 26–May 1). Interactive sketchpad: A multimodal tutoring system for collaborative, visual problem-solving. CHI Conference on Human Factors in Computing Systems (pp. 1–14), Yokohama, Japan. [Google Scholar] [CrossRef]
Lee, U., Jeon, M., Lee, Y., Byun, G., Son, Y., Shin, J., Ko, H., & Kim, H. (2024). LLaVA-docent: Instruction tuning with multimodal large language model to support art appreciation education. Computers and Education: Artificial Intelligence, 7, 100297. [Google Scholar] [CrossRef]
Liu, H., Li, C., Li, Y., & Lee, Y.-J. (2024, June 16–22). Improved baselines with visual instruction tuning. IEEE Conference on Computer Vision and Pattern Recognition (pp. 26286–26296), Seattle, WA, USA. [Google Scholar] [CrossRef]
Marquez-Carpintero, L., Viejo, D., & Cazorla, M. (2025). Enhancing engineering and STEM education with vision and multimodal large language models to predict student attention. IEEE Access, 13, 114681–114695. [Google Scholar] [CrossRef]
Mittal, M., Tyagi, G., Bailey, A., Ranade, G. V., & Norouzi, N. (2025, July 22–26). Askademia: A real-time AI system for automatic responses to student questions. International Conference on Artificial Intelligence in Education (pp. 105–118), Palermo, Italy. [Google Scholar] [CrossRef]
Ng, D. T. K., Chan, E. K. C., & Lo, C. K. (2025). Opportunities, challenges and school strategies for integrating generative AI in education. Computers and Education: Artificial Intelligence, 8, 100373. [Google Scholar] [CrossRef]
Nguyen, H., & Hayward, J. (2025). Applying generative artificial intelligence to critiquing science assessments. Journal of Science Education and Technology, 34(1), 199–214. [Google Scholar] [CrossRef]
Nguyen, H., & Park, S. (2025, March 3–7). Providing automated feedback on formative science assessments: Uses of multimodal large language models. International Conference on Learning Analytics and Knowledge (pp. 803–809), Dublin, Ireland. [Google Scholar] [CrossRef]
Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann, T. C., Mulrow, C. D., Shamseer, L., Tetzlaff, J. M., Akl, E. A., Brennan, S. E., Chou, R., Glanville, J., Grimshaw, J. M., Hróbjartsson, A., Lalu, M. M., Li, T., Loder, E. W., Mayo-Wilson, E., McDonald, S., … Moher, D. (2021). The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. Systematic Reviews, 10, 89. [Google Scholar] [CrossRef]
Paivio, A. (2013). Imagery and verbal processes. Taylor & Francis. [Google Scholar] [CrossRef]
Pang, L., Siu, T., Alazzawe, A., Kant, K., & Latecki, L. J. (2026, September 22–25). Generalizable detection of student engagement in online learning environments. International Conference on Computer Analysis of Images and Patterns (pp. 207–219), Las Palmas de Gran Canaria, Spain. [Google Scholar] [CrossRef]
Picard, C., Edwards, K. M., Doris, A. C., Man, B., Giannone, G., Alam, M. F., & Ahmed, F. (2025). From concept to manufacturing: Evaluating vision-language models for engineering design. Artificial Intelligence Review, 58(9), 288. [Google Scholar] [CrossRef]
Radford, A., Kim, J.-W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021, July 18–24). Learning transferable visual models from natural language supervision. International Conference on Machine Learning (pp. 8748–8763), Virtual. Available online: https://proceedings.mlr.press/v139/radford21a.html (accessed on 25 November 2025).
Rahmanian, M., Sami, A., & Yu, Y. (2025). Challenges and feasibility of multimodal LLMs in ER diagram evaluation. Cogent Education, 12(1), 2590901. [Google Scholar] [CrossRef]
Raihan, M. N., Siddiq, M. L., Santos, J. C., & Zampieri, M. (2025, February 26–March 1). Large language models in computer science education: A systematic literature review. ACM Technical Symposium on Computer Science Education (Vol. 1, pp. 938–944), Pittsburgh, PA, USA. [Google Scholar] [CrossRef]
Sheng, Z., Shen, S., Shen, L., Duan, Q., Tang, N., Hui, P., Qu, H., & Luo, Y. (2025, July 22–26). Automatic modeling and analysis of students’ problem-solving handwriting trajectories. International Conference on Artificial Intelligence in Education (pp. 221–235), Palermo, Italy. [Google Scholar] [CrossRef]
Shinde, G., Ravi, A., Dey, E., Sakib, S., Rampure, M., & Roy, N. (2025). A survey on efficient vision-language models. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 15(3), e70036. [Google Scholar] [CrossRef]
Shu, C., Yao, N., Chen, Y., Wijeratne, V., Ma, L., Loo, J. K. K., Chai, K. K., Alam, A. S., & Abuelmaatti, A. (2025, April 22–25). AI-assisted multiple-choice questions generation with multimodal large language models in engineering higher education. IEEE Global Engineering Education Conference (pp. 1–9), London, UK. [Google Scholar] [CrossRef]
Singh, J., Zouhar, V., & Sachan, M. (2023, December 6–10). Enhancing textbooks with visuals from the web for improved learning. International Conference on Empirical Methods in Natural Language Processing (pp. 11931–11944), Singapore. [Google Scholar] [CrossRef]
Stamatakis, M., Berger, J., Wartena, C., Ewerth, R., & Hoppe, A. (2025, July 22–26). Enhancing the learning experience: Using vision-language models to generate questions for educational videos. International Conference on Artificial Intelligence in Education (pp. 305–319), Palermo, Italy. [Google Scholar] [CrossRef]
Su, B., Chen, Q., Peng, J., Tan, W., & Wang, L. (2025, May 14–16). Enhancing chinese character writing learning: The role of MLLM-based intelligent tutoring systems. International Conference on Artificial Intelligence and Education (pp. 194–200), Suzhou, China. [Google Scholar] [CrossRef]
Tan, H., Gu, Y., Li, L., Leong, M. C., & Chen, N. F. (2025, October 13–17). Contextualized visual storytelling for conversational chatbot in education. International Conference on Multimodal Interaction (pp. 185–189), Canberra, Australia. [Google Scholar] [CrossRef]
Tapia-Mandiola, S., & Araya, R. (2025, June 25–27). From palette to reasoning: Improving LLM’s visual recognition capabilities in children’s coloring tasks. International Conference in Methodologies and intelligent Systems for Techhnology Enhanced Learning (pp. 172–183), Lille, France. [Google Scholar] [CrossRef]
Teotia, J., Zhang, X., Mao, R., & Cambria, E. (2024, December 9). Evaluating vision language models in detecting learning engagement. IEEE International Conference on Data Mining (pp. 496–502), Abu Dhabi, United Arab Emirates. [Google Scholar] [CrossRef]
Tian, J. (2025). Integrating artificial intelligence into the cybersecurity curriculum in higher education: A systematic literature review. Education Sciences, 15(11), 1540. [Google Scholar] [CrossRef]
Tschope, M., Fritsch, S. G., Fortes Rey, V., Nandurkar, N. N., Trevenna, S., Monger, E. J., & Lukowicz, P. (2025, April 21–25). NEEDLE: Nurse education enhanced by vision-based deep learning evaluation. International Conference on Activity and Behavior Computing (pp. 1–10), Al Ain, United Arab Emirates. [Google Scholar] [CrossRef]
Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., & Lin, J. (2024). Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution. Available online: https://arxiv.org/abs/2409.12191 (accessed on 25 November 2025).
Wang, P., Jing, Y., & Shen, S. (2025). A systematic literature review on the application of generative artificial intelligence (GAI) in teaching within higher education: Instructional contexts, process, and strategies. The Internet and Higher Education, 65, 100996. [Google Scholar] [CrossRef]
Wang, X., Pang, C. C., & Hui, P. (2025, September 28–October 1). Talking spell: A wearable system enabling real-time anthropomorphic voice interaction with everyday objects. ACM Symposium on User Interface Software and Technology, Busan, Republic of Korea. [Google Scholar] [CrossRef]
Wang, X., Yu, R., Zhang, Y., & Xu, Y. (2025, November 1–2). English composition image automatic scoring based on multi-modal large language models. International Conference on Artificial Intelligence and Future Education (pp. 247–254), Shanghai, China. [Google Scholar] [CrossRef]
Wang, Y., Yu, J., Zhang-Li, D., Lim, J. J. Y., Tu, S., Li, H., Liu, Z., Liu, H., Hou, L., Li, J., & Xu, B. (2025, November 10–14). EduCraft: A system for generating pedagogical lecture scripts from long-context multimodal presentations. ACM International Conference on Information and Knowledge Management (pp. 6153–6160), Seoul, Republic of Korea. [Google Scholar] [CrossRef]
Xie, T., Wang, X., & Li, J. (2025, July 11–13). A study on prompt engineering for K12 exam paper correction using Qwen2.5-VL-72B-Instruct. International Conference on Educational Knowledge and Informatization (pp. 138–142), Chongqing, China. [Google Scholar] [CrossRef]
Xing, W., Zhu, T., Wang, J., & Liu, B. (2024). A survey on MLLMs in education: Application and future directions. Future Internet, 16(12), 467. [Google Scholar] [CrossRef]
Yang, Z., Li, L., Lin, K., Wang, J., Lin, C.-C., Liu, Z., & Wang, L. (2023). The dawn of LMMs: Preliminary explorations with GPT-4V(ision). Available online: https://arxiv.org/abs/2309.17421 (accessed on 25 November 2025).
Zawacki-Richter, O., Marín, V. I., Bond, M., & Gouverneur, F. (2019). Systematic review of research on artificial intelligence applications in higher education—Where are the educators? International Journal of Educational Technology in Higher Education, 16(1), 39. [Google Scholar] [CrossRef]
Zhan, Z., Tong, Y., Lan, X., & Zhong, B. (2024). A systematic literature review of game-based learning in Artificial Intelligence education. Interactive Learning Environments, 32(3), 1137–1158. [Google Scholar] [CrossRef]
Zhang, J., Huang, J., Jin, S., & Lu, S. (2024). Vision-Language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8), 5625–5644. [Google Scholar] [CrossRef]
Zheng, C., Yu, Z., Jiang, Y., Zhang, M., Lu, X., Jin, J., & Gao, L. (2025, April 26–May 1). ArtMentor: AI-assisted evaluation of artworks to explore multimodal large language models capabilities. CHI Conference on Human Factors in Computing Systems, Yokohama, Japan. [Google Scholar] [CrossRef]
Zhou, K., Liu, Z., & Gao, P. (2025). Large vision-language models: Pre-training, prompting, and applications. Springer Nature. [Google Scholar]
Zhuang, Y., Zhao, R., Xie, Z., & Yu, P. L. H. (2025). Enhancing language learning through generative AI feedback on picture-cued writing tasks. Computers and Education: Artificial Intelligence, 9, 100450. [Google Scholar] [CrossRef]

Figure 1. The PRISMA flow diagram used in this paper.

Figure 2. VLM solutions and configurations in the included studies (RQ2). (a) Summary of VLM solutions used (models appearing once grouped as Others; studies may use multiple solutions). (b) Distribution of VLM configurations, including pre-trained and tuned.

Figure 3. The quality analysis of 42 articles using the MMAT checklist (Hong et al., 2018). (a) The distribution of the study design category. (b) The distribution of study quality.

Table 1. A summary of typical VLMs that are used in teaching and learning. The symbol − means that the information is undisclosed in the article.

Table 2. Comparison of this paper with existing review papers.