Retrieval-Augmented Vision–Language Agents for Child-Centered Encyclopedia Learning

Du, Jing; Liu, Wenhao; Ye, Jingyi; Zhou, Dibin; Liu, Fuchang

doi:10.3390/app151910821

Open AccessArticle

Retrieval-Augmented Vision–Language Agents for Child-Centered Encyclopedia Learning

by

Jing Du

¹,

Wenhao Liu

²,

Jingyi Ye

²,

Dibin Zhou

^2,3 and

Fuchang Liu

^2,*

¹

Department of Media & Communication, Kangwon National University, Chuncheon 24341, Republic of Korea

²

School of Information Science and Technology, Hangzhou Normal University, Hangzhou 311121, China

³

Center for Engineering and Scientific Computation, Zhejiang University, Hangzhou 310058, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(19), 10821; https://doi.org/10.3390/app151910821

Submission received: 14 September 2025 / Revised: 6 October 2025 / Accepted: 7 October 2025 / Published: 9 October 2025

(This article belongs to the Special Issue Applications of Digital Technology and AI in Educational Settings)

Download

Browse Figures

Versions Notes

Abstract

This study introduces an Encyclopedic Agent for children’s learning that integrates multimodal retrieval with retrieval-augmented generation (RAG). To support this framework, we construct a dataset of 9524 Wikipedia pages covering 935 encyclopedia topics, each converted into images with associated topical queries and explanations. Based on this dataset, we fine-tune SigLIP, a vision–language retrieval model, using LoRA adaptation on 8484 training pairs, with 1040 reserved for testing. Experimental results show that the fine-tuned SigLIP significantly outperforms baseline models such as ColPali in both accuracy and latency, enabling efficient and precise document-image retrieval. Combined with GPT-5 for response generation, the Encyclopedic Agent delivers illustrated, interactive Q&A that is more accessible and engaging for children compared to traditional text-only methods. These findings highlight the feasibility of applying multimodal retrieval and RAG to educational agents, offering new possibilities for personalized, child-centered learning in domains such as science, history, and the arts.

Keywords:

agent; encyclopedia; vision language models; education

1. Introduction

In recent years, the rapid advancement of digital technology has profoundly transformed how children access and engage with educational resources. Online encyclopedias, interactive e-books, and multimedia platforms now provide abundant opportunities for self-directed learning. Yet, despite this wealth of content, children often struggle to comprehend abstract or domain-specific concepts when presented in predominantly text-based formats. For young learners in particular, traditional encyclopedia materials may lack the interactivity, personalization, and multimodal scaffolding required to foster deep conceptual understanding and sustained engagement.

The integration of Artificial Intelligence (AI) into education has become a transformative force, driven by the rapid progress of multimodal large language models (MLLMs), such as GPT-4V [1] and Gemini [2]. These advances highlight the promise of multimodality in combining textual, visual, and interactive inputs to enrich the teaching–learning process. Since children naturally engage with information through multiple channels—reading, observing, listening, and hands-on exploration [3]—educational technologies must support this diversity to foster deeper comprehension. In parallel, AI-powered educational agents [4,5] have emerged as promising tools for supporting learning, as demonstrated by research on intelligent tutoring systems, conversational agents, and adaptive learning platforms. Such agents personalize content delivery, monitor learner progress, and provide real-time feedback. The advent of MLLMs further expands these possibilities: by processing and reasoning over both textual and visual information, they can function not only as information retrievers but also as explainers, visualizers, and interactive tutors. This aligns closely with constructivist theories of learning, which emphasize active engagement, dialogue, and multimodal support in the development of critical thinking [6].

Building on these developments, this study introduces a multimodal encyclopedia-style AI agent designed specifically for children’s education. The agent leverages richly illustrated materials and retrieval-augmented generation to transform static encyclopedia entries into interactive Q&A experiences, as depicted in Figure 1. Beyond answering queries, the agent supports adaptive explanation, visual augmentation, and follow-up questioning, bridging the gap between static digital resources and dynamic, child-centered learning experiences. We release models, data, code and benchmarks under open licenses at https://huggingface.co/dj86/siglip-ft-enpedia, accessed on 14 September 2025.

Our contributions are as follows:

We construct a specialized children-oriented encyclopedia dataset that integrates multimodal content (webpage screenshots and queries), providing a valuable resource for evaluating retrieval-augmented generation in educational settings.
We fine-tune state-of-the-art vision–language retrieval models using this dataset, demonstrating significant improvements in retrieval accuracy and efficiency, which are essential for building reliable educational AI systems.
We design an Encyclopedia Agent that combines document retrieval, RAG-based answer generation, and interactive multimodal explanation. This framework highlights the scalability of our approach and its potential applicability to diverse educational domains, such as science, history, and arts.

2. Literature Review

In this section, we review the relevant literature surrounding the development of encyclopedia-style educational agents. The prior work highlights the growing role of retrieval-augmented systems and multimodal interaction in transforming learning experiences. Accordingly, we summarize the related studies from three perspectives: textual and multimodal retrieval methods, vision–language models (VLMs) for knowledge grounding, and agent-based educational systems that integrate these technologies to support child-centered learning.

2.1. Textual Retrieval Methods

Textual retrieval has traditionally relied on statistical methods based on word frequency, such as TF-IDF [7] and BM25 [8]. These approaches remain widely adopted due to their simplicity, interpretability, and computational efficiency, and they continue to serve as strong baselines in many retrieval benchmarks [9]. However, such methods are inherently limited in their ability to capture semantic similarity, as they rely heavily on surface-level lexical overlap.

Recent advances in neural embedding models, particularly those built on fine-tuned large language models (LLMs), have demonstrated state-of-the-art performance across a wide range of text embedding and retrieval tasks. These models map queries and documents into a continuous vector space, allowing for semantic similarity to be computed efficiently using vector distance metrics. In the bi-encoder paradigm [10,11,12], documents are independently encoded offline into dense embeddings, while queries are embedded at the inference time and matched to documents through fast similarity search operations, such as cosine similarity or inner product.

Beyond bi-encoder architectures, late interaction models, such as ColBERT [13], introduce a more fine-grained matching mechanism. Instead of compressing a document into a single dense vector, ColBERT retains token-level embeddings and performs late interaction during retrieval. This design enables richer semantic alignment between query tokens and document tokens, balancing efficiency with retrieval accuracy. Such developments highlight the trajectory from frequency-based statistical approaches to embedding-based and interaction-aware neural retrieval methods, which underpin modern retrieval-augmented generation systems.

2.2. Vision–Language Models

Recent years have witnessed remarkable progress in language modeling, with LLMs such as LLaMA and ChatGPT achieving strong performance across a wide variety of tasks. While initially limited to text-only inputs, these models have increasingly been extended with visual modalities, giving rise to VLMs. By bridging vision and language, VLMs enable a range of applications in multimodal reasoning, retrieval, and content generation, thereby playing a central role in the ongoing AI-driven technological transformation.

VLMs have been developed under several major training paradigms. One widely adopted strategy is contrastive learning [14,15], in which the model learns to align paired image–text representations by pulling positive examples closer in the embedding space while pushing away negative examples. Another common paradigm is masking [16,17], where either parts of an image or tokens in a caption are masked, and the model is trained to reconstruct the missing elements using the unmasked modality as context. Beyond these partially reconstructive approaches, generative VLMs [18,19,20] are trained to produce full captions or even entire images, offering high flexibility but typically requiring significantly more computational resources.

In addition to these paradigms, many state-of-the-art VLMs adopt a pretrained backbone [21,22] strategy, where a large-scale image encoder is aligned with an open-source LLM to build multimodal capabilities. Representative models in this category include BLIP-2 [23], Qwen-VL, and Qwen-VL-Chat [24]. Such designs leverage the strengths of existing unimodal models while enabling efficient multimodal adaptation, laying the foundation for the development of more specialized and application-oriented agents. Bridging vision and language remains an active area of research, with approaches ranging from contrastive learning to fully generative training. Despite their effectiveness, these methods typically demand substantial computational resources and large-scale datasets, which limits accessibility for many researchers. As a result, a common strategy is to rely on pretrained language models or image encoders, focusing the training process on learning an efficient mapping between the two modalities.

2.3. LLM-Based Learning Systems and Agents

LLMs have recently shown extraordinary progress, exhibiting reasoning, planning, and decision-making abilities that approximate human-level intelligence. These advances have fueled interest in building autonomous agents that not only process information but also interact with their environment and respond adaptively [25,26]. Extending beyond single-agent systems, researchers have begun to explore multi-agent frameworks where multiple LLM-powered agents collaborate, each equipped with specialized roles and skills. Such systems are able to generate richer and more dynamic behaviors by enabling inter-agent communication and coordination, thereby offering a more realistic approximation of complex real-world scenarios. Early work [27] has demonstrated the potential of multi-agent LLM systems in diverse application domains, including robotics, policy modeling, software engineering, large-scale simulations, and human behavior emulation. Prominent examples include Generative Agents [28], Ghost in the Minecraft [29], and GPT-Bargaining [30], each showcasing different aspects of emergent behavior and interactive intelligence.

With these developments, methodological debates have intensified regarding how best to leverage LLMs for collective intelligence and learning [31,32]. A particularly promising line of inquiry concerns the use of multi-agent models in education. Simulation-based classroom environments, where multiple AI agents interact with students, have demonstrated the capacity to mimic authentic teaching and learning dynamics while enriching student engagement [33]. This represents a novel paradigm in the digital education landscape, where the benefits of AI extend beyond static knowledge delivery toward fostering more interactive, dialogic experiences.

In the broader educational context, scholars have investigated multiple approaches to integrating LLMs into learning. For instance, Huber et al. [34] explored game-based learning enhanced by conversational AI, while other works [35,36] proposed personalized tutoring systems grounded in LLM prompting. Yet, despite these advances, concerns persist about over-reliance on AI systems and their potential to weaken student initiative, critical reasoning, and knowledge retention [37,38,39]. Recent analyses point to challenges such as the spread of misinformation, decreased creativity, and diminished independent problem-solving, raising fundamental questions about the long-term role of AI in education. In parallel, issues of assessment validity and reliability in online learning environments continue to demand careful scrutiny [40].

Equally pressing are the ethical and social considerations that accompany AI integration. The adoption of tools like ChatGPT in classrooms highlights unresolved tensions regarding privacy, bias, academic integrity, and the potential for technology overuse [41]. While LLM-powered educational systems show great promise, achieving a balanced and responsible deployment requires continued research into personalization, adaptability, and interactive features—such as debate-driven dialogue or quiz-based tutoring—that can cultivate higher-order thinking skills. Future work must therefore address not only technological innovation but also pedagogical design and ethical safeguards to ensure that AI-enhanced learning serves both cognitive and humanistic educational goals.

3. EncAgent

3.1. Construction of Encyclopedia Dataset

We collected 9,524 Wikipedia pages covering 935 commonly encountered topics in children’s encyclopedias—such as celestial bodies, animals, history, art, archaeology, and everyday objects—with reference to the DK Encyclopedia [42]. Each page was rendered into images of 595 × 842 pixels and further annotated with several attributes, including the corresponding topic keywords, a broad topical query with its explanation, as well as a specific detail query accompanied by its explanation. The dataset was partitioned into 8484 samples for training and 1040 samples for testing, as shown in Figure 2. To reduce the cost and subjectivity of manual annotation, we further leveraged an LLM to automatically generate structured annotations. Specifically, for each image, the LLM was prompted to produce a pair of broad topical queries and specific detail queries, along with their corresponding explanations. These automatically generated queries and explanations served as high-quality annotations for the training set, enabling the model to learn both general topic understanding and fine-grained knowledge reasoning.

3.2. Fine-Tuning VLM and Evaluation Metrics

To adapt the VLM to our encyclopedia dataset, we apply parameter-efficient fine-tuning using the Low-Rank Adaptation (LoRA) technique. Specifically, we inject LoRA modules into the attention layers of the SigLIP backbone, targeting the query and value projection matrices. LoRA significantly reduces the number of trainable parameters while preserving the representational capacity of the original model, making it suitable for large-scale fine-tuning under limited computational resources.

For evaluation, we employ standard information retrieval metrics to measure the alignment between model predictions and ground-truth relevance. Precision at rank k (

P @ k

) is defined as follows:

P @ k = \frac{# documents in top k}{k}

(1)

The Average Precision (AP) for a query q is computed as follows:

A P (q) = \frac{1}{| R_{q} |} \sum_{k = 1}^{N} P @ k \cdot r e l (k)

(2)

where

| R_{q} |

is the total number of relevant documents for query q, and

r e l (k)

is an indicator function denoting whether the k-th retrieved result is relevant. Finally, the overall performance across all M queries is reported using the Mean Average Precision (MAP):

M A P = \frac{1}{M} \sum_{i = 1}^{M} A P (q_{i})

(3)

These metrics provide a comprehensive evaluation of retrieval quality, balancing both precision and ranking order, thereby reflecting the model’s ability to retrieve semantically and visually relevant knowledge for children’s encyclopedia queries.

3.3. Retrieval and Chat

To enable encyclopedia-style question answering for children, we design a retrieval-augmented pipeline that combines image-based document retrieval with multimodal response generation. First, we fine-tune the SigLIP vision–language model on our Wikipedia-based dataset to align visual features of encyclopedia images with the semantic representations of user queries. This allows the model to retrieve the most semantically relevant image–text pairs from the collection based on both visual and linguistic cues.

Once the relevant images are retrieved, the system leverages a retrieval-augmented generation (RAG) mechanism to produce contextually grounded responses. Specifically, the retrieved images serve as visual evidence, and their associated textual knowledge is incorporated into the language model to generate explanatory answers. In this way, the Encyclopedia Agent not only provides factually accurate responses but also enhances interpretability by grounding answers in a concrete visual content.

Compared with traditional encyclopedia reading, which requires children to independently process large amounts of text and lacks interactivity, our method transforms knowledge learning into a dialogic, image-supported process. The integration of multimodal retrieval and conversational generation offers a more engaging and accessible form of knowledge acquisition, allowing children to interact with the encyclopedia in a question–answer format that is both informative and entertaining.

4. Results

4.1. Experimental Settings

Baseline Models. We evaluate our approach against two state-of-the-art document retrieval models: ColPali [43] and SigLIP [44]. Both models have recently achieved strong performance in large-scale document and multimodal retrieval tasks, making them suitable baselines for our encyclopedia-retrieval scenario. ColPali is a cross-modal retriever designed for dense document understanding, while SigLIP is a lightweight but highly effective vision–language model optimized for semantic alignment. We choose these two models as baselines because they represent the state of the art in document retrieval and provide complementary characteristics in terms of accuracy and efficiency.

Fine-tuning Strategy. For model adaptation, we fine-tune SigLIP using the LoRA [45] method, injecting low-rank adaptation layers into the attention modules (q_proj and v_proj). For ColPali, given its relatively larger size, we employ QLoRA [46] with 4-bit quantization to reduce memory overhead while maintaining competitive performance. Both models are fine-tuned for a maximum of 30 epochs, and the best-performing checkpoints are selected based on the validation performance.

Dataset. Our training dataset consists of 8484 image–text pairs, each constructed from a Wikipedia encyclopedia webpage screenshot paired with its corresponding broad topical query generated by GPT-5 [47]. The test set contains 1040 image–text pairs in the same format. The prompt used in our experiments was “You are a helpful AI that generates encyclopedia-style educational Q&A for children.” The dataset thus provides aligned visual and textual knowledge representations that are well-suited for retrieval-augmented generation (RAG) in the encyclopedia domain.

Training Details. All experiments are conducted with 30 training epochs. The ColPali model is fine-tuned using QLoRA to handle its larger parameter size efficiently, whereas the SigLIP model is fine-tuned with LoRA due to its relatively smaller footprint. The hyperparameters (rank

r = 8

,

α = 16

, dropout = 0.05) follow standard settings for LoRA-based fine-tuning of vision–language models. All experiments were carried out on an NVIDIA L40 GPU (48 GB), with the learning rate set to

5 \times 10^{- 5}

.

4.2. Performance Analysis

Table 1 reports the performance comparison between baseline models and our fine-tuned variants on the encyclopedia retrieval task. The “Vanilla ColPali” and “Vanilla SigLIP” rows correspond to the original model weights released on HuggingFace, without any task-specific adaptation. In contrast, “ColPali-our-8k” refers to ColPali fine-tuned on our Wikipedia-based dataset for 10 epochs, which was the best checkpoint within 30 epochs, while “SigLIP-our-8k” denotes SigLIP fine-tuned for 5 epochs, also selected as the best-performing checkpoint within 30 epochs.

The accuracy metric in Table 1 corresponds to MAP, which evaluates whether the retrieved document images contain ground-truth keywords consistent with the query keywords. Latency measures the average inference time of the retrieval process. As shown in the results, SigLIP consistently outperforms ColPali in both retrieval precision and inference efficiency. For instance, while Vanilla ColPali achieves only 80.60% in terms of MAP with a 0.10 s latency, Vanilla SigLIP already reaches 86.12% in terms of MAP with a faster latency of 0.06 s. After fine-tuning, the improvements become more significant: ColPali-our-8k achieves 86.05% in terms of MAP at 0.13 s of latency, whereas SigLIP-our-8k attains the best overall result with 93.97% in terms of MAP and only 0.07 s of latency.

These findings suggest that SigLIP is both more accurate and computationally efficient for encyclopedia-style retrieval tasks, making it the preferred backbone for our Encyclopedia Agent. Consequently, we adopt SigLIP for the retrieval component while relying on GPT-5 to generate conversational responses based on the retrieved visual knowledge.

4.3. Case Analysis

To further highlight the limitations of existing baselines and how our approach addresses these challenges, we present two case studies in Figure 3. On the left, large language models such as GPT-5 generate lengthy text-based explanations when answering questions related to children’s encyclopedic knowledge. While informative, such responses tend to be monotonous and insufficiently intuitive for young learners. In contrast, our Encyclopedia Agent first employs SigLIP to retrieve the most relevant document images from the training set. It then performs RAG grounded in the retrieved images, producing responses that are more targeted and context-specific. The final outputs combine text and images in a visually enriched format that closely resembles a children’s encyclopedia, while also introducing interactivity through agent–child question–answer exchanges, making the learning experience more engaging and enjoyable.

To further validate the effectiveness of the method proposed in this study, 10 primary school students in Grades 1–2 (5 males and 5 females) were invited to participate in a survey using Likert scale questions. First, the students were asked to propose encyclopedia knowledge topics they wanted to learn about. Subsequently, they were assisted in operating the Encyclopedia Agent, and the returned images and text were presented to them. Finally, the students were required to rate the returned results.

Table 2 presents the statistical results. It can be observed that the scores are relatively close to 5, indicating that the primary school students gave relatively positive evaluations. Additionally, the variance is less than 1, which suggests that there is little discrepancy in the subjective evaluations among the students. While GPT-5 can only provide plain text, it fails to effectively engage children’s interest.

5. Discussion

Our proposed Encyclopedia Agent demonstrates promising potential in both methodology and application, yet several aspects merit further discussion.

Technical extensibility: Although the current study focuses on encyclopedia-style educational Q&A, the underlying technical pipeline is highly generalizable. With appropriate training datasets, this approach could be extended to a wide range of educational domains, including science, history, and art education for children. At present, the dataset employed is primarily sourced from Wikipedia due to intellectual property considerations. While Wikipedia provides a broad knowledge base, its text is not specifically tailored for children, often being lengthy and complex. Nevertheless, the agent’s technical framework remains feasible and can be further optimized by curating high-quality, child-oriented multimodal datasets.

Limitations of current LLMs: Most existing large language models are trained on general-purpose corpora, which makes them less suitable for specialized domains such as children’s education. While prompt engineering allows the models to role-play and generate more age-appropriate outputs, the responses are still limited in clarity and engagement. In contrast, multimodal responses that combine text with retrieved visual content offer children a more intuitive and interactive learning experience. Furthermore, hallucinations remain a well-documented issue in LLMs. By grounding the agent’s responses in retrieved documents through RAG, we can significantly improve factual accuracy, which is especially critical in educational settings.

Toward vertical educational agents: The development of domain-specific, or “vertical,” intelligent agents represents an important trend in AI. In the education sector, teaching requirements vary widely across subjects, domains of knowledge, target student groups, and learning environments. This diversity underscores the need for specialized agents capable of assisting teachers in both content delivery and personalized tutoring. Our Encyclopedia Agent provides an initial example of how a vertical AI system can be designed for children’s learning, laying the groundwork for future agents that support a broader range of educational contexts.

Scalability: Data and model scalability are both critical factors influencing application deployment. The proposed approach is highly scalable from both the data and model perspectives. First, the construction of the encyclopedia dataset can be easily expanded to additional domains such as science, art, and history by automatically collecting web pages from Wikipedia or other open educational resources. Although the current data annotation process involves limited manual verification, most steps—including webpage capture, keyword extraction, and prompt-based query generation—are automated, allowing for efficient large-scale data production. Second, since the retrieval component is based on SigLIP, a vision–language model that supports large-batch inference and efficient embedding computation, the system can be scaled to millions of image–text pairs without significant degradation in retrieval latency or accuracy. Moreover, the modular design of the agent allows for fine-tuning and deployment across different educational subjects with minimal modification, further demonstrating the scalability of the framework.

6. Conclusions

This study investigates multimodal knowledge learning for children, comparing the traditional “book-reading only” encyclopedia approach with methods incorporating VLMs, specifically “image-based RAG Q&A” and “multimodal agent-assisted explanation.” Traditional approaches, which rely on children reading encyclopedia texts independently, impose high literacy demands and provide little interactivity. In contrast, LLM-based Q&A methods rely predominantly on text, often overlooking multimodal retrieval information, and their responses without RAG tend to be less targeted and context-specific. The multimodal RAG-based method proposed in this study integrates visual and textual knowledge sources, offering more engaging, interactive, and contextually grounded explanations for children. Rather than relying primarily on role-playing strategies, our approach emphasizes RAG-based generation to support encyclopedia-style Q&A for children.

However, the proposed method primarily focuses on improving the richness and relevance of responses through multimodal retrieval and fine-tuning. While the system achieves basic personalization via prompt-based role adaptation, it does not yet incorporate deeper adaptive mechanisms that account for a child’s age, knowledge level, or learning style. Future work will aim to implement dynamic personalization strategies and broader data collection to better align the agent’s responses with individual learners’ educational needs.

Author Contributions

Conceptualization, F.L. and J.D.; methodology, J.D. and D.Z.; software, J.D.; validation, F.L.; formal analysis, W.L.; investigation, J.D.; resources, W.L.; data curation, J.Y.; writing—F.L. and J.D.; writing—review and editing, F.L.; visualization, J.D.; supervision, F.L.; project administration, F.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The collected data are publicly available at the following link: https://huggingface.co/datasets/dj86/wiki_dataset, accessed on 14 September 2025.

Conflicts of Interest

The authors declare no conflicts of interest. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Google Gemini Team. Gemini: A Family of Highly Capable Multimodal Models. 2023. Available online: https://storage.googleapis.com/deepmindmedia/gemini/gemini_1_report.pdf (accessed on 14 September 2025).
Mayer, R.; Sims, V. For whom is a picture worth a thousand words? Extensions of a dual-coding theory of multimedia learning. J. Educ. Psychol. 1994, 86, 389–401. [Google Scholar] [CrossRef]
Lee, G.; Shi, L.; Latif, E.; Gao, Y.; Bewersdorff, A.; Nyaaba, M.; Guo, S.; Liu, Z.; Mai, G.; Liu, T.; et al. Multimodality of AI for Education: Toward Artificial General Intelligence. IEEE Trans. Learn. Technol. 2025, 18, 666–683. [Google Scholar] [CrossRef]
Ji, H.; Qiu, S.; Xin, S.; Han, S.; Chen, Z.; Zhang, D.; Wang, H.; Yao, H. From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Reasoning-Driven Pedagogical Visualization. arXiv 2025, arXiv:2505.16832. [Google Scholar]
Sharma, K.; Papamitsiou, Z.; Giannakos, M. Building pipelines for educational data using AI and multimodal analytics: A “grey-box” approach. Br. J. Educ. Technol. 2019, 50, 3004–3031. [Google Scholar] [CrossRef]
Jones, K. A statistical interpretation of term specificity and its application in retrieval. In Document Retrieval Systems; Taylor Graham Publishing: London, UK, 1988; pp. 132–142. [Google Scholar]
Robertson, S.; Walker, S.; Beaulieu, M. Experimentation as a way of life: Okapi at TREC. Inf. Process. Manag. 2000, 36, 95–108. [Google Scholar] [CrossRef]
Muennighoff, N.; Tazi, N.; Magne, L.; Reimers, N. MTEB: Massive Text Embedding Benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (ACL), Dubrovnik, Croatia, 2–6 May 2023; pp. 2014–2037. [Google Scholar]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, 3–7 November 2019; pp. 3982–3992. [Google Scholar]
Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online Event, 16–20 November 2020; pp. 6769–6781. [Google Scholar]
Wang, L.; Yang, N.; Huang, X.; Jiao, B.; Yang, L.; Jiang, D.; Majumder, R.; Wei, F. Text Embeddings by Weakly-Supervised Contrastive Pre-training. arXiv 2022, arXiv:2212.03533. [Google Scholar]
Khattab, O.; Zaharia, M. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’20), Virtual Event, 25–30 July 2020; pp. 39–48. [Google Scholar]
LeCun, Y.; Chopra, S.; Hadsell, R.; Ranzato, M.; Huang, F. A Tutorial on Energy-Based Learning. In Predicting Structured Data; Bakir, G., Hofman, T., Sch¨olkopf, B., Smola, A., Taskar, B., Eds.; MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
Radford, A.; Kim, J.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; Volume 139, pp. 8748–8763. [Google Scholar]
Singh, A.; Hu, R.; Goswami, V.; Couairon, G.; Galuba, W.; Rohrbach, M.; Kiela, D. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 15638–15650. [Google Scholar]
Kwon, G.; Cai, Z.; Ravichran, A.; Bas, E.; Bhotika, R.; Soatto, S. Masked vision and language modeling for multi-modal representation learning. In Proceedings of the Eleventh International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Rombach, R.; Blattman, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.; Ghasemipour, K.; Lopes, R.; Ayan, B.; Salimans, T.; et al. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst. 2022, 35, 36479–36494. [Google Scholar]
Yu, J.; Xu, Y.; Koh, J.; Luong, T.; Baid, G.; Wang, Z.; Vasudevan, V.; Ku, A.; Yang, Y.; Ayan, B.; et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv 2022, arXiv:2206.10789. [Google Scholar]
Tsimpoukelli, M.; Menick, J.; Cabi, S.; Eslami, S.; Vinyals, O.; Hill, F. Multimodal few-shot learning with frozen language models. Adv. Neural Inf. Process. Syst. 2021, 34, 200–212. [Google Scholar]
Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv 2023, arXiv:2304.10592. [Google Scholar]
Li, J.; Li, D.; Savarese, S.; Hoi, S. BLIP-2: Bootstrapping languageimage pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning (ICML), Honolulu, HI, USA, 23–29 July 2023; Volume 202, pp. 19730–19742. [Google Scholar]
Bai, J.; Bai, S.; Yang, S.; Wang, S.; Tan, S.; Wang, P.; Lin, J.; Zhou, C.; Zhou, J. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv 2023, arXiv:2308.12966. [Google Scholar]
Wooldridge, M.; Jennings, N. Intelligent Agents: Theory and Practice. Knowl. Eng. Rev. 1995, 10, 115–152. [Google Scholar] [CrossRef]
Xi, Z.; Chen, W.; Guo, X.; He, W.; Ding, Y.; Hong, B.; Zhang, M.; Wang, J.; Jin, S.; Zhou, E. The Rise and Potential of Large Language Model Based Agents: A Survey. arXiv 2023, arXiv:2309.07864. [Google Scholar] [CrossRef]
Guo, T.; Chen, X.; Wang, Y.; Chang, R.; Pei, S.; Chawla, N.; Wiest, O.; Zhang, X. Large Language Model based Multi-Agents: A Survey of Progress and Challenges. arXiv 2024, arXiv:2402.01680. [Google Scholar] [CrossRef]
Park, J.; O’Brien, J.; Cai, C.; Morris, M.; Liang, P.; Bernstein, M. Generative Agents: Interactive Simulacra of Human Behavior. In Proceedings of the 36th Annual ACM Symposium On User Interface Software And Technology (UIST ’23), San Francisco, CA, USA, 29 October–1 November 2023. [Google Scholar]
Zhu, X.; Chen, Y.; Tian, H.; Tao, C.; Su, W.; Yang, C.; Huang, G.; Li, B.; Lu, L.; Wang, X.; et al. Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory. arXiv 2023, arXiv:2305.17144. [Google Scholar]
Fu, Y.; Peng, H.; Khot, T.; Lapata, M. Improving Language Model Negotiation with Self-Play and In-Context Learning from AI Feedback. arXiv 2023, arXiv:2305.10142. [Google Scholar]
Extance, A. ChatGPT has Entered the Classroom: How LLMs Could Transform Education. Nature 2023, 623, 474–477. [Google Scholar] [CrossRef]
Yue, M.; Mifdal, W.; Zhang, Y.; Suh, J.; Yao, Z. MathVC: An LLM-Simulated Multi-Character Virtual Classroom for Mathematics Education. arXiv 2024, arXiv:2404.06711. [Google Scholar]
Zhang, Z.; Zhang-li, D.; Yu, J.; Gong, L.; Zhou, J.; Liu, Z.; Hou, L.; Li, J. Simulating Classroom Education with LLM-Empowered Agents. arXiv 2024, arXiv:2406.19226. [Google Scholar] [CrossRef]
Huber, S.; Kiili, K.; Nebel, S.; Ryan, R.; Sailer, M.; Ninaus, M. Leveraging the Potential of Large Language Models in Education Through Playful and Game-Based Learning. Educ. Psychol. Rev. 2024, 36, 1–20. [Google Scholar] [CrossRef]
Baillifard, A.; Gabella, M.; Lavenex, P.; Martarelli, C. Effective Learning with a Personal AI Tutor: A Case Study. Educ. Inf. Technol. 2025, 30, 297–312. [Google Scholar] [CrossRef]
Park, M.; Kim, S.; Lee, S.; Kwon, S.; Kim, K. Empowering Personalized Learning through a Conversation-based Tutoring System with Student Modeling. In Proceedings of the CHI EA ’24: Extended Abstracts of the CHI Conference on Human Factors In Computing Systems, Honolulu, HI, USA, 11–16 May 2024; pp. 1–10. [Google Scholar]
Mohamed, A. Exploring the Potential of an AI-based Chatbot (ChatGPT) in Enhancing English as a Foreign Language (EFL) Teaching: Perceptions of EFL Faculty Members. Educ. Inf. Technol. 2024, 29, 3195–3217. [Google Scholar] [CrossRef]
Tlili, A.; Shehata, B.; Adarkwah, M.; Bozkurt, A.; Hickey, D.; Huang, R.; Agyemang, B. What if the Devil Is My Guardian Angel: ChatGPT as a Case Study of Using Chatbots in Education. Smart Learn. Environ. 2023, 10, 1–24. [Google Scholar] [CrossRef]
Zhang, S.; Zhao, X.; Zhou, T.; Kim, J. Do You Have AI Dependency? The Roles of Academic Self-efficacy, Academic Stress, and Performance Expectations on Problematic AI Usage Behavior. Int. J. Educ. Technol. High. Educ. 2024, 21, 2–14. [Google Scholar] [CrossRef]
Siu, O.; Lui, K.; Huang, Y.; Ng, T.; Yeung, W. An Efficient, Reliable and Valid Assessment for Affective States during Online Learning. Sci. Rep. 2024, 14, 15768. [Google Scholar] [CrossRef]
Amina, A.; Ahmad, A.; Raghad, A.; Mohamed, E.; Said, S. Ethical Implications of Using ChatGPT in Educational Environments: A Comprehensive Review. In Artificial Intelligence In Education: The Power And Dangers Of ChatGPT In The Classroom; Springer: Cham, Switzerland, 2024; pp. 185–199. [Google Scholar]
Davey, L. DK Children’s Encyclopedia: The Book That Explains Everything; Dorling Kindersley: New York, NY, USA, 2017. [Google Scholar]
Faysse, M.; Sibille, H.; Wu, T.; Omrani, B.; Viaud, G.; Hudelot, C.; Colombo, P. ColPali: Efficient Document Retrieval with Vision Language Models. In Proceedings of the 13th International Conference on Learning Representations (ICLR), Singapore, 24–28 April 2025. [Google Scholar]
Zhai, X.; Mustafa, B.; Kolesnikov, A.; Beyer, L. Sigmoid Loss for Language Image Pre-Training. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 11975–11986. [Google Scholar]
Hu, E.; Shen, Y.; Wallis, P.; Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the 2022 International Conference on Learning Representations (ICLR), Virtually, 25–29 April 2022. [Google Scholar]
Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv 2023, arXiv:2305.14314. [Google Scholar] [CrossRef]
OpenAI. Introducing GPT-5. 2025. Available online: https://openai.com/index/introducing-gpt-5/ (accessed on 14 September 2025).

Figure 1. VLM-based Encyclopedic Agents. When a user inputs an encyclopedia knowledge query they wish to ask, the proposed Encyclopedia Agent first employs a fine-tuned model to embed the input text into a feature vector and compute the similarity between this text feature vector and the image feature vectors in the library—thus retrieving the most similar images to return to the user. Furthermore, the returned images are used as input for retrieval-augmented generation (RAG) and fed into GPT, which then generates text to answer the user’s question. This workflow yields an encyclopedia answer that integrates both images and text.

Figure 2. Data curation pipeline. Encyclopedia pages were collected from Wikipedia using keywords corresponding to those in the DK encyclopedia and subsequently converted into images. To construct the annotated training set, we further employed a large language model to generate broad topical queries and specific detail queries, together with their associated explanations, based on the collected images; all generated content has undergone manual verification.

Figure 3. Baseline models versus our Encyclopedia Agent. These examples clearly illustrate that baseline models such as GPT-5 are not well suited for children, as their lengthy textual explanations lack vividness. In sharp contrast, the Encyclopedia Agent produces richly illustrated, multimodal explanations that are far more engaging and accessible.

Table 1. Performance of baseline models and fine-tuned models on Encyclopedia Dataset.

Method	Latency (s)	MAP (%)
Vanilla ColPali	0.10	80.60
ColPali-our-8k (+finetuning)	0.13	86.05
Vanilla SigLIP	0.06	86.12
SigLIP-our-8k (+finetuning)	0.07	93.97

Table 2. Quantitative results of scores on elementary school students. 1—worst; 5—best.

	Score	$δ$
Encyclopedia Agent	4.3	0.82
GPT-5	2.7	0.48

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Du, J.; Liu, W.; Ye, J.; Zhou, D.; Liu, F. Retrieval-Augmented Vision–Language Agents for Child-Centered Encyclopedia Learning. Appl. Sci. 2025, 15, 10821. https://doi.org/10.3390/app151910821

AMA Style

Du J, Liu W, Ye J, Zhou D, Liu F. Retrieval-Augmented Vision–Language Agents for Child-Centered Encyclopedia Learning. Applied Sciences. 2025; 15(19):10821. https://doi.org/10.3390/app151910821

Chicago/Turabian Style

Du, Jing, Wenhao Liu, Jingyi Ye, Dibin Zhou, and Fuchang Liu. 2025. "Retrieval-Augmented Vision–Language Agents for Child-Centered Encyclopedia Learning" Applied Sciences 15, no. 19: 10821. https://doi.org/10.3390/app151910821

APA Style

Du, J., Liu, W., Ye, J., Zhou, D., & Liu, F. (2025). Retrieval-Augmented Vision–Language Agents for Child-Centered Encyclopedia Learning. Applied Sciences, 15(19), 10821. https://doi.org/10.3390/app151910821

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Retrieval-Augmented Vision–Language Agents for Child-Centered Encyclopedia Learning

Abstract

1. Introduction

2. Literature Review

2.1. Textual Retrieval Methods

2.2. Vision–Language Models

2.3. LLM-Based Learning Systems and Agents

3. EncAgent

3.1. Construction of Encyclopedia Dataset

3.2. Fine-Tuning VLM and Evaluation Metrics

3.3. Retrieval and Chat

4. Results

4.1. Experimental Settings

4.2. Performance Analysis

4.3. Case Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI