Retrieval-Augmented Vision–Language Agents for Child-Centered Encyclopedia Learning
Abstract
1. Introduction
- We construct a specialized children-oriented encyclopedia dataset that integrates multimodal content (webpage screenshots and queries), providing a valuable resource for evaluating retrieval-augmented generation in educational settings.
- We fine-tune state-of-the-art vision–language retrieval models using this dataset, demonstrating significant improvements in retrieval accuracy and efficiency, which are essential for building reliable educational AI systems.
- We design an Encyclopedia Agent that combines document retrieval, RAG-based answer generation, and interactive multimodal explanation. This framework highlights the scalability of our approach and its potential applicability to diverse educational domains, such as science, history, and arts.
2. Literature Review
2.1. Textual Retrieval Methods
2.2. Vision–Language Models
2.3. LLM-Based Learning Systems and Agents
3. EncAgent
3.1. Construction of Encyclopedia Dataset
3.2. Fine-Tuning VLM and Evaluation Metrics
3.3. Retrieval and Chat
4. Results
4.1. Experimental Settings
4.2. Performance Analysis
4.3. Case Analysis
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
- Google Gemini Team. Gemini: A Family of Highly Capable Multimodal Models. 2023. Available online: https://storage.googleapis.com/deepmindmedia/gemini/gemini_1_report.pdf (accessed on 14 September 2025).
- Mayer, R.; Sims, V. For whom is a picture worth a thousand words? Extensions of a dual-coding theory of multimedia learning. J. Educ. Psychol. 1994, 86, 389–401. [Google Scholar] [CrossRef]
- Lee, G.; Shi, L.; Latif, E.; Gao, Y.; Bewersdorff, A.; Nyaaba, M.; Guo, S.; Liu, Z.; Mai, G.; Liu, T.; et al. Multimodality of AI for Education: Toward Artificial General Intelligence. IEEE Trans. Learn. Technol. 2025, 18, 666–683. [Google Scholar] [CrossRef]
- Ji, H.; Qiu, S.; Xin, S.; Han, S.; Chen, Z.; Zhang, D.; Wang, H.; Yao, H. From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Reasoning-Driven Pedagogical Visualization. arXiv 2025, arXiv:2505.16832. [Google Scholar]
- Sharma, K.; Papamitsiou, Z.; Giannakos, M. Building pipelines for educational data using AI and multimodal analytics: A “grey-box” approach. Br. J. Educ. Technol. 2019, 50, 3004–3031. [Google Scholar] [CrossRef]
- Jones, K. A statistical interpretation of term specificity and its application in retrieval. In Document Retrieval Systems; Taylor Graham Publishing: London, UK, 1988; pp. 132–142. [Google Scholar]
- Robertson, S.; Walker, S.; Beaulieu, M. Experimentation as a way of life: Okapi at TREC. Inf. Process. Manag. 2000, 36, 95–108. [Google Scholar] [CrossRef]
- Muennighoff, N.; Tazi, N.; Magne, L.; Reimers, N. MTEB: Massive Text Embedding Benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (ACL), Dubrovnik, Croatia, 2–6 May 2023; pp. 2014–2037. [Google Scholar]
- Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, 3–7 November 2019; pp. 3982–3992. [Google Scholar]
- Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online Event, 16–20 November 2020; pp. 6769–6781. [Google Scholar]
- Wang, L.; Yang, N.; Huang, X.; Jiao, B.; Yang, L.; Jiang, D.; Majumder, R.; Wei, F. Text Embeddings by Weakly-Supervised Contrastive Pre-training. arXiv 2022, arXiv:2212.03533. [Google Scholar]
- Khattab, O.; Zaharia, M. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’20), Virtual Event, 25–30 July 2020; pp. 39–48. [Google Scholar]
- LeCun, Y.; Chopra, S.; Hadsell, R.; Ranzato, M.; Huang, F. A Tutorial on Energy-Based Learning. In Predicting Structured Data; Bakir, G., Hofman, T., Sch¨olkopf, B., Smola, A., Taskar, B., Eds.; MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
- Radford, A.; Kim, J.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; Volume 139, pp. 8748–8763. [Google Scholar]
- Singh, A.; Hu, R.; Goswami, V.; Couairon, G.; Galuba, W.; Rohrbach, M.; Kiela, D. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 15638–15650. [Google Scholar]
- Kwon, G.; Cai, Z.; Ravichran, A.; Bas, E.; Bhotika, R.; Soatto, S. Masked vision and language modeling for multi-modal representation learning. In Proceedings of the Eleventh International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Rombach, R.; Blattman, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
- Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.; Ghasemipour, K.; Lopes, R.; Ayan, B.; Salimans, T.; et al. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst. 2022, 35, 36479–36494. [Google Scholar]
- Yu, J.; Xu, Y.; Koh, J.; Luong, T.; Baid, G.; Wang, Z.; Vasudevan, V.; Ku, A.; Yang, Y.; Ayan, B.; et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv 2022, arXiv:2206.10789. [Google Scholar]
- Tsimpoukelli, M.; Menick, J.; Cabi, S.; Eslami, S.; Vinyals, O.; Hill, F. Multimodal few-shot learning with frozen language models. Adv. Neural Inf. Process. Syst. 2021, 34, 200–212. [Google Scholar]
- Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv 2023, arXiv:2304.10592. [Google Scholar]
- Li, J.; Li, D.; Savarese, S.; Hoi, S. BLIP-2: Bootstrapping languageimage pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning (ICML), Honolulu, HI, USA, 23–29 July 2023; Volume 202, pp. 19730–19742. [Google Scholar]
- Bai, J.; Bai, S.; Yang, S.; Wang, S.; Tan, S.; Wang, P.; Lin, J.; Zhou, C.; Zhou, J. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv 2023, arXiv:2308.12966. [Google Scholar]
- Wooldridge, M.; Jennings, N. Intelligent Agents: Theory and Practice. Knowl. Eng. Rev. 1995, 10, 115–152. [Google Scholar] [CrossRef]
- Xi, Z.; Chen, W.; Guo, X.; He, W.; Ding, Y.; Hong, B.; Zhang, M.; Wang, J.; Jin, S.; Zhou, E. The Rise and Potential of Large Language Model Based Agents: A Survey. arXiv 2023, arXiv:2309.07864. [Google Scholar] [CrossRef]
- Guo, T.; Chen, X.; Wang, Y.; Chang, R.; Pei, S.; Chawla, N.; Wiest, O.; Zhang, X. Large Language Model based Multi-Agents: A Survey of Progress and Challenges. arXiv 2024, arXiv:2402.01680. [Google Scholar] [CrossRef]
- Park, J.; O’Brien, J.; Cai, C.; Morris, M.; Liang, P.; Bernstein, M. Generative Agents: Interactive Simulacra of Human Behavior. In Proceedings of the 36th Annual ACM Symposium On User Interface Software And Technology (UIST ’23), San Francisco, CA, USA, 29 October–1 November 2023. [Google Scholar]
- Zhu, X.; Chen, Y.; Tian, H.; Tao, C.; Su, W.; Yang, C.; Huang, G.; Li, B.; Lu, L.; Wang, X.; et al. Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory. arXiv 2023, arXiv:2305.17144. [Google Scholar]
- Fu, Y.; Peng, H.; Khot, T.; Lapata, M. Improving Language Model Negotiation with Self-Play and In-Context Learning from AI Feedback. arXiv 2023, arXiv:2305.10142. [Google Scholar]
- Extance, A. ChatGPT has Entered the Classroom: How LLMs Could Transform Education. Nature 2023, 623, 474–477. [Google Scholar] [CrossRef]
- Yue, M.; Mifdal, W.; Zhang, Y.; Suh, J.; Yao, Z. MathVC: An LLM-Simulated Multi-Character Virtual Classroom for Mathematics Education. arXiv 2024, arXiv:2404.06711. [Google Scholar]
- Zhang, Z.; Zhang-li, D.; Yu, J.; Gong, L.; Zhou, J.; Liu, Z.; Hou, L.; Li, J. Simulating Classroom Education with LLM-Empowered Agents. arXiv 2024, arXiv:2406.19226. [Google Scholar] [CrossRef]
- Huber, S.; Kiili, K.; Nebel, S.; Ryan, R.; Sailer, M.; Ninaus, M. Leveraging the Potential of Large Language Models in Education Through Playful and Game-Based Learning. Educ. Psychol. Rev. 2024, 36, 1–20. [Google Scholar] [CrossRef]
- Baillifard, A.; Gabella, M.; Lavenex, P.; Martarelli, C. Effective Learning with a Personal AI Tutor: A Case Study. Educ. Inf. Technol. 2025, 30, 297–312. [Google Scholar] [CrossRef]
- Park, M.; Kim, S.; Lee, S.; Kwon, S.; Kim, K. Empowering Personalized Learning through a Conversation-based Tutoring System with Student Modeling. In Proceedings of the CHI EA ’24: Extended Abstracts of the CHI Conference on Human Factors In Computing Systems, Honolulu, HI, USA, 11–16 May 2024; pp. 1–10. [Google Scholar]
- Mohamed, A. Exploring the Potential of an AI-based Chatbot (ChatGPT) in Enhancing English as a Foreign Language (EFL) Teaching: Perceptions of EFL Faculty Members. Educ. Inf. Technol. 2024, 29, 3195–3217. [Google Scholar] [CrossRef]
- Tlili, A.; Shehata, B.; Adarkwah, M.; Bozkurt, A.; Hickey, D.; Huang, R.; Agyemang, B. What if the Devil Is My Guardian Angel: ChatGPT as a Case Study of Using Chatbots in Education. Smart Learn. Environ. 2023, 10, 1–24. [Google Scholar] [CrossRef]
- Zhang, S.; Zhao, X.; Zhou, T.; Kim, J. Do You Have AI Dependency? The Roles of Academic Self-efficacy, Academic Stress, and Performance Expectations on Problematic AI Usage Behavior. Int. J. Educ. Technol. High. Educ. 2024, 21, 2–14. [Google Scholar] [CrossRef]
- Siu, O.; Lui, K.; Huang, Y.; Ng, T.; Yeung, W. An Efficient, Reliable and Valid Assessment for Affective States during Online Learning. Sci. Rep. 2024, 14, 15768. [Google Scholar] [CrossRef]
- Amina, A.; Ahmad, A.; Raghad, A.; Mohamed, E.; Said, S. Ethical Implications of Using ChatGPT in Educational Environments: A Comprehensive Review. In Artificial Intelligence In Education: The Power And Dangers Of ChatGPT In The Classroom; Springer: Cham, Switzerland, 2024; pp. 185–199. [Google Scholar]
- Davey, L. DK Children’s Encyclopedia: The Book That Explains Everything; Dorling Kindersley: New York, NY, USA, 2017. [Google Scholar]
- Faysse, M.; Sibille, H.; Wu, T.; Omrani, B.; Viaud, G.; Hudelot, C.; Colombo, P. ColPali: Efficient Document Retrieval with Vision Language Models. In Proceedings of the 13th International Conference on Learning Representations (ICLR), Singapore, 24–28 April 2025. [Google Scholar]
- Zhai, X.; Mustafa, B.; Kolesnikov, A.; Beyer, L. Sigmoid Loss for Language Image Pre-Training. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 11975–11986. [Google Scholar]
- Hu, E.; Shen, Y.; Wallis, P.; Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the 2022 International Conference on Learning Representations (ICLR), Virtually, 25–29 April 2022. [Google Scholar]
- Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv 2023, arXiv:2305.14314. [Google Scholar] [CrossRef]
- OpenAI. Introducing GPT-5. 2025. Available online: https://openai.com/index/introducing-gpt-5/ (accessed on 14 September 2025).
Method | Latency (s) | MAP (%) |
---|---|---|
Vanilla ColPali | 0.10 | 80.60 |
ColPali-our-8k (+finetuning) | 0.13 | 86.05 |
Vanilla SigLIP | 0.06 | 86.12 |
SigLIP-our-8k (+finetuning) | 0.07 | 93.97 |
Score | ||
---|---|---|
Encyclopedia Agent | 4.3 | 0.82 |
GPT-5 | 2.7 | 0.48 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Du, J.; Liu, W.; Ye, J.; Zhou, D.; Liu, F. Retrieval-Augmented Vision–Language Agents for Child-Centered Encyclopedia Learning. Appl. Sci. 2025, 15, 10821. https://doi.org/10.3390/app151910821
Du J, Liu W, Ye J, Zhou D, Liu F. Retrieval-Augmented Vision–Language Agents for Child-Centered Encyclopedia Learning. Applied Sciences. 2025; 15(19):10821. https://doi.org/10.3390/app151910821
Chicago/Turabian StyleDu, Jing, Wenhao Liu, Jingyi Ye, Dibin Zhou, and Fuchang Liu. 2025. "Retrieval-Augmented Vision–Language Agents for Child-Centered Encyclopedia Learning" Applied Sciences 15, no. 19: 10821. https://doi.org/10.3390/app151910821
APA StyleDu, J., Liu, W., Ye, J., Zhou, D., & Liu, F. (2025). Retrieval-Augmented Vision–Language Agents for Child-Centered Encyclopedia Learning. Applied Sciences, 15(19), 10821. https://doi.org/10.3390/app151910821