Open-Source Large Language Models in Education: A Narrative Review of Evidence, Pedagogical Roles, and Learning Outcomes
Abstract
1. Introduction
1.1. Motivation
1.2. Research Gap
1.3. Purpose and Guiding Questions
- Question 1 (Educational Use & Impact): How are open-source LLMs being used across educational contexts, and what impacts on teaching practices and student learning are reported?
- Question 2 (Learning Outcomes & Evidence): What kinds of learning and perception outcomes have been evaluated in studies using open-source LLMs, and what evidence is reported about their effectiveness for learning?
- Question 3 (Human–AI Collaboration): What roles do teachers play in human–AI collaboration with open-source LLMs, and how do these roles shape the design and outcomes of learning activities?
2. Background
2.1. Prior Uses of AI/LLMs in Education
2.2. Open-Source vs. Closed-Source LLMs
- Licensing. “Open” releases (often open weights under community/custom licenses) allow local use and some adaptation, yet may restrict activities such as using model outputs for retraining or competitive purposes. Closed models rely on proprietary terms and API access, with capabilities and uses controlled by providers.
- Transparency. Open releases include model weights, inference code, and model cards, enabling subgroup auditing and error analysis; closed systems offer only limited documentation such as “system cards,” with restricted access to data or training recipes (Mitchell et al., 2019).
- Adaptability. Open weights support parameter-efficient finetuning (Hu et al., 2022) and pedagogical integration, whereas closed APIs allow only prompt-level adjustments with no deep modification.
- Deployment control. Open models can run on-premises or in sovereign clouds for privacy/compliance and version pinning; closed providers simplify operations and offer limited data-residency options, but usage remains bound by provider policy and lifecycle changes.
- Cost structure. Open deployments require infrastructure investment but offer low ongoing inference costs; closed APIs reverse this, replacing capital expense with usage-based fees. Overall cost depends on institutional scale, technical capacity, and reliability needs (Pan & Wang, 2025).
2.3. Existing Reviews and Contribution of This Review
3. Approach to the Narrative Review
3.1. Literature Search Strategy
3.2. Identification and Refinement of Relevant Studies
3.3. Analytic Orientation
4. Results
4.1. Open-Source Deployment Mechanisms & Reporting Density
4.2. Human–AI Collaboration and Instructional Roles
4.3. Educational Use & Impact
4.4. Learning Outcomes & Evidence
5. Discussion & Future Directions
5.1. From Introduction to Integration
5.2. From Perceptions to Performance
5.3. From Taxonomy to Orchestration as a Design Goal
A Minimal Orchestration Specification for Early Implementations
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Abbas, N., & Atwell, E. (2025). Cognitive computing with large language models for student assessment feedback. Big Data and Cognitive Computing, 9(5), 112. [Google Scholar] [CrossRef]
- Abdallah, N., Katmah, R., Khalaf, K., & Jelinek, H. F. (2025). Systematic review of ChatGPT in higher education: Navigating impact on learning, wellbeing, and collaboration. Social Sciences & Humanities Open, 12, 101866. [Google Scholar] [CrossRef]
- Albadarin, Y., Saqr, M., Pope, N., & Tukiainen, M. (2024). A systematic literature review of empirical research on ChatGPT in education. Discover Education, 3(1), 60. [Google Scholar] [CrossRef]
- Attali, Y., & Burstein, J. (2004). Automated essay scoring with e-rater® v.2.0. Journal of Technology, Learning, and Assessment, 2004(2), i-21. [Google Scholar] [CrossRef]
- Bauer, E., Greiff, S., Graesser, A. C., Scheiter, K., & Sailer, M. (2025). Looking beyond the hype: Understanding the effects of AI on learning. Educational Psychology Review, 37(2), 45. [Google Scholar] [CrossRef]
- Bond, M., Khosravi, H., De Laat, M., Bergdahl, N., Negrea, V., Oxley, E., Pham, P., Chong, S. W., & Siemens, G. (2024). A meta systematic review of artificial intelligence in higher education: A call for increased ethics, collaboration, and rigour. International Journal of Educational Technology in Higher Education, 21(1), 4. [Google Scholar] [CrossRef]
- Chui, C. K., Yang, L., & Kao, B. (2024). Empowering students in emerging technology: A framework for developing hands-on competency in generative AI with ethical considerations. In 2024 ASEE annual conference & exposition. American Society for Engineering Education (ASEE). [Google Scholar]
- Clark, R. E. (1983). Reconsidering research on learning from media. Review of Educational Research, 53(4), 445–459. [Google Scholar] [CrossRef]
- Dahal, R., Murray, G., Chataut, R., Hefeida, M., Srivastava, A., & Gyawali, P. (2025). AutoTA: A dynamic intent-based virtual teaching assistant for students using open source LLMs. IEEE Access, 13, 118122–118134. [Google Scholar] [CrossRef]
- Demiris, G., Oliver, D. P., & Washington, K. T. (2018). Behavioral intervention research in hospice and palliative care: Building an evidence base. Academic Press. [Google Scholar]
- Deng, R., Jiang, M., Yu, X., Lu, Y., & Liu, S. (2024). Does ChatGPT enhance student learning? A systematic review and meta-analysis of experimental studies. Computers & Education, 227, 105224. [Google Scholar] [CrossRef]
- Dillenbourg, P. (2013). Design for classroom orchestration. Computers & Education, 69, 485–492. [Google Scholar] [CrossRef]
- Ferrari, R. (2015). Writing narrative style literature reviews. Medical Writing, 24(4), 230–235. [Google Scholar] [CrossRef]
- Gao, X., Karumbaiah, S., Dalal, A., Dey, I., Gnesdilow, D., & Puntambekar, S. (2025). A comparative analysis of LLM and specialized NLP system for automated assessment of science content. In 26th international conference on artificial intelligence in education (AIED 2025) (Vol. 15882, pp. 76–82). Springer Nature. [Google Scholar] [CrossRef]
- Hochmair, H. H. (2025). Use and effectiveness of chatbots as support tools in GIS programming course assignments. ISPRS International Journal of Geo-Information, 14(4), 156. [Google Scholar] [CrossRef]
- Holstein, K., & Aleven, V. (2022). Designing for human–AI complementarity in K-12 education. AI Magazine, 43(2), 239–248. [Google Scholar] [CrossRef]
- Holstein, K., Aleven, V., & Rummel, N. (2020). A conceptual framework for human–AI hybrid adaptivity in education. In G. Biswas, T. Barnes, & H. Baker (Eds.), Artificial intelligence in education (Vol. 12163, pp. 294–307). Springer. [Google Scholar] [CrossRef]
- Holstein, K., McLaren, B. M., & Aleven, V. (2019a). Co-designing a real-time classroom orchestration tool to support teacher–AI complementarity. Journal of Learning Analytics, 6(2), 27–52. [Google Scholar] [CrossRef]
- Holstein, K., McLaren, B. M., & Aleven, V. (2019b). Designing for complementarity: Teacher and student needs for orchestration support in AI-enhanced classrooms. In Artificial intelligence in education: 20th international conference, AIED 2019, proceedings, part I. Springer. [Google Scholar] [CrossRef]
- Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-rank adaptation of large language models. International Conference on Learning Representations, 1(2022), 3. [Google Scholar]
- Hussain, Z., Binz, M., Mata, R., & Wulff, D. U. (2024). A tutorial on open-source large language models for behavioral science. Behavior Research Methods, 56, 8214–8237. [Google Scholar] [CrossRef]
- Jošt, G., Taneski, V., & Karakatič, S. (2024). The impact of large language models on programming education and student learning outcomes. Applied Sciences, 14(10), 4115. [Google Scholar] [CrossRef]
- Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., & Krusche, S. (2023). ChatGPT for good? On opportunities and challenges of large language models. Learning and Individual Differences, 103, 102274. [Google Scholar] [CrossRef]
- Koedinger, K. R., & Aleven, V. (2007). Exploring the assistance dilemma in experiments with cognitive tutors. Educational Psychology Review, 19(3), 239–264. [Google Scholar] [CrossRef]
- Lawrence, L., Echeverria, V., Yang, K., Aleven, V., & Rummel, N. (2023). How teachers conceptualise shared control with an AI co-orchestration tool: A multiyear teacher-centred design process. British Journal of Educational Technology, 55(3), 823–844. [Google Scholar] [CrossRef]
- Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems (NEURIPS 2020), 33, 9459–9474. Available online: https://dl.acm.org/doi/abs/10.5555/3495724.3496517 (accessed on 16 December 2025).
- Li, Y., Wu, B., Huang, Y., & Luan, S. (2024). Developing trustworthy artificial intelligence: Insights from research on interpersonal, human-automation, and human-AI trust. Frontiers in Psychology, 15, 1382693. [Google Scholar] [CrossRef] [PubMed]
- Lim, T., Gottipati, S., & Cheong, M. (2025). What students really think: Unpacking AI ethics in educational assessments through a triadic framework. International Journal of Educational Technology in Higher Education, 22(1), 56. [Google Scholar] [CrossRef]
- Lin, M. P.-C., Chang, D., Hall, S., & Jhajj, G. (2024). Preliminary systematic review of open-source large language models in education. In A. Sifaleras, & F. Lin (Eds.), Generative intelligence and intelligent tutoring systems (Vol. 14798). Springer. [Google Scholar] [CrossRef]
- Lin, Y., Khan, M. F. F., & Sakamura, K. (2025). Athena: A GenAI-powered programming tutor based on open-source LLM. In 2025 1st international conference on consumer technology (ICCT-PACIFIC) (pp. 1–4). IEEE. [Google Scholar] [CrossRef]
- Lucas, H. C., Upperman, J. S., & Robinson, J. R. (2024). A systematic review of large language models and their applications in medical education. Medical Education, 58(11), 1276–1285. [Google Scholar] [CrossRef]
- Ma, W., Adesope, O. O., Nesbit, J. C., & Liu, Q. (2014). Intelligent tutoring systems and learning outcomes: A meta-analysis. Journal of Educational Psychology, 106(4), 901–918. [Google Scholar] [CrossRef]
- Machado, J. (2025). Toward a public and secure generative AI: A comparative analysis of open and closed llms. arXiv, arXiv:2505.10603. [Google Scholar] [CrossRef]
- Mai, D. T. T., Da, C. V., & Hanh, N. V. (2024). The use of ChatGPT in teaching and learning: A systematic review through SWOT analysis approach. Frontiers in Education, 9, 1328769. [Google Scholar] [CrossRef]
- Mendonça, P. C., Quintal, F., & Mendonça, F. (2025). Evaluating LLMs for automated scoring in formative assessments. Applied Sciences, 15(5), 2787. [Google Scholar] [CrossRef]
- Meyer, A., Bleckmann, T., & Friege, G. (2025). Automatic feedback on physics tasks using open-source generative artificial intelligence. International Journal of Science Education, 1–26. [Google Scholar] [CrossRef]
- Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D., & Gebru, T. (2019, January 29–31). Model cards for model reporting. Conference on fairness, accountability, and transparency (pp. 220–229), Atlanta, GA, USA. [Google Scholar]
- Nye, B. D., Graesser, A. C., & Hu, X. (2014). AutoTutor and family: A review of 17 years of natural language tutoring. International Journal of Artificial Intelligence in Education, 24(4), 427–469. [Google Scholar] [CrossRef]
- Pan, G., & Wang, H. (2025). A cost-benefit analysis of on-premise large language model deployment: Breaking even with commercial llm services. arXiv, arXiv:2509.18101. [Google Scholar]
- Parasuraman, R., & Manzey, D. H. (2010). Complacency and bias in human use of automation: An attentional integration. Human Factors, 52(3), 381–410. [Google Scholar] [CrossRef]
- Pareek, S., van Berkel, N., Velloso, E., & Goncalves, J. (2024). Effect of explanation conceptualisations on trust in AI-assisted credibility assessment. Proceedings of the ACM on Human-Computer Interaction, 8(CSCW2), 1–31. [Google Scholar] [CrossRef]
- Poitras, E., Crane, B. G. C., Dempsey, D., Bragg, T. A., Siegel, A. A., & Lin, M. P.-C. (2024). Cognitive apprenticeship and artificial intelligence coding assistants. In Navigating computer science education in the 21st century (pp. 261–281). IGI Global Scientific Publishing. [Google Scholar]
- Rodrigues, L., Pereira, F. D., Toda, A. M., Palomino, P. T., Pessoa, M., Carvalho, L. S. G., Fernandes, D., Oliveira, E. H., Cristea, A. I., & Isotani, S. (2022). Gamification suffers from the novelty effect but benefits from the familiarization effect: Findings from a longitudinal study. International Journal of Educational Technology in Higher Education, 19(1), 1–25. [Google Scholar] [CrossRef]
- Shashidhar, S., Chinta, A., Sahai, V., Wang, Z., & Ji, H. (2023). Democratizing LLMs: An exploration of cost-performance trade-offs in self-refined open-source models. In Findings of the association for computational linguistics: EMNLP 2023 (pp. 9070–9084). Association for Computational Linguistics. [Google Scholar] [CrossRef]
- Shermis, M. D., & Burstein, J. (Eds.). (2013). Handbook of automated essay evaluation: Current applications and new directions. Routledge. [Google Scholar] [CrossRef]
- Shu, Z., Zhang, J., & Li, Z. (2023). Design of pedagogical agent based on open-source large language model in online learning. In 2023 twelfth international conference of educational innovation through technology (EITT) (pp. 71–74). IEEE. [Google Scholar] [CrossRef]
- Siemens, G. (2013). Learning analytics: The emergence of a discipline. American Behavioral Scientist, 57(10), 1380–1400. [Google Scholar] [CrossRef]
- Skantze, G. (2021). Turn-taking in conversational systems and human-robot interaction: A review. Computer Speech & Language, 67, 101178. [Google Scholar]
- Snyder, H. (2019). Literature review as a research methodology: An overview and guidelines. Journal of Business Research, 104, 333–339. [Google Scholar] [CrossRef]
- Steenbergen-Hu, S., & Cooper, H. (2014). A meta-analysis of the effectiveness of intelligent tutoring systems on college students’ academic learning. Journal of Educational Psychology, 106(2), 331–347. [Google Scholar] [CrossRef]
- Sukhera, J. (2022). Narrative reviews: Flexible, rigorous, and practical. Journal of Graduate Medical Education, 14(4), 414–417. [Google Scholar] [CrossRef] [PubMed]
- VanLehn, K. (2011). The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educational Psychologist, 46(4), 197–221. [Google Scholar] [CrossRef]
- Wang, X., Niu, J., Fang, B., Han, G., & He, J. (2025). Empowering teachers’ professional development with LLMs: An empirical study of developing teachers’ competency for instructional design in blended learning. Teaching and Teacher Education, 165, 105091. [Google Scholar] [CrossRef]
- Winne, P. H. (2020). A proposed remedy for grievances about self-report methodologies. Frontline Learning Research, 8(3), 164–173. [Google Scholar] [CrossRef]
- Yan, L., Sha, L., Zhao, L., Li, Y., Martinez-Maldonado, R., Chen, G., Li, X., Jin, Y., & Gašević, D. (2023). Practical and ethical challenges of large language models in education: A systematic scoping review. British Journal of Educational Technology, 55(1), 90–112. [Google Scholar] [CrossRef]
- Yee-King, M., & Fiorucci, A. (2025). Deploying language model-based assessment support technology in a computer science degree: How do the academics feel about it? In 2025 IEEE global engineering education conference (EDUCON) (pp. 1–8). IEEE. [Google Scholar] [CrossRef]
- Zhai, C., Wibowo, S., & Li, L. D. (2024). The effects of over-reliance on AI dialogue systems on students’ cognitive abilities: A systematic review. Smart Learning Environments, 11(1), 28. [Google Scholar] [CrossRef]


| Study | Designers | Facilitators | Monitors | Evaluators |
|---|---|---|---|---|
| (Shu et al., 2023) | ✓ | – | – | – |
| (Y. Lin et al., 2025) | ✓ | – | – | – |
| (Dahal et al., 2025) | ✓ | ✓ | ✓ | ✓ |
| (Chui et al., 2024) | ✓ | ✓ | ✓ | – |
| (Abbas & Atwell, 2025) | ✓ | – | ✓ | ✓ |
| (Hochmair, 2025) | ✓ | ✓ | – | – |
| (Mendonça et al., 2025) | ✓ | – | ✓ | ✓ |
| (Gao et al., 2025) | ✓ | – | ✓ | – |
| (Yee-King & Fiorucci, 2025) | ✓ | – | ✓ | ✓ |
| (Meyer et al., 2025) | ✓ | – | ✓ | ✓ |
| LLM Task Type | Ref. | |
|---|---|---|
| Tutoring & Guidance | T1—AI Tutor/Teaching Assistant | (Dahal et al., 2025; Y. Lin et al., 2025; Shu et al., 2023) |
| T2—General-purpose Chatbot for Assignments | (Hochmair, 2025) | |
| Automated Assessment | A1—Automated Scoring/Grading | (Mendonça et al., 2025) |
| A2—Formative Feedback (non-grading) | (Abbas & Atwell, 2025; Gao et al., 2025; Meyer et al., 2025) | |
| Instructional Content Preparation | (Yee-King & Fiorucci, 2025) | |
| Not Applicable | (Chui et al., 2024) | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Lin, M.P.-C.; Huang, J.-Y.; Chang, D.H.; Tembrevilla, G.; Bowen, G.M.; Poitras, E.; Janarthanan, V.; Ryoo, J. Open-Source Large Language Models in Education: A Narrative Review of Evidence, Pedagogical Roles, and Learning Outcomes. AI Educ. 2026, 2, 4. https://doi.org/10.3390/aieduc2010004
Lin MP-C, Huang J-Y, Chang DH, Tembrevilla G, Bowen GM, Poitras E, Janarthanan V, Ryoo J. Open-Source Large Language Models in Education: A Narrative Review of Evidence, Pedagogical Roles, and Learning Outcomes. AI in Education. 2026; 2(1):4. https://doi.org/10.3390/aieduc2010004
Chicago/Turabian StyleLin, Michael Pin-Chuan, Jing-Yuan Huang, Daniel H. Chang, Gerald Tembrevilla, G. Michael Bowen, Eric Poitras, Vasudevan Janarthanan, and Jeeho Ryoo. 2026. "Open-Source Large Language Models in Education: A Narrative Review of Evidence, Pedagogical Roles, and Learning Outcomes" AI in Education 2, no. 1: 4. https://doi.org/10.3390/aieduc2010004
APA StyleLin, M. P.-C., Huang, J.-Y., Chang, D. H., Tembrevilla, G., Bowen, G. M., Poitras, E., Janarthanan, V., & Ryoo, J. (2026). Open-Source Large Language Models in Education: A Narrative Review of Evidence, Pedagogical Roles, and Learning Outcomes. AI in Education, 2(1), 4. https://doi.org/10.3390/aieduc2010004

