Open-Source Large Language Models in Education: A Narrative Review of Evidence, Pedagogical Roles, and Learning Outcomes

Lin, Michael Pin-Chuan; Huang, Jing-Yuan; Chang, Daniel H.; Tembrevilla, Gerald; Bowen, G. Michael; Poitras, Eric; Janarthanan, Vasudevan; Ryoo, Jeeho

doi:10.3390/aieduc2010004

Open AccessReview

Open-Source Large Language Models in Education: A Narrative Review of Evidence, Pedagogical Roles, and Learning Outcomes

by

Michael Pin-Chuan Lin

^1,*

,

Jing-Yuan Huang

²

,

Daniel H. Chang

²

,

Gerald Tembrevilla

¹

,

G. Michael Bowen

¹

,

Eric Poitras

³

,

Vasudevan Janarthanan

⁴

and

Jeeho Ryoo

⁴

¹

Faculty of Education, Mount Saint Vincent University, Halifax, NS B3M 2J6, Canada

²

Faculty of Education, Simon Fraser University, Burnaby, BC V5A 1S6, Canada

³

Faculty of Computer Science, Dalhousie University, Halifax, NS B3H 4R2, Canada

⁴

Olsen College of Engineering and Science, Fairleigh Dickinson University, Vancouver, BC V6B 2P6, Canada

^*

Author to whom correspondence should be addressed.

AI Educ. 2026, 2(1), 4; https://doi.org/10.3390/aieduc2010004

Submission received: 17 December 2025 / Revised: 16 February 2026 / Accepted: 23 February 2026 / Published: 27 February 2026

Download

Browse Figures

Versions Notes

Abstract

Open-source large language models (LLMs) are increasingly explored in educational contexts due to their transparency, adaptability, and alignment with institutional governance and equity considerations. Despite growing interest, empirical research on how open-source LLMs are deployed in education and what evidence currently supports their integration remains limited and fragmented. This paper presents a state-of-the-art narrative review of peer-reviewed, human empirical studies examining the use of open-source LLMs in education. Guided by three questions, the review synthesizes how open-source LLMs are deployed across instructional contexts, what learner-related evidence is reported, and how teachers engage in human–AI collaboration. The reviewed literature is concentrated in higher education, particularly within computer science and programming domains, with applications focused on post-class tutoring, guidance, and formative feedback. Learner perceptions are generally positive, but evidence linking open-source LLM use to measurable learning outcomes remains emerging and inconsistent. Through interpretive synthesis, the review articulates a four-role model—Designer, Facilitator, Monitor, and Evaluator—that captures how teacher agency is enacted across AI-supported instructional workflows. This review maps recurring orchestration dimensions, decision points, and tensions that characterize early implementations, and it proposes a minimal orchestration reporting scaffold (configuration, boundaries, logging, adjudication) intended to support auditability and cross-study comparison as the empirical base develops.

Keywords:

large language models; open source; education; pedagogy

1. Introduction

1.1. Motivation

Open-source large language models (LLMs) have attracted increasing attention in education due to their potential to support transparency, adaptability, and institutional accountability in the use of instructional AI (M. P.-C. Lin et al., 2024; Yan et al., 2023). With open-source LLMs’ flexibility and control, institutions can tailor models to curricula, restrict retrieval to trusted sources, and deploy them on-premises or in sovereign cloud environments to ensure privacy and regulatory compliance, thereby transforming general-purpose model capabilities into course-specific and assessable instructional workflows (Hussain et al., 2024). They also offer cost flexibility for resource-constrained programs and research groups, reducing vendor lock-in and enabling comparative evaluation of models, prompts, and governance protocols across institutional contexts (Machado, 2025; Shashidhar et al., 2023). These characteristics distinguish open-source LLMs from proprietary systems that dominate current educational deployments. While closed-source models have enabled rapid experimentation through scalable APIs, their limited transparency and restricted control over data, model behavior, and lifecycle changes raise concerns for long-term pedagogical alignment, assessment validity, and institutional governance (Bond et al., 2024; Yan et al., 2023). Therefore, open-source LLMs are increasingly viewed as auditable collaborators in tutoring, assessment, and instructional design, aligning with broader open-science principles of reproducibility and accountability (Lim et al., 2025; M. P.-C. Lin et al., 2024).

1.2. Research Gap

While scholarship on LLMs in education has expanded rapidly, most existing syntheses foreground general-purpose or closed-source systems, particularly commercial tools (e.g., ChatGPT) (Lucas et al., 2024; Mai et al., 2024). These reviews typically examine LLMs at a broad capability level without distinguishing between proprietary and open-source implementations (Abdallah et al., 2025; Albadarin et al., 2024; Bauer et al., 2025; Deng et al., 2024; Yan et al., 2023). Consequently, they offer limited visibility into whether, where, and how open-source LLMs are being adopted in classrooms, laboratories, and assessment settings.

Although a small number of studies have begun to explore open-source LLMs in specific educational contexts, this work remains fragmented and heterogeneous in both design and reporting. Many studies are exploratory, focus on narrow use cases, or emphasize system performance rather than instructional practice or learner outcomes (M. P.-C. Lin et al., 2024). Additionally, a coherent understanding of adoption patterns, pedagogical roles, and educational implications of open-source LLMs is still emerging.

Moreover, while open-source LLM deployments in education inherently involve human guidance, few studies explicitly examine how teachers shape, constrain, and evaluate these systems as part of everyday instructional practice (Holstein et al., 2020; Lawrence et al., 2023). This lack of attention to teacher roles limits the field’s ability to move from proof-of-concept implementations toward reproducible and pedagogically grounded integration. For educators, research on human and intelligent tutoring systems (ITS) shows that learning gains are achievable under guided support (Steenbergen-Hu & Cooper, 2014; VanLehn, 2011). Similarly, studies of AI-supported classrooms suggest that educational impact hinges on how teachers coordinate human–AI interactions (Holstein & Aleven, 2022; Holstein et al., 2019b).

In short, existing reviews provide valuable overviews of LLMs in education but offer limited insight into open-source implementations, learner-level evidence, and teacher-led human–AI collaboration. An interpretive synthesis of emerging human empirical work is therefore necessary to clarify what has been studied, the types of evidence currently available, and where critical gaps remain.

1.3. Purpose and Guiding Questions

Motivated by these gaps, this narrative review synthesizes human empirical research on the use of open-source LLMs in education. The review seeks to clarify how these models are currently used, what forms of evidence have been reported, and how educators shape human–AI collaboration in practice. To organize this synthesis, the review is guided by three questions:

Question 1 (Educational Use & Impact): How are open-source LLMs being used across educational contexts, and what impacts on teaching practices and student learning are reported?
Question 2 (Learning Outcomes & Evidence): What kinds of learning and perception outcomes have been evaluated in studies using open-source LLMs, and what evidence is reported about their effectiveness for learning?
Question 3 (Human–AI Collaboration): What roles do teachers play in human–AI collaboration with open-source LLMs, and how do these roles shape the design and outcomes of learning activities?

Together, these questions support a structured narrative synthesis that maps current practices, highlights limitations in the existing evidence base, and informs future research on reproducible and responsible integration of open-source LLMs in education.

2. Background

2.1. Prior Uses of AI/LLMs in Education

Prior to the rise of LLMs, AI in education primarily centered on ITS, which model domain knowledge, learner progress, and pedagogical strategies to deliver step-by-step or dialogue-based guidance emulating one-on-one tutoring (Nye et al., 2014; Steenbergen-Hu & Cooper, 2014; VanLehn, 2011). Over several decades, ITS research has demonstrated the potential of such adaptive, feedback-driven environments to support individualized learning and improve student outcomes across diverse subjects (Ma et al., 2014). Another branch of work simultaneously advanced automated assessment, from early statistical scoring to NLP-based essay and short-answer evaluation, enabling scalable formative feedback (Attali & Burstein, 2004; Shermis & Burstein, 2013). A third branch emphasized learning analytics to power adaptive practice and instructor dashboards (Siemens, 2013). Across these strands, systems were task-specific and authoring-intensive, yielding strong results in narrow domains.

In this era, LLMs introduced a qualitative shift. Rather than developing separate dialogue or scoring systems for each task, a single general-purpose model can be prompted or lightly adapted to provide tutoring, writing feedback, code support, rubric generation, and on-demand explanations, all with relatively low development cost. In educational workflow, this shift enables rapid prototyping of tutor-like interactions, dynamic hinting and exemplars, and bootstrapping of curriculum-aligned materials via retrieval-augmented generation (RAG) and parameter-efficient tuning (Hu et al., 2022; Lewis et al., 2020). LLMs also expand the locus of support from learners to instructors—assisting lesson planning, assessment design, and differentiation—thereby moving AI from a niche tool embedded in a single activity to preparation, instruction, and assessment (Kasneci et al., 2023).

2.2. Open-Source vs. Closed-Source LLMs

In this review, we adopt a practical, deployment oriented notion of open-source LLMs as an umbrella category encompassing both fully open-source models and open-weight models, in which model parameters are released and permit local execution and modification for instructional use, even if some training details or data are not fully disclosed. This operational choice centers on educational needs such as governance, auditability, and curriculum alignment, rather than legal formalism alone.

For educational settings, the distinction between open and closed models matters along five axes that shape pedagogy and institutional policy:

Licensing. “Open” releases (often open weights under community/custom licenses) allow local use and some adaptation, yet may restrict activities such as using model outputs for retraining or competitive purposes. Closed models rely on proprietary terms and API access, with capabilities and uses controlled by providers.
Transparency. Open releases include model weights, inference code, and model cards, enabling subgroup auditing and error analysis; closed systems offer only limited documentation such as “system cards,” with restricted access to data or training recipes (Mitchell et al., 2019).
Adaptability. Open weights support parameter-efficient finetuning (Hu et al., 2022) and pedagogical integration, whereas closed APIs allow only prompt-level adjustments with no deep modification.
Deployment control. Open models can run on-premises or in sovereign clouds for privacy/compliance and version pinning; closed providers simplify operations and offer limited data-residency options, but usage remains bound by provider policy and lifecycle changes.
Cost structure. Open deployments require infrastructure investment but offer low ongoing inference costs; closed APIs reverse this, replacing capital expense with usage-based fees. Overall cost depends on institutional scale, technical capacity, and reliability needs (Pan & Wang, 2025).

Representative open-source families shown in research include Llama, Mistral/Mixtral, Phi, and DeepSeek.

2.3. Existing Reviews and Contribution of This Review

As outlined in Section 1.1 and Section 1.2, prior reviews of LLMs in education have largely emphasized proprietary systems and broad capability trends, giving limited attention to open-source deployments, teacher roles, and human empirical learning outcomes. Instead of re-summarizing that literature, this review offers a focused synthesis of empirical studies that explicitly examine open-source LLMs in educational contexts. Given the emerging and heterogeneous nature of this research area, the review adopts a narrative, state-of-the-art approach that prioritizes interpretive clarity over exhaustive coverage (Demiris et al., 2018; Sukhera, 2022). Structured database searches were used to identify relevant peer-reviewed studies, but the purpose of the synthesis is not to claim completeness. Instead, the review seeks to clarify patterns of use, evidentiary trends, and pedagogical roles reported in existing human empirical work, consistent with narrative review approaches outlined by (Snyder, 2019). By situating these studies in relation to one another, the review highlights what has been examined to date, where empirical evidence remains limited, and how future research can move toward more reproducible, pedagogically grounded, and transparent integration of open-source LLMs in education.

Importantly, the contribution of this review does not rest on demonstrating that open-source LLMs currently produce fundamentally different learning outcomes than proprietary systems. Rather, its distinct value lies in examining how open-source deployment reshapes the constraints, affordances, and decision spaces within which educators design, govern, and reproduce AI-supported instruction. Even where pedagogical usage patterns resemble those reported in broader LLM-in-education reviews, the open-source context foregrounds issues of orchestration, accountability, and reproducibility that remain largely invisible in syntheses centered on closed, API-mediated systems.

3. Approach to the Narrative Review

This review adopts a narrative, state-of-the-art approach to synthesize emerging human empirical research on open-source LLMs in educational contexts. Narrative reviews are particularly appropriate for areas characterized by rapid technological change, conceptual diversity, and limited empirical consolidation, where the goal is to interpret patterns, tensions, and trajectories in the literature rather than to exhaustively enumerate or statistically aggregate studies (Sukhera, 2022). Following established guidance for narrative reviews, this review was guided by clearly articulated research questions, explicit conceptual boundaries, and a transparent yet flexible approach to literature identification and analysis (Demiris et al., 2018; Ferrari, 2015). The intent was not to identify all possible studies on open-source LLMs in education, but to assemble a sufficient and informative corpus of peer-reviewed human empirical work capable of supporting interpretive synthesis and critical analysis.

For the purposes of this review, inclusion was bounded to peer-reviewed empirical studies conducted in formal educational settings that examined the use of open-source or open-weight LLMs in relation to instructional activities, learning processes, or learner outcomes. Conceptual papers, technical system descriptions without educational implementation, and studies focused exclusively on proprietary models were not considered within the analytic scope.

The three guiding questions focused on the educational applications of open-source LLMs, evidence related to student learning and perceptions, and the instructional roles educators play in human–AI collaboration. This orientation supports integrative analysis across diverse study designs, instructional settings, and outcome measures, foregrounding conceptual coherence and pedagogical insight within an evolving research landscape.

3.1. Literature Search Strategy

Multiple scholarly databases, spanning both education and computer science, were consulted, including the ACM Digital Library, IEEE Xplore, EBSCOhost, Web of Science, and Scopus. Prior to the final search session, preliminary trials of search terms were conducted to ensure that the combinations yielded conceptually relevant and interpretable results. Once appropriate terms were established, database searches were conducted within a defined time window.

Search terms focused on three core concepts: large language models, open-source or locally deployable systems, and educational contexts. Consistent with narrative review practices, the search strategy was designed to support interpretive synthesis rather than exhaustive coverage, to identify a sufficient body of literature to inform conceptual analysis rather than to enumerate all available publications on the topic (Ferrari, 2015). Accordingly, the search was designed to prioritize relevance to the guiding questions and pedagogical interpretability, rather than exhaustive retrieval. We anticipated that the resulting corpus would be necessarily small, given the current state of empirical research on open-source LLMs in education.

3.2. Identification and Refinement of Relevant Studies

The combined database searches returned approximately 441 records. After removing duplicates, titles and abstracts were reviewed for relevance, and potentially relevant records were examined at full-text. Through progressive refinement and close reading, ten studies were retained for synthesis.

Because the corpus is small and the purpose is interpretive synthesis, selection emphasized transparency and coherence of contribution rather than exhaustive coverage (Demiris et al., 2018). In practice, full texts were retained when they described an implemented educational use case involving an open-source or open-weight LLM and provided sufficient methodological and contextual detail to interpret how the system was used and what was examined. Ambiguous candidates that partially met the scope boundaries were discussed within the author team and retained only when they contributed a distinct mechanism relevant to open source educational deployment, such as implementation constraints, oversight practices, traceability features, or a learner facing role not already represented in the developing corpus. Consistent with the narrative state-of-the-art orientation described above, refinement prioritized corpus sufficiency for the guiding questions rather than exhaustive enumeration.

Full-text exclusions largely reflected the scope boundaries stated above, typically because candidate papers either did not report an implemented educational use case with an open-source or open-weight LLM or did not provide sufficient detail for interpretive analysis.

3.3. Analytic Orientation

Given the diversity of research designs, educational settings, and outcome measures represented in the reviewed studies, and the exploratory character of much of this work, findings were synthesized narratively. The analysis focused on identifying recurring patterns in instructional use, learner engagement and perceptions, and instructor involvement in human–AI collaboration. The synthesis emphasizes conceptual interpretation and pedagogical implications, acknowledging that different review teams or contexts may yield alternative yet equally valid interpretations of the same body of literature. Accordingly, the analytic aim is to integrate concepts by mapping mechanisms, tensions, and role functions; the review is not intended to be exhaustive or meta-analytic.

4. Results

This section synthesizes patterns across the reviewed studies, organizing the synthesis around deployment-level mechanisms distinctive to open-source or locally hosted LLM use in education: governance constraints, orchestration choices, and reproducibility practices. Pedagogical contexts, task orientations, and outcome evidence are then interpreted through these mechanisms and treated as secondary descriptive patterns. Given the small and heterogeneous empirical corpus, and the uneven reporting of governance and reproducibility in primary studies, this synthesis relies partly on interpretive integration of sparse reporting, as implementation descriptions are not consistently comprehensive or directly comparable (Ferrari, 2015; Sukhera, 2022). Accordingly, the patterns described below should be interpreted as emerging tendencies rather than definitive or exhaustive characterizations of the field.

4.1. Open-Source Deployment Mechanisms & Reporting Density

Across the reviewed studies, the use of open-source LLM is most consistently justified at the deployment layer rather than the interaction layer. Authors cite institutional control over data handling, model access, and system configuration as benefits of open-source models, along with claims about transparency and reproducibility. However, descriptions of these mechanisms are uneven across the corpus. In several studies (e.g., Abbas & Atwell, 2025; Meyer et al., 2025; Yee-King & Fiorucci, 2025), governance and orchestration are stated as motivations or design principles but are only partially specified in terms of concrete implementation details (e.g., what is logged, how model versions are fixed, what human review thresholds trigger escalation, or how retrieval sources are constrained). This inconsistent reporting substantially limits cumulative inference based on the current evidence and motivates the mechanism-focused synthesis that follows.

In the remainder of this section, three deployment-level mechanisms recur across studies when they are described: (1) governance constraints that shape what data and outputs can be stored, audited, or shared; (2) orchestration choices that determine how teacher oversight is configured, including what the AI is permitted to do and when human review is required; and (3) reproducibility practices, including the extent to which prompts, model versions, retrieval sources, and decision thresholds are stabilized and documented across use.

4.2. Human–AI Collaboration and Instructional Roles

Across the reviewed studies, the clearest locus of open-source distinctiveness is not learner-facing pedagogy but how instructors are positioned to configure, constrain, supervise, and adjudicate AI-supported activity. These instructional moves operationalize the deployment-level mechanisms introduced above. Governance constraints become visible when teachers determine what data, outputs, and records can be stored and audited. Orchestration choices become visible when teachers specify what the system is permitted to do, when students may use it, and when human review is required. Reproducibility practices become visible when teachers stabilize and document prompts, model versions, retrieval sources, and acceptance thresholds. Because none of the reviewed studies theorize teacher involvement using a shared role framework, the four roles below are offered as an interpretive synthesis that consolidates recurrent patterns across heterogeneous designs. Role characterizations were inferred from author descriptions of system configuration, protocols, and reported instructor responsibilities, because direct analyses of teacher practice were rarely provided. Table 1 summarizes how these roles are manifested across studies, while the discussion below synthesizes how they function in practice.

(1) Teachers as Designers: This role primarily instantiates orchestration by specifying system boundaries, and it can support reproducibility when configuration choices are stabilized and documented. The designer role foregrounds the sociotechnical foundations of AI-supported instruction. Rather than beginning solely from pedagogical intentions, this role involves defining what the AI system can and cannot do through model selection, prompt engineering, and system configuration. Across studies, teachers acting as designers selected or fine-tuned open-source models such as ChatGLM-6B, Llama/LLaVA, Mistral, and OpenChat, and developed prompt templates, rubrics, and task constraints aligned with instructional goals (Abbas & Atwell, 2025; Dahal et al., 2025; Y. Lin et al., 2025; Meyer et al., 2025; Shu et al., 2023). In tutoring contexts, design decisions frequently involved constraining models to course materials and predefined task structures. In programming education, “no direct answers” prompting strategies were commonly employed to shift AI support toward learning-to-learn processes rather than answer provision (Dahal et al., 2025; Y. Lin et al., 2025). In assessment-oriented applications, the designer role was expressed through alignment of rubrics, workflows, and acceptance thresholds, including decisions about which assessment components the AI could draft or modify, when human review was required, and which quantitative metrics (e.g., Cohen’s

κ

, precision, recall, F1, adoption thresholds, or spot-check rates) determined acceptable performance (Mendonça et al., 2025; Meyer et al., 2025; Yee-King & Fiorucci, 2025). Across studies, open-source deployment did not imply reduced instructor involvement; instead, it redistributed control and transparency to educators, increasing the importance of clearly defined boundaries and protocols prior to classroom implementation.

(2) Teachers as Facilitators: This role instantiates orchestration at the activity level by regulating access, pacing, and scaffolding, including decisions about when AI support is available and what forms of help are encouraged. Once instruction is underway, teachers assume the role of facilitators who regulate when, how, and to what extent AI support is introduced into the learning process. Facilitation involves maintaining instructional pace, managing cognitive load, and preserving learner agency through deliberate scaffolding strategies (Dahal et al., 2025; Hochmair, 2025). In tutoring and guidance applications, facilitation commonly includes structured in-class clarification turns, intent-routed virtual teaching assistants grounded in course materials, and usage rules such as “think first, then ask” to discourage premature reliance on AI assistance during practice or assignments (Dahal et al., 2025; Hochmair, 2025; Y. Lin et al., 2025). Several studies also described workshops and live demonstrations that modeled responsible AI use and highlighted how to recognize and respond to system failures (Chui et al., 2024; Hochmair, 2025). Across contexts, facilitation emphasized selective activation of AI support rather than continuous availability. The pedagogical aim was to sustain student engagement and progress while avoiding scenarios in which AI either supplanted learning processes or left students unsupported when encountering difficulties (Dahal et al., 2025; Hochmair, 2025; Y. Lin et al., 2025).

(3) Teachers as Monitors: This role most directly operationalizes governance and traceability by imposing oversight protocols, documentation practices, and reliability thresholds that determine what outputs may enter instruction or assessment workflows. Monitoring emerges as a critical role in ensuring the reliability and credibility of AI-supported instruction. Given persistent risks such as hallucinations or inconsistent outputs, all reviewed implementations retained humans in the loop (Mendonça et al., 2025; Yee-King & Fiorucci, 2025). Teachers acting as monitors translated AI efficiency into instructional reliability through several mechanisms. These monitoring mechanisms are described in only a subset of the reviewed studies and are presented here as design features reported in early implementations, not as established norms across educational settings.

First, scope containment strategies were employed to limit AI outputs to designated sources, thereby reducing hallucinations through techniques such as intent routing, entity mapping, and course-curated retrieval-augmented generation (Dahal et al., 2025). Second, traceability was emphasized by logging prompts, model versions, sources, and outputs, enabling retrospective review and student self-justification (Dahal et al., 2025; Yee-King & Fiorucci, 2025). Third, studies established eligibility thresholds to determine whether AI outputs were sufficiently reliable for instructional use. These included inter-rater agreement measures such as Cohen’s

κ

to assess alignment between AI and human judgments (Gao et al., 2025; Meyer et al., 2025), as well as predefined tolerance ranges for “practical equivalence” between AI-generated and human-assigned scores (Mendonça et al., 2025). Through these mechanisms, LLMs were entrusted with high-volume routine tasks, while teachers retained responsibility for complex judgments requiring contextual or subjective interpretation (Abbas & Atwell, 2025; Gao et al., 2025; Mendonça et al., 2025; Meyer et al., 2025). Monitoring also extended to academic integrity, where similarity or collusion detection supported assessment workflows, but flagged cases were consistently escalated to human review and final decision-making to ensure procedural fairness (Yee-King & Fiorucci, 2025).

(4) Teachers as Evaluators: This role instantiates governance through accountability for final decisions, and it can support reproducibility when adoption, revision, or rejection decisions are recorded and reused as stable decision rules. The evaluator role underscores that final authority over instructional and assessment decisions remains with human instructors. Whereas monitoring focuses on ongoing reliability and boundary enforcement, evaluation involves establishing normative standards and rendering final judgments about AI outputs. Teachers decide which outputs to adopt, revise, or reject, and how AI-generated contributions align with pedagogical and institutional expectations. In automated scoring contexts, rubrics and consistency tests were used to establish baselines for alignment with human raters, with grading policies ultimately reflecting human judgment (Mendonça et al., 2025). In formative feedback applications, multi-tiered rating systems distinguished between feedback that could be used directly, revised prior to use, or discarded altogether (Meyer et al., 2025). For instructional content preparation and academic integrity investigations, teachers validated item quality and interpreted investigative outcomes, with insights feeding back into iterative refinement of prompts, rubrics, and retrieval sources (Abbas & Atwell, 2025; Meyer et al., 2025; Yee-King & Fiorucci, 2025). This write-back loop transformed individual evaluative actions into traceable institutional records, enabling reuse, auditing, and refinement across courses and terms (Abbas & Atwell, 2025; Yee-King & Fiorucci, 2025).

4.3. Educational Use & Impact

This subsection describes the contexts in which open-source LLM deployments have been situated and the task orientations they have served. These patterns are reported to contextualize the mechanism-focused synthesis above, as learner-facing uses often resemble those in proprietary settings even when governance and orchestration constraints differ.

Across the reviewed studies, the use of open-source LLMs appears to be predominantly situated in higher education contexts, with a smaller but emerging presence in K-12 settings. Most higher education studies focus on computer science and programming-related domains, including undergraduate and graduate programming courses as well as postgraduate data science programs (Abbas & Atwell, 2025; Chui et al., 2024; Dahal et al., 2025; Hochmair, 2025; Y. Lin et al., 2025; Mendonça et al., 2025; Shu et al., 2023; Yee-King & Fiorucci, 2025). In contrast, the two K–12 studies were conducted in secondary-school physics classrooms, where open-source LLMs were used to support science writing and problem-solving activities (Gao et al., 2025; Meyer et al., 2025). One additional study explored video-based learning in a health education context, representing a non-traditional subject area relative to the dominant focus on computing disciplines (Shu et al., 2023).

In terms of instructional timing, LLM use is most frequently reported as being positioned in post-class settings, where systems support assignments, self-study, and revision activities (Abbas & Atwell, 2025; Gao et al., 2025; Hochmair, 2025; Y. Lin et al., 2025; Mendonça et al., 2025; Meyer et al., 2025). A smaller number of studies report in-class deployments, in which LLMs function as virtual teaching assistants or tutors (Dahal et al., 2025; Shu et al., 2023). Additional applications include instructional support for generating starter drafts of exam questions (Yee-King & Fiorucci, 2025), as well as a multi-stage pedagogical design in which students progressed through phases of adoption, development, and application of generative AI tools (Chui et al., 2024). Taken together, these patterns suggest a strong concentration of research in higher education, computer science, and post-class learning contexts, with comparatively limited but emerging exploration in K-12 environments and other disciplines.

The educational uses of open-source LLMs identified in the reviewed studies can be provisionally organized into three broad task orientations, summarized in Table 2. Two dominant application types emerge: Tutoring and Guidance and Automated Assessment. Within Tutoring and Guidance, most systems are designed to function as virtual teaching assistants or mentors that provide real-time or personalized support, such as Athena (Y. Lin et al., 2025), AutoTA (Dahal et al., 2025), and a pedagogical agent described by (Shu et al., 2023). These applications emphasize stepwise hints, exemplars, and adaptive scaffolding, particularly to support novice learners. One study adopts a more general-purpose chatbot approach to assist with assignments (Hochmair, 2025). Across tutoring-oriented applications, descriptions of teacher orchestration and quality control are often implicit. This reporting gap reinforces the need to treat governance and orchestration as analytic objects rather than as assumed implementation details.

Automated Assessment applications are oriented toward either automated scoring and grading or formative feedback. One study evaluates the feasibility of using open-source LLMs, such as Llama 3.2, for automated scoring by comparing their performance with GPT-4o and human graders (Mendonça et al., 2025). The remaining studies emphasize formative feedback rather than grading, with two situated in secondary-school physics contexts (Gao et al., 2025; Meyer et al., 2025) and one in a higher education data science program (Abbas & Atwell, 2025). These systems provide structured, short-cycle feedback intended to guide revision rather than assign scores. Examples include the use of Llama-3 to detect key content units in middle school science writing (Gao et al., 2025), a quantized OpenChat 3.6 model to generate German-language feedback for physics problem-solving (Meyer et al., 2025), and the use of Mistral-7B and CodeLlama-7B to support feedback on programming and narrative components of graduate-level reports (Abbas & Atwell, 2025).

A smaller set of applications focuses on Instructional Content Preparation, where open-source LLMs assist instructors in generating or checking assessment materials while leaving final decisions under human control (Yee-King & Fiorucci, 2025). Although some studies report slightly lower performance or consistency of open-source models in specific quantitative tasks when compared to proprietary systems (Gao et al., 2025; Mendonça et al., 2025; Meyer et al., 2025), authors frequently emphasize advantages related to privacy, transparency, and institutional control.

One included study focuses on developing students’ proficiency with generative AI tools rather than deploying open-source LLMs as instructional supports (Chui et al., 2024). In this case, the open-source LLM serves primarily as an object of learning rather than a pedagogical agent. While the study reports on instructional activities involving generative AI, it does not present evidence regarding the effectiveness or processes of LLM-supported instruction. It is therefore treated as conceptually distinct for the purposes of analyzing educational use in this subsection.

Mapping task orientations onto instructional contexts reveals several recurring tendencies (Figure 1). The reviewed studies are primarily comparative and evaluative, with adoption concentrated in post-class higher education settings, particularly for automated assessment and feedback. For instructors, reported impacts mainly involve examining the feasibility and workflow implications of delegating routine instructional tasks to open-source LLMs. For students, tutoring and feedback-oriented designs seem to enhance self-study and revision by increasing the density and immediacy of guidance, which may lower entry barriers for novices in technical domains (Abbas & Atwell, 2025; Gao et al., 2025; Hochmair, 2025; Y. Lin et al., 2025; Meyer et al., 2025).

At the same time, direct causal evidence of learning or achievement gains remains limited, as discussed further in Section 4.2. Most studies prioritize system alignment, usability, or agreement with human judgments, alongside student and instructor perceptions. As a result, reported benefits are best interpreted as process-level affordances whose educational impact is contingent on instructional design, teacher oversight, and human-in-the-loop implementation policies.

4.4. Learning Outcomes & Evidence

This subsection synthesizes patterns in the reviewed literature regarding learner-level outcomes associated with the use of open-source LLMs in educational settings. To organize the evidence, outcomes are interpreted along two broad families, as illustrated in Figure 2: student perception outcomes and objective learning outcomes. Student perception outcomes refer to learners’ self-reported attitudes, beliefs, and experiences while engaging with open-source LLMs, typically captured through Likert-type surveys, interviews, or reflective writing. These include constructs such as perceived usefulness, motivation, engagement, confidence or self-efficacy, satisfaction, trust, and awareness of ethical or AI literacy. Objective learning outcomes, in contrast, refer to demonstrable indicators of knowledge or skill development attributable to LLM-supported activities, such as test or quiz scores, rubric-based assessments, performance tasks, or measures of retention and transfer. Given the limited learner-level evidence, outcome claims are interpreted cautiously and read in relation to how oversight and scaffolding are described, since orchestration choices can shape whether positive perceptions translate into performance. System-level metrics, including model accuracy, automated–human alignment, and feedback quality, along with instructor-only perceptions, are treated as contextual indicators. These measures inform feasibility and implementation considerations but do not directly reflect changes in learners’ perceptions or achievement. Accordingly, they are reported as background evidence for interpreting implementations rather than as learner outcomes.

Across the ten reviewed studies, learner-level outcomes are reported inconsistently. Four studies report student perception outcomes, spanning online video-based learning with a pedagogical agent (Shu et al., 2023), a multi-week deployment of a retrieval-augmented programming tutor (Y. Lin et al., 2025), a content-plus-ethics instructional framework focused on generative AI literacy (Chui et al., 2024), and chatbot-supported learning in a GIS programming course (Hochmair, 2025). Only one study reports objective learning outcomes with statistical analysis (Hochmair, 2025). The remaining studies primarily focus on system performance or instructor perspectives and do not include learner-level perception or achievement measures.

Student perception outcomes: Across studies that examine learner perceptions, reported experiences are generally positive. In the video-based learning study employing a pedagogical agent, students reported an improved overall learning experience and a greater willingness to continue using the system compared to a control condition, with perceived personalization emerging as a salient contributor to these effects (Shu et al., 2023). In programming-focused contexts, open-source LLMs are associated with increased motivation, confidence, and perceived usefulness. Survey responses frequently exceeded mid-to-high scale thresholds for motivation, engagement, helpfulness, and intention to use similar tools in the future. Students highlighted the value of receiving multiple perspectives and on-demand problem-solving assistance from the tutor (Y. Lin et al., 2025). Similarly, participants in the GIS programming course described positive experiences related to the interactive nature and constant availability of chatbot support (Hochmair, 2025). In the ethics-oriented instructional design, reflective analyses suggested growth in learners’ awareness of LLM capabilities and limitations, alongside increased sensitivity to responsible-use considerations. Students articulated more calibrated attitudes toward AI assistance, recognizing both its affordances and constraints (Chui et al., 2024). Together, these studies suggest that open-source LLMs may enhance learners’ perceived support and engagement, particularly when positioned as accessible, responsive resources.

Moreover, several studies point to vulnerabilities in trust calibration and dependency. In the GIS programming context, students reported confusion, concerns about trust, and anxiety regarding over-reliance on chatbot output. Notably, the introduction of LLM support was negatively correlated with quiz performance, raising concerns about potential interference with code comprehension (Hochmair, 2025). Reflections from the ethics-focused study similarly cautioned against excessive dependence on AI-generated content, with students identifying risks such as diminished critical engagement, inappropriate emotional attachment, or habitual reliance on the AI agent (Chui et al., 2024). These findings highlight that positive perceptions do not uniformly translate into effective learning behaviors.

Objective learning outcomes: Evidence related to objective learning outcomes is limited. In the GIS programming study, students performed worse on a short code-interpretation quiz following the introduction of LLM support, even after controlling for prior programming experience (Hochmair, 2025). Another study planned a pre-/post-test design but did not report post-test results, leaving the impact on learning outcomes indeterminate (Shu et al., 2023). Collectively, these findings suggest that introducing LLMs without explicit instructional guidance or scaffolding may not lead to immediate performance gains and may, in some cases, coincide with short-term declines.

Out of scope for learner outcomes: Although learner-level evidence is sparse, several studies report system- or staff-oriented findings that inform implementation considerations. These include alignment between automated and human scoring, accuracy in concept detection, and perceived usefulness of LLM-generated feedback or instructional support (Abbas & Atwell, 2025; Dahal et al., 2025; Gao et al., 2025; Mendonça et al., 2025; Meyer et al., 2025; Yee-King & Fiorucci, 2025). While such results support the feasibility and practicality of open-source LLM deployment, they do not constitute direct evidence of changes in learners’ perceptions or achievement and are therefore treated as contextual rather than outcome-focused.

5. Discussion & Future Directions

5.1. From Introduction to Integration

The reviewed studies indicate that open-source LLM research in education remains at an introductory stage, with a small empirical base that signals a field still in formation. Within this evidence base, the significance of open-source LLMs is less visible in differential learning outcomes than in how instructional workflows are configured, monitored, and governed under conditions of local control and transparency. Empirical work has concentrated primarily on post-class applications in higher education, particularly within computer science and programming contexts, where open-source LLMs are deployed for tutoring and guidance (T1/T2) or formative feedback (A2). In contrast, in-class uses and real-time instructional integration remain comparatively underexplored. Even within tutoring-oriented designs, many studies provide limited articulation of teacher involvement frameworks, such as handoff rules, sampling strategies, or mappings between rubrics and operational procedures. As a result, instructional roles are often implied rather than explicitly stated, which constrains the transferability of these designs to other contexts.

At the level of learner experience and immediate instructional interaction, the reviewed studies do not consistently demonstrate that open-source LLMs function differently from proprietary models. In many cases, open-source systems appear to operate as functional substitutes, producing similar patterns of interaction, perceived usefulness, and short-term limitations. Therefore, the educational significance of open-source LLMs emerges less through distinct pedagogical effects than through the institutional and infrastructural conditions they enable, including local deployment, auditability, version control, and alignment with governance and assessment practices. Consequently, any pedagogical consequences of open-source deployment are best understood as indirect and contingent, shaped by how instructors and institutions leverage these infrastructural affordances rather than by model openness alone.

The reviewed literature also varies considerably in methodological robustness. Many studies employ exploratory pilot designs to probe feasibility, usability, or initial learner perceptions, often through short-term implementations and descriptive or correlational analyses. A smaller subset uses more structured quasi-experimental or mixed-methods approaches that attempt to link instructional design decisions with observable learning outcomes. This heterogeneity reflects the early stage of the research field, where exploratory studies appropriately dominate efforts to surface design possibilities, constraints, and risks, while more robust evaluative designs remain comparatively rare.

Most of the included studies are also characterized by single-site, short-term implementations. While such pilot designs are appropriate for early-stage exploration, they limit insight into longer-term knowledge development, skill transfer, and retention, and reduce the extent to which findings can inform sustained instructional practice across settings. Viewed collectively, the current corpus motivates a set of design hypotheses about how teacher-supervised open-source LLMs might be integrated into instructional workflows under conditions of local control. In particular, a small number of studies describe workflow arrangements in which open-source systems offload routine tasks while maintaining human judgment through oversight checkpoints, logging, and review protocols (Wang et al., 2025). These configurations should be read as localized implementation examples, as most evidence comes from short-term, single-site feasibility or system-focused evaluations and does not demonstrate cross-context generalizability. Accordingly, this review maps an emerging design space and associated instructional risks by surfacing candidate mechanisms and constraints, and it frames LLMs as components embedded within human-governed educational systems, not autonomous instructional agents.

It is important to note that the current evidence base is heavily concentrated in computer science and programming-related domains, where instructors’ technical expertise and the alignment between LLM capabilities and disciplinary tasks likely facilitate early adoption. While these contexts provide valuable testbeds for experimentation, this concentration constrains the extent to which findings can be readily generalized to other disciplinary settings. Building on these patterns, future research would benefit from moving beyond post-class deployments in which open-source LLMs function as auxiliary tools attached to academic tasks. Greater attention to classroom-integrated designs, including real-time, small-group, and whole-class interventions, could advance understanding of how human–AI collaboration unfolds during instruction and how teachers manage error detection, recovery, and adaptation in live settings (Skantze, 2021). Although current research priorities have tended to emphasize productivity-oriented applications that support “work” rather than “learning,” several studies point toward exploratory uses in non-computer science and learning-centered contexts (Dahal et al., 2025; Shu et al., 2023). Expanding this line of inquiry across disciplines would help clarify how teachers develop pedagogical frameworks that meaningfully integrate open-source AI into subject-matter instruction.

Methodologically, multi-site and longitudinal studies are needed to examine how instructional designs involving open-source LLMs evolve over time and across institutional contexts. Equally important is greater transparency in reporting implementation practices. Rather than focusing solely on system performance or learner perceptions, future studies could document instructional scripts that specify when teachers intervene, how AI outputs are sampled and reviewed, and how prompts and rubrics are enacted in practice. Such documentation would enhance interpretive transparency and pedagogical reuse, supporting the development of shared design knowledge for integrating open-source LLMs into curricula.

5.2. From Perceptions to Performance

Synthesizing the available evidence related to learner outcomes suggests a recurring tension between perceived benefits and demonstrated performance in the contexts where open-source LLMs have been integrated into educational activities. Across studies, learners frequently report increased motivation, engagement, confidence, satisfaction, and ethical awareness while using these tools. In contrast, objective evidence of learning gains remains sparse, with most studies relying on self-reported data. Because self-report measures are susceptible to response biases and may not reliably index cognitive change, inferences about learning improvement should be interpreted cautiously (Winne, 2020). Notably, the only study employing a controlled experimental design to assess objective outcomes reported lower quiz scores following the introduction of LLM support, which cautions against assuming short-term performance benefits under current designs. Taken together, these findings indicate that, in some instructional contexts, current uses of open-source LLMs may be neutral or even counterproductive for learning in the absence of careful pedagogical design and boundary-setting.

Several explanatory mechanisms may account for this perception–performance gap. One possibility is effort displacement, whereby the availability of LLMs provides shorter paths to task completion that reduce engagement in deeper cognitive processes such as self-explanation and strategic problem solving (Koedinger & Aleven, 2007). Closely related is automation bias, in which heightened confidence and satisfaction with AI assistance diminish learners’ critical scrutiny of model outputs (Li et al., 2024; Pareek et al., 2024). In addition, early positive reactions may reflect a novelty effect, where learners’ enthusiasm is driven by unfamiliarity rather than sustained instructional value, with effects that attenuate as the technology becomes routine (Clark, 1983; Rodrigues et al., 2022).

A further tension arises from task–tool misalignment, particularly when assessment emphasizes explanation, interpretation, or conceptual reasoning. In such cases, educational goals may prioritize sense-making and justification, whereas AI-supported workflows tend to optimize efficiency and output production. When instructional tasks and assessments are not explicitly designed to accommodate or constrain AI use, the presence of LLMs may inadvertently undermine the very competencies being evaluated (Jošt et al., 2024; Zhai et al., 2024).

Compounding these challenges, several studies report limited guidance regarding AI usage protocols, including when learners should engage with AI, how much reliance is appropriate, and how reflection should occur after AI-supported work. Without explicit instructional boundaries, perceived usability and efficiency may not translate into durable understanding. Design strategies such as designated no-AI phases, structured self-explanations, reflective comparisons between human- and AI-generated solutions, and guided review cycles are plausible candidates for mitigating over-reliance and for redirecting attention toward sense-making, although the current corpus provides limited direct tests of these strategies under controlled conditions (Chui et al., 2024; Hochmair, 2025; Y. Lin et al., 2025).

Building on these insights, future evaluations would benefit from incorporating longer-term outcome measures, including delayed retention and transfer assessments, to examine whether perceived gains persist over time. In parallel, greater attention to process-level data, such as prompt trajectories, revision attempts, and interaction patterns, could clarify whether LLMs are supplementing or supplanting learners’ cognitive effort. Aligning assessment designs more closely with learning activities is also critical. For example, when instruction emphasizes code generation or debugging, evaluations should directly measure those competencies rather than rely exclusively on interpretation-focused quizzes.

Finally, the ways in which teachers structure AI use appear central to whether learner perceptions translate into measurable performance gains. A human-first, AI-assisted approach, in which learners articulate their reasoning through drafting or outlining before engaging with AI feedback, may mitigate over-reliance and discourage premature answer acquisition (Dahal et al., 2025; Y. Lin et al., 2025). Limiting AI access during early phases of concept formation can promote deeper processing, while introducing AI tools later supports comparison, refinement, and revision. Encouraging learners to document AI feedback and their responses further supports metacognitive monitoring and helps counter automation bias by sustaining critical engagement with AI-generated outputs (Hochmair, 2025; Poitras et al., 2024).

5.3. From Taxonomy to Orchestration as a Design Goal

The four-role model—Designer, Facilitator, Monitor, and Evaluator—provides an interpretive organizing lens for teacher agency in AI-supported instruction. The reviewed studies do not theorize teacher roles using a shared framework, and the role sequence is not presented here as an inductive taxonomy emerging from explicit analyses of teacher practice. Instead, the model consolidates recurrent, often implicit, descriptions of how teachers configure AI use, scaffold learner interaction, oversee reliability and integrity, and adjudicate outputs against pedagogical standards. Read in this way, the model motivates an orchestration hypothesis about where teacher agency enters AI-supported activity across instructional phases. It also clarifies a future-oriented design goal: reproducibility becomes more attainable when orchestration parameters such as prompts, model versions, retrieval sources, and review thresholds are documented and governed, although such features are unevenly reported and not consistently demonstrated as stable routines across contexts in the current corpus.

This role progression aligns with prior research on classroom orchestration and human–AI collaboration, which emphasizes teachers’ responsibility for coordinating learning activities while managing technological supports (Dillenbourg, 2013; Holstein et al., 2019a). Across contexts, teachers adapt their roles to balance competing demands such as efficiency and consistency on the one hand, and learner agency and interpretive judgment on the other. In this sense, the reviewed implementations illustrate how open-source LLMs can be treated as governed components of instructional workflows, with teachers coordinating when AI is used, what is monitored, and how outputs are adjudicated against instructional standards.

Notably, however, the reviewed evidence remains concentrated at the level of individual courses, instructors, or small instructional teams. Empirical illustrations of institution-wide adoption, including governance structures, policy enactment, infrastructure integration, and cross-unit coordination, are largely absent from the current literature. This gap underscores the early stage at which open-source LLM use in education is being explored, with most studies prioritizing local feasibility and instructional experimentation while institutional implementation remains largely unexamined.

Furthermore, the synthesis highlights persistent structural tensions that accompany such orchestration. These include trade-offs between rapid feedback and deep learning, transparency and maintenance cost, open exploration and academic integrity, local deployment and team capacity, and role clarity and reproducibility. Similar tensions have been identified in prior work on automation bias and complacency in human–machine systems, particularly when protocols are specified in principle but inconsistently enacted in practice (Parasuraman & Manzey, 2010; Pareek et al., 2024). Across the reviewed studies, risks emerged when responsibilities were not explicitly assigned, when prompts and task conditions were insufficiently differentiated, or when the timing and sequencing of teacher interventions were left implicit. These gaps limit the extent to which orchestration practices can be reliably transferred or audited across contexts.

Building on these insights, a priority for future research is to test whether the role sequence can be implemented as auditable and portable orchestration routines. Studies could report explicit operational protocols for facilitation and monitoring, specifying who sets performance thresholds, what data is logged, how outputs are sampled for review, and when human adjudication is required. Comparative and multi-site work could then examine whether common parameterizations emerge for approval thresholds, quality-check frequencies, and logging practices, and under what instructional and institutional conditions these parameterizations remain pedagogically viable. Formal procedures for human intervention and escalation when system outputs deviate from pedagogical or ethical expectations also warrant empirical attention, as these procedures shape both accountability and learning conditions in practice.

At this stage of the literature, the review does not warrant prescriptive design rules or transferable decision algorithms for “responsible integration.” Instead, its practical contribution is to surface orchestration dimensions that recur across early implementations and to clarify the trade-offs they expose, including efficiency versus deep processing and openness versus maintenance cost, which require targeted empirical testing across contexts before stable design principles can be claimed.

A Minimal Orchestration Specification for Early Implementations

To make the above decision points more concrete, we offer a minimal orchestration specification that translates the four-role lens into a short set of implementation details that can be stated explicitly and compared across studies. The aim is pragmatic documentation, not prescription, to support future tests of which parameterizations matter across contexts.

Designer decisions (configuration): What model and version are used; what retrieval sources are permitted; what prompt templates are provided; what outputs are disallowed (e.g., direct answers); what data are stored and for how long.

Facilitator decisions (timing and boundaries): When AI use is permitted (phase of learning, in-class versus post-class); when AI use is restricted (no-AI phases); what students must do before consulting AI (e.g., attempt, outline, self-explanation); what students must produce after AI use (e.g., justification, comparison, revision rationale).

Monitor decisions (quality control): What is logged (prompts, outputs, sources, model parameters); how outputs are sampled for review; what triggers escalation (hallucination indicators, rule violations, integrity flags); what reliability checks are applied (spot checks, agreement checks, threshold rules).

Evaluator decisions (adjudication): What criteria determine adoption, revision, or rejection of AI outputs; who has final authority; how disputes or edge cases are resolved; how decisions feed back into prompt and rubric updates.

This specification functions as a shared reporting scaffold. It does not guarantee learning benefits, but it makes orchestration choices explicit, supports accountability, and enables more precise cross-study comparison in a field currently dominated by localized pilots. We recommend reporting it alongside basic study descriptors (course, task, assessment, duration) so later work can link orchestration parameterizations to learner outcomes.

Reproducibility is best approached as a forward-looking design goal that open-source deployments can make more attainable, provided there is sufficient governance capacity, practical logging infrastructure, and well-documented orchestration parameters. Open-source LLMs can foster more transparent instructional decision-making when model versions, prompts, retrieval sources, and review thresholds are recorded and open to inspection. However, the current corpus offers only early illustrations of such features and does not demonstrate reproducible orchestration routines across institutions, nor does it show that efficiency gains reliably lead to learning gains. Fully realizing the potential of local control for auditability and adaptation requires sustained empirical work that examines governance capacity, maintenance costs, and the consequences of orchestration choices for learning, equity, and accountability.

6. Conclusions

This paper presented a state-of-the-art narrative review of human-empirical research on open-source LLMs in education, synthesizing evidence on educational use, learner outcomes, and teacher-led human–AI collaboration. Current studies are concentrated in higher-education computer science contexts, with open-source LLMs most often applied to post-class tutoring, guidance, and formative assessment. Learners commonly report increased motivation, confidence, and perceived usefulness, while evidence of objective learning gains remains limited and, in some cases, short-term performance declines are observed. These findings highlight a gap between perceived support and demonstrable learning outcomes, showing the importance of instructional designs that translate process-level efficiencies into sustained learning.

A key contribution of this review is the articulation of a four-role teacher framework—Designer, Facilitator, Monitor, and Evaluator—that positions open-source LLMs as governed collaborators within instructional design. Teachers enact these roles across instructional phases to shape AI configuration, guide learning activity, ensure reliability and integrity, and retain final evaluative authority. Future research would benefit from in-class, multi-site, and longitudinal designs, alongside transparent reporting of orchestration routines and educator scripts, to support pedagogical reuse and meaningful integration of open-source LLMs across educational contexts.

As a narrative, state-of-the-art review, this synthesis prioritizes conceptual integration and interpretive insight over exhaustive coverage or procedural replicability. While this approach supports theory building and sense-making in an emerging research area, it also entails inherent limitations, including reliance on author judgment in study selection and synthesis. Accordingly, the patterns and frameworks proposed here should be understood as analytically grounded but provisional, intended to inform ongoing empirical inquiry rather than to offer definitive or comprehensive claims about the field.

Author Contributions

Conceptualization: M.P.-C.L. and J.R.; Investigation: M.P.-C.L. and J.R.; methodology: M.P.-C.L., D.H.C. and J.R.; Visualization: J.-Y.H.; Formal analysis: J.-Y.H.; Writing—Original Draft Preparation: M.P.-C.L. and J.R.; Writing—review and editing: M.P.-C.L., J.-Y.H., D.H.C., G.T., G.M.B., E.P., V.J. and J.R.; Supervision: M.P.-C.L. and J.R.; Funding acquisition: M.P.-C.L., E.P., and D.H.C.; Validation: M.P.-C.L. and J.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Social Sciences and Humanities Research Council of Canada (SSHRC), grant number 430-2024-00269 (PI-Michael Lin).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were generated or analyzed in this study. All information synthesized in this narrative review was derived from peer-reviewed publications accessed through scholarly databases, including ACM Digital Library, IEEE Xplore, EBSCOhost, Web of Science, and Scopus.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Abbas, N., & Atwell, E. (2025). Cognitive computing with large language models for student assessment feedback. Big Data and Cognitive Computing, 9(5), 112. [Google Scholar] [CrossRef]
Abdallah, N., Katmah, R., Khalaf, K., & Jelinek, H. F. (2025). Systematic review of ChatGPT in higher education: Navigating impact on learning, wellbeing, and collaboration. Social Sciences & Humanities Open, 12, 101866. [Google Scholar] [CrossRef]
Albadarin, Y., Saqr, M., Pope, N., & Tukiainen, M. (2024). A systematic literature review of empirical research on ChatGPT in education. Discover Education, 3(1), 60. [Google Scholar] [CrossRef]
Attali, Y., & Burstein, J. (2004). Automated essay scoring with e-rater^® v.2.0. Journal of Technology, Learning, and Assessment, 2004(2), i-21. [Google Scholar] [CrossRef]
Bauer, E., Greiff, S., Graesser, A. C., Scheiter, K., & Sailer, M. (2025). Looking beyond the hype: Understanding the effects of AI on learning. Educational Psychology Review, 37(2), 45. [Google Scholar] [CrossRef]
Bond, M., Khosravi, H., De Laat, M., Bergdahl, N., Negrea, V., Oxley, E., Pham, P., Chong, S. W., & Siemens, G. (2024). A meta systematic review of artificial intelligence in higher education: A call for increased ethics, collaboration, and rigour. International Journal of Educational Technology in Higher Education, 21(1), 4. [Google Scholar] [CrossRef]
Chui, C. K., Yang, L., & Kao, B. (2024). Empowering students in emerging technology: A framework for developing hands-on competency in generative AI with ethical considerations. In 2024 ASEE annual conference & exposition. American Society for Engineering Education (ASEE). [Google Scholar]
Clark, R. E. (1983). Reconsidering research on learning from media. Review of Educational Research, 53(4), 445–459. [Google Scholar] [CrossRef]
Dahal, R., Murray, G., Chataut, R., Hefeida, M., Srivastava, A., & Gyawali, P. (2025). AutoTA: A dynamic intent-based virtual teaching assistant for students using open source LLMs. IEEE Access, 13, 118122–118134. [Google Scholar] [CrossRef]
Demiris, G., Oliver, D. P., & Washington, K. T. (2018). Behavioral intervention research in hospice and palliative care: Building an evidence base. Academic Press. [Google Scholar]
Deng, R., Jiang, M., Yu, X., Lu, Y., & Liu, S. (2024). Does ChatGPT enhance student learning? A systematic review and meta-analysis of experimental studies. Computers & Education, 227, 105224. [Google Scholar] [CrossRef]
Dillenbourg, P. (2013). Design for classroom orchestration. Computers & Education, 69, 485–492. [Google Scholar] [CrossRef]
Ferrari, R. (2015). Writing narrative style literature reviews. Medical Writing, 24(4), 230–235. [Google Scholar] [CrossRef]
Gao, X., Karumbaiah, S., Dalal, A., Dey, I., Gnesdilow, D., & Puntambekar, S. (2025). A comparative analysis of LLM and specialized NLP system for automated assessment of science content. In 26th international conference on artificial intelligence in education (AIED 2025) (Vol. 15882, pp. 76–82). Springer Nature. [Google Scholar] [CrossRef]
Hochmair, H. H. (2025). Use and effectiveness of chatbots as support tools in GIS programming course assignments. ISPRS International Journal of Geo-Information, 14(4), 156. [Google Scholar] [CrossRef]
Holstein, K., & Aleven, V. (2022). Designing for human–AI complementarity in K-12 education. AI Magazine, 43(2), 239–248. [Google Scholar] [CrossRef]
Holstein, K., Aleven, V., & Rummel, N. (2020). A conceptual framework for human–AI hybrid adaptivity in education. In G. Biswas, T. Barnes, & H. Baker (Eds.), Artificial intelligence in education (Vol. 12163, pp. 294–307). Springer. [Google Scholar] [CrossRef]
Holstein, K., McLaren, B. M., & Aleven, V. (2019a). Co-designing a real-time classroom orchestration tool to support teacher–AI complementarity. Journal of Learning Analytics, 6(2), 27–52. [Google Scholar] [CrossRef]
Holstein, K., McLaren, B. M., & Aleven, V. (2019b). Designing for complementarity: Teacher and student needs for orchestration support in AI-enhanced classrooms. In Artificial intelligence in education: 20th international conference, AIED 2019, proceedings, part I. Springer. [Google Scholar] [CrossRef]
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-rank adaptation of large language models. International Conference on Learning Representations, 1(2022), 3. [Google Scholar]
Hussain, Z., Binz, M., Mata, R., & Wulff, D. U. (2024). A tutorial on open-source large language models for behavioral science. Behavior Research Methods, 56, 8214–8237. [Google Scholar] [CrossRef]
Jošt, G., Taneski, V., & Karakatič, S. (2024). The impact of large language models on programming education and student learning outcomes. Applied Sciences, 14(10), 4115. [Google Scholar] [CrossRef]
Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., & Krusche, S. (2023). ChatGPT for good? On opportunities and challenges of large language models. Learning and Individual Differences, 103, 102274. [Google Scholar] [CrossRef]
Koedinger, K. R., & Aleven, V. (2007). Exploring the assistance dilemma in experiments with cognitive tutors. Educational Psychology Review, 19(3), 239–264. [Google Scholar] [CrossRef]
Lawrence, L., Echeverria, V., Yang, K., Aleven, V., & Rummel, N. (2023). How teachers conceptualise shared control with an AI co-orchestration tool: A multiyear teacher-centred design process. British Journal of Educational Technology, 55(3), 823–844. [Google Scholar] [CrossRef]
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems (NEURIPS 2020), 33, 9459–9474. Available online: https://dl.acm.org/doi/abs/10.5555/3495724.3496517 (accessed on 16 December 2025).
Li, Y., Wu, B., Huang, Y., & Luan, S. (2024). Developing trustworthy artificial intelligence: Insights from research on interpersonal, human-automation, and human-AI trust. Frontiers in Psychology, 15, 1382693. [Google Scholar] [CrossRef] [PubMed]
Lim, T., Gottipati, S., & Cheong, M. (2025). What students really think: Unpacking AI ethics in educational assessments through a triadic framework. International Journal of Educational Technology in Higher Education, 22(1), 56. [Google Scholar] [CrossRef]
Lin, M. P.-C., Chang, D., Hall, S., & Jhajj, G. (2024). Preliminary systematic review of open-source large language models in education. In A. Sifaleras, & F. Lin (Eds.), Generative intelligence and intelligent tutoring systems (Vol. 14798). Springer. [Google Scholar] [CrossRef]
Lin, Y., Khan, M. F. F., & Sakamura, K. (2025). Athena: A GenAI-powered programming tutor based on open-source LLM. In 2025 1st international conference on consumer technology (ICCT-PACIFIC) (pp. 1–4). IEEE. [Google Scholar] [CrossRef]
Lucas, H. C., Upperman, J. S., & Robinson, J. R. (2024). A systematic review of large language models and their applications in medical education. Medical Education, 58(11), 1276–1285. [Google Scholar] [CrossRef]
Ma, W., Adesope, O. O., Nesbit, J. C., & Liu, Q. (2014). Intelligent tutoring systems and learning outcomes: A meta-analysis. Journal of Educational Psychology, 106(4), 901–918. [Google Scholar] [CrossRef]
Machado, J. (2025). Toward a public and secure generative AI: A comparative analysis of open and closed llms. arXiv, arXiv:2505.10603. [Google Scholar] [CrossRef]
Mai, D. T. T., Da, C. V., & Hanh, N. V. (2024). The use of ChatGPT in teaching and learning: A systematic review through SWOT analysis approach. Frontiers in Education, 9, 1328769. [Google Scholar] [CrossRef]
Mendonça, P. C., Quintal, F., & Mendonça, F. (2025). Evaluating LLMs for automated scoring in formative assessments. Applied Sciences, 15(5), 2787. [Google Scholar] [CrossRef]
Meyer, A., Bleckmann, T., & Friege, G. (2025). Automatic feedback on physics tasks using open-source generative artificial intelligence. International Journal of Science Education, 1–26. [Google Scholar] [CrossRef]
Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D., & Gebru, T. (2019, January 29–31). Model cards for model reporting. Conference on fairness, accountability, and transparency (pp. 220–229), Atlanta, GA, USA. [Google Scholar]
Nye, B. D., Graesser, A. C., & Hu, X. (2014). AutoTutor and family: A review of 17 years of natural language tutoring. International Journal of Artificial Intelligence in Education, 24(4), 427–469. [Google Scholar] [CrossRef]
Pan, G., & Wang, H. (2025). A cost-benefit analysis of on-premise large language model deployment: Breaking even with commercial llm services. arXiv, arXiv:2509.18101. [Google Scholar]
Parasuraman, R., & Manzey, D. H. (2010). Complacency and bias in human use of automation: An attentional integration. Human Factors, 52(3), 381–410. [Google Scholar] [CrossRef]
Pareek, S., van Berkel, N., Velloso, E., & Goncalves, J. (2024). Effect of explanation conceptualisations on trust in AI-assisted credibility assessment. Proceedings of the ACM on Human-Computer Interaction, 8(CSCW2), 1–31. [Google Scholar] [CrossRef]
Poitras, E., Crane, B. G. C., Dempsey, D., Bragg, T. A., Siegel, A. A., & Lin, M. P.-C. (2024). Cognitive apprenticeship and artificial intelligence coding assistants. In Navigating computer science education in the 21st century (pp. 261–281). IGI Global Scientific Publishing. [Google Scholar]
Rodrigues, L., Pereira, F. D., Toda, A. M., Palomino, P. T., Pessoa, M., Carvalho, L. S. G., Fernandes, D., Oliveira, E. H., Cristea, A. I., & Isotani, S. (2022). Gamification suffers from the novelty effect but benefits from the familiarization effect: Findings from a longitudinal study. International Journal of Educational Technology in Higher Education, 19(1), 1–25. [Google Scholar] [CrossRef]
Shashidhar, S., Chinta, A., Sahai, V., Wang, Z., & Ji, H. (2023). Democratizing LLMs: An exploration of cost-performance trade-offs in self-refined open-source models. In Findings of the association for computational linguistics: EMNLP 2023 (pp. 9070–9084). Association for Computational Linguistics. [Google Scholar] [CrossRef]
Shermis, M. D., & Burstein, J. (Eds.). (2013). Handbook of automated essay evaluation: Current applications and new directions. Routledge. [Google Scholar] [CrossRef]
Shu, Z., Zhang, J., & Li, Z. (2023). Design of pedagogical agent based on open-source large language model in online learning. In 2023 twelfth international conference of educational innovation through technology (EITT) (pp. 71–74). IEEE. [Google Scholar] [CrossRef]
Siemens, G. (2013). Learning analytics: The emergence of a discipline. American Behavioral Scientist, 57(10), 1380–1400. [Google Scholar] [CrossRef]
Skantze, G. (2021). Turn-taking in conversational systems and human-robot interaction: A review. Computer Speech & Language, 67, 101178. [Google Scholar]
Snyder, H. (2019). Literature review as a research methodology: An overview and guidelines. Journal of Business Research, 104, 333–339. [Google Scholar] [CrossRef]
Steenbergen-Hu, S., & Cooper, H. (2014). A meta-analysis of the effectiveness of intelligent tutoring systems on college students’ academic learning. Journal of Educational Psychology, 106(2), 331–347. [Google Scholar] [CrossRef]
Sukhera, J. (2022). Narrative reviews: Flexible, rigorous, and practical. Journal of Graduate Medical Education, 14(4), 414–417. [Google Scholar] [CrossRef] [PubMed]
VanLehn, K. (2011). The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educational Psychologist, 46(4), 197–221. [Google Scholar] [CrossRef]
Wang, X., Niu, J., Fang, B., Han, G., & He, J. (2025). Empowering teachers’ professional development with LLMs: An empirical study of developing teachers’ competency for instructional design in blended learning. Teaching and Teacher Education, 165, 105091. [Google Scholar] [CrossRef]
Winne, P. H. (2020). A proposed remedy for grievances about self-report methodologies. Frontline Learning Research, 8(3), 164–173. [Google Scholar] [CrossRef]
Yan, L., Sha, L., Zhao, L., Li, Y., Martinez-Maldonado, R., Chen, G., Li, X., Jin, Y., & Gašević, D. (2023). Practical and ethical challenges of large language models in education: A systematic scoping review. British Journal of Educational Technology, 55(1), 90–112. [Google Scholar] [CrossRef]
Yee-King, M., & Fiorucci, A. (2025). Deploying language model-based assessment support technology in a computer science degree: How do the academics feel about it? In 2025 IEEE global engineering education conference (EDUCON) (pp. 1–8). IEEE. [Google Scholar] [CrossRef]
Zhai, C., Wibowo, S., & Li, L. D. (2024). The effects of over-reliance on AI dialogue systems on students’ cognitive abilities: A systematic review. Smart Learning Environments, 11(1), 28. [Google Scholar] [CrossRef]

Figure 1. LLM Use Cases Across Instructional Stages.

Figure 2. Distribution of Outcome Types in Reviewed Studies.

Table 1. Teacher roles across studies.

Study	Designers	Facilitators	Monitors	Evaluators
(Shu et al., 2023)	✓	–	–	–
(Y. Lin et al., 2025)	✓	–	–	–
(Dahal et al., 2025)	✓	✓	✓	✓
(Chui et al., 2024)	✓	✓	✓	–
(Abbas & Atwell, 2025)	✓	–	✓	✓
(Hochmair, 2025)	✓	✓	–	–
(Mendonça et al., 2025)	✓	–	✓	✓
(Gao et al., 2025)	✓	–	✓	–
(Yee-King & Fiorucci, 2025)	✓	–	✓	✓
(Meyer et al., 2025)	✓	–	✓	✓

Table 2. LLM application types in education.

LLM Task Type		Ref.
Tutoring & Guidance	T1—AI Tutor/Teaching Assistant	(Dahal et al., 2025; Y. Lin et al., 2025; Shu et al., 2023)
Tutoring & Guidance	T2—General-purpose Chatbot for Assignments	(Hochmair, 2025)
Automated Assessment	A1—Automated Scoring/Grading	(Mendonça et al., 2025)
Automated Assessment	A2—Formative Feedback (non-grading)	(Abbas & Atwell, 2025; Gao et al., 2025; Meyer et al., 2025)
Instructional Content Preparation		(Yee-King & Fiorucci, 2025)
Not Applicable		(Chui et al., 2024)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lin, M.P.-C.; Huang, J.-Y.; Chang, D.H.; Tembrevilla, G.; Bowen, G.M.; Poitras, E.; Janarthanan, V.; Ryoo, J. Open-Source Large Language Models in Education: A Narrative Review of Evidence, Pedagogical Roles, and Learning Outcomes. AI Educ. 2026, 2, 4. https://doi.org/10.3390/aieduc2010004

AMA Style

Lin MP-C, Huang J-Y, Chang DH, Tembrevilla G, Bowen GM, Poitras E, Janarthanan V, Ryoo J. Open-Source Large Language Models in Education: A Narrative Review of Evidence, Pedagogical Roles, and Learning Outcomes. AI in Education. 2026; 2(1):4. https://doi.org/10.3390/aieduc2010004

Chicago/Turabian Style

Lin, Michael Pin-Chuan, Jing-Yuan Huang, Daniel H. Chang, Gerald Tembrevilla, G. Michael Bowen, Eric Poitras, Vasudevan Janarthanan, and Jeeho Ryoo. 2026. "Open-Source Large Language Models in Education: A Narrative Review of Evidence, Pedagogical Roles, and Learning Outcomes" AI in Education 2, no. 1: 4. https://doi.org/10.3390/aieduc2010004

APA Style

Lin, M. P.-C., Huang, J.-Y., Chang, D. H., Tembrevilla, G., Bowen, G. M., Poitras, E., Janarthanan, V., & Ryoo, J. (2026). Open-Source Large Language Models in Education: A Narrative Review of Evidence, Pedagogical Roles, and Learning Outcomes. AI in Education, 2(1), 4. https://doi.org/10.3390/aieduc2010004

Article Menu

Open-Source Large Language Models in Education: A Narrative Review of Evidence, Pedagogical Roles, and Learning Outcomes

Abstract

1. Introduction

1.1. Motivation

1.2. Research Gap

1.3. Purpose and Guiding Questions

2. Background

2.1. Prior Uses of AI/LLMs in Education

2.2. Open-Source vs. Closed-Source LLMs

2.3. Existing Reviews and Contribution of This Review

3. Approach to the Narrative Review

3.1. Literature Search Strategy

3.2. Identification and Refinement of Relevant Studies

3.3. Analytic Orientation

4. Results

4.1. Open-Source Deployment Mechanisms & Reporting Density

4.2. Human–AI Collaboration and Instructional Roles

4.3. Educational Use & Impact

4.4. Learning Outcomes & Evidence

5. Discussion & Future Directions

5.1. From Introduction to Integration

5.2. From Perceptions to Performance

5.3. From Taxonomy to Orchestration as a Design Goal

A Minimal Orchestration Specification for Early Implementations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI