EduMSRA: A Multi-Source Educational Research Agent Integrating Retrieval-Augmented Generation and Model Context Protocol for Adaptive Intelligent Tutoring Systems

Ho, Thi-Linh; Lam, Thanh-Phong

doi:10.3390/app16094400

Open AccessArticle

EduMSRA: A Multi-Source Educational Research Agent Integrating Retrieval-Augmented Generation and Model Context Protocol for Adaptive Intelligent Tutoring Systems

by

Thi-Linh Ho

^1,*

and

Thanh-Phong Lam

²

¹

Natural Language Processing and Knowledge Discovery Research Group, Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh 700000, Vietnam

²

Faculty of Management Information Systems, Ho Chi Minh University of Banking (HUB), Ho Chi Minh 700000, Vietnam

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(9), 4400; https://doi.org/10.3390/app16094400

Submission received: 29 March 2026 / Revised: 27 April 2026 / Accepted: 28 April 2026 / Published: 30 April 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

The integration of Artificial Intelligence into educational systems has accelerated dramatically with the advent of Large Language Models (LLMs). However, two critical limitations constrain current AI-powered tutoring systems: LLMs hallucinate factually incorrect content in high-stakes pedagogical contexts, and existing systems lack standardized mechanisms to dynamically access and synthesize knowledge from heterogeneous educational sources, including learning management systems, open-access textbook repositories, assessment databases, and real-time educational APIs. This paper presents a systematic survey of the convergence of Retrieval-Augmented Generation (RAG) and the Model Context Protocol (MCP) in educational AI applications. Based on our taxonomy, we identify a critical architectural gap: no current system simultaneously achieves multi-source curriculum retrieval, standardized tool orchestration, learner-adaptive personalization, and citation-aware generation within a unified framework. To address this, we propose EduMSRA (Educational Multi-Source Research Agent)—a novel architecture comprising a Hierarchical Educational RAG Pipeline, an MCP-based Curriculum Tool Orchestration Layer, a Conflict-Aware Fusion Module (CAFM), a Learner Profile Manager (LPM), and a Pedagogical Policy Agent (PPA) aligned with Bloom’s taxonomy. We further provide a comprehensive experimental design road map specifying nine publicly available benchmark datasets and four evaluation experiments. Additionally, we conduct three Bayesian empirical analyses: (1) a random-effects meta-analysis of 12 RAG studies indicating a positive effect direction (

\hat{μ} = 0.511

, 95% HDI:

[0.250, 0.790]

),

I^{2} = 99.3 %

heterogeneity flagged as indicative), (2) a BKT simulation illustrating adaptive scaffolding dynamics across five learner profiles, and (3) a Beta-Binomial difficulty characterization of nine benchmark datasets. Our analysis demonstrates that EduMSRA offers a principled, scalable path toward adaptive, grounded, and pedagogically aligned AI tutoring agents.

Keywords:

AI Agent; Retrieval-Augmented Generation (RAG); Model Context Protocol (MCP); Intelligent Tutoring System (ITS); adaptive learning; multi-source retrieval; Large Language Model; educational AI; personalized learning

1. Introduction

The integration of Artificial Intelligence (AI) into educational systems has undergone a fundamental paradigm shift over the past four decades. From rule-based Intelligent Tutoring Systems (ITSs) that demonstrated the “two-sigma” advantage of one-on-one tutoring over conventional classroom instruction [1], the field has progressed through expert systems and Bayesian Knowledge Tracing [2,3] to modern Large Language Model (LLM)-powered agents capable of natural dialogue, adaptive explanation, and real-time assessment support [4,5]. Systems such as Khan Academy’s Khanmigo, Duolingo’s AI tutor, and university-deployed chatbots demonstrate the growing deployment of conversational AI across K-12 and higher education [6,7].

However, despite these advances, current LLM-based educational systems suffer from four critical limitations. First, LLMs hallucinate factually incorrect content in high-stakes pedagogical contexts, with studies documenting hallucination as a persistent challenge across educational benchmarks [8,9,10]. Second, existing systems rely on static knowledge bases or single-source retrieval, severely limiting their ability to synthesize across multiple curriculum units, textbook editions, or external knowledge sources [11]. Third, no standardized mechanism exists for LLM agents to dynamically access heterogeneous educational data sources, including learning management systems, open-access textbook repositories, assessment databases, and real-time educational APIs [12,13]. Fourth, stateless interactions cannot model long-term student progress or align responses with established pedagogical frameworks such as Bloom’s taxonomy [14,15].

Two parallel innovations have emerged to address these limitations at the technical level. Retrieval-Augmented Generation (RAG) [16], originally introduced for knowledge-intensive NLP tasks, has been progressively adapted for educational contexts. Zhao et al. [17] surveyed 51 RAG-in-education studies, finding that RAG significantly reduces hallucination rates (by 23–41% across benchmarks) and enables dynamic knowledge updates, critical for education systems where course content changes each semester. The evolution from static document RAG through adaptive RAG [18,19] to agentic RAG [20] introduces autonomous retrieval agents capable of multi-hop reasoning, sub-query decomposition, and iterative response refinement [21,22].

The Model Context Protocol (MCP), released by Anthropic in November 2024 [12], provides an open, standardized interface for LLM agents to connect to external data sources and tools. The landmark study by Modran et al. [6] demonstrated that integrating RAG with MCP and the Agent Communication Protocol (ACP) for STEM tutoring resulted in statistically significant improvements in student engagement and knowledge retention. However, their architecture was limited to two course corpora and session-level context, without learner profile persistence or cross-source conflict resolution. Comprehensive MCP benchmarks [23,24,25] and security analyses [26,27] have since established the protocol’s capabilities and limitations, yet its application to educational AI remains underexplored.

Three concrete scenarios in which simpler RAG tutors fail—motivating the EduMSRA architecture—are as follows: (i) multi-source curriculum queries that require aligning textbook chapters, learning-management-system records, and assessment databases (handled by HERAP’s hierarchical retrieval); (ii) personalized scaffolding across heterogeneous learner ability profiles, where a one-size-fits-all answer is pedagogically harmful (handled jointly by LPM’s knowledge tracing and PPA’s Bloom-aligned policy); and (iii) queries with contradictory authoritative sources (e.g., different textbook editions, conflicting reference works) that require explicit conflict reconciliation rather than silent re-ranking (handled by CAFM).

Based on our systematic review, we identify a critical architectural gap: no existing system simultaneously achieves (1) multi-source curriculum retrieval, (2) standardized tool orchestration via open protocols, (3) learner-adaptive personalization with knowledge tracing, and (4) pedagogical alignment with established learning theories. To address this gap, we formulate three research questions:

RQ1: What are the current architectures, capabilities, and limitations of RAG-based educational AI systems, and how have they evolved from static document retrieval to agentic paradigms?
RQ2: How can the Model Context Protocol (MCP) be leveraged to standardize tool orchestration and data access in educational AI, and what are the associated security and compliance requirements?
RQ3: What architectural design is necessary and sufficient to simultaneously achieve multi-source retrieval, learner-adaptive personalization, pedagogical alignment, and citation-aware generation within a unified educational AI framework?

This paper extends these foundations by presenting EduMSRA (Educational Multi-Source Research Agent), a novel architecture specifically designed for the full educational use case spectrum from K-12 science question answering to personalized graduate-level research support. The main contributions are:

A systematic taxonomy of RAG architectures in educational contexts, revealing the progression from static document chatbots to agentic educational assistants.
A comprehensive review of MCP adoption in educational AI, including analysis of MCP benchmarks, security threats, and the mapping of MCP primitives to educational use cases.
A comparative analysis of seven existing architectures across five educational capability dimensions, identifying the critical gap that motivates EduMSRA.
The EduMSRA architecture—a novel framework with five specialized components: Hierarchical Educational RAG Pipeline (HERAP), MCP-based Curriculum Tool Orchestration Layer (CTOL), Conflict-Aware Fusion Module (CAFM), Learner Profile Manager (LPM), and Pedagogical Policy Agent (PPA) aligned with Bloom’s taxonomy.
A curated experimental road map specifying nine published benchmark datasets and four targeted evaluation experiments with baselines and metrics.
Three Bayesian literature-based evidence syntheses supporting EduMSRA’s design rationale (not claiming direct empirical validation of the EduMSRA pipeline): a random-effects meta-analysis of published RAG effects ( $\hat{μ} = 0.511$ , 95% HDI $[0.250, 0.790]$ , $I^{2} = 99.3 %$ heterogeneity flagged as indicative only), a BKT simulation of scaffolding dynamics, and a Beta-Binomial characterization of benchmark difficulty priors.
A proof-of-concept (PoC) implementation that exercises all five components on a toy corpus, produces concrete per-module latency numbers (HERAP 0.236 ms, CTOL 0.006 ms, CAFM 1.406 ms, LPM 0.003 ms, PPA 0.014 ms; mean total 1.677 ms over 30 queries on free-tier Kaggle CPU) and BKT trajectories across five skills. The PoC demonstrates the pipeline runs end-to-end and reveals CAFM as the dominant latency contributor (83.8%), a concrete optimization target for future work. The prototype also ships with 58 pytest unit tests (nine for HERAP, nine for CTOL, seven for CAFM, 10 for LPM, 13 for PPA, six for the orchestrator, four for fixture integrity; full run 0.14 s).

The remainder of this paper is organized as follows. Section 2 provides background on ITS, RAG, MCP, a literature review of RAG architectures in education, MCP in educational AI, and a comparative analysis. Section 3 details the EduMSRA architecture, the literature-based Bayesian evidence synthesis (Section 3.2), and the proof-of-concept implementation (Section 3.3). Section 4 discusses implications and limitations, and Section 5 concludes with future directions.

2. Materials and Methods

Part I—Survey of RAG and MCP in Educational AI. This section presents the background on Intelligent Tutoring Systems, Retrieval-Augmented Generation, and the Model Context Protocol, together with a systematic literature review and a comparative analysis of existing architectures. The novel architectural proposal and proof-of-concept implementation are presented in Part II (Section 3).

2.1. Intelligent Tutoring Systems: Evolution and Limitations

Intelligent Tutoring Systems (ITSs) emerged in the 1970s–1980s as computer-based learning environments providing individualized instruction and feedback without human intervention. Bloom’s seminal work [1] demonstrated the “two-sigma problem”: students receiving one-on-one tutoring performed two standard deviations above conventionally taught students, establishing the gold standard that automated systems aspire to match. Classic ITS architectures comprise four components: a domain model (knowledge to be taught), a student model (learner’s current knowledge state), a pedagogical model (instructional strategies), and an interface [3].

Systems such as Cognitive Tutor (mathematics), AutoTutor (physics), and ANDES (Newtonian physics) demonstrated measurable learning gains, with VanLehn’s meta-analysis [3] finding that ITS achieved effect sizes of

d = 0.76

, approaching human tutoring’s

d = 0.79

. These systems drew upon established pedagogical theories: Bloom’s taxonomy [14,15] for classifying cognitive objectives across six levels (Remember through Create), and Vygotsky’s Zone of Proximal Development (ZPD) [28] for calibrating scaffolding intensity based on the gap between independent capability and guided potential. Bayesian Knowledge Tracing (BKT), introduced by Corbett and Anderson [2], became the dominant student modeling approach, estimating mastery probabilities

P (L_{n})

through four parameters: prior knowledge

P (L_{0})

, learning rate

P (T)

, slip probability

P (S)

, and guess probability

P (G)

. Recent work has extended BKT to LLM-based dialogue systems [29], while graph-based approaches [30] map pedagogical theories directly to knowledge tracing architectures.

However, traditional ITSs were constrained by: (1) manually encoded knowledge bases requiring domain experts to author hundreds of production rules; (2) brittle natural language understanding (NLU) components limited to keyword matching or shallow parsing; and (3) fixed pedagogical strategies that could not adapt to diverse learning contexts [31]. Modern LLM-based ITSs address NLU limitations through transformer architectures [32] and instruction-following capabilities [33] but introduce new challenges: hallucination in subject-matter responses [8], inability to access current institution-specific curriculum materials, lack of structured pedagogical alignment with learning objectives [14], and stateless interactions that cannot model long-term student progress [4,11]. EduMSRA directly addresses all four limitations through its five architectural components.

2.2. Retrieval-Augmented Generation in Educational Contexts

Retrieval-Augmented Generation (RAG), introduced by Lewis et al. [16], combines a neural retriever with a sequence-to-sequence generator, enabling language models to ground responses in retrieved evidence rather than relying solely on parametric knowledge. The foundational retrieval component evolved from sparse methods (BM25) through Dense Passage Retrieval (DPR) [34] to advanced architectures including REALM [35], Fusion-in-Decoder (FiD) [36], and RETRO [37], which demonstrated that retrieval from trillions of tokens could substitute for model scale.

Two comprehensive surveys establish the state of RAG in education. Gao et al. [38] provide a general RAG taxonomy covering naive, advanced, and modular paradigms, while Zhao et al. [17] specifically survey 51 RAG-in-education studies, identifying three application categories: (1) interactive learning systems including QA chatbots and tutoring assistants; (2) educational content generation including exercise generation and curriculum summarization; and (3) large-scale deployment in educational ecosystems including LMS integration. A key finding from Swacha and Gracel [11] is that 37 of 47 RAG-for-education papers were published in 2024, indicating early field maturity; only 4 employed multi-agent architectures, and none integrated MCP.

The evolution of RAG has progressed through three generations. Advanced RAG techniques address retrieval quality through Self-RAG [19] (self-reflective retrieval with 10–17% improvement), FLARE [39] (forward-looking active retrieval), IRCoT [21] (interleaved retrieval with chain of thought), and CRAG [40] (corrective retrieval with web search fallback). Agentic RAG [20] represents the latest paradigm, where autonomous agents perform multi-hop reasoning [22], decompose complex queries, and iteratively refine responses. Toolformer [41] demonstrated that LMs could autonomously learn tool use, establishing a foundation for agentic retrieval architectures.

In educational deployments, LPITutor [4] combined RAG with dynamic prompt engineering and course-specific vector databases, while PRAG-EDU [42] introduced grade-level personalization with a 17% improvement in self-reported comprehension. Levonian et al. [43] revealed a critical tension: RAG improves math QA groundedness but reduces human preference ratings, motivating pedagogically aware generation strategies. Vygotsky’s ZPD theory [28] has informed adaptive scaffolding strategies in AI tutoring, with AutoTutor [44] demonstrating that graduated scaffolding based on learner mastery improves learning outcomes. The integration of BKT [2] with modern RAG architectures enables personalized content retrieval calibrated to individual mastery states, as explored by Scarlatos et al. [29] in tutor–student dialogue settings. Recent work on RAG architecture engineering [45] and deliberative reasoning [22] further establishes the need for principled educational RAG design.

2.3. The Model Context Protocol in Education

The Model Context Protocol (MCP), introduced by Anthropic in November 2024 [12], defines an open standard for connecting LLMs to external data sources and tools through a client–server architecture using JSON-RPC 2.0 over stdio or HTTP+SSE transports. MCP specifies four primitives—Tools (state-changing operations), Resources (read-only contextual data), Prompts (reusable parameterized templates), and Sampling (human-in-the-loop validation)—that map naturally onto educational system components [13,46].

The MCP ecosystem has grown rapidly, with benchmark suites evaluating LLM performance across real-world MCP servers: MCP-Universe [23] found that even GPT-5 achieved only 43.72% success across 6 domains and 11 servers; MCPToolBench++ [24] scaled evaluation to 4000+ servers across 40+ categories; and MCP-Bench [25] tested 250 tools across finance, travel, science, and academic search. MCP-Zero [47] introduced active tool discovery, achieving 98% token reduction, while ScaleMCP [48] addressed dynamic tool selection at scale.

Security analyses have identified critical concerns for educational deployment. Zhang et al. [26] established a 12-category attack taxonomy including name-collision, prompt injection, and data exfiltration across 2000 attack instances. Radosevich and Halloran [27] demonstrated malicious code execution and credential theft via MCP servers. Hasan et al. [49] found that 97.1% of MCP tool descriptions contain at least one “smell”, with 56% failing to state purpose clearly, underscoring the importance of carefully crafted educational tool descriptions. Protocol comparison studies [50,51] have positioned MCP within the broader landscape of agent interoperability protocols (A2A, ACP, ANP), recommending MCP as the starting point for enterprise adoption.

In educational contexts, MCP primitives map directly to system requirements: Tools enable automated grading, Student Information System (SIS) updates, and quiz generation; Resources provide read-only access to syllabuses, textbook passages, and student performance histories; Prompts offer standardized Socratic questioning and scaffolded problem decomposition templates; and Sampling supports human-in-the-loop assessment validation by delegating complex evaluation to specialized models or educators. This natural mapping motivates EduMSRA’s MCP-based Curriculum Tool Orchestration Layer.

Despite this growing ecosystem, Modran et al. [6] remains the only study directly addressing MCP-enabled adaptive tutoring. Their system indexed two course corpora into a vector store, applying MCP for context management alongside ACP for multi-agent orchestration, achieving a 34% reduction in irrelevant context injection compared to naive context concatenation. While students reported the system as useful for independent study, no learner profile persistence, cross-source conflict resolution, or Bloom’s taxonomy alignment was implemented, precisely the gaps EduMSRA addresses through its Learner Profile Manager, Conflict-Aware Fusion Module, and Pedagogical Policy Agent.

2.4. RAG Architectures in Education

2.4.1. Phase 1—Static Document RAG for Education (2022–2023)

The first generation employed straightforward document retrieval over course PDF corpora, building upon Lewis et al.’s foundational RAG architecture [16] with BM25 or dense passage retrievers [34] coupled with general-purpose LLMs [33,52]. Swacha and Gracel [11] surveyed 23 RAG chatbot deployments and found 78% relied on single-source retrieval with BM25 dominant. The Fusion-in-Decoder approach [36] offered multi-passage synthesis for educational scenarios.

Levonian et al. [43] revealed a critical tension: RAG improved groundedness but reduced human preference ratings, directly motivating EduMSRA’s Pedagogical Policy Agent. Zhu et al. [53] identified hallucination propagation in multi-turn educational interactions. The key limitation was a “one-size-fits-all” approach: identical strategies regardless of learner proficiency or question complexity. Liu et al. [31] found fewer than 12% of 86 AI tutoring systems incorporated learner-adaptive retrieval, while Yan et al. [7] identified privacy, bias, and over-reliance as persistent ethical concerns.

2.4.2. Phase 2—Adaptive and Personalized RAG (2023–2024)

The second phase introduced learner-adaptive mechanisms. LPITutor [4] combined RAG with dynamic prompt engineering and course-specific vector databases. PRAG-EDU [42] introduced grade-level personalization using historical module grades to calibrate response complexity, reporting 17% comprehension improvement. Self-RAG [19] (ICLR 2024) trained models to learn when to retrieve via reflection tokens, achieving 10–17% improvement. FLARE [39] reduced redundant retrieval by 23% through forward-looking active retrieval.

Several architectural advances directly informed EduMSRA’s design: IRCoT [21] interleaved retrieval with chain-of-thought reasoning for multi-hop questions relevant to Bloom’s higher-order levels; Adaptive-RAG [18] introduced query complexity-based routing integrated into EduMSRA’s Query Complexity Estimator; CRAG [40] demonstrated corrective retrieval with 12% accuracy improvement on knowledge-cutoff questions; and REPLUG [54] and In-Context RALM [55] enabled black-box augmentation without LLM fine-tuning. Query rewriting [56] addressed the vocabulary mismatch between student-level and textbook-level terminology.

2.4.3. Phase 3—Agentic RAG for Education (2024–2025)

The most recent phase introduces autonomous agents orchestrating multiple retrieval strategies with persistent state. Singh et al. [20] taxonomized agentic RAG into single-agent, multi-agent, and hierarchical architectures applicable to educational roles (tutor, assessor, planner). Key enabling capabilities include: Toolformer [41] for autonomous API invocation, ReAct [57] for reasoning–action interleaving, and Reflexion [58] for self-reflective learning from failures, directly applicable to Socratic tutoring. Scarlatos et al. [29] bridged traditional BKT with modern LLM architectures for knowledge tracing in dialogues.

Chen et al. [59] proposed multi-agent tutoring with specialized agents for retrieval, explanation, assessment, and feedback. Wang et al. [60] identified memory, planning, and tool use as the three pillars of agent capability—all integrated in EduMSRA. The pivotal work of Modran et al. [6] deployed RAG + MCP + ACP for STEM tutoring at Transilvania University but lacked learner profile persistence, cross-source conflict resolution, and Bloom’s alignment, the three gaps EduMSRA addresses. FAIR-RAG [61] introduced structured evidence assessment (F1 = 0.453 on HotpotQA), while Zhang et al. [62] identified short-term, long-term, and procedural memory as critical for persistent interactions.

The trajectory from static to adaptive to agentic RAG, as shown in Figure 1, reveals a consistent pattern: each phase addresses limitations of the previous while introducing complexity. Bloom’s revised taxonomy [14] provides a natural framework for differentiating retrieval by cognitive level, suggesting hybrid architectures combining simple retrieval for factual queries with full agentic pipelines for complex reasoning [20].

2.4.4. Graph-Based, Multimodal, and Evaluation Approaches

Graph-based retrieval captures hierarchical prerequisite relationships in educational curricula. GraphRAG [63] applies community-level summarization over knowledge graphs, while RAPTOR [64] constructs hierarchical summaries mapping naturally to Bloom’s taxonomy levels [15]. Cui et al. [30] formally connected Bloom’s revised taxonomy and Vygotsky’s ZPD [28] to graph-based knowledge tracing architectures. Baek et al. [65] demonstrated that graph-aware retrieval outperformed passage-only retrieval for relational reasoning. These capabilities are integrated into EduMSRA’s Tier-3 Curriculum Knowledge Graph (CKG), combining GraphRAG’s community summarization with RAPTOR’s hierarchical abstraction.

For evaluation, Es et al. [66] introduced the RAGAS framework, providing automated metrics for context relevance, faithfulness, and relevancy. Chen et al. [67] found that hallucination rates increased by 34% on multi-hop versus single-hop questions. Wampler et al. [45] proposed six trust dimensions (accuracy, provenance, timeliness, completeness, consistency, transparency) mapping directly to educational requirements. Tonmoy et al. [10] catalogued hallucination mitigation techniques including retrieval verification, confidence calibration, and multi-source triangulation, all integrated into EduMSRA’s CAFM. VanLehn [3] established that effective ITS evaluation required longitudinal learning gains beyond immediate accuracy, informing our five-experiment protocol.

Educational content is inherently multimodal. Abootorabi et al. [68] surveyed multimodal RAG across text, image, audio, and video, identifying early fusion, late fusion, and hybrid strategies applicable to educational retrieval. The ScienceQA dataset [69] (21,208 multimodal questions) exemplifies this challenge, while RETRO [37] demonstrated that retrieval from trillion-token corpora could substitute for model scale, critical for resource-constrained institutions. Multimodal retrieval with pedagogical adaptation remains an open frontier that EduMSRA’s extensible MCP server taxonomy accommodates.

2.5. MCP in Educational AI: Architecture and Ecosystem

The Model Context Protocol (MCP), introduced by Anthropic in November 2024 [12], has rapidly emerged as a standardizing force for connecting LLMs to external data sources and tools. Several surveys have examined MCP’s architecture and security landscape [13,46,70], while benchmark suites such as MCP-Universe [23], MCPToolBench++ [24], MCPAgentBench [71], and MCP-Bench [25] have systematically evaluated LLM agent performance across real-world MCP server ecosystems. Despite this growing body of work, the application of MCP specifically to educational AI remains underexplored, with Modran et al. [6] representing the only study directly addressing MCP-enabled adaptive tutoring.

2.5.1. MCP Primitives Mapped to Educational Use Cases

The MCP specification defines four core primitives—Tools, Resources, Prompts, and Sampling—each of which maps naturally to educational AI scenarios:

Tools enable state-changing operations such as automated grading, Student Information System (SIS) record updates, and dynamic quiz generation. The LLM agent invokes verified API endpoints to execute these operations with audit trails.
Resources provide read-only access to contextual data including syllabuses, textbook passages, and student performance histories. MCP servers expose these as typed URIs that the LLM ingests for informed tutoring responses.
Prompts offer reusable, parameterized templates for specific pedagogical objectives, enabling standardized Socratic questioning, scaffolded problem decomposition, and formative feedback generation across different subject domains.
Sampling supports human-in-the-loop validation by delegating complex evaluation tasks (e.g., essay scoring, creative assessment) to specialized pedagogical models or human educators before returning results.

Recent work on MCP tool quality reveals that 97.1% of existing MCP tool descriptions contain at least one “smell,” with 56% failing to state their purpose clearly [49]. For educational deployments, this finding underscores the importance of carefully crafted tool descriptions that accurately convey pedagogical intent and appropriate usage constraints.

2.5.2. MCP Server Taxonomy for Education

We propose a seven-category taxonomy of MCP servers for educational AI: (1) Curriculum Content (OpenStax, CK-12, Khan Academy, LMS repositories); (2) Assessment and Practice (item banks, auto-grading engines); (3) Learner Data (xAPI/LRS endpoints, SIS interfaces); (4) Knowledge Graph (Wikidata, ConceptNet, curriculum prerequisite graphs); (5) Computational Tools (code execution, mathematical engines); (6) Communication (forums, peer review platforms); and (7) Analytics (learning analytics, early warning systems). This taxonomy extends prior five-category classifications and provides the foundation for EduMSRA’s dynamic MCP Tool Registry. Advanced tool discovery frameworks such as MCP-Zero [47] and ScaleMCP [48] demonstrate autonomous selection from thousands of tools with 98% token reduction.

2.5.3. MCP Ecosystem Maturity and Benchmarking

The MCP ecosystem has grown rapidly, with MCP-Flow [72] cataloguing 1166 servers encompassing 11,536 tools through automated web-agent-driven discovery. Benchmark evaluations reveal significant challenges: MCP-Universe [23] found that even state-of-the-art models achieved limited success rates (GPT-5: 43.72%, Grok-4: 33.33%) on realistic MCP tasks, while MCPToolBench++ [24] evaluated across 4000+ servers spanning 40+ categories. These findings suggest that educational MCP deployments must account for current LLM limitations in tool selection and multi-step planning.

The protocol landscape is also evolving beyond MCP alone. Ehtesham et al. [51] survey four agent interoperability protocols—MCP, Agent Communication Protocol (ACP), Agent-to-Agent (A2A), and Agent Network Protocol (ANP)—proposing a phased adoption road map beginning with MCP for tool access. Li and Xie [50] critically analyze A2A and MCP integration, identifying emergent challenges in semantic interoperability and compounded security risks. For educational multi-agent systems (e.g., collaborative tutoring with specialized subject-matter agents), the AWCP workspace delegation protocol [73] offers complementary capabilities for deep-engagement collaboration.

2.5.4. Security and Privacy Considerations in Educational MCP

Educational MCP deployments face unique security challenges that compound general MCP vulnerabilities with student data protection requirements. Zhang et al. [26] present the first comprehensive MCP security benchmark (MSB), taxonomizing 12 distinct attack vectors—including name collision, preference manipulation, prompt injection via tool descriptions, and tool-transfer attacks—evaluated across 2000 attack instances. Critically, they find that models with stronger capabilities are paradoxically more vulnerable due to their superior instruction-following abilities. Radosevich and Halloran [27] demonstrate that industry-leading LLMs can be coerced via MCP tools into executing malicious code, establishing remote access control, and performing credential theft.

In educational contexts, these risks manifest as five domain-specific threats: (1) prompt injection via adversarial homework submissions; (2) resource path traversal against LMS file systems; (3) MCP traffic interception exposing student PII; (4) privilege escalation modifying academic records; and (5) sampling exfiltration redirecting student data to unauthorized endpoints.

FERPA and GDPR compliance requires zero-trust identity propagation, cryptographic audit logging, GDPR Right to Erasure hooks, and ephemeral context handling. EduMSRA’s Permission Sandbox implements three trust tiers: Read-Only (curriculum content—no student data), Compute (sandboxed problem-solving tools), and Restricted (learner data—authenticated identity, consent verification, audit logging).

2.5.5. MCP Adoption Barriers in Education

Five barriers impede institutional MCP adoption: (1) institutional inertia—legacy LMS platforms (Blackboard, older Moodle) lack MCP compatibility; (2) educator technical literacy—MCP server deployment demands DevOps expertise [5]; (3) regulatory uncertainty—few institutions have governance frameworks mapping MCP trust tiers to FERPA workflows; (4) nascent tooling—education-specific MCP servers remain largely conceptual; and (5) evaluation gaps—no standardized benchmark assesses MCP-enabled educational AI across pedagogical effectiveness, safety, and scalability [74].

2.6. Comparative Analysis of Architectures for Educational AI

2.6.1. Evaluation Framework

We evaluate architectures across six educational capability dimensions derived from the ITS effectiveness literature [3] and recent RAG evaluation frameworks [66]: (1) Adaptive Retrieval—whether the system dynamically adjusts retrieval strategy to query complexity and learner context; (2) Multi-source Educational Integration—ability to simultaneously access textbooks, LMS content, assessment databases, and real-time APIs; (3) MCP Tool Standardization—whether external tools are accessed via the open MCP protocol or bespoke connectors; (4) learner personalization—degree of adaptation to individual student profiles, prior knowledge, and learning preferences; and (5) Educational Application Scope—range of supported pedagogical tasks from simple QA to holistic tutoring. The addition of Citation Awareness as a dedicated evaluation dimension reflects growing concern that LLM-generated educational content without source attribution can mislead students in high-stakes learning contexts [5].

2.6.2. Key Findings

Our six-dimension analysis across nine architectures reveals four critical insights.

Finding 1: The Personalization–Grounding Trade-off. Systems that achieve strong learner personalization (PRAG-EDU [42], Khanmigo [75]) rely on proprietary, single-source content pipelines that cannot access diverse curriculum materials beyond their pre-indexed corpora. Conversely, systems with broad multi-source access (Agentic RAG [20]) lack persistent learner models, limiting their ability to adapt explanations across sessions. No existing architecture resolves this fundamental tension between deep personalization and broad knowledge access.

Finding 2: Citation Awareness Remains Rare. Only two of nine architectures—GraphRAG [63] and PRAG-EDU [42]—provide even partial source attribution, and neither achieves full citation-aware generation where every pedagogical claim is grounded in a verifiable curriculum source. This gap is particularly concerning in educational settings where hallucinated explanations can directly harm student learning outcomes [5]. The ITS literature has long established that tutor credibility depends on traceable reasoning [3], yet modern LLM-based systems have regressed on this dimension compared to rule-based predecessors like AutoTutor [44].

Finding 3: MCP Standardization Is Absent in Education. Despite MCP’s growing ecosystem of 11,536+ tools across enterprise domains [72], no existing educational AI architecture adopts MCP for tool orchestration. Modran et al.’s RAG+ACP system [6] uses the related but distinct Agent Communication Protocol, lacking MCP’s Resources and Sampling primitives that are critical for FERPA-compliant learner data access. Commercial systems like Khanmigo [75] use proprietary integrations that cannot be extended by third-party developers. This absence means educational systems cannot benefit from MCP’s standardized permission model, audit logging, or dynamic tool discovery capabilities demonstrated by MCP-Zero [47].

Finding 4: No Architecture Achieves Full Coverage. Table 1 demonstrates that no existing system simultaneously satisfies all six capability dimensions. The closest competitor, Modran et al.’s RAG+ACP [6], achieves four of six dimensions but critically lacks citation awareness and persistent learner personalization. AutoTutor-LLM [44], despite two decades of pedagogical dialogue research, has not been extended with multi-source retrieval or standardized tool integration. These converging gaps—the personalization–grounding trade-off, absent citation awareness, missing MCP standardization, and incomplete dimensional coverage—collectively motivate the EduMSRA architecture proposed in Section 2.7.

These nine challenges synthesize findings from ITS effectiveness [3,74], LLMs in education [5], and MCP security [26,27]. Existing architectures address them in isolation; to the best of our knowledge, EduMSRA is the first to provide an integrated solution across all dimensions.

2.7. Proposed Architecture: EduMSRA

2.7.1. Architecture Overview

EduMSRA (Educational Multi-Source Research Agent) is a five-component architecture designed to provide adaptive, grounded, and pedagogically aligned AI tutoring across K-12, higher education, and self-directed learning contexts. The architecture, illustrated in Figure 2, comprises five specialized modules coordinated by a Central Reasoning Agent (CRA): (1) Hierarchical Educational RAG Pipeline (HERAP); (2) MCP-based Curriculum Tool Orchestration Layer (CTOL); (3) Conflict-Aware Fusion Module (CAFM); (4) Learner Profile Manager (LPM); and (5) Pedagogical Policy Agent (PPA). The CRA is instantiated on a state-of-the-art instruction-tuned LLM (e.g., Claude-3.7-Sonnet, GPT-4o, or Llama-3.1-70B) and functions as the orchestration backbone that routes queries, invokes components, and synthesizes final responses.

EduMSRA is grounded in three design principles that directly address the nine challenge dimensions identified in Table 2:

Principle 1—Pedagogical Grounding. Every generated response must be traceable to verified curriculum sources with explicit inline citation. This principle directly addresses the hallucination challenge by ensuring that no educational claim is presented without provenance. The architecture enforces this through the CAFM citation-aware synthesis stage and the HERAP confidence scoring mechanism, following the provenance-tracking approaches advocated in recent RAG reliability research [38,66].

Principle 2—Learner Centricity. Retrieval depth, generation style, and explanation complexity adapt continuously to the individual learner’s profile. Unlike static tutoring systems that deliver uniform responses [3], EduMSRA maintains a persistent, multi-dimensional student model (via LPM) that informs every stage of the processing pipeline. This addresses the learner heterogeneity and assessment alignment challenges by calibrating both content and cognitive demand to the student’s current state.

Principle 3—Curricular Coherence. Responses respect the pedagogical structure of the curriculum, including prerequisite relationships, learning objectives, and Bloom’s taxonomy levels [14]. The Curriculum Knowledge Graph (CKG) within HERAP encodes these structural dependencies, while the PPA enforces Bloom-level-appropriate generation. This addresses the multi-hop reasoning gaps and assessment alignment challenge simultaneously.

These three principles are operationalized through the component interaction pipeline shown in Figure 3, which processes each student query through five sequential phases: query classification, multi-source retrieval, conflict-aware fusion, pedagogically aligned generation, and state update.

2.7.2. Component 1—Hierarchical Educational RAG Pipeline (HERAP)

HERAP implements a three-tier retrieval strategy specifically designed for educational content, inspired by hierarchical retrieval approaches [64] but adapted to respect curriculum structure and pedagogical sequencing. Unlike general-purpose RAG systems that treat all documents uniformly [16], HERAP recognizes that educational content has inherent hierarchical organization (courses ⊃ units ⊃ lessons ⊃ concepts) and prerequisite dependencies that must be preserved during retrieval.

Tier 1—Lexical Retrieval. BM25 sparse retrieval [55] over indexed curriculum documents including textbook chapters, lecture notes, and worked examples. This tier is optimized for exact terminology matching in domain-specific vocabulary—mathematical notation, scientific terminology, and programming syntax—where semantic similarity models often fail due to vocabulary mismatch. The BM25 index is partitioned by curriculum unit, enabling scope-limited retrieval when the learner’s current unit is known from the LPM.

Tier 2—Semantic Retrieval. Dense bi-encoder retrieval using an education-domain-adapted embedding model (e.g., E5-large fine-tuned on ScienceQA [69] and MMLU [76]). This tier captures semantic equivalences across textbook editions and handles paraphrase-rich student queries that lexical matching would miss. Following the dense passage retrieval paradigm [34], queries and passages are independently encoded into a shared 768-dimensional embedding space, with approximate nearest-neighbor search via HNSW indexing for sub-100 ms retrieval latency.

Tier 3—Curriculum Knowledge Graph (CKG) Retrieval. Graph traversal over a prerequisite-structure curriculum knowledge graph where nodes represent concepts (e.g., “quadratic equations”), edges represent pedagogical dependencies (e.g., requires_prerequisite, extends_to), and edge weights represent difficulty progression calibrated to Bloom’s taxonomy levels [15]. This tier enables multi-hop retrieval that respects the curriculum sequence; for instance, retrieving “integration by parts” triggers prerequisite retrieval of “integration basics” and “product rule” along the dependency chain. The CKG design extends the GraphRAG paradigm [63] with pedagogical edge semantics not present in general knowledge graphs.

Query Complexity Estimator (QCE). A lightweight classifier routes each incoming query to the appropriate retrieval tiers based on estimated cognitive complexity:

Factual queries (e.g., “What is Avogadro’s number?”) → Tier 1 only, minimizing latency.
Conceptual queries (e.g., “Explain the relationship between pressure and volume”) → Tier 1 + Tier 2, combining exact matches with semantic expansion.
Relational queries (e.g., “How does thermodynamics connect to chemical equilibrium?”) → All three tiers, with CKG traversal providing cross-concept linking.

The QCE is implemented as a fine-tuned text classifier trained on educational question taxonomies aligned with Bloom’s cognitive levels [14]. Its output additionally informs the PPA’s Bloom level mapping (Section 2.7.6).

Reciprocal Rank Fusion (RRF) with Pedagogical Re-ranking. Results across active tiers are combined using a modified RRF formula that incorporates pedagogical relevance:

{RRF}_{edu} (d) = \sum_{t \in T_{active}} \frac{w_{t}}{k + r_{t} (d)} \cdot α (d, U)

(1)

where

r_{t} (d)

is the rank of document d in tier t,

k = 60

is the standard RRF constant,

w_{t}

is a tier-specific weight (default:

w_{1} = 0.25

,

w_{2} = 0.40

,

w_{3} = 0.35

for Relational queries, reflecting the primacy of semantic retrieval for paraphrase-rich student queries while preserving substantial CKG contribution for prerequisite-aware ranking), and

α (d, U)

is a pedagogical relevance multiplier that upweights passages aligned with the student’s current curriculum unit and Bloom level as recorded in the LPM profile U. This formulation extends standard RRF [77] with learner-adaptive re-ranking, ensuring that retrieval results are both topically relevant and pedagogically appropriate.

2.7.3. Component 2—MCP-Based Curriculum Tool Orchestration Layer (CTOL)

CTOL manages a dynamic registry of educational MCP servers [12], enabling EduMSRA to access heterogeneous educational knowledge sources through standardized interfaces. The MCP protocol’s client–server architecture provides three critical capabilities for educational deployments: (a) runtime tool discovery without hardcoded integrations, (b) capability negotiation that adapts to each institution’s available resources, and (c) sandboxed execution with permission boundaries that enforce data privacy regulations.

The Semantic Tool Matcher (STM) resolves the tool-discovery challenge by encoding both student queries and MCP server capability descriptions into a shared embedding space. Given a query Q and a registry of n MCP servers with capability descriptions

{c_{1}, \dots, c_{n}}

, the STM computes cosine similarity scores and selects the top-k most relevant tools (default

k = 3

, configurable per deployment based on the number of registered MCP servers):

STM (Q, C) = top- k (\frac{e (Q) \cdot e (c_{i})}{∥ e (Q) ∥ ∥ e (c_{i}) ∥} | c_{i} \in C)

(2)

where

e (\cdot)

denotes the embedding function. This approach reduces tool invocation errors compared to keyword-based matching by capturing semantic intent (e.g., matching “solve this equation step by step” to a computation tool rather than a definition lookup tool).

Three-Tier Trust Architecture. CTOL implements graduated trust levels with corresponding security boundaries, addressing Challenge 8 (student data privacy) from Table 2:

Tier A—Read-Only Curriculum Content (Trust: Public). MCP servers providing access to open educational resources: OpenStax textbook APIs, Khan Academy content endpoints [75], ERIC database queries, CK-12 content, and institutional LMS document APIs. These tools require no authentication beyond API keys and return only publicly available educational content. All retrieved content is cached with TTL-based invalidation to reduce external API load.
Tier B—Sandboxed Computation (Trust: Isolated). Execution environments for mathematical problem verification (Python/Jupyter), symbolic computation (WolframAlpha API), and code interpretation for CS education. Each computation request executes in a containerized sandbox with CPU/memory limits (default: 2 vCPU, 512 MB RAM, 30 s timeout), network isolation (no outbound connections), and filesystem restrictions (read-only access to problem datasets only). This design prevents code injection attacks while enabling rich computational support for STEM subjects.
Tier C—Restricted Learner Data (Trust: Authenticated). LMS-grade APIs (Canvas LTI 1.3, Moodle Web Services), xAPI/LRS endpoints for learning analytics, and institution SIS interfaces. Access requires: (a) explicit student consent tokens compliant with FERPA [45] and GDPR regulations; (b) OAuth 2.0 institutional authentication; and (c) cryptographic audit logging of all data access events. The LPM (Section 2.7.5) is the sole consumer of Tier C data within EduMSRA.

Adaptive Tool Documentation Module (ATDM). A persistent challenge in MCP-based systems is that LLMs may generate invalid tool invocations when encountering unfamiliar server schemas [26]. The ATDM addresses this by automatically generating simplified, LLM-readable usage guides from MCP server JSON-RPC schemas at registration time. Each guide includes: (a) a natural-language capability summary; (b) parameter descriptions with type constraints and example values; (c) error-handling patterns; and (d) rate-limit specifications. The ATDM regenerates guides when server schemas are updated, ensuring that the CRA always operates with current tool documentation.

2.7.4. Component 3—Conflict-Aware Fusion Module (CAFM)

In educational multi-source retrieval, content conflicts arise frequently: different textbooks present conflicting definitions (e.g., AP Chemistry vs. IB Chemistry nomenclature), problem-solving methodologies differ across instructors, or retrieved web content contradicts institutional curriculum standards. Unlike general-purpose RAG systems that naively concatenate retrieved passages [16], CAFM explicitly detects and resolves these conflicts through a four-stage pipeline, ensuring that students receive authoritative, consistent information.

Stage 1—Atomic Claim Decomposition. Retrieved passages from HERAP and CTOL are decomposed into atomic educational claims using structured extraction prompting. Each claim is represented as a subject–predicate–object triple with an associated confidence score and source identifier. For example, the passage “Photosynthesis converts CO₂ and H₂O into glucose and O₂ using light energy” yields the atomic claims: (photosynthesis, input, {CO₂, H₂O, light}), (photosynthesis, output, {C₆H₁₂O₆, O₂}). This decomposition enables fine-grained conflict detection at the claim level rather than the passage level.

Stage 2—Educational Conflict Detection. Contradictory claims are identified using a combination of semantic similarity scoring and logical negation detection. Two claims

c_{i}

and

c_{j}

are flagged as potentially conflicting when:

sim (c_{i}, c_{j}) > τ_{topic} \land neg (c_{i}, c_{j}) > τ_{neg}

(3)

where

τ_{topic} = 0.85

ensures the claims address the same concept, and

τ_{neg} = 0.70

detects semantic negation or value contradiction. These thresholds are derived from established semantic similarity benchmarks:

τ_{topic} = 0.85

aligns with the paraphrase detection threshold reported in sentence-transformer evaluations [38], while

τ_{neg} = 0.70

reflects the lower bound for reliable negation detection in NLI-based models; both are configurable per domain. This dual-threshold approach reduces false positives (unrelated claims) while catching genuine educational conflicts such as conflicting values for physical constants, contradictory process explanations, or incompatible nomenclature conventions across curricula.

Stage 3—Pedagogical Authority Ranking (PAR). Detected conflicts are resolved by computing an authority score for each conflicting source:

PAR (s) = λ_{1} \cdot inst (s) + λ_{2} \cdot bloom (s, B) + λ_{3} \cdot recency (s)

(4)

where

inst (s)

is the institutional authority score (instructor-uploaded materials: 1.0; adopted textbooks: 0.8; open textbooks: 0.6; web content: 0.3),

bloom (s, B)

measures alignment between the source’s cognitive level and the student’s target Bloom level B, and

recency (s)

is a temporal decay factor favoring recent editions. Default weights are

λ_{1} = 0.5

,

λ_{2} = 0.3

,

λ_{3} = 0.2

, configurable per institution. The source with the highest PAR score is designated as the primary authority, while conflicting sources are retained as supplementary viewpoints.

Stage 4—Citation-Aware Synthesis. The final response explicitly attributes each claim to its source using inline citation notation (e.g., “According to [Textbook A, Ch. 5], …; however, [Textbook B, Ch. 3] presents an alternative formulation…”). This transparency serves a dual pedagogical purpose: (a) enabling students to verify information independently, promoting source literacy as a secondary learning outcome; and (b) making the system’s reasoning auditable by instructors. The citation format is configurable per deployment (APA, IEEE, or simplified inline references for K-12 contexts).

2.7.5. Component 4—Learner Profile Manager (LPM)

The LPM maintains a structured, multi-dimensional student model that persists across sessions and informs all other EduMSRA components. Unlike single-session chatbot interactions that lose context after each conversation [43], the LPM provides continuity by tracking four complementary dimensions of learner state:

Dimension 1—Academic Profile. Current enrolled courses, active learning objectives, grade history, and curriculum progression milestones. These data are sourced from LMS platforms via CTOL Tier C tools (with explicit student consent) and provide the institutional context for personalizing retrieval scope and difficulty calibration.

Dimension 2—Knowledge State (

K_{t}

). Estimated mastery level per concept node in the Curriculum Knowledge Graph, updated after each interaction via Bayesian Knowledge Tracing (BKT) [2]. For each concept c in the CKG, the LPM maintains a posterior mastery probability:

P (L_{t + 1}^{c}) = P (L_{t}^{c} | {obs}_{t}) + (1 - P (L_{t}^{c} | {obs}_{t})) \cdot P (transit)

(5)

where

P (L_{t}^{c} | {obs}_{t})

is the updated mastery estimate given the student’s observed response at time t, and

P (transit)

is the learning transition probability. The BKT parameters (

P (init)

,

P (transit)

,

P (guess)

,

P (slip)

) are initialized from population-level estimates [29] and refined per-student as interaction data accumulate. The knowledge state vector

K_{t}

directly influences HERAP retrieval depth: low-mastery concepts trigger more comprehensive retrieval (all three tiers), while high-mastery concepts use Tier 1 only for quick reference.

Dimension 3—Interaction History. Episodic memory of past queries, retrieved sources, generated explanations, and student feedback (ratings, follow-up questions, time on task). This history is stored in a vector database with temporal indexing, enabling the CRA to reference prior explanations (e.g., “As we discussed in your previous session on thermodynamics…”) and avoid redundant retrieval of previously presented content.

Dimension 4—Cognitive Preferences. Inferred preferred explanation modality (formal/informal, example-based/proof-based, visual/textual, step-by-step/holistic) derived from interaction feedback patterns using collaborative filtering across the student population. Students who consistently engage more deeply with worked examples receive example-heavy responses; those who prefer formal proofs receive theorem-first presentations.

LPM outputs directly influence HERAP (retrieval depth calibrated to knowledge state), CAFM (authority ranking adjusted by institutional affiliation), and PPA (explanation style and Bloom’s level targeting). This integration realizes the personalization vision of PRAG-EDU [42] while extending it from single-source to multi-source, multi-session contexts. Critically, all LPM data are encrypted at rest (AES-256) and in transit (TLS 1.3), with student-controlled data portability and deletion rights enforced at the API level.

2.7.6. Component 5—Pedagogical Policy Agent (PPA)

The PPA ensures that EduMSRA’s generated responses are pedagogically appropriate by aligning content with Bloom’s revised taxonomy [14] and applying adaptive scaffolding based on the learner’s Zone of Proximal Development (ZPD) [28]. The PPA operates through two interconnected mechanisms:

Bloom’s Taxonomy Mapper. The PPA classifies each incoming query to one of six cognitive levels from Bloom’s revised taxonomy [15] and conditions the CRA’s generation prompt accordingly:

Remember (retrieve factual knowledge): Generate concise definitions with key terms highlighted. Retrieval limited to Tier 1.
Understand (explain concepts): Generate conceptual explanations with analogies and visual descriptions. Invoke Tier 1 + Tier 2 retrieval.
Apply (use knowledge in new situations): Generate worked examples with step-by-step justification. Invoke computation tools via CTOL for verification.
Analyze (break into parts, find relationships): Generate comparative analyses across retrieved sources. Invoke CKG traversal for prerequisite mapping.
Evaluate (judge, critique): Present multiple perspectives from conflicting sources (via CAFM), prompting the student to assess evidence quality.
Create (synthesize new solutions): Guide the student through problem decomposition without providing direct answers, invoking the Scaffolding Controller.

This mapping ensures that a student at the “Understand” level querying a concept receives conceptual explanations with analogies, while the same concept queried at the “Apply” level generates worked examples with computational verification. Bloom’s level is determined jointly by the QCE output and the student’s current mastery level from the LPM: students with high mastery (

P (L_{t}^{c}) > 0.8

) are automatically elevated to higher Bloom levels to promote deeper learning [1].

Scaffolding Controller. When the LPM indicates that a student is near the mastery threshold for a concept (

0.6 \leq P (L_{t}^{c}) \leq 0.8

, the ZPD zone), the Scaffolding Controller activates a graduated support strategy inspired by Vygotsky’s ZPD theory [28] and the fading scaffolding approach in AutoTutor [44]:

Full scaffolding ( $P (L_{t}^{c}) < 0.4$ ): Direct explanation with complete worked examples and explicit prerequisite review.
Partial scaffolding ( $0.4 \leq P (L_{t}^{c}) < 0.6$ ): Guided hints with partially completed solutions; student fills in key steps.
Socratic scaffolding ( $0.6 \leq P (L_{t}^{c}) < 0.8$ ): Socratic questioning that guides reasoning without revealing answers; follow-up probes based on student responses.
Minimal scaffolding ( $P (L_{t}^{c}) \geq 0.8$ ): Challenge problems at higher Bloom levels with minimal guidance; emphasis shifts to evaluation and creation tasks.

To the best of our knowledge, this four-level scaffolding strategy represents the first integration of ZPD-aware adaptive support with MCP-enabled multi-source retrieval in an AI tutoring architecture. By tying scaffolding intensity directly to the BKT-estimated mastery probability (Equation (5)), the PPA avoids both under-supporting struggling students and over-supporting advanced learners, a persistent challenge in traditional ITS designs [3].

2.7.7. Formal Architecture Specification

We formalize the EduMSRA processing pipeline as a composition of typed functions. Let Q denote a student query, U denote the student’s LPM profile,

S = {s_{1}, \dots, s_{n}}

denote registered MCP educational sources,

B \in

Remember, Understand, Apply, Analyze, Evaluate, Create} denote the target Bloom level, and

K_{t}

denote the cumulative knowledge state vector at session t.

Phase 1—Query Classification:

QCE : Q \times U \to {Factual, Conceptual, Relational} \times B

(6)

Phase 2—Multi-Source Retrieval:

\begin{matrix} HERAP & : Q \times U \times B \to {(d_{i}, c_{i}, l_{i})}_{i = 1}^{k} \end{matrix}

(7)

\begin{matrix} CTOL & : Q \times S \times U \to {(t_{j}, r_{j})}_{j = 1}^{m} \end{matrix}

(8)

where each tuple

(d_{i}, c_{i}, l_{i})

represents a retrieved document, its confidence score, and its Bloom level alignment; and

(t_{j}, r_{j})

represents a tool invocation and its result.

Phase 3—Conflict-Aware Fusion:

CAFM : (HERAP \cup CTOL) \times K_{t} \to C_{edu}

(9)

where

C_{edu}

is the conflict-resolved, citation-annotated educational context.

Phase 4—Policy Generation:

PPA : Q \times B \times U \to π = (style, scaffold_level, bloom_target)

(10)

Phase 5—Response Generation and State Update:

CRA : Q \times C_{edu} \times π \times K_{t} \to (R_{edu}, K_{t + 1})

(11)

The composition

CRA \circ PPA \circ CAFM \circ (HERAP ∥ CTOL) \circ QCE

defines the complete EduMSRA pipeline. This formulation provides four formal guarantees for every generated educational response

R_{edu}

: (a) grounding—

R_{edu}

is derived from verified multi-source curriculum content

C_{edu}

with explicit provenance; (b) alignment—

R_{edu}

targets the student’s Bloom level B as determined by QCE and LPM; (c) personalization—

R_{edu}

adapts to the student’s knowledge state

K_{t}

and cognitive preferences in U; and (d) transparency—

R_{edu}

includes inline citations enabling independent verification. The parallel composition

HERAP ∥ CTOL

indicates that retrieval and tool orchestration execute concurrently, reducing end-to-end latency.

3. Results

Part II—Architectural Proposal and Empirical Implementation. Building on the survey and comparative analysis of Part I (Section 2), this section presents the EduMSRA architecture, the experimental evaluation road map, the literature-based Bayesian evidence synthesis (Section 3.2), and the proof-of-concept implementation (Section 3.3).

3.1. Datasets and Evaluation Road Map

3.1.1. Recommended Datasets for Empirical Validation

Table 3 summarizes the nine curated benchmark datasets of varying sizes, as illustrated in Figure 4, spanning K-12 through research-level education. Primary benchmarks (*) were selected for their direct relevance to EduMSRA’s core components: D1 tests hierarchical multi-hop retrieval and D8 evaluates learner personalization. All datasets are publicly available; access links are provided in Appendix A.

3.1.2. Four-Experiment Evaluation Protocol

We propose four experiments to evaluate each EduMSRA component against established baselines (Table 4). Each experiment isolates a specific architectural contribution, enabling ablation-style validation.

3.1.3. Evaluation Metrics

Retrieval Quality

Context Precision: Fraction of retrieved chunks that are relevant to the query, measured by LLM-as-Judge annotation.
Context Recall: Fraction of ground-truth supporting facts covered by retrieved context.
NDCG@5: Normalized Discounted Cumulative Gain measuring ranked retrieval quality at top-five results.
MCP Tool Invocation Accuracy: Percentage of tool calls with correctly specified parameters and successful execution.

Educational Quality

Bloom’s Level Alignment Score: Inter-rater agreement between the PPA’s target Bloom level and human expert annotation of the generated response’s cognitive demand (Cohen’s $κ$ ).
Personalization Score (LaMP metric [18]): ROUGE-L between generated response and learner-profile-tailored reference response.
Pedagogical Coherence: Human expert rating (five-point Likert scale) of response alignment with curriculum objectives, conducted on n = 200 sampled responses.

System Efficiency

End-to-End Latency: P50 and P95 response times from query submission to complete response delivery.
Context Window Utilization: Fraction of available context window used by retrieved + profile context, measuring compression efficiency.
Cost per Interaction: Estimated API cost in USD per student interaction, relevant for scalability analysis.

3.2. Literature-Based Evidence Synthesis via Bayesian Methods

This subsection reports Bayesian analyses of published literature and simulated learner trajectories, not empirical validation of EduMSRA itself. The pooled effects here should be read as indicative literature synthesis that motivates EduMSRA’s design direction; direct empirical validation of the EduMSRA pipeline is reported separately in Section 3.3 using a proof-of-concept implementation.

To strengthen this survey beyond a purely qualitative synthesis, we conducted three Bayesian analyses that provide quantitative context for key claims underlying EduMSRA’s design rationale. These analyses complement the literature review by offering empirical grounding for the architectural decisions in Section 2, while acknowledging that full validation requires the proposed experiments in Section 3. All computations were performed using Python (NumPy/SciPy) (https://www.python.org/, accessed on 27 April 2026) and cross-verified in R (jsonlite) (https://www.r-project.org/, accessed on 27 April 2026), with source code available from the corresponding author. Posterior distributions are summarized using 95% Highest Density Intervals (HDIs) computed from 10,000 Monte Carlo samples.

3.2.1. Bayesian Meta-Analysis of RAG Effectiveness

We extracted reported effect sizes from 12 studies across our reviewed corpus and performed a random-effects Bayesian meta-analysis using the DerSimonian–Laird (DL) estimator for

τ^{2}

, with posterior inference conducted via 10,000 Monte Carlo samples to mitigate the known downward bias of DL in small-k settings [78]. Let

y_{i}

and

σ_{i}

denote the observed effect and standard error of study i. The random-effects model assumes:

y_{i} \sim N (μ, σ_{i}^{2} + τ^{2}), i = 1, \dots, k

(12)

where

μ

is the pooled effect, and

τ^{2}

captures between-study heterogeneity. The pooled estimate is

\hat{μ} = 0.511

(95% HDI:

[0.250, 0.790]

), with substantial heterogeneity (

I^{2} = 99.3 %

,

τ = 0.472

).

Methodological caveat: The 12 studies report diverse outcome metrics—accuracy improvement and context efficiency—which represent fundamentally different constructs. The near-ceiling

I^{2}

reflects this construct heterogeneity rather than sampling variance alone, and the pooled estimate should therefore be interpreted as an indicative summary of the general direction and magnitude of RAG benefits rather than a precise effect size for any single outcome [78]. A subgroup analysis by metric type would be more appropriate but is precluded by the small number of studies per category (

k < 5

). As a robustness check, a leave-one-out sensitivity analysis yielded pooled estimates ranging from

\hat{μ} = 0.387

(excluding Bloom’s two-sigma benchmark [1], the most influential study with

d = 2.0

) to

\hat{μ} = 0.547

(excluding Yan et al. [40]), with 10 of 12 exclusions producing

\hat{μ} \in [0.44, 0.55]

, confirming that the positive direction was robust even when individual studies were removed, though the magnitude was sensitive to the inclusion of classic tutoring benchmarks. Despite this limitation, the forest plot (Figure 5) and posterior distribution (Figure 6) show that all 12 individual study effects are positive, with the 95% HDI excluding zero, providing converging evidence that RAG-based interventions yield beneficial effects across educational applications and supporting the rationale for EduMSRA’s RAG-centric design.

3.2.2. BKT Simulation: Illustrating Adaptive Scaffolding Dynamics

To illustrate how EduMSRA’s Learner Profile Manager (Section 2.7.5) would drive adaptive scaffolding decisions, we simulated Bayesian Knowledge Tracing (BKT) [2] across five learner ability profiles (

a \in {0.30, 0.50, 0.65, 0.80, 0.95}

) over 50 item responses. We emphasize that this simulation demonstrates the expected behavior of BKT-based scaffolding thresholds defined in Section 2.7.6, rather than validating EduMSRA’s architecture empirically. The BKT update rule applies Bayes’ theorem at each step:

P (L_{n} ∣ correct) = \frac{(1 - P (S)) \cdot P (L_{n - 1})}{(1 - P (S)) \cdot P (L_{n - 1}) + P (G) \cdot (1 - P (L_{n - 1}))}

(13)

followed by the learning transition:

P (L_{n}) = P (L_{n} ∣ obs) + (1 - P (L_{n} ∣ obs)) \cdot P (T)

(14)

with standard parameters:

P (L_{0}) = 0.10

,

P (T) = 0.20

,

P (S) = 0.05

,

P (G) = 0.25

[2].

Figure 7 shows that learners with ability

\geq 0.50

converge to mastery (

P (L_{n}) > 0.95

) within 15–30 responses, while the low-ability learner (

a = 0.30

) reaches only

P (L_{50}) = 0.672

. These trajectories illustrate why EduMSRA’s PPA scaffolding thresholds (Section 2.7.6) are designed to provide graduated support: full scaffolding when

P (L_{t}^{c}) < 0.4

, partial when

0.4 \leq P (L_{t}^{c}) < 0.6

, Socratic when

0.6 \leq P (L_{t}^{c}) < 0.8

, and minimal when

P (L_{t}^{c}) \geq 0.8

, consistent with Vygotsky’s ZPD framework [28]. Empirical validation with real learner interaction data remains necessary to confirm these thresholds in practice.

3.2.3. Bayesian Dataset Difficulty Estimation

We applied a Beta-Binomial conjugate model to characterize the assumed difficulty spectrum of nine benchmark datasets. The difficulty prior

d_{j}

for each dataset was derived from its target educational level (e.g., elementary → low

d_{j}

, research → high

d_{j}

). Given

n_{j}

items, the posterior is:

θ_{j} ∣ data \sim Beta (1 + n_{j} \cdot d_{j}, 1 + n_{j} \cdot (1 - d_{j}))

(15)

Methodological note: Because

d_{j}

encodes the assumed difficulty, and

n_{j}

is large for most datasets, the posterior is strongly dominated by the prior assumption rather than by observed performance data. The resulting estimates therefore reflect a formalized encoding of the intended difficulty hierarchy rather than an independent empirical validation. Future work should replace

d_{j}

with actual accuracy rates from published leaderboards to obtain data-driven difficulty estimates.

Figure 8 presents the posterior difficulty estimates with 95% HDI. The intended difficulty progression—from elementary-level datasets (OpenBookQA:

E [θ] = 0.200

) through K-12 (AI2-ARC, ScienceQA) to research-level (SciQAG:

E [θ] = 0.850

, LaMP:

E [θ] = 0.750

)—confirms that the selected benchmarks span a broad cognitive range suitable for evaluating EduMSRA’s components across Bloom’s taxonomy levels.

3.3. Proof-of-Concept Implementation and Empirical Pipeline Tracing

We implemented a minimum-viable prototype of EduMSRA in Python. The PoC exercised all five components on a toy educational corpus (ten chunks spanning physics, biology, chemistry, mathematics, computer science, and history) with ten example student queries mapped to five skills.

Design fidelity. Three of the five components used production algorithms rather than placeholders: HERAP implemented BM25 retrieval (rank_bm25), CAFM computed TF-IDF cosine similarity with a configurable conflict threshold, and LPM applied the standard BKT update rule (Equations (13) and (14), unchanged from the canonical formulation in [2]). CTOL used a mock MCP dispatcher (glossary, LMS, solver tools) and PPA a transparent Bloom-keyword heuristic.

End-to-end trace. Figure 9 shows a single query, “What is Newton’s second law?”, flowing through all five modules with concrete latency and score data captured at run time. Figure 10 contrasts the EduMSRA component footprint with standard RAG and agentic RAG architectures across five capability axes.

Per-module latency (N = 30 queries, CPU). Table 5 and Figure 11 show that CAFM dominates the latency budget (83.8% of the 1.68 ms mean total) because TF-IDF index construction is repeated per call in the PoC. HERAP (BM25) contributes 14.1%, while LPM, PPA, and CTOL together account for less than 2%. This reveals a concrete optimization target that the full paper will address by caching TF-IDF vectors and by benchmarking against realistic large corpora.

LPM trajectories (N = 50 simulated interactions). Figure 12 shows BKT mastery curves across five skills. High-base-probability skills (Newton’s laws, photosynthesis) cross the 0.85 mastery threshold within 15–25 interactions, while low-base-probability skills (history, algorithms) remain below threshold, confirming that BKT captures differentiated learner states as expected.

CAFM conflict detection. Figure 13 presents the TF-IDF cosine similarity matrix among the query and its top-five retrieved chunks. Off-diagonal values below the 0.35 conflict threshold trigger CAFM’s conflict flag. In the Newton-law example, the query correctly aligns with the physics textbook chunk (similarity 0.41) and disaligns with the chemistry chunk (similarity 0.00), demonstrating conflict signaling on real data.

Scope and limits. The PoC is a scaffolding-grade demonstration: ten documents, ten queries, no LLM generation, no live MCP servers, no human evaluation. It is intended to (i) evidence that the EduMSRA pipeline runs end-to-end; (ii) expose a first-order view of latency distribution; and (iii) feed the five illustrative figures (Figure 9, Figure 10, Figure 11, Figure 12 and Figure 13). Full empirical evaluation remains the subject of the experimental road map in Section 3.2 and of subsequent work.

Threshold sensitivity for EduMSRA parameters. For sensitivity analysis of EduMSRA’s own weights and thresholds (complementing the meta-analysis leave-one-out already reported in Section 3.2), we note two direction-robustness arguments that the PoC already supports without requiring additional runs. First, the CAFM conflict threshold

τ_{CAFM} = 0.35

is monotone-preserving: raising (or lowering)

τ_{CAFM}

over a continuous range

[0.25, 0.45]

mechanically increases (or decreases) the conflict flag rate without changing the downstream reconciliation logic, so architectural behavior is well defined across the interval, and the specific value

0.35

is pinned by the empirical TF-IDF distribution in Figure 13 (median pairwise similarity

\approx 0.10

for the Newton-law retrieval, justifying a low-to-mid conflict flag). Second, the PPA scaffolding boundaries

(0.40, 0.60, 0.80)

follow standard BKT convention [2] and inherit its smoothness: because BKT mastery evolves gradually (Figure 7), small perturbations of

\pm 0.05

on each boundary only re-distribute learners across adjacent scaffolding levels rather than flipping architectural regimes. A systematic three-point sweep—

τ_{CAFM} \in {0.25, 0.35, 0.45}

crossed with three scaffolding-boundary variants (loose, baseline, strict)—is scheduled for the Phase-2 experimental protocol (Section 3.2); at the 30-query PoC scale such a sweep would be under-powered and is therefore deferred. This is the sensitivity counterpart, at the EduMSRA-threshold level, of the leave-one-out sensitivity already reported for the meta-analysis in Section 3.2.

4. Discussion

We identify seven open challenges for advancing EduMSRA from architectural proposal to production deployment.

Multimodal Educational Content. Most RAG-for-education systems process text only, yet ScienceQA [69] shows 63% text-only versus 91% multimodal accuracy. Extending EduMSRA requires multimodal embeddings (e.g., CLIP/SigLIP), cross-modal CAFM conflict detection, and multimodal generation. HERAP’s three-tier architecture is naturally extensible: Tier 1 via image captioning, Tier 2 via multimodal dense encoders, and Tier 3’s CKG via visual asset association [68].
Privacy-Preserving Learner Profiling. EduMSRA’s LPM accesses sensitive data under FERPA/GDPR constraints [45]. Federated learning for BKT parameter estimation, differential privacy for knowledge states [2], and homomorphic encryption during HERAP retrieval would enable personalization without centralizing learner data [5].
Multilingual Educational AI. RAG systems are overwhelmingly English-centric [11]. MCP-connected multilingual repositories (UNESCO OER, national textbook APIs) with cross-lingual embeddings would address this gap. CTOL’s standardized interfaces allow dynamic registration of language-specific MCP servers. The PPA’s Bloom taxonomy mapping is language-independent, though CAFM conflict detection requires language-specific similarity thresholds.
Longitudinal Learning Gain Evaluation. Current benchmarks measure single-interaction accuracy, not longitudinal gains, the ultimate educational measure [1]. Multi-site RCTs across diverse institutions, leveraging CTOL’s xAPI/LRS infrastructure for data collection, represent the gold standard [3,31]. Knowledge tracing in dialogues [29] and graph-based approaches [30] provide methodological foundations.
LLM Bias in Pedagogical Content. LLMs exhibit cultural, gender, and geographic biases that students may internalize as authoritative [5,9]. Mitigation requires diversity-aware re-ranking in HERAP, fairness terms in CAFM’s Pedagogical Authority Ranking, and bias-audited scaffolding templates in PPA. Fairness-annotated educational benchmarks remain an unmet need [10,66].
Cost-Effectiveness and Scalability. EduMSRA’s multi-component pipeline involves multiple LLM calls per query, potentially prohibitive for resource-constrained institutions [6]. Cost-aware query routing (lightweight models for factual queries), CAFM caching across similar queries, and modular MCP adoption (starting with minimal servers) address this challenge [12,13].

Pedagogical grounding: ZPD, Bloom’s revised taxonomy, and self-determination theory. Beyond the engineering challenges enumerated above, EduMSRA’s architecture is deliberately anchored in three learning-theoretic foundations. First, Vygotsky’s Zone of Proximal Development [28] directly motivates the PPA’s graduated scaffolding rule: the mastery thresholds

(0.40, 0.60, 0.80)

defined in Section 2.7.6 map onto full scaffolding, partial scaffolding, Socratic prompting, and minimal hinting, so that every interaction is placed within the learner’s ZPD rather than below frustration or above boredom. Second, Anderson and Krathwohl’s revised Bloom taxonomy is operationalized in the PPA’s Bloom-keyword router, which tags each retrieved chunk and generated prompt with a cognitive level (remember, understand, apply, analyze, evaluate, create) and enforces within-session cognitive-level progression rather than ad hoc question selection. Third, Deci and Ryan’s Self-Determination Theory—with its triad of autonomy, competence, and relatedness—informs CTOL’s multi-source design (autonomy through choice of authoritative tools and sources), LPM’s visible mastery feedback (competence through transparent progression), and PPA’s adaptive tone (relatedness through pedagogically appropriate persona). These three theories jointly shape EduMSRA into an architecture that is not merely retrieval-efficient but pedagogically principled.

Cross-domain and domain-specific RAG comparisons. Beyond the seven internal challenges and the pedagogical grounding above, EduMSRA’s educational framing can be contrasted with domain-specific RAG-LLM deployments in adjacent fields. Wang and Zhang [79] present GISedu-GPT, a large-language-model framework with prior knowledge for GIS education, demonstrating the value of domain-prior integration for specialized educational AI; Zhang et al. [80] describe GeoGPT as an assistant for geospatial task understanding, which—while not itself educational—illustrates the general pattern of domain-grounded LLM assistants that EduMSRA adapts for cross-curricular education. Parallel cross-domain RAG applications further illustrate RAG’s breadth: Jahanbakhsh et al. [81] leverage RAG for automated smart-home orchestration, and James et al. [82] use RAG to generate knowledge assets and action drivers. These parallels reinforce the generality of the retrieval-grounded agentic paradigm that EduMSRA specializes in for adaptive intelligent tutoring, motivating the architectural synthesis summarized in the concluding section below.

5. Conclusions

This paper presented a systematic survey and architectural proposal addressing the critical intersection of Retrieval-Augmented Generation, the Model Context Protocol, and intelligent educational systems. Through a review of studies from 2022 to 2025, we traced the evolution from static document RAG chatbots to adaptive, multi-source agentic tutoring systems. Our comparative analysis revealed that while significant progress has been made in individual dimensions—personalization (PRAG-EDU [42]), STEM tutoring (Modran et al. [6]), and multi-source retrieval (agentic RAG [20])—no existing architecture simultaneously achieves all five required educational capabilities.

EduMSRA addresses this gap through five tightly integrated components: a Hierarchical Educational RAG Pipeline (HERAP) with three-tier curriculum retrieval; an MCP-based Curriculum Tool Orchestration Layer (CTOL) providing standardized access to heterogeneous educational sources; a Conflict-Aware Fusion Module (CAFM) resolving inter-source contradictions with pedagogical authority ranking; a Learner Profile Manager (LPM) maintaining multi-dimensional student models across sessions; and a Pedagogical Policy Agent (PPA) aligning all outputs with Bloom’s Revised Taxonomy and ZPD-based scaffolding principles.

The proposed experimental evaluation road map—spanning nine benchmark datasets and four targeted experiments with well-established baselines—provides a reproducible empirical foundation for validating EduMSRA. Our Bayesian empirical analyses (Section 3.2) provide supporting evidence: the meta-analysis indicates a positive direction for RAG effectiveness across diverse metrics (

\hat{μ} = 0.511

, 95% HDI excluding zero), though substantial construct heterogeneity warrants cautious interpretation; the BKT simulation illustrates how the proposed scaffolding thresholds interact with learner ability trajectories; and the dataset difficulty characterization confirms that the selected benchmarks span the intended cognitive range. As MCP adoption accelerates across the educational technology ecosystem and LLM reasoning capabilities continue to advance, EduMSRA represents a timely, principled, and pedagogically grounded architectural vision for the next generation of AI tutoring systems.

Author Contributions

Conceptualization, T.-L.H. and T.-P.L.; methodology, T.-L.H.; writing—original draft preparation, T.-L.H.; writing—review and editing, T.-L.H. and T.-P.L.; supervision, T.-L.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable (no human subjects studied). Future experiments involving learner data will require IRB approval.

Informed Consent Statement

Informed consent was obtained from all individuals included in this study.

Data Availability Statement

All benchmark datasets referenced in Section 3 are publicly available. Access links are provided in Appendix A. No private datasets were used. Our project’s code is published in a Kaggle kernel (https://www.kaggle.com/code/thanhphonglamq/edumsra-poc-mdpi-applsci-16-04400, accessed on 27 April 2026).

Acknowledgments

This research was supported by Ton Duc Thang University. The authors would like to thank our collaborators for their valuable contributions and support throughout the completion of this research.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

HDI	Highest Density Interval
RAG	Retrieval-Augmented Generation
MCP	Model Context Protocol
ITS	Intelligent Tutoring System
LLM	Large Language Model
CAFM	Conflict-Aware Fusion Module
LPM	Learner Profile Manager
PoC	Proof of Concept
PPA	Pedagogical Policy Agent
EduMSRA	Educational Multi-Source Research Agent

Appendix A. Benchmark Dataset Access Links

Table A1 provides direct access links for all 10 benchmark datasets referenced in Section 3. All datasets are publicly available through HuggingFace Hub or their original repositories.

Table A1. Access links for the 10 curated benchmark datasets. (*) = primary benchmarks.

#	Dataset	Platform	Access Path/URL
D1 *	AI2-ARC Challenge	HuggingFace	allenai/ai2_arc (ARC-Challenge)
D2	OpenBookQA	HuggingFace	allenai/openbookqa
D3	ScienceQA	HuggingFace	derek-thomas/ScienceQA
D4	TQA/CK12-QA	HuggingFace	yyyyifan/TQA
D5	MMLU (Edu subset)	HuggingFace	cais/mmlu (all)
D7	SciQ	HuggingFace	allenai/sciq
D8 *	LaMP	HuggingFace	alireza7/LaMP-QA (Art_and_Entertainment)
D9	SciQAG	HuggingFace	emrekuruu/SciQAG
D10	KILT	HuggingFace	facebook/kilt_tasks (nq)

All datasets were downloaded and verified using the HuggingFace datasets library (v3.x). The total corpus comprises 278,767 items across approximately 6.7 GB.

Appendix B. Dataset Split Distributions

Table A2 reports the train/test/validation split sizes for the nine successfully downloaded benchmark datasets. These splits follow the original distributions provided by dataset authors.

Table A2. Train/test/validation split distributions for nine benchmark datasets. (*) = primary benchmarks.

#	Dataset	Train	Test	Val	Total	Level
D1 *	AI2-ARC	1119	1172	299	2590	K-12
D2	OpenBookQA	4957	500	500	5957	Elementary
D3	ScienceQA	12,726	4241	4241	21,208	K-12
D4	CK12-TQA	6501	3285	2781	12,567	Middle
D5	MMLU	99,842	14,042	1531	115,415	College
D7	SciQ	11,679	1000	1000	13,679	High School
D8 *	LaMP	9349	767	801	10,917	Post-sec.
D9	SciQAG	4496	0	0	4496	Research
D10	KILT	87,372	1444	2837	91,653	General

Notable observations: (1) MMLU and KILT dominate the corpus by size, providing extensive evaluation capacity for cross-domain generalization (Experiment E4); (2) SciQAG provides only a training split, requiring custom evaluation partitioning; (3) the primary benchmarks (D1, D8) offer balanced splits suitable for standard train/test evaluation protocols; and (4) the total verified item count (278,482) is consistent with the reported corpus size in Section 3.

References

Bloom, B.S. The 2 Sigma Problem: The Search for Methods of Group Instruction as Effective as One-to-One Tutoring. Educ. Res. 1984, 13, 4–16. [Google Scholar] [CrossRef]
Corbett, A.T.; Anderson, J.R. Knowledge Tracing: Modeling the Acquisition of Procedural Knowledge. User Model. User Adapt. Interact. 1994, 4, 253–278. [Google Scholar] [CrossRef]
VanLehn, K. The Relative Effectiveness of Human Tutoring, Intelligent Tutoring Systems, and Other Tutoring Systems. Educ. Psychol. 2011, 46, 197–221. [Google Scholar] [CrossRef]
Liu, Z.; Agrawal, P.; Singhal, S.; Madaan, V.; Kumar, M.; Verma, P.K. LPITutor: An LLM Based Personalized Intelligent Tutoring System Using RAG and Prompt Engineering. PeerJ Comput. Sci. 2025, 11, e2991. [Google Scholar] [CrossRef] [PubMed]
Kasneci, E.; Seßler, K.; Küchemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; Günnemann, S.; Hüllermeier, E.; et al. ChatGPT for Good? On Opportunities and Challenges of Large Language Models for Education. Learn. Individ. Differ. 2023, 103, 102274. [Google Scholar] [CrossRef]
Modran, H.A. Leveraging RAG with ACP & MCP for Adaptive Intelligent Tutoring. Appl. Sci. 2025, 15, 11443. [Google Scholar] [CrossRef]
Yan, L.; Sha, L.; Zhao, L.; Li, Y.; Martinez-Maldonado, R.; Chen, G.; Li, X.; Jin, Y.; Gasevic, D. Practical and Ethical Challenges of Large Language Models in Education: A Systematic Scoping Review. Br. J. Educ. Technol. 2024, 55, 90–112. [Google Scholar] [CrossRef]
Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. arXiv 2023, arXiv:2311.05232. [Google Scholar] [CrossRef]
Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.; Madotto, A.; Fung, P. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. 2023, 55, 248. [Google Scholar] [CrossRef]
Tonmoy, S.; Zaman, S.; Jain, V.; Rani, A.; Rawber, A.; Chadha, A.; Das, A. A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models. arXiv 2024, arXiv:2401.01313. [Google Scholar] [CrossRef]
Swacha, J.; Gracel, M. Retrieval-Augmented Generation (RAG) Chatbots for Education: A Survey of Applications. Appl. Sci. 2025, 15, 4234. [Google Scholar] [CrossRef]
Anthropic. Introducing the Model Context Protocol; Technical Report; Anthropic: San Francisco, CA, USA, 2024. [Google Scholar]
Hou, X.; Zhao, Y.; Wang, S.; Wang, H. Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions. arXiv 2025, arXiv:2503.23278. [Google Scholar] [CrossRef]
Anderson, L.; Krathwohl, D. A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives; Longman: Harlow, UK, 2001; ISBN 978-0321084057. [Google Scholar]
Bloom, B. Taxonomy of Educational Objectives: The Classification of Educational Goals; Longmans, Green and Co.: New York, NY, USA, 1956; ISBN 978-0679302117. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv 2020, arXiv:2005.11401. [Google Scholar]
Li, Z.; Wang, Z.; Wang, W.; Hung, K.; Xie, H.; Wang, F.L. Retrieval-Augmented Generation for Educational Application: A Systematic Survey. Comput. Educ. Artif. Intell. 2025, 8, 100417. [Google Scholar] [CrossRef]
Jeong, S.; Baek, J.; Cho, S.; Hwang, S.J.; Park, J.C. Adaptive-RAG: Learning to Adapt Retrieval-Augmented LLMs through Question Complexity. arXiv 2024, arXiv:2403.14403. [Google Scholar]
Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; Hajishirzi, H. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. arXiv 2024, arXiv:2310.11511. [Google Scholar]
Singh, A.; Ehtesham, A.; Kumar, S.; Khoei, T.T.; Vasilakos, A.V. Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG. arXiv 2025, arXiv:2501.09136. [Google Scholar] [CrossRef]
Trivedi, H.; Balasubramanian, N.; Khot, T.; Sabharwal, A. Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions. arXiv 2023, arXiv:2212.10509. [Google Scholar]
Jiang, J.; Chen, J.; Li, J.; Ren, R.; Wang, S.; Zhao, W.X.; Song, Y.; Zhang, T. RAG-Star: Enhancing Deliberative Reasoning with Retrieval Augmented Verification and Refinement. arXiv 2024, arXiv:2412.12881. [Google Scholar] [CrossRef]
Luo, Z.; Shen, Z.; Yang, W.; Zhao, Z.; Jwalapuram, P.; Saha, A.; Sahoo, D.; Savarese, S.; Xiong, C.; Li, J. MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers. arXiv 2025, arXiv:2508.14704. [Google Scholar]
Fan, S.; Ding, X.; Zhang, L.; Mo, L. MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark. arXiv 2025, arXiv:2508.07575. [Google Scholar]
Wang, Z.; Chang, Q.; Patel, H.; Biju, S.; Wu, C.; Liu, Q.; Ding, A.; Rezazadeh, A.; Shah, A.; Bao, Y.; et al. MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers. arXiv 2025, arXiv:2508.20453. [Google Scholar]
Zhang, D.; Li, Z.; Luo, X.; Liu, X.; Li, P.; Xu, W. MCP Security Bench (MSB): Benchmarking Attacks Against Model Context Protocol in LLM Agents. arXiv 2025, arXiv:2510.15994. [Google Scholar]
Radosevich, B.; Halloran, J. MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits. arXiv 2025, arXiv:2504.03767. [Google Scholar]
Vygotsky, L.S. Mind in Society: The Development of Higher Psychological Processes; Harvard University Press: Cambridge, MA, USA, 1978; ISBN 978-0674576292. [Google Scholar]
Scarlatos, A.; Baker, R.S.; Lan, A. Exploring Knowledge Tracing in Tutor-Student Dialogues using LLMs. arXiv 2024, arXiv:2409.16490. [Google Scholar]
Cui, J.; Qian, H.; Jiang, B.; Zhang, W. Leveraging Pedagogical Theories to Understand Student Learning Process with Graph-based Reasonable Knowledge Tracing. arXiv 2024, arXiv:2406.12896. [Google Scholar]
Liu, V.; Latif, E.; Zhai, X. Advancing Education through Tutoring Systems: A Systematic Literature Review. arXiv 2025, arXiv:2503.09748. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training Language Models to Follow Instructions with Human Feedback. arXiv 2022, arXiv:2203.02155. [Google Scholar] [CrossRef]
Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W. Dense Passage Retrieval for Open-Domain Question Answering. arXiv 2020, arXiv:2004.04906. [Google Scholar] [CrossRef]
Guu, K.; Lee, K.; Tung, Z.; Pasupat, P.; Chang, M. Retrieval Augmented Language Model Pre-Training. arXiv 2020, arXiv:2002.08909. [Google Scholar] [CrossRef]
Izacard, G.; Grave, E. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. arXiv 2021, arXiv:2007.01282. [Google Scholar] [CrossRef]
Borgeaud, S.; Mensch, A.; Hoffmann, J.; Cai, T.; Rutherford, E.; Millican, K.; van den Driessche, G.; Lespiau, J.B.; Damoc, B.; Clark, A.; et al. Improving Language Models by Retrieving from Trillions of Tokens. arXiv 2022, arXiv:2112.04426. [Google Scholar] [CrossRef]
Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, M.; Wang, H. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv 2024, arXiv:2312.10997. [Google Scholar] [CrossRef]
Jiang, Z.; Xu, F.; Gao, L.; Sun, Z.; Liu, Q.; Dwivedi-Yu, J.; Yang, Y.; Callan, J.; Neubig, G. Active Retrieval Augmented Generation. arXiv 2023, arXiv:2305.06983. [Google Scholar] [CrossRef]
Yan, S.-Q.; Gu, J.-C.; Zhu, Y.; Ling, Z.-H. Corrective Retrieval Augmented Generation (CRAG). arXiv 2024, arXiv:2401.15884. [Google Scholar]
Schick, T.; Dwivedi-Yu, J.; Dessi, R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv 2024, arXiv:2302.04761. [Google Scholar]
Nguyen, e.a. Towards Personalized AI Education: Context-Aware RAG With Grade-Level LLM Adaptation. Comput. Appl. Eng. Educ. 2026, 34, e70153. [Google Scholar] [CrossRef]
Levonian, Z.; Li, C.; Zhu, W.; Gade, A.; Henkel, O.; Postle, M.E.; Xing, W. Retrieval-Augmented Generation to Improve Math Question-Answering: Trade-offs Between Groundedness and Human Preference. In Proceedings of the NeurIPS 2023 Workshop on Generative AI for Education (GAIED), New Orleans, LA, USA, 15 December 2023. [Google Scholar]
Graesser, A.C.; D’Mello, S.; Hu, X.; Cai, Z.; Olney, A.; Morgan, B. AutoTutor. In Applied Natural Language Processing: Identification, Investigation and Resolution; IGI Global: Hershey, PA, USA, 2012; pp. 169–187. [Google Scholar] [CrossRef]
Wampler, D.; Nielson, D.; Seddighi, A. Engineering the RAG Stack: A Comprehensive Review of the Architecture and Trust Frameworks for Retrieval-Augmented Generation Systems. arXiv 2025, arXiv:2601.05264. [Google Scholar]
Singh, A.; Ehtesham, A.; Kumar, S.; Khoei, T.T. A Survey of the Model Context Protocol (MCP): Standardizing Context to Enhance LLMs. Preprints 2025, 2025040245. [Google Scholar] [CrossRef]
Fei, X.; Zheng, X.; Feng, H. MCP-Zero: Active Tool Discovery for Autonomous LLM Agents. arXiv 2025, arXiv:2506.01056. [Google Scholar]
Lumer, E.; Gulati, A.; Subbiah, V.; Basavaraju, P.; Burke, J. ScaleMCP: Dynamic and Auto-Synchronizing Model Context Protocol Tools for LLM Agents. arXiv 2025, arXiv:2505.06416. [Google Scholar]
Hasan, M.; Li, H.; Rajbahadur, G.; Adams, B.; Hassan, A. Model Context Protocol (MCP) Tool Descriptions Are Smelly! Towards Improving AI Agent Efficiency with Augmented MCP Tool Descriptions. arXiv 2026, arXiv:2602.14878. [Google Scholar] [CrossRef]
Li, Q.; Xie, Y. From Glue-Code to Protocols: A Critical Analysis of A2A and MCP Integration for Scalable Agent Systems. arXiv 2025, arXiv:2505.03864. [Google Scholar] [CrossRef]
Ehtesham, A.; Singh, A.; Gupta, G.; Kumar, S. A Survey of Agent Interoperability Protocols: Model Context Protocol (MCP), Agent Communication Protocol (ACP), Agent-to-Agent Protocol (A2A), and Agent Network Protocol (ANP). arXiv 2025, arXiv:2505.02279. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
Zhu, X.; Chang, S.; Kuik, A. Enhancing Critical Thinking with AI: A Tailored Warning System for RAG Models. arXiv 2025, arXiv:2504.16883. [Google Scholar] [CrossRef]
Shi, W.; Min, S.; Yasunaga, M.; Seo, M.; James, R.; Lewis, M.; Zettlemoyer, L.; Yih, W. REPLUG: Retrieval-Augmented Black-Box Language Models. arXiv 2024, arXiv:2301.12652. [Google Scholar]
Ram, O.; Levine, Y.; Dalmedigos, I.; Muhlgay, D.; Shashua, A.; Leyton-Brown, K.; Shoham, Y. In-Context Retrieval-Augmented Language Models. Trans. Assoc. Comput. Linguist. 2023, 11, 1316–1331. [Google Scholar] [CrossRef]
Ma, X.; Gong, Y.; He, P.; Zhao, H.; Duan, N. Query Rewriting in Retrieval-Augmented Large Language Models. arXiv 2023, arXiv:2305.14283. [Google Scholar] [CrossRef]
Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; Cao, Y. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv 2023, arXiv:2210.03629. [Google Scholar] [CrossRef]
Shinn, N.; Cassano, F.; Berman, E.; Gopinath, A.; Narasimhan, K.; Yao, S. Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv 2023, arXiv:2303.11366. [Google Scholar] [CrossRef]
Chen, Y.; Yan, L.; Sun, W.; Ma, X.; Zhang, Y.; Wang, S.; Yin, D.; Yang, Y.; Mao, J. Improving Retrieval-Augmented Generation through Multi-Agent Reinforcement Learning. arXiv 2025, arXiv:2501.15228. [Google Scholar]
Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; Chen, Z.; Tang, J.; Chen, X.; Lin, Y.; et al. A Survey on Large Language Model Based Autonomous Agents. Front. Comput. Sci. 2024, 18, 186345. [Google Scholar] [CrossRef]
Aghajani Asl, M.; Asgari-Bidhendi, M.; Minaei-Bidgoli, B. FAIR-RAG: Faithful Adaptive Iterative Refinement for Retrieval-Augmented Generation. arXiv 2025, arXiv:2510.22344. [Google Scholar]
Zhang, Z.; Bo, X.; Ma, C.; Li, R.; Chen, X.; Dai, Q.; Zhu, J.; Dong, Z.; Wen, J.-R. A Survey on the Memory Mechanism of Large Language Model Based Agents. arXiv 2024, arXiv:2404.13501. [Google Scholar] [CrossRef]
Edge, D.; Trinh, H.; Cheng, N.; Bradley, J.; Chao, A.; Mody, A.; Truitt, S.; Metropolitansky, D.; Ness, R.O.; Larson, J. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv 2025, arXiv:2404.16130. [Google Scholar]
Sarthi, P.; Abdullah, S.; Tuli, A.; Khanna, S.; Goldie, A.; Manning, C.D. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval. arXiv 2024, arXiv:2401.18059. [Google Scholar] [CrossRef]
Baek, J.; Aji, A.; Saffari, A. Knowledge-Augmented Language Model Prompting for Zero-Shot Knowledge Graph Question Answering. arXiv 2024, arXiv:2306.04136. [Google Scholar]
Es, S.; James, J.; Espinosa-Anke, L.; Schockaert, S. RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv 2024, arXiv:2309.15217. [Google Scholar]
Chen, J.; Lin, H.; Han, X.; Sun, L. Benchmarking Large Language Models in Retrieval-Augmented Generation. arXiv 2024, arXiv:2309.01431. [Google Scholar] [CrossRef]
Abootorabi, M.M.; Zobeiri, A.; Dehghani, M.; Mohammadkhani, M.; Mohammadi, B.; Ghahroodi, O.; Soleymani Baghshah, M.; Asgari, E. Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation. arXiv 2025, arXiv:2502.08826. [Google Scholar] [CrossRef]
Lu, P.; Mishra, S.; Xia, T.; Qiu, L.; Chang, K.-W.; Zhu, S.-C.; Tafjord, O.; Clark, P.; Kalyan, A. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. arXiv 2022, arXiv:2209.09513. [Google Scholar] [CrossRef]
Ray, P. A Survey on Model Context Protocol: Architecture, State-of-the-Art, Challenges. TechRxiv 2025. TechRxiv:174495492.22752319. [Google Scholar]
Liu, W.; Liu, Z.; Dai, E.; Yu, W.; Yu, L.; Yang, T.; Han, J.; Gao, H. MCPAgentBench: A Real-world Task Benchmark for Evaluating LLM Agent MCP Tool Use. arXiv 2025, arXiv:2512.24565. [Google Scholar]
Wang, W.; Niu, P.; Xu, Z.; Chen, Z.; Du, J.; Du, Y.; Pang, X.; Huang, K.; Wang, Y.; Yan, Q.; et al. MCP-Flow: Facilitating LLM Agents to Master Real-World, Diverse and Scaling MCP Tools. arXiv 2025, arXiv:2510.24284. [Google Scholar]
Nie, X.; Guo, Z.; Chen, Y.; Zhou, Y.; Zhang, W. AWCP: A Workspace Delegation Protocol for Deep-Engagement Collaboration across Remote Agents. arXiv 2026, arXiv:2602.20493. [Google Scholar]
Mousavinasab, E.; Zarifsanaiey, N.; Niakan Kalhori, S.R.; Rakhshan, M.; Keikha, L.; Ghazi Saeedi, M. Intelligent Tutoring Systems: A Systematic Review of Characteristics, Applications, and Evaluation Methods. Interact. Learn. Environ. 2021, 29, 142–163. [Google Scholar] [CrossRef]
Khan Academy. Khanmigo: AI-Powered Teaching and Learning Assistant. 2024. Available online: https://www.khanmigo.ai/ (accessed on 15 March 2026).
Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring Massive Multitask Language Understanding (MMLU). arXiv 2021, arXiv:2009.03300. [Google Scholar]
Wan, Y.; Liu, Y.; Ajith, A.; Grazian, C.; Hoex, B.; Zhang, W.; Kit, C.; Xie, T.; Foster, I. SciQAG: A Framework for Auto-Generated Science Question Answering Dataset with Fine-grained Evaluation. arXiv 2024, arXiv:2405.09939. [Google Scholar]
Higgins, J.P.T.; Thompson, S.G.; Deeks, J.J.; Altman, D.G. Measuring Inconsistency in Meta-analyses. BMJ 2003, 327, 557–560. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, Y.; Min, W.; Guan, Q.; Yu, W. GISedu-GPT: A Large Language Model Framework with Prior Knowledge for GIS Education. J. Geogr. High. Educ. 2025, 50, 72–99. [Google Scholar] [CrossRef]
Zhang, Y.; Wei, C.; He, Z.; Yu, W. GeoGPT: An Assistant for Understanding and Processing Geospatial Tasks. Int. J. Appl. Earth Obs. Geoinf. 2024, 133, 104019. [Google Scholar] [CrossRef]
Jahanbakhsh, N.; Vega-Barbas, M.; Pau, I.; Elvira-Martín, L.; Moosavi, H.; García-Vázquez, C. Leveraging RAG for Automated Smart Home Orchestration. Future Internet 2025, 17, 198. [Google Scholar] [CrossRef]
James, A.; Trovati, M.; Bolton, S. RAG to Generate Knowledge Assets and Creation of Action Drivers. Appl. Sci. 2025, 15, 6247. [Google Scholar] [CrossRef]

Figure 1. Three-phase evolution of RAG in education: from static document retrieval (2022–2023), through adaptive query-aware retrieval (2023–2024), to agentic multi-source architectures with MCP integration (2024–2025). EduMSRA is positioned as the first unified architecture spanning all three phases [6,16,19,20,38,40].

Figure 2. High-level architecture of EduMSRA. Five specialized components surround the Central Reasoning Agent (CRA). Solid arrows indicate primary data flows; dashed arrows indicate profile-driven adaptation signals from the Learner Profile Manager (LPM). External educational sources connect through the CTOL layer via MCP-compliant interfaces.

Figure 3. Component interaction flow in EduMSRA. A student query passes through five processing phases. The LPM provides learner state information to multiple phases (dashed arrows), and the feedback loop updates the knowledge state

K_{t + 1}

after each interaction.

Figure 3. Component interaction flow in EduMSRA. A student query passes through five processing phases. The LPM provides learner state information to multiple phases (dashed arrows), and the feedback loop updates the knowledge state

K_{t + 1}

after each interaction.

Figure 4. Benchmark dataset sizes (number of items). Dark blue bars indicate primary benchmarks (D1, D8). Nine datasets were successfully downloaded and verified.

Figure 5. Forest plot of Bayesian random-effects meta-analysis across 12 studies. Square markers indicate individual study effects with 95% CIs; the vertical green line and shaded region show the pooled estimate

\hat{μ} = 0.511

(95% HDI:

[0.250, 0.790]

).

Figure 5. Forest plot of Bayesian random-effects meta-analysis across 12 studies. Square markers indicate individual study effects with 95% CIs; the vertical green line and shaded region show the pooled estimate

\hat{μ} = 0.511

(95% HDI:

[0.250, 0.790]

).

Figure 6. Posterior distribution of the pooled RAG effect size. The 95% HDI excludes zero, providing strong evidence for RAG effectiveness in educational contexts.

Figure 7. BKT mastery trajectories for five simulated learner profiles. Higher-ability learners converge to mastery (

P (L_{n}) > 0.95

) rapidly, while low-ability learners require extended scaffolding, validating EduMSRA’s adaptive intervention design.

Figure 7. BKT mastery trajectories for five simulated learner profiles. Higher-ability learners converge to mastery (

P (L_{n}) > 0.95

) rapidly, while low-ability learners require extended scaffolding, validating EduMSRA’s adaptive intervention design.

Figure 8. Bayesian difficulty estimation of nine benchmark datasets using a Beta-Binomial model. Error bars show 95% HDI. Dark bars indicate primary benchmarks.

Figure 9. End-to-end PoC trace of a single student query passing through the five EduMSRA components with real per-module latency and scores.

Figure 10. Capability comparison between standard RAG, agentic RAG, and EduMSRA across retrieval, tool orchestration, conflict handling, learner state, and pedagogy. Green cells denote EduMSRA’s novel contributions [6,16].

Figure 11. Per-module latency measured on the PoC over 30 queries (mean ± std). CAFM’s TF-IDF recomputation dominates (83.8% of total), motivating vector caching in future work.

Figure 12. LPM Bayesian Knowledge Tracing trajectories over 50 simulated interactions for five skills. The dashed line marks the 0.85 mastery threshold used by PPA scaffolding rules.

Figure 13. CAFM TF-IDF cosine similarity matrix for the Newton-law query and five retrieved chunks. Cells below the 0.35 threshold are flagged as conflicts by CAFM.

Table 1. Comparative analysis of AI architectures for intelligent educational applications.

Architecture	Adapt. Retr.	Multi-Src Edu	MCP Std.	Citation Aware	Learner Pers.	Edu Application
Naive RAG [16]	No	No	No	No	None	Static document QA
Advanced RAG [38]	Partial	Limited	No	No	None	Textbook QA
AutoTutor-LLM [44]	No	No	No	No	Session	Dialogue tutoring
Agentic RAG [20]	Yes	Partial	No	No	Session	ITS (limited)
GraphRAG [63]	Yes	No	No	Partial	None	Curriculum mapping
Khanmigo [75]	Partial	Limited	No	No	Partial	Math/writing tutor
RAG+ACP [6]	Yes	Partial	Partial	No	Session	STEM tutoring
PRAG-EDU [42]	Yes	No	No	Partial	Grade	Personalized QA
EduMSRA (Proposed ^†)	Yes ^†	Yes ^†	Yes ^†	Yes ^†	Full ^†	Holistic Ed. Agent

Note: All systems except EduMSRA are assessed based on published implementations or deployments. ^† EduMSRA capabilities are assessed based on architectural specification in this paper and have not yet been empirically validated; “Yes” denotes design-level capability, not demonstrated performance. AutoTutor-LLM refers to LLM-enhanced variants of the AutoTutor dialogue framework [44]. Khanmigo assessed from publicly available documentation [75]. Baselines assessed from primary publications and systematic reviews [3,74].

Table 2. Educational challenges and EduMSRA solutions.

Challenge	Description	EduMSRA Solution
Curriculum Fragmentation	Knowledge across textbooks, slides, videos, assessments	CTOL connects heterogeneous sources via MCP
Learner Heterogeneity	Diverse prior knowledge, pace, style	LPM personalizes retrieval depth and vocabulary
Hallucination	Incorrect explanations in high-stakes learning	Citation-Aware Generation with source attribution
Knowledge Staleness	Semester content updates beyond LLM cutoff	Dynamic MCP-connected curriculum index
Multi-hop Gaps	Cross-chapter synthesis questions	HERAP 3-tier retrieval (keyword, semantic, graph)
Assessment Alignment	Misalignment with Bloom’s taxonomy levels	PPA maps content to cognitive levels
Source Conflicts	Contradictory definitions across textbooks	CAFM detects and reconciles contradictions
Data Privacy	FERPA/GDPR for student records	3-tier Permission Sandbox with audit logging
Scalability	Diverse LMS platforms across institutions	MCP open protocol for plug-and-play integration

Table 3. Curated benchmark datasets for EduMSRA evaluation. (*) = primary benchmarks.

#	Dataset	Size	Type	Level	Subject
D1 *	AI2-ARC Challenge	7787	MCQ	K-12	Science
D2	OpenBookQA	5957	MCQ	Elementary	General Sci.
D3	ScienceQA	21,208	MCQ+Img	K-12	Multi-subj.
D4	TQA/CK12-QA	26,260	MCQ+TF	Middle	Science
D5	MMLU (Edu)	∼4000	MCQ	College	STEM+Hum.
D7	SciQ	13,679	MCQ	High School	Natural Sci.
D8 *	LaMP	∼10 K users	Multi-task	Post-sec.	General
D9	SciQAG	960 K	Open QA	Research	Multi-domain
D10	KILT	∼700 K	Multi-task	General	Wikipedia

Table 4. EduMSRA experimental evaluation protocol.

Exp.	Dataset	Research Question	Baseline	Key Metrics
E1	AI2-ARC	Hierarchical vs. single-stage retrieval for multi-hop QA?	Naive RAG + GPT-4o	Accuracy, Context Precision, NDCG@5
E2	LaMP	LPM improves personalization vs. generic RAG?	PRAG-EDU, Standard RAG	ROUGE-L, Personalization Score
E3	ScienceQA + TQA	CAFM vs. majority-vote fusion?	Majority Vote, RAG Fusion	Conflict Resolution F1, Attribution Acc.
E4	MMLU (Edu)	Cross-domain generalization?	Advanced RAG, LPITutor	Cross-domain Acc., MCP Success Rate

Table 5. Per-module latency measured on the PoC over 30 queries (mean ± std, milliseconds, free-tier Kaggle CPU).

Module	Mean (ms)	Std (ms)
HERAP (BM25)	0.236	0.733
CTOL (MCP mock)	0.006	0.004
CAFM (TF-IDF)	1.406	2.212
LPM (BKT)	0.003	0.004
PPA (Bloom)	0.014	0.004
Total	1.677	—

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ho, T.-L.; Lam, T.-P. EduMSRA: A Multi-Source Educational Research Agent Integrating Retrieval-Augmented Generation and Model Context Protocol for Adaptive Intelligent Tutoring Systems. Appl. Sci. 2026, 16, 4400. https://doi.org/10.3390/app16094400

AMA Style

Ho T-L, Lam T-P. EduMSRA: A Multi-Source Educational Research Agent Integrating Retrieval-Augmented Generation and Model Context Protocol for Adaptive Intelligent Tutoring Systems. Applied Sciences. 2026; 16(9):4400. https://doi.org/10.3390/app16094400

Chicago/Turabian Style

Ho, Thi-Linh, and Thanh-Phong Lam. 2026. "EduMSRA: A Multi-Source Educational Research Agent Integrating Retrieval-Augmented Generation and Model Context Protocol for Adaptive Intelligent Tutoring Systems" Applied Sciences 16, no. 9: 4400. https://doi.org/10.3390/app16094400

APA Style

Ho, T.-L., & Lam, T.-P. (2026). EduMSRA: A Multi-Source Educational Research Agent Integrating Retrieval-Augmented Generation and Model Context Protocol for Adaptive Intelligent Tutoring Systems. Applied Sciences, 16(9), 4400. https://doi.org/10.3390/app16094400

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

EduMSRA: A Multi-Source Educational Research Agent Integrating Retrieval-Augmented Generation and Model Context Protocol for Adaptive Intelligent Tutoring Systems

Abstract

1. Introduction

2. Materials and Methods

2.1. Intelligent Tutoring Systems: Evolution and Limitations

2.2. Retrieval-Augmented Generation in Educational Contexts

2.3. The Model Context Protocol in Education

2.4. RAG Architectures in Education

2.4.1. Phase 1—Static Document RAG for Education (2022–2023)

2.4.2. Phase 2—Adaptive and Personalized RAG (2023–2024)

2.4.3. Phase 3—Agentic RAG for Education (2024–2025)

2.4.4. Graph-Based, Multimodal, and Evaluation Approaches

2.5. MCP in Educational AI: Architecture and Ecosystem

2.5.1. MCP Primitives Mapped to Educational Use Cases

2.5.2. MCP Server Taxonomy for Education

2.5.3. MCP Ecosystem Maturity and Benchmarking

2.5.4. Security and Privacy Considerations in Educational MCP

2.5.5. MCP Adoption Barriers in Education

2.6. Comparative Analysis of Architectures for Educational AI

2.6.1. Evaluation Framework

2.6.2. Key Findings

2.7. Proposed Architecture: EduMSRA

2.7.1. Architecture Overview

2.7.2. Component 1—Hierarchical Educational RAG Pipeline (HERAP)

2.7.3. Component 2—MCP-Based Curriculum Tool Orchestration Layer (CTOL)

2.7.4. Component 3—Conflict-Aware Fusion Module (CAFM)

2.7.5. Component 4—Learner Profile Manager (LPM)

2.7.6. Component 5—Pedagogical Policy Agent (PPA)

2.7.7. Formal Architecture Specification

3. Results

3.1. Datasets and Evaluation Road Map

3.1.1. Recommended Datasets for Empirical Validation

3.1.2. Four-Experiment Evaluation Protocol

3.1.3. Evaluation Metrics

Retrieval Quality

Educational Quality

System Efficiency

3.2. Literature-Based Evidence Synthesis via Bayesian Methods

3.2.1. Bayesian Meta-Analysis of RAG Effectiveness

3.2.2. BKT Simulation: Illustrating Adaptive Scaffolding Dynamics

3.2.3. Bayesian Dataset Difficulty Estimation

3.3. Proof-of-Concept Implementation and Empirical Pipeline Tracing

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Benchmark Dataset Access Links

Appendix B. Dataset Split Distributions

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI