EduMSRA: A Multi-Source Educational Research Agent Integrating Retrieval-Augmented Generation and Model Context Protocol for Adaptive Intelligent Tutoring Systems
Abstract
1. Introduction
- RQ1: What are the current architectures, capabilities, and limitations of RAG-based educational AI systems, and how have they evolved from static document retrieval to agentic paradigms?
- RQ2: How can the Model Context Protocol (MCP) be leveraged to standardize tool orchestration and data access in educational AI, and what are the associated security and compliance requirements?
- RQ3: What architectural design is necessary and sufficient to simultaneously achieve multi-source retrieval, learner-adaptive personalization, pedagogical alignment, and citation-aware generation within a unified educational AI framework?
- A systematic taxonomy of RAG architectures in educational contexts, revealing the progression from static document chatbots to agentic educational assistants.
- A comprehensive review of MCP adoption in educational AI, including analysis of MCP benchmarks, security threats, and the mapping of MCP primitives to educational use cases.
- A comparative analysis of seven existing architectures across five educational capability dimensions, identifying the critical gap that motivates EduMSRA.
- The EduMSRA architecture—a novel framework with five specialized components: Hierarchical Educational RAG Pipeline (HERAP), MCP-based Curriculum Tool Orchestration Layer (CTOL), Conflict-Aware Fusion Module (CAFM), Learner Profile Manager (LPM), and Pedagogical Policy Agent (PPA) aligned with Bloom’s taxonomy.
- A curated experimental road map specifying nine published benchmark datasets and four targeted evaluation experiments with baselines and metrics.
- Three Bayesian literature-based evidence syntheses supporting EduMSRA’s design rationale (not claiming direct empirical validation of the EduMSRA pipeline): a random-effects meta-analysis of published RAG effects (, 95% HDI , heterogeneity flagged as indicative only), a BKT simulation of scaffolding dynamics, and a Beta-Binomial characterization of benchmark difficulty priors.
- A proof-of-concept (PoC) implementation that exercises all five components on a toy corpus, produces concrete per-module latency numbers (HERAP 0.236 ms, CTOL 0.006 ms, CAFM 1.406 ms, LPM 0.003 ms, PPA 0.014 ms; mean total 1.677 ms over 30 queries on free-tier Kaggle CPU) and BKT trajectories across five skills. The PoC demonstrates the pipeline runs end-to-end and reveals CAFM as the dominant latency contributor (83.8%), a concrete optimization target for future work. The prototype also ships with 58 pytest unit tests (nine for HERAP, nine for CTOL, seven for CAFM, 10 for LPM, 13 for PPA, six for the orchestrator, four for fixture integrity; full run 0.14 s).
2. Materials and Methods
2.1. Intelligent Tutoring Systems: Evolution and Limitations
2.2. Retrieval-Augmented Generation in Educational Contexts
2.3. The Model Context Protocol in Education
2.4. RAG Architectures in Education
2.4.1. Phase 1—Static Document RAG for Education (2022–2023)
2.4.2. Phase 2—Adaptive and Personalized RAG (2023–2024)
2.4.3. Phase 3—Agentic RAG for Education (2024–2025)
2.4.4. Graph-Based, Multimodal, and Evaluation Approaches
2.5. MCP in Educational AI: Architecture and Ecosystem
2.5.1. MCP Primitives Mapped to Educational Use Cases
- Tools enable state-changing operations such as automated grading, Student Information System (SIS) record updates, and dynamic quiz generation. The LLM agent invokes verified API endpoints to execute these operations with audit trails.
- Resources provide read-only access to contextual data including syllabuses, textbook passages, and student performance histories. MCP servers expose these as typed URIs that the LLM ingests for informed tutoring responses.
- Prompts offer reusable, parameterized templates for specific pedagogical objectives, enabling standardized Socratic questioning, scaffolded problem decomposition, and formative feedback generation across different subject domains.
- Sampling supports human-in-the-loop validation by delegating complex evaluation tasks (e.g., essay scoring, creative assessment) to specialized pedagogical models or human educators before returning results.
2.5.2. MCP Server Taxonomy for Education
2.5.3. MCP Ecosystem Maturity and Benchmarking
2.5.4. Security and Privacy Considerations in Educational MCP
2.5.5. MCP Adoption Barriers in Education
2.6. Comparative Analysis of Architectures for Educational AI
2.6.1. Evaluation Framework
2.6.2. Key Findings
2.7. Proposed Architecture: EduMSRA
2.7.1. Architecture Overview
2.7.2. Component 1—Hierarchical Educational RAG Pipeline (HERAP)
- Factual queries (e.g., “What is Avogadro’s number?”) → Tier 1 only, minimizing latency.
- Conceptual queries (e.g., “Explain the relationship between pressure and volume”) → Tier 1 + Tier 2, combining exact matches with semantic expansion.
- Relational queries (e.g., “How does thermodynamics connect to chemical equilibrium?”) → All three tiers, with CKG traversal providing cross-concept linking.
2.7.3. Component 2—MCP-Based Curriculum Tool Orchestration Layer (CTOL)
- Tier A—Read-Only Curriculum Content (Trust: Public). MCP servers providing access to open educational resources: OpenStax textbook APIs, Khan Academy content endpoints [75], ERIC database queries, CK-12 content, and institutional LMS document APIs. These tools require no authentication beyond API keys and return only publicly available educational content. All retrieved content is cached with TTL-based invalidation to reduce external API load.
- Tier B—Sandboxed Computation (Trust: Isolated). Execution environments for mathematical problem verification (Python/Jupyter), symbolic computation (WolframAlpha API), and code interpretation for CS education. Each computation request executes in a containerized sandbox with CPU/memory limits (default: 2 vCPU, 512 MB RAM, 30 s timeout), network isolation (no outbound connections), and filesystem restrictions (read-only access to problem datasets only). This design prevents code injection attacks while enabling rich computational support for STEM subjects.
- Tier C—Restricted Learner Data (Trust: Authenticated). LMS-grade APIs (Canvas LTI 1.3, Moodle Web Services), xAPI/LRS endpoints for learning analytics, and institution SIS interfaces. Access requires: (a) explicit student consent tokens compliant with FERPA [45] and GDPR regulations; (b) OAuth 2.0 institutional authentication; and (c) cryptographic audit logging of all data access events. The LPM (Section 2.7.5) is the sole consumer of Tier C data within EduMSRA.
2.7.4. Component 3—Conflict-Aware Fusion Module (CAFM)
2.7.5. Component 4—Learner Profile Manager (LPM)
2.7.6. Component 5—Pedagogical Policy Agent (PPA)
- Remember (retrieve factual knowledge): Generate concise definitions with key terms highlighted. Retrieval limited to Tier 1.
- Understand (explain concepts): Generate conceptual explanations with analogies and visual descriptions. Invoke Tier 1 + Tier 2 retrieval.
- Apply (use knowledge in new situations): Generate worked examples with step-by-step justification. Invoke computation tools via CTOL for verification.
- Analyze (break into parts, find relationships): Generate comparative analyses across retrieved sources. Invoke CKG traversal for prerequisite mapping.
- Evaluate (judge, critique): Present multiple perspectives from conflicting sources (via CAFM), prompting the student to assess evidence quality.
- Create (synthesize new solutions): Guide the student through problem decomposition without providing direct answers, invoking the Scaffolding Controller.
- Full scaffolding (): Direct explanation with complete worked examples and explicit prerequisite review.
- Partial scaffolding (): Guided hints with partially completed solutions; student fills in key steps.
- Socratic scaffolding (): Socratic questioning that guides reasoning without revealing answers; follow-up probes based on student responses.
- Minimal scaffolding (): Challenge problems at higher Bloom levels with minimal guidance; emphasis shifts to evaluation and creation tasks.
2.7.7. Formal Architecture Specification
3. Results
3.1. Datasets and Evaluation Road Map
3.1.1. Recommended Datasets for Empirical Validation
3.1.2. Four-Experiment Evaluation Protocol
3.1.3. Evaluation Metrics
Retrieval Quality
- Context Precision: Fraction of retrieved chunks that are relevant to the query, measured by LLM-as-Judge annotation.
- Context Recall: Fraction of ground-truth supporting facts covered by retrieved context.
- NDCG@5: Normalized Discounted Cumulative Gain measuring ranked retrieval quality at top-five results.
- MCP Tool Invocation Accuracy: Percentage of tool calls with correctly specified parameters and successful execution.
Educational Quality
- Bloom’s Level Alignment Score: Inter-rater agreement between the PPA’s target Bloom level and human expert annotation of the generated response’s cognitive demand (Cohen’s ).
- Personalization Score (LaMP metric [18]): ROUGE-L between generated response and learner-profile-tailored reference response.
- Pedagogical Coherence: Human expert rating (five-point Likert scale) of response alignment with curriculum objectives, conducted on n = 200 sampled responses.
System Efficiency
- End-to-End Latency: P50 and P95 response times from query submission to complete response delivery.
- Context Window Utilization: Fraction of available context window used by retrieved + profile context, measuring compression efficiency.
- Cost per Interaction: Estimated API cost in USD per student interaction, relevant for scalability analysis.
3.2. Literature-Based Evidence Synthesis via Bayesian Methods
3.2.1. Bayesian Meta-Analysis of RAG Effectiveness
3.2.2. BKT Simulation: Illustrating Adaptive Scaffolding Dynamics
3.2.3. Bayesian Dataset Difficulty Estimation
3.3. Proof-of-Concept Implementation and Empirical Pipeline Tracing
4. Discussion
- Multimodal Educational Content. Most RAG-for-education systems process text only, yet ScienceQA [69] shows 63% text-only versus 91% multimodal accuracy. Extending EduMSRA requires multimodal embeddings (e.g., CLIP/SigLIP), cross-modal CAFM conflict detection, and multimodal generation. HERAP’s three-tier architecture is naturally extensible: Tier 1 via image captioning, Tier 2 via multimodal dense encoders, and Tier 3’s CKG via visual asset association [68].
- Privacy-Preserving Learner Profiling. EduMSRA’s LPM accesses sensitive data under FERPA/GDPR constraints [45]. Federated learning for BKT parameter estimation, differential privacy for knowledge states [2], and homomorphic encryption during HERAP retrieval would enable personalization without centralizing learner data [5].
- Multilingual Educational AI. RAG systems are overwhelmingly English-centric [11]. MCP-connected multilingual repositories (UNESCO OER, national textbook APIs) with cross-lingual embeddings would address this gap. CTOL’s standardized interfaces allow dynamic registration of language-specific MCP servers. The PPA’s Bloom taxonomy mapping is language-independent, though CAFM conflict detection requires language-specific similarity thresholds.
- Longitudinal Learning Gain Evaluation. Current benchmarks measure single-interaction accuracy, not longitudinal gains, the ultimate educational measure [1]. Multi-site RCTs across diverse institutions, leveraging CTOL’s xAPI/LRS infrastructure for data collection, represent the gold standard [3,31]. Knowledge tracing in dialogues [29] and graph-based approaches [30] provide methodological foundations.
- LLM Bias in Pedagogical Content. LLMs exhibit cultural, gender, and geographic biases that students may internalize as authoritative [5,9]. Mitigation requires diversity-aware re-ranking in HERAP, fairness terms in CAFM’s Pedagogical Authority Ranking, and bias-audited scaffolding templates in PPA. Fairness-annotated educational benchmarks remain an unmet need [10,66].
- Cost-Effectiveness and Scalability. EduMSRA’s multi-component pipeline involves multiple LLM calls per query, potentially prohibitive for resource-constrained institutions [6]. Cost-aware query routing (lightweight models for factual queries), CAFM caching across similar queries, and modular MCP adoption (starting with minimal servers) address this challenge [12,13].
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| HDI | Highest Density Interval |
| RAG | Retrieval-Augmented Generation |
| MCP | Model Context Protocol |
| ITS | Intelligent Tutoring System |
| LLM | Large Language Model |
| CAFM | Conflict-Aware Fusion Module |
| LPM | Learner Profile Manager |
| PoC | Proof of Concept |
| PPA | Pedagogical Policy Agent |
| EduMSRA | Educational Multi-Source Research Agent |
Appendix A. Benchmark Dataset Access Links
| # | Dataset | Platform | Access Path/URL |
|---|---|---|---|
| D1 * | AI2-ARC Challenge | HuggingFace | allenai/ai2_arc (ARC-Challenge) |
| D2 | OpenBookQA | HuggingFace | allenai/openbookqa |
| D3 | ScienceQA | HuggingFace | derek-thomas/ScienceQA |
| D4 | TQA/CK12-QA | HuggingFace | yyyyifan/TQA |
| D5 | MMLU (Edu subset) | HuggingFace | cais/mmlu (all) |
| D7 | SciQ | HuggingFace | allenai/sciq |
| D8 * | LaMP | HuggingFace | alireza7/LaMP-QA (Art_and_Entertainment) |
| D9 | SciQAG | HuggingFace | emrekuruu/SciQAG |
| D10 | KILT | HuggingFace | facebook/kilt_tasks (nq) |
Appendix B. Dataset Split Distributions
| # | Dataset | Train | Test | Val | Total | Level |
|---|---|---|---|---|---|---|
| D1 * | AI2-ARC | 1119 | 1172 | 299 | 2590 | K-12 |
| D2 | OpenBookQA | 4957 | 500 | 500 | 5957 | Elementary |
| D3 | ScienceQA | 12,726 | 4241 | 4241 | 21,208 | K-12 |
| D4 | CK12-TQA | 6501 | 3285 | 2781 | 12,567 | Middle |
| D5 | MMLU | 99,842 | 14,042 | 1531 | 115,415 | College |
| D7 | SciQ | 11,679 | 1000 | 1000 | 13,679 | High School |
| D8 * | LaMP | 9349 | 767 | 801 | 10,917 | Post-sec. |
| D9 | SciQAG | 4496 | 0 | 0 | 4496 | Research |
| D10 | KILT | 87,372 | 1444 | 2837 | 91,653 | General |
References
- Bloom, B.S. The 2 Sigma Problem: The Search for Methods of Group Instruction as Effective as One-to-One Tutoring. Educ. Res. 1984, 13, 4–16. [Google Scholar] [CrossRef]
- Corbett, A.T.; Anderson, J.R. Knowledge Tracing: Modeling the Acquisition of Procedural Knowledge. User Model. User Adapt. Interact. 1994, 4, 253–278. [Google Scholar] [CrossRef]
- VanLehn, K. The Relative Effectiveness of Human Tutoring, Intelligent Tutoring Systems, and Other Tutoring Systems. Educ. Psychol. 2011, 46, 197–221. [Google Scholar] [CrossRef]
- Liu, Z.; Agrawal, P.; Singhal, S.; Madaan, V.; Kumar, M.; Verma, P.K. LPITutor: An LLM Based Personalized Intelligent Tutoring System Using RAG and Prompt Engineering. PeerJ Comput. Sci. 2025, 11, e2991. [Google Scholar] [CrossRef] [PubMed]
- Kasneci, E.; Seßler, K.; Küchemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; Günnemann, S.; Hüllermeier, E.; et al. ChatGPT for Good? On Opportunities and Challenges of Large Language Models for Education. Learn. Individ. Differ. 2023, 103, 102274. [Google Scholar] [CrossRef]
- Modran, H.A. Leveraging RAG with ACP & MCP for Adaptive Intelligent Tutoring. Appl. Sci. 2025, 15, 11443. [Google Scholar] [CrossRef]
- Yan, L.; Sha, L.; Zhao, L.; Li, Y.; Martinez-Maldonado, R.; Chen, G.; Li, X.; Jin, Y.; Gasevic, D. Practical and Ethical Challenges of Large Language Models in Education: A Systematic Scoping Review. Br. J. Educ. Technol. 2024, 55, 90–112. [Google Scholar] [CrossRef]
- Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. arXiv 2023, arXiv:2311.05232. [Google Scholar] [CrossRef]
- Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.; Madotto, A.; Fung, P. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. 2023, 55, 248. [Google Scholar] [CrossRef]
- Tonmoy, S.; Zaman, S.; Jain, V.; Rani, A.; Rawber, A.; Chadha, A.; Das, A. A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models. arXiv 2024, arXiv:2401.01313. [Google Scholar] [CrossRef]
- Swacha, J.; Gracel, M. Retrieval-Augmented Generation (RAG) Chatbots for Education: A Survey of Applications. Appl. Sci. 2025, 15, 4234. [Google Scholar] [CrossRef]
- Anthropic. Introducing the Model Context Protocol; Technical Report; Anthropic: San Francisco, CA, USA, 2024. [Google Scholar]
- Hou, X.; Zhao, Y.; Wang, S.; Wang, H. Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions. arXiv 2025, arXiv:2503.23278. [Google Scholar] [CrossRef]
- Anderson, L.; Krathwohl, D. A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives; Longman: Harlow, UK, 2001; ISBN 978-0321084057. [Google Scholar]
- Bloom, B. Taxonomy of Educational Objectives: The Classification of Educational Goals; Longmans, Green and Co.: New York, NY, USA, 1956; ISBN 978-0679302117. [Google Scholar]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv 2020, arXiv:2005.11401. [Google Scholar]
- Li, Z.; Wang, Z.; Wang, W.; Hung, K.; Xie, H.; Wang, F.L. Retrieval-Augmented Generation for Educational Application: A Systematic Survey. Comput. Educ. Artif. Intell. 2025, 8, 100417. [Google Scholar] [CrossRef]
- Jeong, S.; Baek, J.; Cho, S.; Hwang, S.J.; Park, J.C. Adaptive-RAG: Learning to Adapt Retrieval-Augmented LLMs through Question Complexity. arXiv 2024, arXiv:2403.14403. [Google Scholar]
- Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; Hajishirzi, H. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. arXiv 2024, arXiv:2310.11511. [Google Scholar]
- Singh, A.; Ehtesham, A.; Kumar, S.; Khoei, T.T.; Vasilakos, A.V. Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG. arXiv 2025, arXiv:2501.09136. [Google Scholar] [CrossRef]
- Trivedi, H.; Balasubramanian, N.; Khot, T.; Sabharwal, A. Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions. arXiv 2023, arXiv:2212.10509. [Google Scholar]
- Jiang, J.; Chen, J.; Li, J.; Ren, R.; Wang, S.; Zhao, W.X.; Song, Y.; Zhang, T. RAG-Star: Enhancing Deliberative Reasoning with Retrieval Augmented Verification and Refinement. arXiv 2024, arXiv:2412.12881. [Google Scholar] [CrossRef]
- Luo, Z.; Shen, Z.; Yang, W.; Zhao, Z.; Jwalapuram, P.; Saha, A.; Sahoo, D.; Savarese, S.; Xiong, C.; Li, J. MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers. arXiv 2025, arXiv:2508.14704. [Google Scholar]
- Fan, S.; Ding, X.; Zhang, L.; Mo, L. MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark. arXiv 2025, arXiv:2508.07575. [Google Scholar]
- Wang, Z.; Chang, Q.; Patel, H.; Biju, S.; Wu, C.; Liu, Q.; Ding, A.; Rezazadeh, A.; Shah, A.; Bao, Y.; et al. MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers. arXiv 2025, arXiv:2508.20453. [Google Scholar]
- Zhang, D.; Li, Z.; Luo, X.; Liu, X.; Li, P.; Xu, W. MCP Security Bench (MSB): Benchmarking Attacks Against Model Context Protocol in LLM Agents. arXiv 2025, arXiv:2510.15994. [Google Scholar]
- Radosevich, B.; Halloran, J. MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits. arXiv 2025, arXiv:2504.03767. [Google Scholar]
- Vygotsky, L.S. Mind in Society: The Development of Higher Psychological Processes; Harvard University Press: Cambridge, MA, USA, 1978; ISBN 978-0674576292. [Google Scholar]
- Scarlatos, A.; Baker, R.S.; Lan, A. Exploring Knowledge Tracing in Tutor-Student Dialogues using LLMs. arXiv 2024, arXiv:2409.16490. [Google Scholar]
- Cui, J.; Qian, H.; Jiang, B.; Zhang, W. Leveraging Pedagogical Theories to Understand Student Learning Process with Graph-based Reasonable Knowledge Tracing. arXiv 2024, arXiv:2406.12896. [Google Scholar]
- Liu, V.; Latif, E.; Zhai, X. Advancing Education through Tutoring Systems: A Systematic Literature Review. arXiv 2025, arXiv:2503.09748. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training Language Models to Follow Instructions with Human Feedback. arXiv 2022, arXiv:2203.02155. [Google Scholar] [CrossRef]
- Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W. Dense Passage Retrieval for Open-Domain Question Answering. arXiv 2020, arXiv:2004.04906. [Google Scholar] [CrossRef]
- Guu, K.; Lee, K.; Tung, Z.; Pasupat, P.; Chang, M. Retrieval Augmented Language Model Pre-Training. arXiv 2020, arXiv:2002.08909. [Google Scholar] [CrossRef]
- Izacard, G.; Grave, E. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. arXiv 2021, arXiv:2007.01282. [Google Scholar] [CrossRef]
- Borgeaud, S.; Mensch, A.; Hoffmann, J.; Cai, T.; Rutherford, E.; Millican, K.; van den Driessche, G.; Lespiau, J.B.; Damoc, B.; Clark, A.; et al. Improving Language Models by Retrieving from Trillions of Tokens. arXiv 2022, arXiv:2112.04426. [Google Scholar] [CrossRef]
- Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, M.; Wang, H. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv 2024, arXiv:2312.10997. [Google Scholar] [CrossRef]
- Jiang, Z.; Xu, F.; Gao, L.; Sun, Z.; Liu, Q.; Dwivedi-Yu, J.; Yang, Y.; Callan, J.; Neubig, G. Active Retrieval Augmented Generation. arXiv 2023, arXiv:2305.06983. [Google Scholar] [CrossRef]
- Yan, S.-Q.; Gu, J.-C.; Zhu, Y.; Ling, Z.-H. Corrective Retrieval Augmented Generation (CRAG). arXiv 2024, arXiv:2401.15884. [Google Scholar]
- Schick, T.; Dwivedi-Yu, J.; Dessi, R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv 2024, arXiv:2302.04761. [Google Scholar]
- Nguyen, e.a. Towards Personalized AI Education: Context-Aware RAG With Grade-Level LLM Adaptation. Comput. Appl. Eng. Educ. 2026, 34, e70153. [Google Scholar] [CrossRef]
- Levonian, Z.; Li, C.; Zhu, W.; Gade, A.; Henkel, O.; Postle, M.E.; Xing, W. Retrieval-Augmented Generation to Improve Math Question-Answering: Trade-offs Between Groundedness and Human Preference. In Proceedings of the NeurIPS 2023 Workshop on Generative AI for Education (GAIED), New Orleans, LA, USA, 15 December 2023. [Google Scholar]
- Graesser, A.C.; D’Mello, S.; Hu, X.; Cai, Z.; Olney, A.; Morgan, B. AutoTutor. In Applied Natural Language Processing: Identification, Investigation and Resolution; IGI Global: Hershey, PA, USA, 2012; pp. 169–187. [Google Scholar] [CrossRef]
- Wampler, D.; Nielson, D.; Seddighi, A. Engineering the RAG Stack: A Comprehensive Review of the Architecture and Trust Frameworks for Retrieval-Augmented Generation Systems. arXiv 2025, arXiv:2601.05264. [Google Scholar]
- Singh, A.; Ehtesham, A.; Kumar, S.; Khoei, T.T. A Survey of the Model Context Protocol (MCP): Standardizing Context to Enhance LLMs. Preprints 2025, 2025040245. [Google Scholar] [CrossRef]
- Fei, X.; Zheng, X.; Feng, H. MCP-Zero: Active Tool Discovery for Autonomous LLM Agents. arXiv 2025, arXiv:2506.01056. [Google Scholar]
- Lumer, E.; Gulati, A.; Subbiah, V.; Basavaraju, P.; Burke, J. ScaleMCP: Dynamic and Auto-Synchronizing Model Context Protocol Tools for LLM Agents. arXiv 2025, arXiv:2505.06416. [Google Scholar]
- Hasan, M.; Li, H.; Rajbahadur, G.; Adams, B.; Hassan, A. Model Context Protocol (MCP) Tool Descriptions Are Smelly! Towards Improving AI Agent Efficiency with Augmented MCP Tool Descriptions. arXiv 2026, arXiv:2602.14878. [Google Scholar] [CrossRef]
- Li, Q.; Xie, Y. From Glue-Code to Protocols: A Critical Analysis of A2A and MCP Integration for Scalable Agent Systems. arXiv 2025, arXiv:2505.03864. [Google Scholar] [CrossRef]
- Ehtesham, A.; Singh, A.; Gupta, G.; Kumar, S. A Survey of Agent Interoperability Protocols: Model Context Protocol (MCP), Agent Communication Protocol (ACP), Agent-to-Agent Protocol (A2A), and Agent Network Protocol (ANP). arXiv 2025, arXiv:2505.02279. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
- Zhu, X.; Chang, S.; Kuik, A. Enhancing Critical Thinking with AI: A Tailored Warning System for RAG Models. arXiv 2025, arXiv:2504.16883. [Google Scholar] [CrossRef]
- Shi, W.; Min, S.; Yasunaga, M.; Seo, M.; James, R.; Lewis, M.; Zettlemoyer, L.; Yih, W. REPLUG: Retrieval-Augmented Black-Box Language Models. arXiv 2024, arXiv:2301.12652. [Google Scholar]
- Ram, O.; Levine, Y.; Dalmedigos, I.; Muhlgay, D.; Shashua, A.; Leyton-Brown, K.; Shoham, Y. In-Context Retrieval-Augmented Language Models. Trans. Assoc. Comput. Linguist. 2023, 11, 1316–1331. [Google Scholar] [CrossRef]
- Ma, X.; Gong, Y.; He, P.; Zhao, H.; Duan, N. Query Rewriting in Retrieval-Augmented Large Language Models. arXiv 2023, arXiv:2305.14283. [Google Scholar] [CrossRef]
- Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; Cao, Y. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv 2023, arXiv:2210.03629. [Google Scholar] [CrossRef]
- Shinn, N.; Cassano, F.; Berman, E.; Gopinath, A.; Narasimhan, K.; Yao, S. Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv 2023, arXiv:2303.11366. [Google Scholar] [CrossRef]
- Chen, Y.; Yan, L.; Sun, W.; Ma, X.; Zhang, Y.; Wang, S.; Yin, D.; Yang, Y.; Mao, J. Improving Retrieval-Augmented Generation through Multi-Agent Reinforcement Learning. arXiv 2025, arXiv:2501.15228. [Google Scholar]
- Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; Chen, Z.; Tang, J.; Chen, X.; Lin, Y.; et al. A Survey on Large Language Model Based Autonomous Agents. Front. Comput. Sci. 2024, 18, 186345. [Google Scholar] [CrossRef]
- Aghajani Asl, M.; Asgari-Bidhendi, M.; Minaei-Bidgoli, B. FAIR-RAG: Faithful Adaptive Iterative Refinement for Retrieval-Augmented Generation. arXiv 2025, arXiv:2510.22344. [Google Scholar]
- Zhang, Z.; Bo, X.; Ma, C.; Li, R.; Chen, X.; Dai, Q.; Zhu, J.; Dong, Z.; Wen, J.-R. A Survey on the Memory Mechanism of Large Language Model Based Agents. arXiv 2024, arXiv:2404.13501. [Google Scholar] [CrossRef]
- Edge, D.; Trinh, H.; Cheng, N.; Bradley, J.; Chao, A.; Mody, A.; Truitt, S.; Metropolitansky, D.; Ness, R.O.; Larson, J. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv 2025, arXiv:2404.16130. [Google Scholar]
- Sarthi, P.; Abdullah, S.; Tuli, A.; Khanna, S.; Goldie, A.; Manning, C.D. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval. arXiv 2024, arXiv:2401.18059. [Google Scholar] [CrossRef]
- Baek, J.; Aji, A.; Saffari, A. Knowledge-Augmented Language Model Prompting for Zero-Shot Knowledge Graph Question Answering. arXiv 2024, arXiv:2306.04136. [Google Scholar]
- Es, S.; James, J.; Espinosa-Anke, L.; Schockaert, S. RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv 2024, arXiv:2309.15217. [Google Scholar]
- Chen, J.; Lin, H.; Han, X.; Sun, L. Benchmarking Large Language Models in Retrieval-Augmented Generation. arXiv 2024, arXiv:2309.01431. [Google Scholar] [CrossRef]
- Abootorabi, M.M.; Zobeiri, A.; Dehghani, M.; Mohammadkhani, M.; Mohammadi, B.; Ghahroodi, O.; Soleymani Baghshah, M.; Asgari, E. Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation. arXiv 2025, arXiv:2502.08826. [Google Scholar] [CrossRef]
- Lu, P.; Mishra, S.; Xia, T.; Qiu, L.; Chang, K.-W.; Zhu, S.-C.; Tafjord, O.; Clark, P.; Kalyan, A. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. arXiv 2022, arXiv:2209.09513. [Google Scholar] [CrossRef]
- Ray, P. A Survey on Model Context Protocol: Architecture, State-of-the-Art, Challenges. TechRxiv 2025. TechRxiv:174495492.22752319. [Google Scholar]
- Liu, W.; Liu, Z.; Dai, E.; Yu, W.; Yu, L.; Yang, T.; Han, J.; Gao, H. MCPAgentBench: A Real-world Task Benchmark for Evaluating LLM Agent MCP Tool Use. arXiv 2025, arXiv:2512.24565. [Google Scholar]
- Wang, W.; Niu, P.; Xu, Z.; Chen, Z.; Du, J.; Du, Y.; Pang, X.; Huang, K.; Wang, Y.; Yan, Q.; et al. MCP-Flow: Facilitating LLM Agents to Master Real-World, Diverse and Scaling MCP Tools. arXiv 2025, arXiv:2510.24284. [Google Scholar]
- Nie, X.; Guo, Z.; Chen, Y.; Zhou, Y.; Zhang, W. AWCP: A Workspace Delegation Protocol for Deep-Engagement Collaboration across Remote Agents. arXiv 2026, arXiv:2602.20493. [Google Scholar]
- Mousavinasab, E.; Zarifsanaiey, N.; Niakan Kalhori, S.R.; Rakhshan, M.; Keikha, L.; Ghazi Saeedi, M. Intelligent Tutoring Systems: A Systematic Review of Characteristics, Applications, and Evaluation Methods. Interact. Learn. Environ. 2021, 29, 142–163. [Google Scholar] [CrossRef]
- Khan Academy. Khanmigo: AI-Powered Teaching and Learning Assistant. 2024. Available online: https://www.khanmigo.ai/ (accessed on 15 March 2026).
- Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring Massive Multitask Language Understanding (MMLU). arXiv 2021, arXiv:2009.03300. [Google Scholar]
- Wan, Y.; Liu, Y.; Ajith, A.; Grazian, C.; Hoex, B.; Zhang, W.; Kit, C.; Xie, T.; Foster, I. SciQAG: A Framework for Auto-Generated Science Question Answering Dataset with Fine-grained Evaluation. arXiv 2024, arXiv:2405.09939. [Google Scholar]
- Higgins, J.P.T.; Thompson, S.G.; Deeks, J.J.; Altman, D.G. Measuring Inconsistency in Meta-analyses. BMJ 2003, 327, 557–560. [Google Scholar] [CrossRef]
- Wang, Z.; Zhang, Y.; Min, W.; Guan, Q.; Yu, W. GISedu-GPT: A Large Language Model Framework with Prior Knowledge for GIS Education. J. Geogr. High. Educ. 2025, 50, 72–99. [Google Scholar] [CrossRef]
- Zhang, Y.; Wei, C.; He, Z.; Yu, W. GeoGPT: An Assistant for Understanding and Processing Geospatial Tasks. Int. J. Appl. Earth Obs. Geoinf. 2024, 133, 104019. [Google Scholar] [CrossRef]
- Jahanbakhsh, N.; Vega-Barbas, M.; Pau, I.; Elvira-Martín, L.; Moosavi, H.; García-Vázquez, C. Leveraging RAG for Automated Smart Home Orchestration. Future Internet 2025, 17, 198. [Google Scholar] [CrossRef]
- James, A.; Trovati, M.; Bolton, S. RAG to Generate Knowledge Assets and Creation of Action Drivers. Appl. Sci. 2025, 15, 6247. [Google Scholar] [CrossRef]













| Architecture | Adapt. Retr. | Multi-Src Edu | MCP Std. | Citation Aware | Learner Pers. | Edu Application |
|---|---|---|---|---|---|---|
| Naive RAG [16] | No | No | No | No | None | Static document QA |
| Advanced RAG [38] | Partial | Limited | No | No | None | Textbook QA |
| AutoTutor-LLM [44] | No | No | No | No | Session | Dialogue tutoring |
| Agentic RAG [20] | Yes | Partial | No | No | Session | ITS (limited) |
| GraphRAG [63] | Yes | No | No | Partial | None | Curriculum mapping |
| Khanmigo [75] | Partial | Limited | No | No | Partial | Math/writing tutor |
| RAG+ACP [6] | Yes | Partial | Partial | No | Session | STEM tutoring |
| PRAG-EDU [42] | Yes | No | No | Partial | Grade | Personalized QA |
| EduMSRA (Proposed †) | Yes † | Yes † | Yes † | Yes † | Full † | Holistic Ed. Agent |
| Challenge | Description | EduMSRA Solution |
|---|---|---|
| Curriculum Fragmentation | Knowledge across textbooks, slides, videos, assessments | CTOL connects heterogeneous sources via MCP |
| Learner Heterogeneity | Diverse prior knowledge, pace, style | LPM personalizes retrieval depth and vocabulary |
| Hallucination | Incorrect explanations in high-stakes learning | Citation-Aware Generation with source attribution |
| Knowledge Staleness | Semester content updates beyond LLM cutoff | Dynamic MCP-connected curriculum index |
| Multi-hop Gaps | Cross-chapter synthesis questions | HERAP 3-tier retrieval (keyword, semantic, graph) |
| Assessment Alignment | Misalignment with Bloom’s taxonomy levels | PPA maps content to cognitive levels |
| Source Conflicts | Contradictory definitions across textbooks | CAFM detects and reconciles contradictions |
| Data Privacy | FERPA/GDPR for student records | 3-tier Permission Sandbox with audit logging |
| Scalability | Diverse LMS platforms across institutions | MCP open protocol for plug-and-play integration |
| # | Dataset | Size | Type | Level | Subject |
|---|---|---|---|---|---|
| D1 * | AI2-ARC Challenge | 7787 | MCQ | K-12 | Science |
| D2 | OpenBookQA | 5957 | MCQ | Elementary | General Sci. |
| D3 | ScienceQA | 21,208 | MCQ+Img | K-12 | Multi-subj. |
| D4 | TQA/CK12-QA | 26,260 | MCQ+TF | Middle | Science |
| D5 | MMLU (Edu) | ∼4000 | MCQ | College | STEM+Hum. |
| D7 | SciQ | 13,679 | MCQ | High School | Natural Sci. |
| D8 * | LaMP | ∼10 K users | Multi-task | Post-sec. | General |
| D9 | SciQAG | 960 K | Open QA | Research | Multi-domain |
| D10 | KILT | ∼700 K | Multi-task | General | Wikipedia |
| Exp. | Dataset | Research Question | Baseline | Key Metrics |
|---|---|---|---|---|
| E1 | AI2-ARC | Hierarchical vs. single-stage retrieval for multi-hop QA? | Naive RAG + GPT-4o | Accuracy, Context Precision, NDCG@5 |
| E2 | LaMP | LPM improves personalization vs. generic RAG? | PRAG-EDU, Standard RAG | ROUGE-L, Personalization Score |
| E3 | ScienceQA + TQA | CAFM vs. majority-vote fusion? | Majority Vote, RAG Fusion | Conflict Resolution F1, Attribution Acc. |
| E4 | MMLU (Edu) | Cross-domain generalization? | Advanced RAG, LPITutor | Cross-domain Acc., MCP Success Rate |
| Module | Mean (ms) | Std (ms) |
|---|---|---|
| HERAP (BM25) | 0.236 | 0.733 |
| CTOL (MCP mock) | 0.006 | 0.004 |
| CAFM (TF-IDF) | 1.406 | 2.212 |
| LPM (BKT) | 0.003 | 0.004 |
| PPA (Bloom) | 0.014 | 0.004 |
| Total | 1.677 | — |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Ho, T.-L.; Lam, T.-P. EduMSRA: A Multi-Source Educational Research Agent Integrating Retrieval-Augmented Generation and Model Context Protocol for Adaptive Intelligent Tutoring Systems. Appl. Sci. 2026, 16, 4400. https://doi.org/10.3390/app16094400
Ho T-L, Lam T-P. EduMSRA: A Multi-Source Educational Research Agent Integrating Retrieval-Augmented Generation and Model Context Protocol for Adaptive Intelligent Tutoring Systems. Applied Sciences. 2026; 16(9):4400. https://doi.org/10.3390/app16094400
Chicago/Turabian StyleHo, Thi-Linh, and Thanh-Phong Lam. 2026. "EduMSRA: A Multi-Source Educational Research Agent Integrating Retrieval-Augmented Generation and Model Context Protocol for Adaptive Intelligent Tutoring Systems" Applied Sciences 16, no. 9: 4400. https://doi.org/10.3390/app16094400
APA StyleHo, T.-L., & Lam, T.-P. (2026). EduMSRA: A Multi-Source Educational Research Agent Integrating Retrieval-Augmented Generation and Model Context Protocol for Adaptive Intelligent Tutoring Systems. Applied Sciences, 16(9), 4400. https://doi.org/10.3390/app16094400

