An Adaptive Multi-Agent Architecture with Reinforcement Learning and Generative AI for Intelligent Tutoring Systems: A Moodle-Based Case Study

López-Goyez, Juan P.; González-Briones, Alfonso; Demazeau, Yves

doi:10.3390/app16031323

Open AccessArticle

An Adaptive Multi-Agent Architecture with Reinforcement Learning and Generative AI for Intelligent Tutoring Systems: A Moodle-Based Case Study

by

Juan P. López-Goyez

^1,2,*

,

Alfonso González-Briones

¹

and

Yves Demazeau

³

¹

BISITE Research Group, Edificio I+D+i, University of Salamanca, Calle Espejo 2, 37007 Salamanca, Castile and León, Spain

²

Faculty of Agricultural Industries and Environmental Sciences, Universidad Politécnica Estatal del Carchi, Calle Antisana, Tulcán 040101, Ecuador

³

Centre National de le Recherche Scientifique—Laboratoire d’Informatique de Grenoble (CNRS-LIG), University of Grenoble-Alps, 38000 Grenoble, France

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(3), 1323; https://doi.org/10.3390/app16031323

Submission received: 5 January 2026 / Revised: 26 January 2026 / Accepted: 27 January 2026 / Published: 28 January 2026

(This article belongs to the Special Issue Reinforcement Learning for Real-World Applications)

Download

Browse Figures

Versions Notes

Abstract

Intelligent Tutoring Systems are increasingly used in higher education to support personalized learning and academic monitoring in large-scale digital environments. However, existing systems are predominantly based on static architecture and rigid rule-based mechanisms, which limit scalability and hinder effective adaptation to heterogeneous learners, evolving learning behaviors, and real-world educational contexts. This paper presents a self-adaptive multi-agent architecture based on Reinforcement Learning for autonomous decision-making in intelligent systems deployed in real environments. The proposal integrates an RL Meta-Agent that dynamically optimizes the selection of specialized agents through an intelligent switching mechanism, considering the user’s state, behavior, and interaction patterns. The architecture was implemented in Moodle using flows orchestrated in n8n, LLMs, databases, APIs developed in Django, and real academic data. For the empirical evaluation, a real and a simulated case study were designed. A questionnaire was administered to university students, considering dimensions of usability, satisfaction and usefulness, and accessibility and interaction, to understand the perception of the system and improvements. The quantitative data were analyzed using descriptive statistics and nonparametric tests (Mann–Whitney U and Kruskal–Wallis), while the qualitative data were examined using thematic categorization. A simulated case study was conducted to analyze the behavior of the system. The results show that the RL Meta-Agent significantly improves system efficiency, response relevance, and adaptive agent selection, demonstrating that self-adaptive RL-based MAS architectures are a viable solution for intelligent systems applied in real-world contexts, providing empirical evidence of their performance and adaptability in complex scenarios such as higher education.

Keywords:

generative artificial intelligence; intelligent tutoring systems; learning management systems; multi-agent systems; reinforcement learning

1. Introduction

1.1. Digital Transformation, Smart Tutoring, and the Evolution of ITS

Digital transformation has driven the development of intelligent adaptive systems applied to real educational environments, geared toward automation, personalization, and autonomous decision-making. Universities face the challenge of integrating advanced technologies that enable them to respond dynamically to changing environments, where automated and adaptive tutoring is a representative example of the application of intelligent architectures in complex scenarios.

The incorporation of artificial intelligence and data analytics has favored the development of adaptive intelligent systems capable of optimizing autonomous decision-making and automated monitoring in real environments. Reinforcement learning (RL) emerges as a key approach for modeling dynamic adaptation and intelligent tutoring processes in complex digital platforms, allowing the system’s behavior to be adjusted based on user interaction and feedback from different educational environments and settings [1,2].

Research on educational trends mentions that the central challenge of contemporary education is to create learning environments that are simultaneously personalized, automated, inclusive, and capable of dynamically adapting to each student’s actual progress [3]. It is emphasized that deep personalization, based on AI, will be the main focus of modern teaching practices, but warn that its effective implementation requires overcoming technical, ethical, and operational obstacles [4], such as the creation of systems that not only process data but also learn, adapt, and evolve with student interactions, without requiring constant supervision by teachers [5].

Intelligent tutoring systems (ITS) enable immediate feedback and personalized adaptation in online educational environments, but also in other modes of study. Recent studies show their expansion, driven by the incorporation of advanced artificial intelligence techniques that overcome the limitations of traditional approaches [6].

These systems are mainly built on AI technologies, web and mobile platforms, among others, and allow for real-time modeling of the student, the domain, and the interaction. The main models that stand out are rule-based and expert knowledge systems, Bayesian Knowledge Tracing (BKT) and Item Response Theory (IRT), probabilistic models and dynamic extensions with Kalman filters, Performance Factors Analysis (PFA) and Learning Factors Analysis (LFA), Markov and Partially Observable Markov Decision Processes (POMDP) models, and machine learning (ML) and deep learning (DL) models for student modeling [1,7].

ITS also integrates natural language processing (NLP) techniques, dialogue systems that include speech recognition and synthesis to manage communication with students, as well as decision-making modules that can rely on reinforcement learning (RL) and educational data analysis to dynamically adapt activities and feedback to students through virtual learning environments (VLE) [8].

AI-based ITS have progressively incorporated techniques such as machine learning (ML), deep learning (DL), knowledge tracing, conversational agents, and, more recently, multi-agent systems (MAS). Although technological innovations promise to improve personalization, there is still a significant gap between theory and real-world applications in higher education. Most ITS continue to operate in controlled environments, with small-scale simulations or tests [9], and are rarely seamlessly integrated into widely used institutional platforms [10].

These systems have evolved from rule-based approaches to adaptive architectures oriented towards learning from data. However, limitations remain in terms of generalization, scalability, and continuous operation on real platforms such as Learning Management Systems (LMS), Moodle, Chamilo, and Open edX, where multiple courses and heterogeneous user profiles coexist [4,11,12].

Although ITSs have been extensively studied, their integration with self-adaptive architectures based on Reinforcement Learning and generative AI remains limited, especially in real-world and scalable environments. Despite advances in LLM, many proposals rely on static flows and rigid rules, which restrict dynamic adaptation to heterogeneous users and contexts.

1.2. Adaptive Learning in ITS: Reinforcement Learning and Deep Learning

Reinforcement learning (RL) emerges as a key strategy for improving the adaptability of agent based ITS, allowing systems to learn optimal policies through continuous interaction with dynamic environments, favoring adaptive interventions in contexts with heterogeneous user behaviors. Unlike other ML approaches, RL allows an agent to learn optimal policies through continuous interaction with an environment, identifying actions that maximize cumulative rewards [13].

RL allows for the optimization of sequencing, recommendation, and personalization of interventions, learning adaptive strategies based on user performance without direct supervision, which has proven its viability in intelligent tutoring systems, considering the educational needs of the user and their interaction with the system [14].

The strengthening of RL techniques has been made possible by advanced methods such as Proximal Policy Optimization (PPO), which combine stability in policy updating, sampling efficiency, and generalization ability. PPO is suitable for adaptive learning environments that require a balance between exploration and exploitation [15]. In the field of education, there is an urgent need to bring RL to real-world assessments in educational contexts, moving beyond laboratory tests and addressing complex variations in human behavior.

Advances in Deep Learning, such as Deep Knowledge Tracing (DKT), allow the student’s knowledge status to be modeled based on interaction sequences, improving performance prediction compared to classical approaches such as Bayesian Knowledge Tracing and facilitating the design of adaptive strategies without requiring expert labeling [16]. Although these approaches improve performance prediction, they have limitations in terms of interpretability, data dependency, and contextual adaptation, especially in real and heterogeneous environments such as LMSs, where the variability of users and scenarios requires more flexible and adaptive decision-making mechanisms. In real environments such as Moodle, these limitations are intensified due to the variability of courses and student behaviors [17].

1.3. Multiagent Systems, LLM, and Agentic AI for Intelligent Tutoring

Given the limitations of traditional ITS, multi-agent architectures (MAS) are emerging as an alternative for managing the complexity and scalability of adaptive tutoring systems, distributing functions among specialized agents that collaborate in a coordinated manner [12], as they can improve online educational systems by incorporating pedagogical agents, recommendation agents, and monitoring agents that interact in real time to offer personalized adaptations.

MAS in the educational context have been used to solve problems related to recommending learning objects, evaluating student progress, adapting training itineraries, and providing personalized feedback [18]. Recent studies show that MAS make it possible to efficiently manage extensive repositories of educational resources and formulate dynamic recommendations based on student profiles and behaviors [19].

The different agents interact with each other and can build personalized learning paths based on student models, offering adapted teaching sequences [2]. These approaches highlight the importance of modularity and functional decomposition in complex ITS; however, they also reveal significant limitations: systems tend to operate with static rules or deterministic logic, which limits their capacity for autonomous learning and adaptation to real situations not anticipated by designers [20].

The convergence between ITS, MAS architectures, and RL is enhanced by the incorporation of LLM models, which enables the development of intelligent conversational agents capable of interpreting open-ended queries, generating contextual responses, and providing immediate feedback in real-world environments [21]. However, LLMs do not guarantee educational effectiveness on their own, as they must be aligned with pedagogical theories, include student assessment mechanisms, and fulfill structured tutoring functions [22].

AI research has shown a transition from reactive generative models to agent systems capable of autonomously planning, executing, and evaluating tasks through iterative loops of perception, reasoning, action, and learning. Agentic AI are autonomous systems that integrate LLM, planning, memory, tools, and evaluation metrics [23]. In this context, agents are no longer simple conversational interfaces but become complete systems capable of operating without human intervention, which is particularly relevant for complex domains such as higher education.

In educational AI, there has been growth in LLM-based agents applied to tutoring and learning personalization, highlighting the need for hybrid human-agent interaction (HAI) flows that integrate ethical considerations and oversight mechanisms, where the teacher retains a central role in supervising and validating the tutoring process [24].

Various studies have proposed frameworks and models for structuring AI agents that transcend the simple prompt-response paradigm. Among these, agentic workflows stand out, which organize the interaction between LLM, external tools, and memory through explicit and reusable processes [25]. Proposals are presented for agents to identify common patterns such as task decomposition, the use of APIs and web servers, the incorporation of semantic and episodic memory, and reasoning verification mechanisms.

A representative example is Mentigo, an intelligent agent geared toward supporting creative problem solving through guided feedback [9]. These systems demonstrate the potential of agents to take on complex and adaptive tutorial roles, although challenges remain related to multi-agent coordination, the explainability of LLM-based reasoning, memory management, and the evaluation of human–agent interaction [26].

An emerging research direction involves the integration of multi-agent systems with agentic AI paradigms and LLMs, which significantly extends the traditional scope of intelligent educational systems. Within this framework, agents are no longer limited to executing predefined tasks; instead, they are capable of reasoning about academic goals, planning and coordinating pedagogical interventions, and dynamically adapting their behavior based on students’ performance and learning trajectories. This convergence supports the development of advanced academic assistants that can facilitate complex educational processes, such as academic advising, postgraduate progress monitoring, and highly personalized learning experiences in virtual environments, while operating under principles of teacher supervision, ethical governance, and algorithmic transparency [27].

The automation of agent flows using orchestration tools allows these architectures to be operationalized in real contexts, connecting agents, LLMs, RL modules, and educational services to facilitate governance, scalability, and continuous evaluation of the system [25]. Empirical studies show that the application of Q-learning in ITS improves personalized feedback and student performance compared to non-adaptive approaches. However, studies combining agents and RL focus mainly on recommendations, with little integration of specialized functional agents. Furthermore, the interaction between MAS, RL, and LLM remains largely unexplored in real platforms such as LMSs and in formal academic tutoring processes [8,10,28].

Despite the significant advances in reinforcement learning, deep learning, multi-agent architectures, and LLM-based tutoring systems, most existing approaches address these technologies in isolation or under static orchestration schemes. RL-based ITS often focus on recommendation or sequencing without integrating specialized tutoring agents, while MAS-based systems typically rely on deterministic logic that limits autonomous adaptation. Similarly, LLM-driven tutoring agents emphasize conversational capabilities but lack explicit decision-making mechanisms, experiential learning loops, and pedagogical supervision. Consequently, the integration of reinforcement learning, multi-agent coordination, and generative AI within a unified, self-adaptive architecture deployed in real LMS environments remains largely unexplored. This gap directly motivates the proposed ELA Tutor architecture.

1.4. Research Gap and Contribution

Although ITSs have been extensively studied, their integration with self-adaptive architectures based on Reinforcement Learning and generative AI remains limited, especially in real-world and scalable environments. Despite advances in LLM, many proposals rely on static flows and rigid rules, which restrict dynamic adaptation to heterogeneous users and contexts.

This work proposes ELA Tutor, a self-adaptive MAS architecture designed for deployment in LMS, specifically Moodle, through the integration of LLM, automated flows orchestrated with n8n, and autonomous decision-making mechanisms. The architecture is composed of specialized agents with differentiated functional responsibilities (pedagogical, technical, empathetic, and ethical), coordinated centrally to ensure system consistency, operational continuity, and coherence in decision-making throughout the interaction process.

At the core of the proposal is the Meta-Agent Orchestrator, responsible for coordinating the overall behavior of the system and dynamically selecting the most appropriate response strategy in each interaction. This component integrates an RL cycle, allowing the system to learn and adjust its decision policy based on accumulated experience. To do this, a simplified Q-learning model is used, in which the state of the environment is defined by combining the user’s level of knowledge and their emotional state inferred during the interaction, while the actions correspond to the different tutoring strategies managed by the specialized agents.

The architecture incorporates a persistent experiential memory, managed by the Meta-Agent Orchestrator and stored in a relational database, which is consulted before generating each response. When the historical experience associated with a strategy presents a sufficient level of confidence, the policy learned through RL guides decision-making; otherwise, the system resorts to deterministic mechanisms or inferential reasoning based on LLM. This hybrid decision logic, natively integrated into the architecture, allows for a balance between stability, adaptability, and generalization capacity of the system.

The novelty of this work lies in the design of ELA Tutor as a self-adaptive multi-agent architecture that natively integrates reinforcement learning and generative artificial intelligence within a real Learning Management System environment. Unlike existing Intelligent Tutoring Systems that rely on static orchestration flows or rigid rule-based tutoring strategies, the proposed architecture introduces a Meta-Agent Orchestrator capable of dynamically selecting tutoring strategies based on accumulated experience. This is achieved through a reinforcement learning cycle that combines inferred learner knowledge level and affective indicators to inform decision-making, while preserving a clear separation between decision-making processes and generative response production. Furthermore, the incorporation of persistent experiential memory and hybrid decision logic enables the system to balance adaptability, stability, and scalability, demonstrating its applicability in real-world LMS deployments rather than controlled or purely simulated settings.

2. Materials and Methods

This study follows an applied research approach, focused on the construction and evaluation of an adaptive intelligent tutoring architecture implemented in the Moodle learning management system within a university production environment. The methodology combines architectural design, reinforcement learning (RL)-based decision modeling, and empirical validation through the implementation of real and simulated case studies.

The proposed system is evaluated using a mixed-methods strategy, integrating (i) quantitative system-level indicators derived from RL convergence and interaction logs, and (ii) quantitative and perceptual data collected from students through structured questionnaires. This methodological design allows for the validation of both the technical aspects of the architecture and its educational applicability in real-world contexts.

2.1. System Design and Architecture

ELA Tutor was designed as an agent based ITS aimed at supporting intelligent teaching and tutoring processes, considering HAI principles, academic monitoring, and active teaching and learning methodologies. The architectural design adopts a service-oriented approach based on microservices, with the aim of facilitating interoperability with LMS Moodle, flexible deployment, and scalability of the system in real-world contexts [29].

The architecture was implemented using Docker 4.57.0 containers, which allow the main components of the system, data management, orchestration, interaction, and analytics to be run independently in courses enabled within the Moodle 4.5.5+ (Build: 20250706), allowing the system deployment to be replicated in different educational institutions working under this e-learning platform.

Under this architecture, the layers of interaction, orchestration, MAS, RL, and data management are presented, establishing well-defined communication channels between them. This scheme is particularly relevant in an ITS that integrates technologies in the Moodle LMS platform, automation engines (n8n Version1.100.1), relational databases (PostgresSQL 8.14), and LLM (OpenAI GPT-4.1), as it minimizes the coupling between these technologies through automated flows (Figure 1). Container-based deployment enables incremental updates, allowing specific components to be improved or replaced without interrupting the overall operation of the system.

2.1.1. Interaction and Communication Layer—User Interface

The architecture incorporates a layer of web interaction integrated into Moodle, which enables two-way communication between students, teachers, and ELA Tutor. The interaction is managed by middleware in Django 5.2.9 and Node.js v22.12.0, which exposes REST services for authentication, session management, and academic information synchronization, ensuring secure integration without additional credentials. Moodle acts as an academic context provider, supplying real data on courses, activities, and student progress, which allows for the generation of adaptive responses based on the training process.

2.1.2. Orchestration and Automation Layer

This layer constitutes the operational core of the system and is implemented using n8n as the orchestration engine, which is responsible for coordinating communication and control flows among the user interface, the multi-agent system (MAS), the reinforcement learning module, the database, and external AI services. The selection of n8n enables the modeling of complex interaction pipelines in a flexible and transparent manner, combining event-driven workflows with conditional logic and embedded Python 3.11.0 scripts that support agent coordination, decision-making, and interaction with artificial intelligence services.

Within this layer, each tutoring interaction is managed as a well-defined execution cycle that includes the reception of the user query, contextual enrichment using academic and historical data, evaluation by the Intelligent Switching Router, activation of the appropriate specialized agents, and validation of the generated response prior to delivery. Orchestration logic ensures that these processes are executed in a controlled and sequential manner, preserving consistency across interactions while allowing dynamic adaptation of tutoring strategies.

In addition, the orchestration layer incorporates mechanisms for error detection, exception handling, and fallback execution paths. These mechanisms allow the system to gracefully recover from communication failures, service unavailability, or unexpected agent outputs by reverting to predefined safe policies or alternative processing routes. As a result, the system maintains operational robustness and reliability in real-world educational environments, where variability in user behavior, network conditions, and external services is unavoidable.

2.1.3. Intelligent Switching Router

The Intelligent Switching Router is implemented within the orchestration layer, acting as a central decision node. This router operates as a decision agent, analyzing each incoming request considering multiple dimensions, such as the type of query, the academic context, the history of interactions, and signals derived from student behavior. Based on this analysis, it determines how the request should be processed and which strategies should be prioritized. The router represents a fundamental abstraction for architecture, as it allows the decision logic to be decoupled from the internal workings of the MAS. ELA Tutor can manage diverse educational scenarios without relying on rigid flows, enabling dynamic tutoring that combines conceptual explanation, practical support, feedback, and emotional accompaniment.

2.1.4. Multi-Agent System

This constitutes the intelligent core of the overall system and is composed of multiple specialized agents: pedagogical, technical or practical, analytical, empathetic, and ethical agents, which cooperate in an integrated manner within the system architecture. This design decision simplifies the representation of the system and emphasizes that tutoring is the result of internal collaboration between agents, rather than the isolated activation of independent components.

The MAS is responsible for interpreting user requests, generating contextualized responses, proposing learning resources, and offering adaptive feedback. Its design reflects the inherent complexity of tutoring in higher education, where student needs can vary significantly depending on domain, level of knowledge, learning styles, and emotional state. By centralizing these capabilities, the architecture promotes pedagogical consistency and facilitates the future evolution of the system.

2.1.5. Integration of Reinforcement Learning (RL)

A distinguishing feature of the ELA Tutor architecture is the explicit incorporation of RL as a mechanism for continuous adaptation (Figure 1). This layer introduces metacognitive capabilities that allow the system to learn from accumulated experience and progressively optimize its tutoring strategies. As shown in Figure 2, RL is implemented as an isolated module, clearly separated from the main interaction flow.

The RL Meta-Agent acts as a high-level controller, observing system interactions, evaluating results, and selecting strategies that maximize accumulated reward, allowing ELA Tutor to adapt to emerging patterns beyond static rules. Learning is supported by a Reward Calculator, which transforms interaction signals into numerical values stored together with the learned policies in an RL Policy Store, constituting the system’s experiential memory. The architecture implements hybrid decision logic, where the Intelligent Switching Router prioritizes RL policies with sufficient confidence and, otherwise, resorts to heuristics or inference based on language models.

2.1.6. Data Management and Knowledge Base Layer

The Data Management and Knowledge Layer support the MAS and RL module through a PostgreSQL database that stores user profiles, interaction histories, and relevant metrics. It integrates a hybrid knowledge base with institutional resources and external sources, ensuring relevant, traceable, and contextualized information, as well as facilitating system analysis and evaluation.

Figure 2 illustrates the overall architecture of the ELA Tutor intelligent tutoring system, structured into clearly decoupled layers that support scalability, security, and pedagogical adaptation. At the top, a set of cross-cutting principles—educational objectives, instructional design, learning styles, privacy and security, trust and reliability, and bias and fairness—frame the operation of the ITS. User interaction begins with students and teachers through a web-based interface integrated with the Moodle LMS, which serves as an academic context provider. Communication is managed via a Django-based API connected to an orchestration engine (n8n), which coordinates interaction flows and activates the Intelligent Switching Router responsible for dynamically selecting the appropriate intelligent components for each request.

At the core of the system lies the Multi-Agent System (MAS), composed of specialized agents (pedagogical, practical/technical, analytical, adaptive empathetic, and ethical), along with support agents such as the prompt-receiving and translation agents. These agents collaborate to produce contextualized and safe responses, supported by a data layer that integrates databases, memory, knowledge bases, and external resources. Operating in a decoupled manner, the Reinforcement Learning (RL) Meta-Agent supervises the system at a strategic level by evaluating outcomes through a reward calculator and updating decision policies without directly generating content. This clear separation between decision-making, content generation, and evaluation enables progressive adaptation while preserving interpretability, ethical safeguards, and stability in real-world educational environments.

2.1.7. Considerations for the Design of the Architecture

The proposed architecture is based on a set of cross-cutting principles that guide its design and operation as an ITS. First, educational objectives guide the definition of teaching strategies and educational theories such as connectivism and constructivism, ensuring that each recommendation, feedback, or explanation contributes directly to the achievement of learning outcomes through instructional design in Moodle [30].

Table 1 presents the alignment between the proposed architecture of ELA Tutor (Figure 2) and the principles established in the IEEE Ethically Aligned Design framework [31], with an emphasis on the criteria of trust, reliability, bias, and fairness. The architectural, functional, and orchestration mechanisms that enable ethical, transparent, and responsible intelligent tutoring in educational environments are described.

2.2. MAS and Intelligent Control Mechanism

ELA Tutor is based on a multi-agent architecture (MAS) designed to efficiently manage the complexity inherent in dynamic, heterogeneous, LMS-based learning environments. This approach allows the intelligent tutoring process to be broken down into specialized components that cooperate in a coordinated manner, promoting the scalability, adaptability, and traceability of the system.

The interaction flow begins with the Reception and Preprocessing Agent, which acts as the system’s entry point. This agent receives user requests via a webhook, performs textual normalization processes, and executes an initial classification of intent using heuristics and lightweight natural language processing techniques. Queries are categorized into different types, such as conceptual requests, practical problem solving, feedback on academic performance, or requests for motivational support. At the same time, the system retrieves relevant academic context from the LMS, including course information, activities, and student progress, and identifies the language of interaction to activate translation mechanisms when necessary.

Centralized decision-making is carried out in the Intelligent Router, which operates as the cognitive control node of the MAS. This component evaluates each interaction considering three main factors: the type of query detected, the student profile inferred from their interaction history, and the level of cognitive complexity required. Based on this evaluation, the router dynamically determines which agent or combination of agents should be activated, avoiding rigid conversational flows and allowing for flexible and contextualized management of the tutorial process.

The pedagogical support and tutoring agents operate in a coordinated manner to address the different needs of the student. The Pedagogical Agent generates structured theoretical explanations aligned with the learning objectives of the course, relying on large-scale language models and previously curated academic sources. The Practical Agent focuses on problem solving, generating applied examples, and developing procedural exercises. For its part, the Analysis Agent examines patterns of interaction and evolution of academic performance to produce formative feedback, while the Adaptive Empathetic Agent adjusts the communication style and tone of the response to reinforce motivation, self-regulation, and continuity of learning.

Before being delivered to the user, the generated content passes through an Ethical-Pedagogical Filter, responsible for verifying academic consistency, curricular relevance, and information security, mitigating risks associated with bias or misinformation. Finally, a Synthesis and Validation Agent consolidates the outputs produced by the different agents, adjusts the depth and style of the response according to the student’s profile, and records the interaction in the system for later use in analysis and adaptation processes.

Figure 3 presents the complete workflow of the agents, conceptualized as an input-process-output system, which demonstrates the coordination between the components of the MAS and their integration with the virtual learning environment.

Within the multi-agent system, the reinforcement learning component is implemented as a meta-level decision agent responsible exclusively for strategic coordination. This RL Meta-Agent operates over the set of specialized tutoring agents rather than within them and does not participate in content generation or direct learner interaction. Its function is to observe the outcomes of agent activations and to learn an optimal policy for agent selection and orchestration under different contextual conditions. Architecturally, the RL Meta-Agent interfaces only with the Intelligent Router and the structured interaction logs, receiving low-dimensional, abstract state representations and discrete reward signals, while remaining fully decoupled from the internal reasoning processes and language generation mechanisms of individual agents. This design confines adaptive learning to the coordination layer of the MAS, ensuring modularity, interpretability, and operational safety, while enabling the tutoring system to incrementally refine its decision policies based on accumulated experience across heterogeneous learning scenarios.

2.3. Meta-Agent and Reinforcement Learning

ELA Tutor incorporates a Meta-Agent Orchestrator that integrates an RL cycle as an adaptive control mechanism. This component supervises the selection of tutorial strategies within the multi-agent system, learning from accumulated experience without interacting directly with the student. RL is implemented using a simplified formulation inspired by tabular Q-learning, where the state combines the student’s level of knowledge and inferred emotional state, and the actions correspond to tutorial strategies.

The Meta-Agent consults a persistent experiential memory and prioritizes the learned policy when there is sufficient confidence; otherwise, it resorts to rules or LLM-based inference. After each interaction, an incremental average of rewards updates the values, ensuring stability and interpretability. This cycle, shown in Figure 4, allows for progressive adaptation of the system’s behavior in real contexts.

2.4. Programming and Functional Logic of Agents

Table 2, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8 and Table 9 describe the functional architecture and logic of the system’s intelligent agents. Each agent fulfills a specific role within the interaction flow, and its behavior is modeled using pseudocode, allowing for a transparent representation of the input, decision, and output processes that underpin the system’s personalized and ethical tutoring. Each table was assigned a color that corresponds to the agent color assigned in Figure 2, in order to differentiate and assign its relevance.

2.4.1. AI Translator Agent

Automatically detects the language of the user message (Spanish or English) and returns the result in JSON format.

Table 2. Functional logic of Translator Agent.

Element	Description	Pseudocode
Input	User message	Input: user_message Output: normalized_message, language, confidence_level START read user_message // Detect language language = DetectLanguage(user_message) // Normalize to Spanish for internal Processing IF language ! = “es” THEN normalized_message = Translate (user_message, target_language = “es”) ELSE normalized_message = user_message ENDIF // Estimate confidence confidence_level = Estimate Confidence (normalized_message) // Return structured output return {normalized_message, language, confidence_level} END
Process	Text analysis to identify the language
Output	JSON with detected language and confidence level Message translated into Spanish or English
Agent	Translator Agent

2.4.2. AI Prompt and Receiving Agent

Analyzes the student’s message by considering conversation history and academic context to classify the request.

Table 3. Functional logic of AI Prompt and Receiving Agent.

Element	Description	Pseudocode
Input	Message + conversation history + student data	START read current_message read conversation_history analyze context classify request_type determine urgency_level identify academic_topic estimate complexity_level return {type, urgency, topic, complexity} END
Process	Semantic and contextual classification
Output	JSON with type, urgency, topic, and complexity
Agent	Request Receiver/Classifier Agent

2.4.3. AI Prompt and Receiving Agent with Moodle

Interpret academic data retrieved from Moodle and communicates it to the student in NLP.

Table 4. Functional logic of AI Prompt and Receiving Agent with Moodle.

Element	Description	Pseudocode
Input	Academic data in JSON format	START read moodle_data interpret academic_information IF tasks_exist list tasks clearly ELSE indicate no information found return friendly_message END
Process	Conversion to natural language
Output	Friendly and clear response
Agent	Moodle Assistant Agent

2.4.4. Pedagogical Agent

Generates clear, brief, and academic theoretical explanations adapted to the user’s language.

Table 5. Functional logic of Pedagogical Agent.

Element	Description	Pseudocode
Input	Classified request data	START read classified_data set response_language generate theoretical_explanation return response END
Process	Academic explanation generation
Output	Theoretical response
Agent	Pedagogical Agent

2.4.5. Technical Agent

Provides practical solutions such as examples, exercises, or code snippets focused on application rather than theory.

Table 6. Functional logic of Technical Agent.

Element	Description	Pseudocode
Input	Topic and user message	START read topic read user_message generate practical_solution include examples or code if required return response END
Process	Practical knowledge application
Output	Technical response
Agent	Technical Agent

2.4.6. Prepare Prompt—Analysis Agent

Evaluates the quality of the student’s response or performance and provides constructive feedback.

Table 7. Functional logic of Analysis Agent.

Element	Description	Pseudocode
Input	Student response	START read student_response evaluate clarity and correctness identify weaknesses suggest improvements return feedback END
Process	Qualitative evaluation
Output	Feedback and improvement suggestions
Agent	Performance Analysis Agent

2.4.7. Adaptive Empathic Agent

Adjust tone, level of detail, and empathy of the response according to the user’s profile and emotional state.

Table 8. Functional logic of Adaptive Empathic Agent.

Element	Description	Pseudocode
Input	User level and emotional context	START read user_level read requested_tone adapt response_style generate empathetic_message return response END
Process	Communication adaptation
Output	Adaptive and empathetic response
Agent	Adaptive Communication Agent

2.4.8. Ethical Agent

Ensures that the responses generated by the system comply with principles of fairness, bias mitigation, reliability, and pedagogical alignment, validating the content before it is delivered to the user.

Table 9. Functional logic of Ethical Agent.

Element	Description	Pseudocode
Input	Response generated by the agents and user context	START read generated_response read user_context read interaction_context evaluate bias_risk evaluate fairness_compliance evaluate pedagogical_alignment evaluate trust_and_reliability IF bias_risk == true OR fairness_compliance == false THEN refine generated_response ENDIF IF pedagogical_alignment == false THEN adjust educational_content ENDIF approve validated_response return validated_response END
Process	Ethical and pedagogical verification of content
Output	Ethically and pedagogically validated answer
Agent	Ethical Agent

2.5. User Interface Design and Scalability

The ELA Tutor UI was designed to support intuitive and efficient HAI, adopting a minimalist, web-based, chat-oriented design that emphasizes clarity, accessibility, and institutional consistency (Figure 5). Natural language input, structured visual feedback, and contextual prompts facilitate seamless interaction, reduce cognitive load, and enhance trust, usability, and engagement, enabling ELA Tutor to operate as an effective adaptive academic assistant.

Scalability and cross-platform integration in ELA Tutor are enabled by a middleware-centric architectural design that decouples intelligent tutoring services from the LMS core. All adaptive decision-making, multi-agent coordination, and reinforcement learning processes are executed within containerized services exposed through standardized APIs and webhooks, while Moodle is used solely as a contextual data provider (courses, activities, progress, and interaction events). This loose coupling eliminates dependency on Moodle’s internal logic, allowing the same middleware instance to be replicated or deployed across multiple Moodle installations without code modification. From a computational perspective, the reinforcement learning meta-agent operates with constant-time updates per interaction and does not introduce shared-state bottlenecks, supporting horizontal scaling under increasing user and course loads. As a result, the proposed architecture can be integrated into heterogeneous Moodle-based environments while preserving modularity, maintainability, and deployment portability.

3. Study Cases

3.1. User-Centered System Evaluation

A case study was conducted in a controlled real-world environment to evaluate the behavior, functionality, and usability of ELA Tutor integrated into Moodle, with the participation of Multimedia and Audiovisual Production students. The study included 150 students, with informed consent, voluntary and anonymized participation, ensuring principles of privacy, confidentiality, Human–AI Interaction, and Trustworthy AI.

The study was conducted in three phases: (i) initial training to familiarize students with the ITS and its interface; (ii) execution of a didactic activity integrated into Moodle, which included theoretical, practical, and academic monitoring components, during which students interacted with ELA Tutor; and (iii) data collection using a structured questionnaire, which assessed usability, satisfaction, and usefulness, as well as accessibility and interaction, using five-point Likert-type items and open-ended questions (Table 10).

3.2. Simulation of Adaptive Decision-Making Using Reinforcement Learning in ELA Tutor

A simulated case study was designed to evaluate the adaptive behavior of the ELA Tutor intelligent tutoring system, with particular emphasis on the functioning of the reinforcement learning–based Meta-Agent Orchestrator. The objective of this study was to analyze how the system progressively adjusts the selection of agents and tutorial strategies based on accumulated experience, following the proposed RL algorithm, under controlled and reproducible conditions.

For each simulated interaction, the system recorded the user query, the selected agent (tutorial strategy), the origin of the routing decision (heuristic baseline or RL-based policy), the inferred contextual state, and the observed outcome. This detailed logging enabled the assessment of decision traceability, policy consistency, and attribution of learning effects, allowing the isolation of the contribution of the RL Meta-Agent from the underlying orchestration logic and language generation components. The simulation phase therefore served as a functional validation step prior to evaluating adaptive behavior in real educational environments.

From a methodological perspective, the simulation models the user as a stochastic feedback generator conditioned on the contextual state and the selected strategy. This abstraction allows controlled experimentation while preserving variability and noise characteristic of real tutoring interactions.

Table 11 describes the reinforcement scheme used by the Meta-Agent to evaluate the effectiveness of the tutorial strategies executed by the specialized agents. Each interaction generates a reward signal R ∈ {+1, 0, −1}, derived from explicit linguistic feedback or interaction outcomes detected in natural language. These signals are integrated into the learning process using an incremental average-based update rule, as defined in Equation (1):

Q_{n e w} = \frac{Q_{o l d} \times (N - 1) + R}{N},

(1)

where

Q_{o l d}

represents the previous rating of the strategy,

N

the number of times it has been used, and

R

the current reward. This mechanism softens the impact of individual rewards, promoting model stability and avoiding abrupt changes in decision policy.

The decision-making process of the reinforcement learning–based Meta-Agent is formalized as a routing policy that selects the tutorial strategy with the highest estimated utility for a given contextual state. This policy as defined in Equation (2):

a^{*} = \arg \max_{a \in A} Q (s, a)

(2)

where

s

denotes the contextual state of the learner, defined as a combination of the inferred level of knowledge and the inferred emotional state. The set

A

represents the space of available tutorial strategies, each corresponding to the activation of one or more specialized agents within the multi-agent system. The function

Q (s, a)

expresses the estimated utility of applying strategy

a

in state s

s

, derived from accumulated interaction experience and used by the Meta-Agent to guide adaptive routing decisions.

The estimated utility associated with each state–action pair is updated using an incremental sample-average rule, which allows stable learning from sparse and noisy feedback without relying on bootstrapping mechanisms. The update rule as defined in Equation (3):

Q_{n + 1} = \frac{Q_{n} \cdot n + R}{n + 1}

(3)

where

Q_{n}

is the current utility estimate after n observations,

R

is the reward obtained from the most recent interaction, and

n

denotes the number of times the corresponding strategy has been previously selected under the same contextual state. This formulation computes the empirical mean of observed rewards, progressively smoothing the influence of individual feedback signals and ensuring interpretable and stable adaptation in educational settings.

The reward mechanism relies on a keyword-triggered polarity scheme interpreted conservatively to account for ambiguous feedback such as politeness markers, sarcasm, or requests for clarification. Neutral rewards are assigned under low-confidence conditions, while incremental averaging reduces the impact of isolated errors. A confidence-gated hybrid control mechanism ensures that reinforcement learning decisions are applied only when sufficient evidence is available; otherwise, the system falls back to deterministic heuristics or constrained LLM inference, guaranteeing safe and stable adaptation.

Table 11. Reward Assignment Matrix (Q-Learning).

Reinforcement Type	User Input	Assigned Value (R)	Impact on the Agent
Positive (Reward)	Keywords: “Thank you”, “Excellent”, “It works”, “Good job”	+1.0	Success validation: The selected strategy successfully addressed the user’s need.
Negative (Penalty)	Keywords: “I don’t understand”, “Bad”, “Error”, “Confusing”	−1.0	Error correction: The strategy was ineffective or the selected agent was not appropriate.
Neutral	Absence of explicit feedback or simple phatic interactions	0.0	State maintenance: There is insufficient evidence to modify the system’s behavior.

Similarly, the state space is defined as the combination of the user’s inferred sentiment (positive, neutral, negative) and inferred knowledge level (basic, intermediate, advanced), generating composite contextual states that characterize each interaction. This abstraction enables the system to distinguish between recurring pedagogical situations, such as routine clarification requests, successful resolution scenarios, or alert cases associated with frustration among novice learners.

During the simulation, representative interaction sequences were executed for different composite states, recording the selected action, the assigned reward, and the updated utility value. This procedure allowed observation of how the system reinforces effective strategies, penalizes systematically ineffective ones, and adapts routing decisions to contextual variation.

To ensure safe operation, the simulation does not include random exploration. The RL policy is applied only when its confidence exceeds a predefined margin; otherwise, routing decisions default to a deterministic heuristic baseline. This design mirrors the constraints of real educational environments, where unsafe exploration is unacceptable. Performance is therefore evaluated comparatively against the heuristic baseline, demonstrating that the RL-driven policy converges toward higher cumulative reward while maintaining pedagogical safety.

Convergence and robustness are supported through repeated simulation runs with multiple random seeds and extended interaction horizons, analyzing the stabilization of average reward and the reduction in variance in utility estimates. Sensitivity analyses were conducted by injecting controlled noise into the reward signal and varying initial state distributions, confirming that the learning process remains stable and that policy improvements over the baseline persist under noisy feedback conditions.

Overall, the simulated case study provides evidence of the correct coupling between the state representation, reward mechanism, and update rule, as well as the stability and interpretability of the adaptive behavior induced by the RL Meta-Agent prior to deployment in real LMS settings. Unsafe exploration is avoided through a confidence-gated hybrid policy: RL decisions are applied only when confidence exceeds a threshold; otherwise, the system reverts to a deterministic heuristic baseline. Premature convergence is prevented by minimum-support criteria, and superiority over the heuristic-only approach is shown via baseline comparison under sparse, noisy feedback.

4. Results

4.1. Students’ Perceptions of ELA TUTOR

A descriptive statistical analysis was performed on the responses obtained in the questionnaire administered to the students. First, the responses to the Likert-type items were normalized on a numerical scale from 1 to 5, where 1 corresponds to “strongly disagree” and 5 to “strongly agree.” The items were grouped by dimension (usability; satisfaction and usefulness; accessibility and interaction), and the average of the items corresponding to each dimension was calculated for each student.

Descriptive statistics were estimated considering the student as the unit of analysis, in order to ensure the independence of observations and avoid pseudoreplication problems. For each dimension, the mean, standard deviation, minimum, and maximum values were calculated.

Table 12 presents the descriptive statistics that show a consistently positive perception of ELA TUTOR in all dimensions evaluated. The means obtained for usability (M = 3.77), satisfaction and usefulness (M = 3.82), and accessibility and interaction (M = 3.90) are above the midpoint of the Likert scale.

The descriptive analysis performed for each question shows that all the dimensions evaluated have averages above the midpoint of the Likert scale (Table 13). In the usability dimension, questions Q1, Q2, and Q3 achieve the highest ratings, while question Q4, formulated in a negative way, has the lowest average, although it maintains a positive perception after the item is reversed. In the satisfaction and usefulness dimension, questions Q6, Q7, and Q8 reflect a high acceptance of the intelligent tutor as a learning support tool. The questions associated with accessibility and interaction, Q9 and Q10, obtain the highest means, which shows that the system is perceived as accessible and with adequate response times.

The perception of ELA Tutor was analyzed according to gender in the dimensions of usability, satisfaction and usefulness, and accessibility and interaction. The results in Table 14 show similar means between the male (n = 90) and female (n = 60) groups, with minimal differences (Δ ≤ 0.05). The nonparametric Mann–Whitney U test did not show statistically significant differences in any dimension (p > 0.05), indicating a homogeneous and equitable perception of the system.

Figure 6 shows the distribution of usability, satisfaction, and accessibility scores for ELA Tutor using Kernel Density Estimation (KDE) on a Likert scale from 1 to 5. The densities are concentrated at high values (4–5), indicating positive perceptions in all dimensions, with a more pronounced peak in usability. Although applied to ordinal data, KDE is used as a visual aid to compare the shape and dispersion of distributions.

The perception of the system was analyzed by age group in the study dimensions. The results (Table 15) show similar scores between groups, indicating a homogeneous and consistent user experience regardless of age.

Figure 7 presents the density estimates (KDE) by age range for the three dimensions evaluated. In all cases, unimodal distributions are observed, concentrated at high values on the Likert scale, with a marked overlap between age groups, which supports the absence of statistically significant differences by age.

To analyze possible differences in the perception of the system according to the students’ level of AI knowledge, the nonparametric Kruskal–Wallis test was applied to the study dimensions. The results showed no statistically significant differences between the groups in any of the dimensions evaluated (Usability: H = 4.88, p = 0.181; Satisfaction and usefulness: H = 4.15, p = 0.246; Accessibility and interaction: H = 5.20, p = 0.158). No dimension reached statistical significance (p > 0.05), suggesting that the perception of the system is consistent regardless of the level of prior knowledge in AI.

The descriptive analysis shows a positive trend in the three dimensions evaluated in ELA Tutor as the level of knowledge of artificial intelligence increases (Table 16). In usability, there is a progressive increase in the mean from the group with no prior knowledge (M = 2.75) to the group with high knowledge (M = 3.95), suggesting that greater familiarity with AI could be associated with a slightly more favorable perception of the system.

In the satisfaction and usefulness dimension, the means are remarkably similar between the groups with low, medium, and high knowledge, indicating that the level of AI knowledge does not seem to substantially influence the perception of the system’s usefulness. In contrast, the group with no prior knowledge has a lower mean (M = 2.00), a result that should be considered exclusively descriptive due to its low sample representation.

With regard to accessibility and interaction, the means remain high and relatively stable in the groups with low, medium, and high knowledge (M: 3.89–4.21), with the group with low knowledge standing out as the one reporting the highest mean (M = 4.21). This result suggests that ELA TUTOR is perceived as accessible and intuitive even by students with less experience in AI, reinforcing its inclusive nature.

Table 16. Dimensions of ELA TUTOR according to level of AI knowledge.

Knowledge of AI	n	Usability (M ± SD)	Satisfaction & Usefulness (M ± SD)	Accessibility & Interaction (M ± SD)
None	2	2.75 ± 1.06	2.00 ± 1.41	2.00 ± 1.41
Low	12	3.69 ± 0.83	3.83 ± 0.78	4.21 ± 0.81
Medium	100	3.75 ± 0.82	3.85 ± 0.93	3.89 ± 1.06
High	33	3.95 ± 0.90	3.84 ± 0.93	3.97 ± 1.07

Note: M = mean, SD = standard deviation. The values correspond to means and standard deviations calculated from Likert-type scales (1–5).

Analysis of open-ended questions Q15 and Q16 allowed us to identify thematic categories that systematize students’ comments on ELA Tutor, differentiating between positive aspects and areas for improvement. These categories were linked to the quantitative dimensions of usability, satisfaction and usefulness, and accessibility and interaction, and were quantified using frequencies and percentages, presented in Table 17.

4.2. Adaptive Behavior Analysis of the Reinforcement Learning Mechanism

During the simulation, the Meta-Agent Orchestrator used an update scheme based on an incremental average of rewards, where each selected action was reinforced or penalized according to user feedback. Positive rewards (+1) were assigned when the strategy adequately addressed the user’s need, negative rewards (−1) when confusion or dissatisfaction was evident, and neutral rewards (0) when there was no explicit feedback. One of the interactions performed through the UI is shown in Figure 8.

Table 18 summarizes representative 3 interaction scenarios executed in a simulated and controlled environment to evaluate the adaptive behavior of the RL-based Meta-Agent. The table reports the detected intent, selected agent, decision rationale, reward signal, and updated Q-Score, illustrating how reinforcement learning progressively reinforces effective strategies and penalizes inadequate ones.

The results provide empirical evidence of the adaptive behavior of the RL-based Meta-Agent Orchestrator under multiple simulated scenarios. The analysis covers social, administrative, conceptual, procedural, technical, and ambiguous interactions, allowing the evaluation of both policy convergence and error correction mechanisms.

In Scenario 1, the system operates mainly under rule-based and heuristic-driven decisions, serving as a baseline for learning. Social and administrative interactions received positive rewards (+1.0), resulting in immediate convergence of the corresponding agents (Q-Score = 1.00). Procedural queries were also correctly routed to the technical agent, reinforcing the suitability of this strategy in practical contexts.

Conceptual queries processed by the Pedagogical Agent received negative feedback (R = −1.0), leading to a negative Q-Score (−1.00). This result indicates that purely theoretical responses were inadequate for the detected context, triggering a penalization that discourages future selection of this agent under similar states.

Scenario 2 demonstrates policy reuse, where previously learned strategies are preferred over heuristic decisions. Both technical and administrative queries were routed using the learned RL policy, receiving positive rewards and maintaining stable Q-Scores of 1.00. This behavior evidences policy convergence and confirms the Meta-Agent’s ability to generalize learned decisions across similar interaction states.

Scenario 3 evaluates system behavior under ambiguous input and negative feedback. The detected negative signal resulted in a reward of −1.0, triggering a policy adjustment rather than a full penalization. The resulting intermediate Q-Score (0.33) reflects a partial correction, indicating that the system adapts cautiously, balancing prior experience with new evidence instead of reacting abruptly.

Table 18. Simulated interaction scenarios and RL-Based decision outcomes in ELA Tutor.

Sn	Interaction Type	User Input	Detected Intent	Selected Agent	Decision Basis	R	Q-Score After Update	Observed Behavior
1	Social/Feedback	Thank you, it worked perfectly	Social feedback	AdaptiveAgent	Rule-based (Social Protocol)	+1.0	1.00	Positive reinforcement of adaptive communication.
1	Administrative	Query about enrolled courses	Administrative	Moodle Queries	Rule-based (Moodle API)	+1.0	1.00	Correct routing to LMS integration.
1	Conceptual	What is cost accounting?	Conceptual	Pedagogical Agent	Heuristic-based	−1.0	−1.00	Theoretical response penalized in practical context.
1	Procedural	How to perform a cash count step by step	Procedural	TechnicalContent	Heuristic-based	+1.0	1.00	Correct activation of technical guidance.
2	Technical	Code example in Python	Pure technical	TechnicalContent	RL policy preferred	+1.0	1.00	Policy reused successfully in similar state.
2	Administrative	Request for grades	Administrative	Moodle Queries	RL policy preferred	+1.0	1.00	Stable convergence in administrative routing.
3	Mixed/ Ambiguous	I don’t understand this part	Negative feedback	AdaptiveAgent	Fallback + RL update	−1.0	0.33	Policy adjusted after negative signal.

Note: The table reports representative simulated interaction scenarios used to evaluate the RL-based decision-making mechanism in ELA Tutor. Rewards reflect explicit user feedback (+1 positive, −1 negative), and Q-scores are updated using an incremental averaging strategy. The results illustrate how the Meta-Agent reinforces effective agent selection, penalizes inadequate strategies, and adjusts routing decisions under ambiguous interaction contexts. Scenario (Sn), Reward (R).

Table 19 summarizes the Meta-Agent’s behavior in response to different types of simulated interaction. For procedural queries, the system achieved a final Q-Score of 0.95 after 20 iterations, demonstrating stable convergence toward the technical agent, which confirms the suitability of this strategy for practical requests. For administrative queries (academic performance), the system showed a maximum Q-Score of 1.00, with a consistently positive average reward, indicating consistent and correct routing to the agent integrated with Moodle. Exclusively conceptual queries showed a negative average reward and a final Q-Score of −0.40, reflecting a process of progressively penalizing the pedagogical agent in contexts where more practical responses were required.

These results demonstrate that the Reinforcement Learning mechanism allows the system to reinforce successful strategies, correct inadequate decisions, and adapt its behavior in a stable manner, validating the effectiveness of the RL Meta-Agent in the dynamic selection of agents within ELA Tutor integrated to Moodle.

Table 19. RL Outcomes by Interaction type in the simulated case study.

Interaction Type	Iterations	Avg Reward	Final Q-Score	Behavior
Procedural queries	20	+0.95	0.95	Stable convergence to technical agent
Administrative queries	15	+1.00	1.00	Consistent routing to Moodle agent
Conceptual-only queries	10	−0.40	−0.40	Progressive avoidance of pedagogical agent

5. Discussion and Future Works

The results obtained in this study show that ELA Tutor is a viable solution for intelligent tutoring in LMS, effectively integrating multi-agent architecture, RL mechanisms, and LLMs within a real institutional platform such as Moodle. The evaluation was approached from two complementary levels: users’ perception of the system and analysis of the tutor’s internal adaptive behavior.

From the HAI perspective, the high levels of usability, satisfaction, and accessibility observed indicate that the system manages to offer a consistent and equitable user experience, regardless of sociodemographic variables such as gender, age, or prior knowledge of AI. These results reinforce the idea that the adoption of conversational interfaces, combined with transparent and controlled routing logic, facilitates the acceptance of intelligent systems in real educational contexts.

In terms of adaptive behavior, the analysis of the Meta-Agent Orchestrator shows that the use of a simplified RL mechanism allows the system to progressively refine the selection of specialized agents based on the feedback received. The positive convergence of Q values in technical and administrative scenarios, together with the effective penalization of inappropriate strategies in practical contexts, confirms that the model is capable of differentiating intentions, correcting suboptimal decisions, and maintaining operational stability. The adoption of an incremental averaging scheme also contributes to the interpretability of learning and avoids erratic behavior, a critical aspect in educational applications.

A relevant element is the explicit separation between language generation and pedagogical decision-making, where LLMs act as communication support rather than autonomous decision-making agents. This design decision reduces algorithmic opacity, improves traceability, and facilitates teacher supervision, aligning with ethical and governance principles of educational AI.

In comparison with existing intelligent tutoring systems, the proposed ELA Tutor architecture extends prior approaches by explicitly integrating a multi-agent tutoring core with a reinforcement learning–based meta-agent for strategic orchestration. While many ITS solutions rely on static workflows, rule-based routing, or single-agent decision mechanisms, ELA Tutor introduces adaptive strategy selection driven by accumulated interaction experience, while preserving a strict separation between decision-making and content generation. Unlike comparable systems that embed adaptation directly within language models or fixed pedagogical rules, the proposed design emphasizes modularity, safety, and interpretability, enabling deployment in real LMS environments. This qualitative comparison highlights the relevance of the proposed approach for scalable and ethically aligned intelligent tutoring in higher education contexts.

However, the study has some limitations. The RL evaluation was conducted in a simulated and controlled environment, with a limited number of interactions and discrete rewards. While this allowed for validation of the algorithm’s operation and architectural integration, longitudinal studies in real-world scenarios are needed to analyze its behavior in the face of greater variability in states and strategies.

As a line of future work, we propose extending the evaluation of the RL Meta-Agent to real environments of prolonged use, incorporating continuous interactions throughout an entire academic period. This would allow us to analyze the evolution of the learned policies, the stability of the system, and its impact on long-term academic monitoring.

We plan to expand the space of states and rewards, incorporating additional indicators such as dropout patterns, response times, or cumulative performance, as well as exploring more advanced reinforcement learning schemes, such as hybrid approaches with approximate value functions or adaptive policies by disciplinary context. Another relevant line of research consists of deepening the integration of the teaching role, allowing teachers to supervise, adjust, or validate the Meta-Agent’s policies from the LMS, strengthening the HAI approach and the governance of the system.

6. Conclusions

This paper presented ELA Tutor, a system that integrates simplified RL and dynamically coordinates the selection of tutorial strategies through the cooperation of specialized agents and language models, maintaining a clear separation between language generation and pedagogical decision-making.

The results obtained in real and simulated environments show that the system effectively differentiates user intentions, reinforces successful strategies, and penalizes inappropriate decisions, achieving stable and consistent behavior in agent selection. The use of an incremental reward averaging mechanism proved adequate for ensuring stability, interpretability, and progressive adaptation in educational contexts with discrete and scarce feedback.

From the HAI perspective, usability, satisfaction, and accessibility analyses showed positive and homogeneous ratings, with no significant differences associated with gender, age, or prior level of AI knowledge. This suggests that the proposed architecture is robust, equitable, and accessible, reinforcing its viability as an institutional solution for automated academic monitoring.

This work contributes to the field of Intelligent Tutoring Systems by demonstrating that the combination of MAS, RL, and IAGen, integrated into real LMS platforms, allows for the construction of adaptive, traceable, and ethically governable tutoring systems. The proposed architecture provides a solid foundation for the development of scalable intelligent educational systems, aligned with real institutional needs and current requirements for transparency and control in AI applications.

Author Contributions

J.P.L.-G.: Writing original draft, Project administration, Methodology, Investigation, Conceptualization. A.G.-B.: Writing original draft, Methodology, Investigation, Formal analysis. Y.D.: Writing complements, review & editing, Validation, Methodology, Investigation, Conceptualization. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data will be made available on request. The survey data generated and analyzed during this study include anonymized quantitative responses and qualitative open-ended comments collected from university students. However, the anonymized dataset can be provided by the corresponding author upon reasonable request, subject to institutional approval and compliance with the informed consent agreement signed by participants.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
IAGen	Generative AI
RL	Reinforcement Learning
ITS	Intelligent Tutoring Systems
MAS	Multi-Agent System

References

Cibu, B.-R.; Crăciun, L.; Molănescu, A.G.; Cotfas, L.-A. Exploring the Educational Applications of Large Language Models: A Systematic Review and Topic Analysis. Electronics 2025, 14, 4683. [Google Scholar] [CrossRef]
Riedmann, A.; D’Eramo, C.; Lugrin, B. Real-world testing for reinforcement learning in education. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (AAMAS), Detroit, MI, USA, 19–23 May 2025; Available online: https://www.ifaamas.org/Proceedings/aamas2025/pdfs/p1764.pdf (accessed on 18 November 2025).
Zou, Y.; Wang, H.; Li, J. Digital learning in the 21st century: Trends, challenges, and innovations. Front. Educ. 2025, 10, 1562391. [Google Scholar] [CrossRef]
Giuffra, L.; Soler, E.; Rossi, G. A multi-agent system model to integrate virtual learning environments and intelligent tutoring systems. Int. J. Interact. Multimed. Artif. Intell. 2013, 2, 6–16. [Google Scholar] [CrossRef][Green Version]
Hassouna, A.B.; Chaari, H.; Belhaj, I. LLM-Agent-UMF: LLM-based Agent Unified Modeling Framework for Seamless Design of Multi Active/Passive Core-Agent Architectures. Inf. Fusion 2026, 127, 103865. [Google Scholar] [CrossRef]
Létourneau, A.; Robillard, P.N.; Léger, P.-M. A systematic review of AI-driven intelligent tutoring systems in K-12 education. Sci. Rep. 2025, 15, 8421. [Google Scholar] [CrossRef]
Brohi, S.; Mastoi, Q.; Jhanjhi, N.Z.; Pillai, T.R. A Research Landscape of Agentic AI and Large Language Models: Applications, Challenges and Future Directions. Algorithms 2025, 18, 499. [Google Scholar] [CrossRef]
Riedmann, A.; Schaper, P.; Lugrin, B. Reinforcement Learning in Education: A Systematic Literature Review. Int. J. Artif. Intell. Educ. 2025, 35, 2669–2723. [Google Scholar] [CrossRef]
Zha, S.; Liu, Y.; Zheng, C.; Xu, J.; Yu, F.; Gong, J.; Xu, Y. Mentigo: An Intelligent Agent for Mentoring Students in the Creative Problem Solving Process. arXiv 2024, arXiv:2409.14228. [Google Scholar] [CrossRef]
Zerkouk, M.; Mihoubi, M.; Chikhaoui, B. A Comprehensive Review of AI-based Intelligent Tutoring Systems: Applications and Challenges. arXiv 2024, arXiv:2507.18882. [Google Scholar]
Deshmukh, S.; Sen, V. Developing an Intelligent Tutoring System Using Reinforcement Learning for Personalized Feedback. Int. Acad. J. Sci. Eng. 2025, 12, 30–33. [Google Scholar] [CrossRef]
Viswanathan, N.; Yin, Y.; Ramachandran, S. Enhancement of online education system by using a multi-agent intelligent tutoring system. Comput. Educ. Artif. Intell. 2022, 3, 100057. [Google Scholar] [CrossRef]
Panagiotidis, P. LLM-based chatbots in language learning: A systematic literature review. Comput. Educ. 2024, 7, 102–123. [Google Scholar]
Silva, A.P.; Fernandes, J.; Rocha, A. A Recommendation Module based on Reinforcement Learning to an Intelligent Tutoring System. In Proceedings of the ICISSp 2022—8th International Conference on Information Systems Security and Privacy, Virtual Event, 9–11 February 2022; pp. 733–740. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Adaptive Learning Control via Proximal Policy Optimization. In Proceedings of the CEUR Workshop Proceedings, Odesa, Ukraine, 24–26 September 2025; Available online: https://ceur-ws.org/Vol-4048/paper37.pdf (accessed on 18 November 2025).
Piech, C.; Huang, J.; Phulsaria, A.; Sivan, S.; Joshi, M.; Portela, A.; Tracing, D.K. Deep Knowledge Tracing. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montréal, QC, Canada, 7–12 December 2015; pp. 505–513. Available online: https://stanford.edu/~cpiech/bio/papers/deepKnowledgeTracing.pdf (accessed on 18 November 2025).
Wang, X.; Zhang, L.; Chen, W. Application of reinforcement learning in personalized learning path recommendation in secondary education. IEEE Trans. Learn. Technol. 2025, 18, 145–159. [Google Scholar]
Moncada-Ramirez, J.; Matez-Bandera, J.-L.; Gonzalez-Jimenez, J.; Ruiz-Sarmiento, J.-R. Agentic Workflows for Improving Large Language Model Reasoning in Robotic Object-Centered Planning. Robotics 2025, 14, 24. [Google Scholar] [CrossRef]
Le Thanh, T. Towards multi-agent system for learning object recommendation in e-learning platforms. Heliyon 2024, 10, e35119. [Google Scholar] [CrossRef]
Ivanova, M. A multi-agent architecture for learning paths-based personalized e-learning systems. Int. J. Inf. Technol. Syst. 2023, 15, 15–28. [Google Scholar]
Piccialli, F.; Chiaro, D.; Sarwar, S.; Cerciello, D.; Qi, P.; Mele, V. AgentAI: A comprehensive survey on autonomous agents in distributed AI for industry 4.0. Expert. Syst. Appl. 2025, 291, 128404. [Google Scholar] [CrossRef]
Beale, R. Aligning conversational AI with proven theories of learning. arXiv 2025, arXiv:2506.19484. [Google Scholar] [CrossRef]
Bandi, A.; Kongari, B.; Naguru, R.; Pasnoor, S.; Vilipala, S.V. The Rise of Agentic AI: A Review of Definitions, Frameworks, Architectures, Applications, Evaluation Metrics, and Challenges. Future Internet 2025, 17, 404. [Google Scholar] [CrossRef]
Córdova-Esparza, D.-M. AI-Powered Educational Agents: Opportunities, Innovations, and Ethical Challenges. Information 2025, 16, 469. [Google Scholar] [CrossRef]
Jiang, Y.-H.; Lu, Y.; Dai, L.; Wang, J.; Li, R.; Jiang, B. Agentic Workflow for Education: Concepts and Applications. 2025. Available online: https://arxiv.org/abs/2509.01517 (accessed on 1 December 2025).
Sapkota, R.; Roumeliotis, K.I.; Karkee, M. AI Agents vs. Agentic AI: A Conceptual taxonomy, applications and challenges. Inf. Fusion 2026, 126, 103599. [Google Scholar] [CrossRef]
Essa, S.G.; Celik, T.; Human-Hendricks, N.E. Personalized Adaptive Learning Technologies Based on Machine Learning Techniques to Identify Learning Styles: A Systematic Literature Review. IEEE Access 2023, 11, 48392–48409. [Google Scholar] [CrossRef]
Moodle. Moodle Learning Management System. 2024. Available online: https://moodle.org (accessed on 21 November 2025).
Lepper, M.R.; Woolverton, M. The Wisdom of Practice: Lessons Learned from the Study of Highly Effective Tutors. Adv. Instr. Psychol. 2002, 1, 135–158. [Google Scholar]
Shahriari, K.; Shahriari, M. IEEE standard review—Ethically aligned design: A vision for prioritizing human wellbeing with artificial intelligence and autonomous systems. In Proceedings of the 2017 IEEE Canada International Humanitarian Technology Conference (IHTC), Toronto, ON, Canada, 21–22 July 2017; pp. 197–201. [Google Scholar] [CrossRef]
IEEE. Ethically Aligned Design: A Vision for Prioritizing Human Well-Being with Autonomous and Intelligent Systems, 2nd ed.; IEEE: New York, NY, USA, 2017; Available online: https://standards.ieee.org/industry-connections/ec/autonomous-systems.html (accessed on 18 November 2025).

Figure 1. RL Cycle of the Meta-Agent Orchestrator in ELA Tutor.

Figure 2. Architecture of the proposed multi-agent intelligent tutoring system with RL.

Figure 3. Multi-Agent Workflow and Decision Flow of the ELA Tutor Architecture.

Figure 4. Simplified Q-Learning Decision Cycle for Meta-Agent Orchestration.

Figure 5. ELA Tutor UI.

Figure 6. KDE of ELA TUTOR usability, satisfaction, and accessibility scores. Note. Kernel density curves provide a smoothed approximation of the distribution of Likert-scale scores for each dimension and are used here for comparative, descriptive purposes only.

Figure 7. Kernel density estimations of ELA TUTOR dimensions by age range. Note: The density curves represent kernel density estimates that provide a smoothed approximation of the distribution of Likert-scale scores (1–5) for each dimension and are intended solely for descriptive comparison.

Figure 8. Interaction performed in the simulation. Note: This section presents the interaction for obtaining feedback through the ELA Tutor UI. In this case, the user positively rates the generated response, allowing the RL Meta Agent to adapt and store the rating.

Table 1. Ethical criteria for ELA design Tutor: IEEE Ethically Aligned Design [31].

IEEE EAD Principle	Criterion Applied	Implementation in ELA Tutor	Architectural Component
Human Well-being	Student-centered tutoring	Academic support and emotional accompaniment are prioritized.	MAS
Human Well-being	Indirect teacher supervision	Pedagogical decisions are based on academic data and can be supervised by instructors.	Moodle, Django API
Bias and Fairness	Equity in access	Enrolled students have access to the same functionalities.	Moodle, Web UI
Bias and Fairness	Performance-based adaptation	Personalization is grounded in academic and interaction indicators.	Intelligent Switching Router, RL Meta-Agent
Transparency	Decision traceability	Every decision is logged and traceable.	PostgreSQL, RL Policy Store
Transparency	Separation between decision-making and generation	The LLM generates language but does not decide pedagogical strategies, reducing algorithmic opacity.	n8n Orchestrator, Router
Accountability	Decision flow control	Tutoring logic is implemented through explicit flows and auditable rules.	n8n Orchestrator
Trust and Reliability	Consistency in tutoring	Students with similar academic backgrounds receive coherent strategies.	MAS
Trust and Reliability	Experience-validated learning	The system adjusts its behavior only when explicit evidence of positive or negative feedback exists.	RL Meta-Agent, Reward Calculator
Privacy and Data Governance	Data minimization	The system uses only the academic data necessary for tutoring.	Django API, PostgreSQL
Privacy and Data Governance	Isolation of sensitive information	Data are stored in separate layers with controlled access.	Docker, Data Layer
Awareness of Misuse	Ethical content filtering	Generated responses are validated.	Ethical Agent
Awareness of Misuse	Data minimization	The system uses only the academic data necessary for tutoring.	Django API, PostgreSQL
Societal and Cultural Awareness	Educational contextualization	The answers align with the institutional curriculum.	Pedagogical Agent, Moodle
Robustness and Security	Architectural resilience	The use of microservices and containers allows faults to be isolated and system operation to be maintained.	Docker, n8n Orchestrator

Table 10. Questionnaire items by ELA Tutor.

Code	Dimension	Item
Q0	Demographic profile	Sex, age, degree program, prior use of AI
Q1	Usability	The intelligent tutoring system was easy to use.
Q2		I did not need much help to learn how to use the system.
Q3		The system’s functions are well integrated and consistent with each other.
Q4		At times, the system became confusing or difficult to understand.
Q5	Satisfaction & usefulness	I am satisfied with the responses provided by the intelligent tutor.
Q6		The intelligent tutor helped me better understand the course content.
Q7		The tutor’s responses were clear and useful.
Q8		I would like to continue using this intelligent tutor in other courses.
Q9	Accessibility & interaction	I was able to access the system without major technical difficulties.
Q10		The response time of the intelligent tutor was adequate.

Table 12. Descriptive statistics of ELA TUTOR dimensions. M = mean, SD = standard deviation.

Dimension	M	SD
Usability	3.77	1.21
Satisfaction & usefulness	3.82	1.02
Accessibility & interaction	3.90	1.15

Note: M = mean, SD = standard deviation.

Table 13. Descriptive statistics by question from the ELA Tutor evaluation.

Dimensions	Question	M	SD
Usability	Q1	4.03	1.09
	Q2	3.90	1.22
	Q3	3.84	1.06
	Q4 *	3.30	1.33
Satisfaction & usefulness	Q5	3.76	1.02
	Q6	3.85	0.97
	Q7	3.81	1.03
	Q8	3.85	1.06
Accessibility & interaction	Q9	3.89	1.25
Accessibility & interaction	Q10	3.91	1.06

Note: * Question P4 was phrased negatively, and its values were reversed for dimensional analysis, so that higher scores indicate a more favorable perception of the system. The mean and SD of P4 are presented in their original form for descriptive purposes only. M = mean, SD = standard deviation.

Table 14. Descriptive statistics for the dimensions of ELA TUTOR according to student gender.

Dimensions	Male (n = 90) M ± SD	Female (n = 60) M ± SD	Δ M (F − M)
Usability	3.74 ± 1.18	3.78 ± 1.16	+0.04
Satisfaction & usefulness	3.80 ± 1.05	3.84 ± 1.02	+0.04
Accessibility & interaction	3.87 ± 1.22	3.92 ± 1.18	+0.05

Note: M = mean, SD = standard deviation.

Table 15. Descriptive statistics for the dimensions of ELA TUTOR according to age ranges.

Dimensions	≤20 Years M ± SD	21–25 Years M ± SD	≥26 Years M ± SD
Usability	3.72 ± 1.17	3.77 ± 1.15	3.75 ± 1.18
Satisfaction & usefulness	3.79 ± 1.04	3.83 ± 1.02	3.81 ± 1.06
Accessibility & interaction	3.85 ± 1.21	3.91 ± 1.18	3.88 ± 1.20

Note: M = mean, SD = standard deviation.

Table 17. Thematic categories derived from open-ended responses and their association with ELA Tutor dimensions.

Question & Comment Type	Qualitative Category	Main Associated Dimension	Approximate Frequency of Mentions
Q15—Positive	Response speed (“quick answers”, “immediate”, “adequate response time”)	Usability/ Accessibility	High
Q15—Positive	Clarity and usefulness of answers (“clear”, “coherent”, “helps me understand topics better”)	Satisfaction/ Usefulness	High
Q15—Positive	Academic support and task management (“helps with assignments”, “shows pending tasks and virtual classroom status”)	Satisfaction/ Accessibility	Medium
Q15—Positive	Ease of use and simple interface (“easy to use”, “simple handling”, “simple/minimalist interface”)	Usability	Medium
Q15—Positive	Integration with degree/program or student context (“knows my major”, “linked to the university”)	Satisfaction	Low–medium
Q16—Improvement	coherence and variety of answers (“more precise”, “does not always get it right”, “avoid repeating the same answer”)	Satisfaction/ Usefulness	High
Q16—Improvement	Depth and structure of explanations (“more detailed”, “step by step”, “not all in one paragraph”)	Satisfaction	Medium
Q16—Improvement	Interface and visual design (“improve design”, “more attractive/dynamic”, “change colors or logo”)	Usability	High
Q16—Improvement	Handling of files, images and mobile app (“upload PDF/Word/JPG”, “send images”, “mobile app”)	Accessibility	Medium

Note: Frequency and percentages indicate the number of students referencing each category.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

López-Goyez, J.P.; González-Briones, A.; Demazeau, Y. An Adaptive Multi-Agent Architecture with Reinforcement Learning and Generative AI for Intelligent Tutoring Systems: A Moodle-Based Case Study. Appl. Sci. 2026, 16, 1323. https://doi.org/10.3390/app16031323

AMA Style

López-Goyez JP, González-Briones A, Demazeau Y. An Adaptive Multi-Agent Architecture with Reinforcement Learning and Generative AI for Intelligent Tutoring Systems: A Moodle-Based Case Study. Applied Sciences. 2026; 16(3):1323. https://doi.org/10.3390/app16031323

Chicago/Turabian Style

López-Goyez, Juan P., Alfonso González-Briones, and Yves Demazeau. 2026. "An Adaptive Multi-Agent Architecture with Reinforcement Learning and Generative AI for Intelligent Tutoring Systems: A Moodle-Based Case Study" Applied Sciences 16, no. 3: 1323. https://doi.org/10.3390/app16031323

APA Style

López-Goyez, J. P., González-Briones, A., & Demazeau, Y. (2026). An Adaptive Multi-Agent Architecture with Reinforcement Learning and Generative AI for Intelligent Tutoring Systems: A Moodle-Based Case Study. Applied Sciences, 16(3), 1323. https://doi.org/10.3390/app16031323

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Adaptive Multi-Agent Architecture with Reinforcement Learning and Generative AI for Intelligent Tutoring Systems: A Moodle-Based Case Study

Abstract

1. Introduction

1.1. Digital Transformation, Smart Tutoring, and the Evolution of ITS

1.2. Adaptive Learning in ITS: Reinforcement Learning and Deep Learning

1.3. Multiagent Systems, LLM, and Agentic AI for Intelligent Tutoring

1.4. Research Gap and Contribution

2. Materials and Methods

2.1. System Design and Architecture

2.1.1. Interaction and Communication Layer—User Interface

2.1.2. Orchestration and Automation Layer

2.1.3. Intelligent Switching Router

2.1.4. Multi-Agent System

2.1.5. Integration of Reinforcement Learning (RL)

2.1.6. Data Management and Knowledge Base Layer

2.1.7. Considerations for the Design of the Architecture

2.2. MAS and Intelligent Control Mechanism

2.3. Meta-Agent and Reinforcement Learning

2.4. Programming and Functional Logic of Agents

2.4.1. AI Translator Agent

2.4.2. AI Prompt and Receiving Agent

2.4.3. AI Prompt and Receiving Agent with Moodle

2.4.4. Pedagogical Agent

2.4.5. Technical Agent

2.4.6. Prepare Prompt—Analysis Agent

2.4.7. Adaptive Empathic Agent

2.4.8. Ethical Agent

2.5. User Interface Design and Scalability

3. Study Cases

3.1. User-Centered System Evaluation

3.2. Simulation of Adaptive Decision-Making Using Reinforcement Learning in ELA Tutor

4. Results

4.1. Students’ Perceptions of ELA TUTOR

4.2. Adaptive Behavior Analysis of the Reinforcement Learning Mechanism

5. Discussion and Future Works

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI