Integrating AI-Driven Wearable Metaverse Technologies into Ubiquitous Blended Learning: A Framework Based on Embodied Interaction and Multi-Agent Collaboration

Xu, Jiaqi; Zhai, Xuesong; Chen, Nian-Shing; Ghani, Usman; Istenic, Andreja; Xin, Junyi

doi:10.3390/educsci15070900

Open AccessArticle

Integrating AI-Driven Wearable Metaverse Technologies into Ubiquitous Blended Learning: A Framework Based on Embodied Interaction and Multi-Agent Collaboration

by

Jiaqi Xu

¹

,

Xuesong Zhai

^2,3,*,

Nian-Shing Chen

⁴

,

Usman Ghani

⁵,

Andreja Istenic

^6,7

and

Junyi Xin

^8,*

¹

Graduate School of Education, Peking University, Beijing 100871, China

²

College of Education, Zhejiang University, Hangzhou 310058, China

³

School of Education, City University of Macau, Macau 999078, China

⁴

Program of Learning Sciences, National Taiwan Normal University, Taipei 100610, Taiwan

⁵

Department of Business Administration, Iqra University, Karachi 75500, Pakistan

⁶

Faculty of Education, University of Primorska, Cankarjeva 5, 6000 Koper, Slovenia

⁷

Faculty of Civil and Geodetic Engineering, University of Ljubljana, Jamova 2, 1000 Ljubljana, Slovenia

⁸

School of Information Engineering, Hangzhou Medical College, Hangzhou 311399, China

^*

Authors to whom correspondence should be addressed.

Educ. Sci. 2025, 15(7), 900; https://doi.org/10.3390/educsci15070900

Submission received: 9 April 2025 / Revised: 5 July 2025 / Accepted: 6 July 2025 / Published: 15 July 2025

(This article belongs to the Special Issue Artificial Intelligence and Blended Learning: Challenges, Opportunities, and Future Directions)

Download

Browse Figures

Versions Notes

Abstract

Ubiquitous blended learning, leveraging mobile devices, has democratized education by enabling autonomous and readily accessible knowledge acquisition. However, its reliance on traditional interfaces often limits learner immersion and meaningful interaction. The emergence of the wearable metaverse offers a compelling solution, promising enhanced multisensory experiences and adaptable learning environments that transcend the constraints of conventional ubiquitous learning. This research proposes a novel framework for ubiquitous blended learning in the wearable metaverse, aiming to address critical challenges, such as multi-source data fusion, effective human–computer collaboration, and efficient rendering on resource-constrained wearable devices, through the integration of embodied interaction and multi-agent collaboration. This framework leverages a real-time multi-modal data analysis architecture, powered by the MobileNetV4 and xLSTM neural networks, to facilitate the dynamic understanding of the learner’s context and environment. Furthermore, we introduced a multi-agent interaction model, utilizing CrewAI and spatio-temporal graph neural networks, to orchestrate collaborative learning experiences and provide personalized guidance. Finally, we incorporated lightweight SLAM algorithms, augmented using visual perception techniques, to enable accurate spatial awareness and seamless navigation within the metaverse environment. This innovative framework aims to create immersive, scalable, and cost-effective learning spaces within the wearable metaverse.

Keywords:

metaverse; embodied interaction; wearable; multi-agent; artificial intelligence; ubiquitous blended learning

1. Introduction

In recent years, the rise of the metaverse has opened up immense opportunities for the field of education. As an advanced immersive environment that blends virtual and physical realities, the metaverse has the potential to revolutionize learning methodologies and reshape educational paradigms (Phakamach et al., 2022). Especially for K-16 learners, this technology promises to unlock educational experiences that are otherwise impossible, impractical, or unsafe (López-Belmonte et al., 2023).Wearable technology stands out as a key enabler of this vision, facilitating rich, multi-modal interactions within immersive ubiquitous learning environments. By supporting seamless, real-time interaction, wearable devices promise to deliver highly personalized, context-aware educational experiences that go beyond the limitations of traditional learning approaches (Zhou et al., 2024).

The wearable metaverse holds vast application potential while also facing challenges. Leading technology companies, such as Apple and Meta, have launched sophisticated wearable devices, like the Vision Pro and Orion, which integrate multisensory interactions and provide immersive experiences across industries, including education and tourism (Pan, 2024). Nevertheless, while the current research has extensively examined standalone applications of wearable devices in education, less attention has been paid to their potential as part of a cohesive ecosystem. The multi-source data collected by these wearable devices, such as physiological signals, environmental information, and user interaction data, remains underutilized due to a lack of effective integration and analysis (Chakma et al., 2021), hindering the development of high-level embodied interactions that rely on full-body perception. The current research has conducted preliminary explorations into the application of AI agents in the metaverse (X. Kang et al., 2024). However, the existing studies primarily focus on algorithm optimization or task simulation (Feng et al., 2025; Yu, 2023), while the specific mechanisms of human–agent collaboration remain largely unaddressed. As a result, intelligent agents tend to play a relatively passive role in metaverse learning environments, making it difficult to achieve truly intelligent and collaborative interactions. Furthermore, processing the multi-source data required to support complex embodied interactions demands substantial computational resources, which often exceed the capabilities of most wearable devices (X. Wang et al., 2024). Consequently, enabling low-computation-cost yet high-performance data analysis on these wearable devices has become a critical challenge.

This study proposes a conceptual framework for multi-source data analysis, a low calculation cost, and human–machine cooperation in wearable metauniverse environments. By integrating embodied interaction and multi-agent collaboration, this study proposes a comprehensive framework for designing wearable metauniverse learning environments. This framework combines a lightweight real-time multi-modal data processing framework, a multi-agent cooperation framework, and a rendering framework combining a lightweight SLAM algorithm with visual perception. This research aims to provide theoretical and technical support for the construction of wearable metaverse learning spaces and large-scale immersive and ubiquitous learning.

2. Literature Review

2.1. Wearable Devices in Ubiquitous Blended Learning

Wearable devices, with their portability, interactivity, and versatility, are an effective technological means for creating blended environments (Frisoli & Leonardis, 2024; Palermo et al., 2025). Wearable devices allow students to immerse themselves in a virtual space, mobilizing their vision, hearing, and even touch to achieve comprehensive interactions between themselves and the virtual environment. This learning method is consistent with the core principles of immersive and ubiquitous blended learning (Cárdenas-Robledo & Peña-Ayala, 2018). Furthermore, these devices provide students with continuous access to knowledge in various environments, offering greater flexibility, efficiency, and participation in the learning process. In recent years, the rapid development of wearable technologies such as smart watches, smart glasses, and smart clothing has further expanded the possibilities of metaverse-based learning. These wearable items help students to stay engaged by allowing them to be educated in different conditions instantaneously and continuously. To this end, smart glasses facilitate school children’s multi-modal story creation by combining 3D virtual objects and hologram elements to enable the children to visualize their invented stories (Mills & Brown, 2023). A study quantified the learner experience and usability of a VR game using data from smartwatch gestures, finding that participants felt comfortable with the system, used it easily, and felt empowered (Nascimento et al., 2023).

Wearable devices play an active role in analyzing and enhancing learners’ learning process. They can analyze learners’ behaviors and emotional states in real time, such as by using smart bracelets to record students’ attention levels and emotions and help build adaptive learning systems that dynamically respond to individual needs (Ba & Hu, 2023). This enables the development of personalized learning spaces in the education metaverse using heart rate signals to assess students’ emotional engagement and cognitive activity levels (Z. Zhao et al., 2022). In addition, wearable devices have been integrated into various learning activities to enhance their immersion and interactivity. For example, at a museum’s dinosaur exhibition presented in English, smart glasses were shown to significantly improve the learning efficiency and motivation compared to tablets (Chen et al., 2023). Similarly, wearable AR and hybrid AR/VR learning materials were also found to significantly improve high school students’ situational interest, engagement, and learning performance in physics laboratories, with hybrid AR/VR outperforming traditional learning methods (J. C. Y. Sun et al., 2023).

Despite the numerous advantages of using wearable devices in educational applications, their further development still faces some technical challenges. One of the major hurdles is effectively processing and analyzing the heterogeneous multi-source data collected by wearable devices. Wearable technologies typically contain a variety of sensors that capture various data types such as physiological signals (e.g., eye-tracking data, heart rates, electroencephalograms) and environmental information (e.g., location, temperature) (Heikenfeld et al., 2018). Integrating and making sense of these disparate data sources requires developing sophisticated data fusion techniques and mining algorithms specifically tailored to the unique characteristics of wearable data. Moreover, the limited computational power, storage capacity, and communication capabilities of wearable devices pose significant barriers to their ability to support advanced learning analytics and interactive functionalities (Nahavandi et al., 2022). This limitation hinders the implementation of real-time and immersive interactive features (Hazarika & Rahmati, 2023). Research on low-computation-cost technologies for wearable devices used within immersive metaverse learning environments has become an urgent direction to address these challenges.

2.2. Embodied Interaction in Ubiquitous Blended Learning

With the advancement of cognitive science, embodied cognition theory has attracted widespread attention in education research. The theoretical framework emphasizes the basic role of the body in cognitive processes. It is believed that cognition not only depends on the function of the brain but also occurs through the dynamic interaction between the body and its environment (Foglia & Wilson, 2013). This view has opened up new research and practical application methods for use in immersive learning environments. Embodied interaction, as an emerging paradigm in human–computer interaction, integrates whole-body sensory and motor systems to create more natural and intuitive interactive experiences (Crowell et al., 2018). In immersive education environments, the application of embodied interaction mainly occurs across three dimensions.

First, designing and applying diversified interactive devices, such as motion capture systems, tactile feedback equipment, and brain–computer interfaces, provides technical support for ubiquitous learning. These devices can track learners’ physical movement, physiological state, and nerve signals in real time, thereby providing corresponding immersive feedback (Crowell et al., 2018; Fleury et al., 2020). For example, VR and motion capture have been used to offer an interactive Tai Chi learning system with a virtual coach, real-time feedback, and avatar control, enhancing self-learning by overcoming the limitations of traditional and video-based methods (J. Liu et al., 2020). Innovative wearable rings with multi-modal sensors and haptic feedback have enhanced immersive social interactions in metaverse-based education by enabling tactile and thermal perception (Z. Sun et al., 2022). Additionally, utilizing a brain–computer interface (BCI) to monitor a student’s brain activity, an embodied robot can detect attention lapses in real time and provide immediate, adaptive responses, thereby improving learning efficacy (Vrins et al., 2022).

Second, embodied interaction has been implemented across various academic disciplines with various application patterns. For instance, Kinect sensors and gesture-based interactions have been used in physics education to create mixed-reality environments where students learn about electric fields through bodily movements and interactive gestures (Johnson-Glenberg & Megowan-Romanowicz, 2017). A study comparing traditional controls and 3D-printed haptic devices in a mixed-reality chemistry lesson found that while both groups exhibited improved knowledge, highly embodied interaction enhanced science identity and efficacy (Johnson-Glenberg et al., 2023). Virtual museums used in science education have been studied using eye-tracking technology to analyze students’ performance and mental effort, with the goal of enhancing virtual museum design and resource development (Wu et al., 2024).

Third, embodied interaction can greatly improve students’ motivation (Lindgren et al., 2016). An embodied interactive teaching model can be personalized and differentiated to meet students’ needs, attracting their attention through multisensory presentation methods and thereby enabling more effective knowledge transfer and skill development. Embodied interaction has taken root in gamified learning. In such environments, embodied interaction technology supports interactive learning by integrating educational content into virtual contexts, stimulating students’ intrinsic motivation and sparking their curiosity and interest (Abrahamson et al., 2020). The positive impact of embodied interaction on students’ learning outcomes is well-established (Mira et al., 2024). It not only enhances students’ cognitive abilities and motor skills but also fosters their emotional development and learning motivation (Kosmas & Zaphiris, 2023). However, in spite of the promising prospects of using embodied interaction in ubiquitous learning, its large-scale promotion and practical application still face multiple challenges. These include high costs and computational resource requirements, the need for more comprehensive principles and standards in interaction design, and the difficulty of accurately quantifying the comprehensive impact of embodied interaction on students’ cognition, emotions, and behavior.

2.3. AI Agents in Metaverse

The rapid advancements in Large Language Models (LLMs) have led to significant breakthroughs in natural language understanding and generation by LLM-based agents, bringing revolutionary changes to education (Xi et al., 2023). LLMs empower AI agents with multidimensional capabilities such as perception, tool invocation, reasoning, planning, interaction, and self-evolution, enabling them to autonomously learn, make decisions, and act in complex, blended-reality environments (Gao et al., 2024). Through real-time interactions with the environment or humans, agents continuously optimize their behavioral strategies by receiving feedback (González-Briones et al., 2018), allowing them to learn continuously in real-world scenarios and enhance their intelligence, interactivity, and collaboration. However, because a single agent struggles with a high cognitive load and inefficient task division in complex educational settings, multi-agent systems (MASs) represent a promising solution (Amirkhani & Barshooi, 2022). By incorporating social attributes and defining roles and communication mechanisms, MASs can engage in cooperative and competitive social interactions to handle more complex educational tasks (Song et al., 2024). MASs can share parameters, knowledge, and decisions, enhancing the robustness and scalability of algorithms through communication (Janbi et al., 2023). Additionally, through interactive collaboration, MASs simulate complex social scenarios that reflect group cooperation dynamics, helping learners understand the behaviors and emotions associated with different roles, thereby enhancing social perception. To simplify the development of MASs, researchers have created frameworks based on LLMs, such as AutoGen, CrewAI, CAMEL, and MetaGPT (Arslan et al., 2024). These frameworks provide powerful tools for facilitating collaboration and competition among agents.

Leveraging MAS frameworks will enhance the performance, efficiency, robustness, and scalability of metaverse educational systems. In metaverse educational practice, LLM-based MASs demonstrate excellent human–computer collaboration capabilities (Xia et al., 2024). Unlike traditional human–computer cooperation processes, MASs can manage human resources proactively by designing socially interactive virtual–physical roles. These systems simulate complex social role interactions, understand learners’ social behavior, and dynamically adjust according to the social norms implied by users’ actions and the environment (Gatto et al., 2022). This adaptive flexibility enables MASs to be widely applied in immersive educational scenarios such as video games, virtual reality, and training simulations. Examples include Stanford University’s AI Agent Town (Park et al., 2023) and agent-based hospitals (Li et al., 2024). Despite the promising prospects of using LLM-based MASs in metaverse education, their development still faces numerous challenges. Specifically, these include the need to advance multi-agent collaboration algorithms, develop robust frameworks, and improve agents’ recognition of metaverse elements (Gatto et al., 2022).

3. A Conceptual Framework for Wearable Metaverse Environments

3.1. The Overall Framework of the Model

This study constructed a wearable metaverse learning environment framework, as shown in Figure 1, to enhance the learning experience in immersive ubiquitous learning. The model consists of four key modules: (1) an Embodied Interaction Module; (2) Multi-Agent Collaboration Module; (3) Multi-Source Data Fusion Module; and (4) Low-Computational-Cost Optimization Module. Through the interconnection of these modules and their interaction with various components of the system, an immersive ubiquitous learning system is formed.

3.2. Embodied Interaction Module

3.2.1. Data Collection and Sensor Integration

In a wearable metaverse learning environment, the comprehensive and real-time collection of learners’ multi-modal data forms the foundation for achieving embodied interaction (Closser et al., 2022). In this study, we developed a modular system for embodied data collection, integrating diverse sensors and wearable devices. The key components include an eye-tracking sensor embedded in smart glasses to capture metrics like fixation points, the gaze duration, and the blink frequency for analyzing learners’ attention, cognitive load, and visual health; EEG sensors to assess learners’ emotional states through their brainwave patterns; and a positioning wristband with inertial sensors to monitor real-time positions, movement trajectories, and gestures. Additionally, haptic sensors measure environmental parameters such as the temperature and pressure, enabling context-aware haptic simulations.

3.2.2. Embodied Interaction Strategies

In the proposed wearable metaverse learning environment, we suggest implementing embodied interaction strategies organized across three dimensions: interaction between learners, interaction between learners and the metaverse environment, and interaction between learners and the real environment (Table 1). These strategies collectively create an integrated learning experience that bridges the virtual and real worlds.

3.3. Multi-Agent Collaboration Module

To enhance learners’ immersive interaction experience in wearable metaverse environments, the Multi-Agent Collaboration Module relies on multi-agent human–computer collaboration algorithms, enabling different intelligent agents to possess expertise in various domains and adapt to collaborative needs across multiple modalities and scenarios. This study proposes a multi-agent collaboration framework based on CrewAI and spatio-temporal graph neural networks (ST-GNNs).

3.3.1. Functions of Multi-Agent Module

The multi-agent module promotes deep human–machine collaboration, resource optimization, and the co-evolution of intelligence. On one hand, multiple agents can be flexibly defined and dynamically adjusted based on their identity, efficiently functioning in diverse interaction modes such as equal collaboration, the structured hierarchical division of labor, and spontaneous discussion. This not only provides users with multidimensional interactive experiences but also allows users to participate in collaborative problem-solving with intelligent agents, exploring the possibilities of using different cooperation models and responsibility sharing. On the other hand, multiple agents will filter and recommend high-quality resources, facilitate on-demand application and adaptive optimization, generate diverse content, act as virtual companions, and provide heuristic dialogue and metacognitive support through emotion perception and cognitive state adjustment, promoting the realization of deep collaboration modes. Furthermore, this study explored a mutual feeding mechanism involving multiple agents and human collaboration to achieve the co-evolution of technological capabilities and human intelligence. Within this framework, humans engage in knowledge co-construction and task resolution through deep collaboration with multiple agents, providing guidance and correction for the optimization of agent behavior, helping them continuously improve the depth of their professional knowledge in vertical domains. Simultaneously, intelligent agents provide feedback on individuals’ cognitive and practical abilities through complex data analysis and providing adaptive behavior feedback.

3.3.2. Intelligent Interaction Mechanisms

Intelligent interaction mechanisms govern how learners and virtual agents engage within the environment. The interaction modes between learners and virtual agents include proactive modes where agents anticipate needs, passive modes where agents respond upon request, hybrid modes combining both approaches, and group collaboration modes involving multiple agents for complex tasks. Additionally, real-time context-based adaptation enhances interactions through the analysis of learners’ behavior, awareness of environmental changes, optimization of interaction strategies based on feedback, and personalized tuning of agents’ parameters to ensure tailored and effective support.

3.3.3. Collaboration Using CrewAI and ST-GNNs

The framework utilizes the CrewAI framework to structure the collaboration among agents. CrewAI allows us to define distinct roles, responsibilities, and goals for each agent, thereby forming a cohesive and mission-focused team dedicated to a specific learning task. This framework supports sophisticated workflows where agents can operate in parallel, delegate tasks, and communicate sequentially, mirroring real-world team dynamics.

To achieve seamless coordination between agents’ actions and learners’ movements, ST-GNNs are employed. This technology is essential for processing and interpreting the complex, dynamic relationships between multiple entities (learners and agents) in both space and time.

3.4. Multi-Source Data Fusion Module

A lightweight and scalable solution is critical for efficient real-time multi-modal data processing in resource-constrained wearable metaverse environments. One promising approach is to integrate MobileNetV4-based lightweight feature extraction with xLSTM-based multi-source fusion.

The feature extraction module leverages MobileNetV4’s depthwise separable convolutions to balance computational efficiency and high performance (Qin et al., 2025). Each modality-specific branch can be independently trained, and the extracted features are integrated into a shared embedding space, enabling seamless integration. Thanks to the efficient architecture of MobileNetV4, this module can support low-latency, real-time inference on resource-constrained devices such as wearables, providing a solid foundation for immersive interactions.

Building on the extracted multi-modal features, an xLSTM network can be employed to fuse features from different modalities and model their temporal dependencies (Beck et al., 2024). Specifically, feature vectors from different branches are fed in parallel into the corresponding input gates of xLSTM. By introducing modality interaction units, xLSTM can explicitly learn the temporal correlation patterns across different modalities, capturing long-range cross-modal dependencies. Additionally, the gating mechanisms of xLSTM allow it to adaptively decide which modality information should be updated or retained at each time step, enhancing the flexibility and robustness of cross-modal information fusion.

3.5. Low-Computation-Cost Strategy Module

The low-computation-cost strategy module plays a critical role in this project, aiming to reduce computational complexity and deliver a smooth, high-quality visual embodied interaction experience.

Inspired by the spatial resolution distribution of the human visual system, we propose that rendering in wearable metaverse learning environments should adopt an adaptive resolution rendering method based on gaze tracking. Specifically, the density of cone cells in the retina peaks in the foveal region (the area surrounding the gaze point), providing the highest visual resolution, while progressively decreasing toward the peripheral areas (Reiniger et al., 2021). Based on this characteristic, the rendering engine of wearable metaverse learning environments needs to dynamically track the user’s gaze position in real time and adjust the rendering resolution accordingly, with regions closer to the gaze point rendered in higher detail and peripheral regions rendered at lower resolutions to optimize computational efficiency without compromising visual quality. Therefore, a low-computation-cost rendering strategy for wearable metaverse learning environments should follow these principles:

High fidelity in gaze-sensitive areas: Regions closer to the gaze point are rendered at higher resolutions to ensure high-fidelity viewing in the user’s focus area;
Optimized peripheral rendering: Regions further away from the viewpoint are rendered at lower resolutions, reducing the computational demands while maintaining acceptable visual quality.

This adaptive-resolution rendering technique balances the computational cost and visual quality, and even under resource-limited conditions, a high-quality visual presentation is maintained.

To enable efficient environment perception and modeling, this framework adopts lightweight SLAM algorithms, such as GS-SLAM + LoopSplat, to construct 3D environment maps and estimate device pose changes in real time on low-power devices. To enhance the learner’s experience, the framework includes a visual perception optimization module that dynamically adjusts the rendering parameters based on real-time quality assessment using algorithms such as CrossScore and GR-PSN. Perceptual mapping techniques like VDP (Visual Difference Prediction) are employed to identify areas of higher visual importance, ensuring that system resources are allocated to maximize the perceptual quality.

By employing this dynamic resource allocation strategy, the system achieves smooth interaction performance and improved overall resource utilization.

4. Technical Approaches for Implementing Wearable Metaverse Environments

The construction of a future immersive ubiquitous learning environment aims to promote educational equity, improve learning outcomes, and optimize resource allocation (X. S. Zhai et al., 2023). However, since the fully immersive ubiquitous learning era has not yet arrived, the concept of the educational metaverse is still in its exploratory phase, and large-scale experimental research is not yet feasible. Therefore, this study addresses specific educational challenges that may arise in future immersive ubiquitous learning environments and proposes four key technical pathways to address these. These pathways aim to overcome the limitations of traditional educational methods through embodied interaction and multi-agent collaboration, providing innovative solutions to challenges such as data collection precision in enhancing immersive learning experiences.

4.1. Enhancing Precision Through Multi-Source Data Fusion

In future immersive ubiquitous learning environments, traditional data collection methods, such as surveys and interviews, may prove insufficient for capturing the dynamic and embodied interactions between learners, intelligent agents, and the virtual–physical environment. The real-time analysis of multi-modal learner data is essential for supporting personalized and adaptive learning (Di Mitri et al., 2022; X. Zhai et al., 2023). However, the heterogeneous nature of data sources in wearable metaverse learning environments poses significant challenges in terms of data integration, synchronization, and interpretation.

This study introduces a lightweight neural network architecture combining MobileNetV4 and xLSTM for multi-source heterogeneous data fusion and analysis, as shown in Figure 2. This approach enables the efficient extraction and integration of features from various data modalities, such as text, images, speech, and sensor data, while maintaining the real-time performance on resource-constrained wearable devices.

4.1.1. Feature Extraction with MobileNetV4

The lightweight feature extraction module for use in wearable metaverse environments leverages MobileNetV4, which effectively balances computational efficiency and model performance using its universally efficient architecture designs for mobile devices. MobileNetV4 introduces the Universal Inverted Bottleneck (UIB) search block, a unified and flexible structure that merges an Inverted Bottleneck (IB), ConvNext, Feed-Forward Network (FFN), a novel Extra-Depthwise (ExtraDW) variant, and Mobile Multi-Query Attention (Mobile MQA) (Qin et al., 2025).

Text Data: Discrete text data is transformed into continuous low-dimensional semantic vectors using an embedding layer. These vectors are then processed through modified UIB blocks, specifically adapted for text data, to extract high-level semantic features.
Image and Video Data: MobileNetV4’s depthwise separable convolution, as a key element of the UIB block, is leveraged to efficiently extract spatial features from image data. For video data, these spatial features are temporally aggregated using temporal modeling layers, such as the Mobile MQA attention block, enabling the capture of dynamic temporal dependencies.
Speech Data: High-level acoustic features are initially extracted using a pre-trained acoustic model (e.g., Wav2Vec or HuBERT). These features are subsequently compressed using MobileNetV4’s UIB blocks, which are fine-tuned for speech data, to reduce the dimensionality without losing essential information. The Mobile MQA attention block is then applied to capture long-range dependencies within the speech sequences.

Each modality-specific branch is independently trained to maximize its individual performance. The extracted features are then projected into a shared embedding space for cross-modality integration. By leveraging the efficient architecture of MobileNetV4, this module ensures real-time feature extraction and compatibility with resource-constrained devices, such as wearable hardware.

4.1.2. Dynamic Cross-Modality Fusion with xLSTM

After extracting feature maps from multi-modal data using MobileNetV4, fusion employing xLSTM involves the following steps:

Input Transformation: The extracted feature maps are pre-processed (e.g., normalization, dimensionality alignment) to ensure compatibility across the modalities.
Temporal Alignment Using Modality Interaction Units (MIUs): MIUs in xLSTM explicitly model the temporal relationships between the modalities.
Dynamic Modality Weighting: At each time step, xLSTM calculates the relative importance of each modality using learned weighting parameters.
Output Fusion for Downstream Tasks: The fused multi-modal representation is passed to task-specific layers (e.g., classification, regression, or decision-making modules).

This architecture supports a highly adaptive and scalable data fusion process by combining MobileNetV4’s efficient feature extraction with xLSTM’s advanced temporal modeling capabilities.

4.2. Agents’ Collaboration Based on Multi-Agent Framework and Graph Neural Networks

Effective collaboration between learners, virtual agents, and the environment is critical for achieving an interactive and adaptive learning experience in the wearable metaverse. However, ensuring seamless coordination between multiple intelligent agents and the dynamic virtual–physical environment is a significant challenge, particularly in terms of agent communication, task allocation, and context awareness.

This study proposes a multi-agent collaborative model based on the CrewAI and spatio-temporal graph neural networks (ST-GNNs), as shown in Figure 3, integrating learners, intelligent agents, virtual environments, and real-world environments. Crew AI offers a decentralized network of communicating agents with decentralized architectures and coordination, providing flexibility and scalability for such systems. ST-GNNs can consider the fine details of the spacetime relationship of the agent with its surroundings, which may support context-aware decision-making and adaptations to the environment. A hybrid learning setup that includes this technology can make the proposed model particularly useful for education in a hyper-virtual world incorporating collective studying, such as through problem-solving situations, role-playing, and interactive simulations.

4.2.1. Spatio-Temporal Collaboration Modeling with ST-GNNs

To capture and model the complex interactions among learners, intelligent agents, and their environments, the framework employs spatio-temporal graph neural networks (ST-GNNs) (Sahili & Awad, 2023). This approach first constructs a spatio-temporal heterogeneous behavior graph, where learners, agents, and environments are represented as nodes with both static and dynamic features and their timestamped interactions form the edges. Within this graph, the framework models the spatial dimension using heterogeneous graph attention networks to aggregate information from neighboring nodes, while simultaneously addressing the temporal dimension using temporal convolutional networks to capture the evolution of features over time. Through this integrated analysis, the ST-GNNs generate low-dimensional collaborative embeddings for each agent. These embeddings encode their roles, states, and mutual dependencies, providing a rich feature space for downstream decision-making. By leveraging ST-GNNs, this framework identifies and models complex spatio-temporal dependencies, enabling adaptive and cooperative interactions that enhance learning outcomes.

4.2.2. Distributed and Hybrid Decision-Making with CrewAI

This framework adopts the CrewAI (Barbarroxa et al., 2025) paradigm to organize distributed decision-making and coordination among agents at both the macro and micro levels:

Macro-Level Coordination: A central platform agent serves as a global coordinator, aggregating information from all the agents and generating high-level decisions using graph neural networks. The platform agent evaluates the states of learners, virtual environments, and real-world contexts to identify optimal task–agent matches. For example, it might assign a specific virtual tutor to a struggling student or coordinate collaborative tasks among urban and rural students.
Micro-Level Distributed Decisions: Individual agents (e.g., virtual tutors, learning companions, or environment agents) independently generate localized decisions based on their private states. Using deep reinforcement learning, the agents express personalized preferences for scheduling or task execution, which are communicated back to the platform agent through CrewAI’s interaction mechanisms. This two-way communication ensures that global decisions are informed by local needs while maintaining the overall system coherence.

This hybrid decision-making approach balances centralized coordination with decentralized adaptability, ensuring scalability and responsiveness in complex learning scenarios.

4.3. Optimization of Visual Experiences Based on Low Computation Cost

As metaverse technologies have continued to evolve, the demand for immersive user experiences has grown significantly. From early Three-Degrees-of-Freedom (3DOF) head tracking to the current Six-Degrees-of-Freedom (6DOF) head and hand tracking, VR/AR devices are increasingly simulating interactions that closely resemble the real world (Manawadu & Park, 2024). However, achieving high-resolution, wide-field-of-view, and low-latency rendering in wearable devices poses significant computational challenges, particularly in resource-constrained environments.

This study argues that wearable metaverse learning environments should be based on low-computation-cost rendering technologies. By leveraging the characteristics of the human visual system, the framework dynamically adjusts the rendering strategies based on the learner’s gaze position, ensuring high-fidelity rendering in visually sensitive regions while reducing the computational load for peripheral areas. This approach not only improves the visual fidelity of wearable metaverse learning environments but also enhances user comfort and reduces power consumption, making it more feasible for prolonged use in educational settings. The framework integrates four key technical components, as shown in Figure 4.

4.3.1. Low-Computation-Cost Environment Perception and Modeling

To enable efficient environment perception and modeling, this framework adopts optimized, lightweight SLAM (Simultaneous Localization and Mapping) algorithms, such as GS-SLAM + LoopSplat (Zhu et al., 2024), to construct 3D environment maps and estimate device pose changes in real time on low-power devices. By utilizing sparse feature extraction and efficient graph optimization techniques, the framework significantly reduces the computational costs while maintaining effective performance.

4.3.2. Gaze Prediction Using Visual Attention Models

To improve the rendering efficiency and user experience, the framework incorporates real-time gaze tracking and visual attention modeling:

Visual Attention Models: Inspired by the human visual system, lightweight convolutional neural networks (e.g., boundary attention models) are used to predict potential regions of interest in images or videos (Polansky et al., 2024). These predictions guide rendering optimizations by focusing computational resources on areas the user is likely to attend to.
Real-Time Gaze Tracking: The system utilizes low-computation-cost gaze-tracking algorithms to identify the learner’s gaze position in real time, ensuring that the rendering priorities align with the user’s visual attention.

This gaze prediction mechanism provides precise data to dynamically optimize the rendering strategies while reducing unnecessary computational overheads.

4.3.3. Gaze Prediction and Dynamic Rendering Cache Mechanism

A core module of the framework integrates gaze prediction with a dynamic rendering cache mechanism to optimize the rendering efficiency:

Dynamic Resolution Adjustment: Based on gaze prediction, the rendering engine dynamically adjusts the resolution of different regions. Higher resolutions are prioritized for gaze-sensitive areas, while peripheral regions are rendered at lower resolutions. Techniques such as the Level of Detail (LOD) method and frustum culling (Su et al., 2017) are used to allocate resources effectively.
Rendering Cache Mechanism: Leveraging temporal coherence, previously rendered frames are stored and reused to avoid redundant computations. Frame difference encoding and result compression techniques are further applied to reduce the computational cost for static or minimally changing regions.
Predictive Gaze Modeling: Recurrent neural networks (e.g., RNNs) predict potential gaze shifts, allowing the system to pre-render areas of future interest and minimize the latency.

This module ensures the efficient utilization of computational resources while maintaining high-quality visual experiences in key areas of user attention.

4.3.4. Visual Perception Optimization

To enhance the learner experience, this framework incorporates a visual perception optimization module that dynamically adjusts the rendering parameters based on real-time quality assessment:

Image Quality Evaluation: Algorithms such as CrossScore (Z. Wang et al., 2025) and GR-PSN (Ju et al., 2024) are used to assess the visual quality of the rendered frames in real time. These evaluations guide adjustments to the rendering parameters, such as the resolution and texture detail, to balance visual fidelity and computational efficiency.
Perceptual Mapping Techniques: Techniques like VDP (Visual Difference Prediction) (Mantiuk et al., 2023) are employed to identify areas of higher visual importance, ensuring that the system resources are allocated in a way that maximizes the perceptual quality.

5. Discussion

This study proposes a lightweight data processing solution tailored to the needs of ubiquitous wearable metaverse environments, effectively facilitating human–computer interaction and supporting ubiquitous learning. The integration of the MobileNetV4 and xLSTM algorithms helps improve the computational efficiency on resource-constrained wearable devices, thereby enhancing the model’s performance. On the one hand, MobileNetV4’s efficient architecture, including the Universal Inverted Bottleneck (UIB) block and Mobile MQA mechanism, provides very high real-time feature extraction accuracy and computational efficiency (Qin et al., 2025). Such model developments acknowledge the developing requirements for light yet powerful models to be implemented in learning systems (G. Zhao et al., 2021). The xLSTM model, on the other hand, captures the long-term dependencies and models the temporal dynamics of different modalities, making it particularly suitable for processing the various time series data generated in metaverse learning environments (Alharthi & Mahmood, 2024). By reducing the need for computing resources, this framework enhances the feasibility of allowing learners to receive education on resource-limited wearable devices anytime and anywhere, promoting equitable access to metaverse-based education across different hardware platforms.

Secondly, this study proposes a human–machine collaboration framework that integrates multiple agents. The multi-agent system-based framework relies on coordination methods to link students, agents, the virtual reality environment, and reality together to form a vibrant, collaborative learning system. Meanwhile, it applies spatio-temporal heterogeneous behavior graphs, allowing the varying behavior parameters of learners, agents, and their environments to be witnessed and researched because they are important for both analyzing and optimizing interactive behaviors. Inside this mixed-learning metaverse space, besides providing novelty and depth to traditional cooperative learning activities, multi-agent collaboration is also a step toward a human–computer education model (Lin, 2015). This framework not only focuses on traditional collaborative learning between students but also explores the interaction between learners and agents and between agents, making it more suitable for ubiquitous learning in the metaverse environment. Multi-agent systems, supported by large-scale language models, can bolster equity in learning, make education more personalized and context-aware, and make the possibility of individualized and resource-rich education more realistic (Cheng et al., 2024). That is, the joint efforts of agents will enable personalized, context-aware, and emotional learning processes, where the recommendations will be changed dynamically and tasks’ difficulty will be adjusted based on the learner’s trajectory.

Finally, this study proposes the adoption of a low-computation-cost rendering strategy in wearable metaverse learning environments to achieve high-quality visual rendering under resource-constrained conditions. This strategy aims to solve the performance bottlenecks faced by mobile and wearable devices when processing complex metaverse scenes, ensuring that learners can experience smooth and realistic interactions in a dynamic, immersive environment. Specifically, the research focused on rendering methods based on lightweight SLAM algorithms and boundary attention frameworks. This method improves the computing efficiency to a certain extent while optimizing resource allocation and ensures the smoothness and realism of educational scene rendering to the greatest possible extent. This method is consistent with cutting-edge research in the field of mobile and wearable devices, focusing on enhancing performance and the image quality using deep learning algorithms (Suo et al., 2023). In educational applications, this strategy ensures the feasibility of running complex metaverse learning applications on resource-constrained wearable devices. In addition, this solution helps reduce the reliance of metaverse resources on high-performance hardware devices, thereby promoting the large-scale adoption of wearable ubiquitous learning.

6. Conclusions

This study proposes a framework and technical solution for developing wearable metaverse learning environments, aiming to achieve immersive and ubiquitous learning experiences through the innovative combination of lightweight data processing, multi-agent collaboration, and low-computation-cost technologies. While significant progress has been made in theoretical exploration and technical design, there remain certain limitations that need to be addressed in future research. On the one hand, as metaverse technology is still in its infancy, this study primarily focused on providing a conceptual framework and technical solution. However, large-scale empirical studies have yet to be conducted to validate the effectiveness and feasibility of the proposed framework and solutions in real-world scenarios. The framework also needs to be further integrated with educational practice by designing and testing specific learning scenarios for different K-16 subjects. On the other hand, aspects such as the computational efficiency and scalability of the lightweight data processing framework, the interaction modeling capabilities of the multi-agent collaboration framework, and the environmental adaptability of the low-computation-cost rendering strategy require further evaluation and optimization through practical system implementation and user studies.

As metaverse technology continues to evolve and mature, we anticipate the emergence of more prototype systems and application scenarios. It is important to explore the applicability and acceptability of metaverse technologies across different teaching applications, instructional models, and educational strategies to ensure their global inclusivity, universality, and sustainability (Y. Liu & Fu, 2024). Through interdisciplinary collaboration and iterative improvements, the wearable metaverse learning environment can be continuously optimized and refined, ultimately enabling ubiquitous and intelligent learning transformations.

Author Contributions

Conceptualization, J.X. (Jiaqi Xu), X.Z. and A.I.; Methodology, J.X. (Jiaqi Xu), X.Z. and N.-S.C.; Investigation, J.X. (Jiaqi Xu) and U.G.; Resources, X.Z., N.-S.C. and A.I.; Writing—original draft, J.X. (Jiaqi Xu); Writing—review & editing, J.X. (Jiaqi Xu), X.Z., N.-S.C., U.G., A.I. and J.X. (Junyi Xin); Visualization, J.X. (Jiaqi Xu); Supervision, X.Z. and J.X. (Junyi Xin); Project administration, X.Z.; Funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Zhejiang Provincial Natural Science Foundation of China, grant number: Y24F020009; Zhejiang Provincial Education Science Planning Project, grant number: 2024SCG247; China Association for Science and Technology (CAST) 2024 Graduate Student Science Popularization Competence Enhancement Program, grant number: KXYJS2024008; Major Project of Humanities and Social Sciences in Higher Education Institutions of Zhejiang Province, grant number: 2023QN075.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Abrahamson, D., Nathan, M. J., Williams-Pierce, C., Walkington, C., Ottmar, E. R., Soto, H., & Alibali, M. W. (2020). The future of embodied design for mathematics teaching and learning. Frontiers in Education, 5, 147. [Google Scholar] [CrossRef]
Alharthi, M., & Mahmood, A. (2024). xLSTMTime: Long-term time series forecasting with xLSTM. AI, 5(3), 1482–1495. [Google Scholar] [CrossRef]
Amirkhani, A., & Barshooi, A. H. (2022). Consensus in multi-agent systems: A review. Artificial Intelligence Review, 55(5), 3897–3935. [Google Scholar] [CrossRef]
Arslan, M., Munawar, S., & Cruz, C. (2024). Sustainable digitalization of business with multi-agent RAG and LLM. Procedia Computer Science, 246, 4722–4731. [Google Scholar] [CrossRef]
Ba, S., & Hu, X. (2023). Measuring emotions in education using wearable devices: A systematic review. Computers & Education, 200, 104797. [Google Scholar] [CrossRef]
Barbarroxa, R., Gomes, L., & Vale, Z. (2025). Benchmarking large language models for multi-agent systems: A comparative analysis of AutoGen, CrewAI, and TaskWeaver. In P. Mathieu, & F. De La Prieta (Eds.), Advances in practical applications of agents, multi-agent systems, and digital twins: The PAAMS collection (Vol. 15157, pp. 39–48). Springer Nature. [Google Scholar] [CrossRef]
Beck, M., Pöppel, K., Spanring, M., Auer, A., Prudnikova, O., Kopp, M., Klambauer, G., Brandstetter, J., & Hochreiter, S. (2024). xLSTM: Extended long short-term memory. arXiv, arXiv:2405.04517. [Google Scholar] [CrossRef]
Cárdenas-Robledo, L. A., & Peña-Ayala, A. (2018). Ubiquitous learning: A systematic review. Telematics and Informatics, 35(5), 1097–1132. [Google Scholar] [CrossRef]
Chakma, A., Faridee, A. Z. M., Khan, M. A. A. H., & Roy, N. (2021). Activity recognition in wearables using adversarial multi-source domain adaptation. Smart Health, 19, 100174. [Google Scholar] [CrossRef]
Chen, H. R., Lin, W. S., Hsu, T. Y., Lin, T. C., & Chen, N. S. (2023). Applying smart glasses in situated exploration for learning English in a national science museum. IEEE Transactions on Learning Technologies, 16(5), 820–830. [Google Scholar] [CrossRef]
Cheng, Y., Zhang, C., Zhang, Z., Meng, X., Hong, S., Li, W., Wang, Z., Wang, Z., Yin, F., Zhao, J., & He, X. (2024). Exploring large language model based intelligent agents: Definitions, methods, and prospects. arXiv, arXiv:2401.03428. [Google Scholar] [CrossRef]
Closser, A. H., Erickson, J. A., Smith, H., Varatharaj, A., & Botelho, A. F. (2022). Blending learning analytics and embodied design to model students’ comprehension of measurement using their actions, speech, and gestures. International Journal of Child-Computer Interaction, 32, 100391. [Google Scholar] [CrossRef]
Crowell, C., Mora-Guiard, J., & Pares, N. (2018). Impact of interaction paradigms on full-body interaction collocated experiences for promoting social initiation and collaboration. Human–Computer Interaction, 33(5–6), 422–454. [Google Scholar] [CrossRef]
Di Mitri, D., Schneider, J., & Drachsler, H. (2022). Keep me in the loop: Real-time feedback with multimodal data. International Journal of Artificial Intelligence in Education, 32(4), 1093–1118. [Google Scholar] [CrossRef]
Feng, L., Jiang, X., Sun, Y., Niyato, D., Zhou, Y., Gu, S., Yang, Z., Yang, Y., & Zhou, F. (2025). Resource allocation for metaverse experience optimization: A multi-objective multi-agent evolutionary reinforcement learning approach. IEEE Transactions on Mobile Computing, 24(4), 3473–3488. [Google Scholar] [CrossRef]
Fleury, M., Lioi, G., Barillot, C., & Lécuyer, A. (2020). A survey on the use of haptic feedback for brain-computer interfaces and neurofeedback. Frontiers in Neuroscience, 14, 528. [Google Scholar] [CrossRef] [PubMed]
Foglia, L., & Wilson, R. A. (2013). Embodied cognition. WIREs Cognitive Science, 4(3), 319–325. [Google Scholar] [CrossRef] [PubMed]
Frisoli, A., & Leonardis, D. (2024). Wearable haptics for virtual reality and beyond. Nature Reviews Electrical Engineering, 1(10), 666–679. [Google Scholar] [CrossRef]
Gao, C., Lan, X., Li, N., Yuan, Y., Ding, J., Zhou, Z., Xu, F., & Li, Y. (2024). Large language models empowered agent-based modeling and simulation: A survey and perspectives. Humanities and Social Sciences Communications, 11(1), 1259. [Google Scholar] [CrossRef]
Gatto, L., Fulvio Gaglio, G., Augello, A., Caggianese, G., Gallo, L., & La Cascia, M. (2022, October 19–21). MET-iquette: Enabling virtual agents to have a social compliant behavior in the Metaverse. 2022 16th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS) (pp. 394–401), Dijon, France. [Google Scholar] [CrossRef]
González-Briones, A., De La Prieta, F., Mohamad, M. S., Omatu, S., & Corchado, J. M. (2018). Multi-agent systems applications in energy optimization problems: A state-of-the-art review. Energies, 11(8), 1928. [Google Scholar] [CrossRef]
Hazarika, A., & Rahmati, M. (2023). Towards an evolved immersive experience: Exploring 5G-and beyond-enabled ultra-low-latency communications for augmented and virtual reality. Sensors, 23(7), 3682. [Google Scholar] [CrossRef] [PubMed]
Heikenfeld, J., Jajack, A., Rogers, J., Gutruf, P., Tian, L., Pan, T., Li, R., Khine, M., Kim, J., Wang, J., & Kim, J. (2018). Wearable sensors: Modalities, challenges, and prospects. Lab on a Chip, 18(2), 217–248. [Google Scholar] [CrossRef] [PubMed]
Janbi, N., Katib, I., & Mehmood, R. (2023). Distributed artificial intelligence: Taxonomy, review, framework, and reference architecture. Intelligent Systems with Applications, 18, 200231. [Google Scholar] [CrossRef]
Johnson-Glenberg, M. C., & Megowan-Romanowicz, C. (2017). Embodied science and mixed reality: How gesture and motion capture affect physics education. Cognitive Research: Principles and Implications, 2(1), 24. [Google Scholar] [CrossRef] [PubMed]
Johnson-Glenberg, M. C., Yu, C. S. P., Liu, F., Amador, C., Bao, Y., Yu, S., & LiKamWa, R. (2023). Embodied mixed reality with passive haptics in STEM education: Randomized control study with chemistry titration. Frontiers in Virtual Reality, 4, 1047833. [Google Scholar] [CrossRef]
Ju, Y., Shi, B., Chen, Y., Zhou, H., Dong, J., & Lam, K. M. (2024). GR-PSN: Learning to estimate surface normal and reconstruct photometric stereo images. IEEE Transactions on Visualization and Computer Graphics, 30(9), 6192–6207. [Google Scholar] [CrossRef] [PubMed]
Kang, J., Chen, J., Xu, M., Xiong, Z., Jiao, Y., Han, L., Niyato, D., Tong, Y., & Xie, S. (2024). UAV-assisted dynamic avatar task migration for vehicular metaverse services: A multi-agent deep reinforcement learning approach. IEEE/CAA Journal of Automatica Sinica, 11(2), 430–445. [Google Scholar] [CrossRef]
Kosmas, P., & Zaphiris, P. (2023). Improving students’ learning performance through Technology-Enhanced Embodied Learning: A four-year investigation in classrooms. Education and Information Technologies, 28(9), 11051–11074. [Google Scholar] [CrossRef]
Li, J., Wang, S., Zhang, M., Li, W., Lai, Y., Kang, X., Ma, W., & Liu, Y. (2024). Agent hospital: A simulacrum of hospital with evolvable medical agents. arXiv, arXiv:2405.02957. [Google Scholar] [CrossRef]
Lin, L. (2015). Exploring collaborative learning: Theoretical and conceptual perspectives. In L. Lin (Ed.), Investigating Chinese HE EFL classrooms (pp. 11–28). Springer. [Google Scholar] [CrossRef]
Lindgren, R., Tscholl, M., Wang, S., & Johnson, E. (2016). Enhancing learning and engagement through embodied interaction within a mixed reality simulation. Computers & Education, 95, 174–187. [Google Scholar] [CrossRef]
Liu, J., Zheng, Y., Wang, K., Bian, Y., Gai, W., & Gao, D. (2020). A real-time interactive tai chi learning system based on VR and motion capture technology. Procedia Computer Science, 174, 712–719. [Google Scholar] [CrossRef]
Liu, Y., & Fu, Z. (2024). Hybrid intelligence: Design for sustainable multiverse via integrative cognitive creation model through human–computer collaboration. Applied Sciences, 14(11), 4662. [Google Scholar] [CrossRef]
López-Belmonte, J., Pozo-Sánchez, S., Moreno-Guerrero, A.-J., & Lampropoulos, G. (2023). Metaverse in education: A systematic review. Revista De Educación a Distancia (RED), 23(73), 2252656. [Google Scholar] [CrossRef]
Manawadu, M., & Park, S. Y. (2024). 6DoF object pose and focal length estimation from single rgb images in uncontrolled environments. Sensors, 24(17), 5474. [Google Scholar] [CrossRef] [PubMed]
Mantiuk, R. K., Hammou, D., & Hanji, P. (2023). HDR-VDP-3: A multi-metric for predicting image differences, quality and contrast distortions in high dynamic range and regular content. arXiv, arXiv:2304.13625. [Google Scholar] [CrossRef]
Mills, K. A., & Brown, A. (2023). Smart glasses for 3D multimodal composition. Learning, Media and Technology, 50(2), 156–177. [Google Scholar] [CrossRef]
Mira, H. H., Chaker, R., Maria, I., & Nady, H. (2024). Review of research on the outcomes of embodied and collaborative learning in STEM in higher education with immersive technologies. Journal of Computing in Higher Education, 1–38. [Google Scholar] [CrossRef]
Nahavandi, D., Alizadehsani, R., Khosravi, A., & Acharya, U. R. (2022). Application of artificial intelligence in wearable devices: Opportunities and challenges. Computer Methods and Programs in Biomedicine, 213, 106541. [Google Scholar] [CrossRef] [PubMed]
Nascimento, T. H., Fernandes, D., Vieira, G., Felix, J., Castro, M., & Soares, F. (2023, October 9–11). MazeVR: Immersion and interaction using google cardboard and continuous gesture recognition on smartwatches. 28th International ACM Conference on 3D Web Technology (pp. 1–5), San Sebastian, Spain. [Google Scholar] [CrossRef]
Palermo, F., Casciano, L., Demagh, L., Teliti, A., Antonello, N., Gervasoni, G., Shalby, H. H. Y., Paracchini, M. B., Mentasti, S., Quan, H., Santambrogio, R., Gilbert, C., Roveri, M., Matteucci, M., Marcon, M., & Trojaniello, D. (2025). Advancements in context recognition for edge devices and smart eyewear: Sensors and applications. IEEE Access, 13, 57062–57100. [Google Scholar] [CrossRef]
Pan, A. (2024). How wearables like apple vision pro and orion are transforming human interactions with interfaces—…. Medium. Available online: https://medium.com/@alexanderpanboy/how-wearables-like-apple-vision-pro-and-orion-are-transforming-human-interactions-with-interfaces-95f3c390a77d (accessed on 10 December 2024).
Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative agents: Interactive simulacra of human behavior. arXiv, arXiv:2304.03442. [Google Scholar] [CrossRef]
Phakamach, P., Senarith, P., & Wachirawongpaisarn, S. (2022). The metaverse in education: The future of immersive teaching & learning. RICE Journal of Creative Entrepreneurship and Management, 3(2), 75–88. [Google Scholar] [CrossRef]
Polansky, M. G., Herrmann, C., Hur, J., Sun, D., Verbin, D., & Zickler, T. (2024). Boundary attention: Learning curves, corners, junctions and grouping. arXiv, arXiv:2401.00935. [Google Scholar] [CrossRef]
Qin, D., Leichner, C., Delakis, M., Fornoni, M., Luo, S., Yang, F., Wang, W., Banbury, C., Ye, C., Akin, B., Aggarwal, V., Zhu, T., Moro, D., & Howard, A. (2025). MobileNetV4: Universal models for the mobile ecosystem. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, & G. Varol (Eds.), Computer vision—ECCV 2024 (Vol. 15098, pp. 78–96). Springer Nature. [Google Scholar] [CrossRef]
Reiniger, J. L., Domdei, N., Holz, F. G., & Harmening, W. M. (2021). Human gaze is systematically offset from the center of cone topography. Current Biology, 31(18), 4188–4193.e3. [Google Scholar] [CrossRef] [PubMed]
Sahili, Z. A., & Awad, M. (2023). Spatio-temporal graph neural networks: A survey. arXiv, arXiv:2301.10569. [Google Scholar] [CrossRef]
Song, T., Tan, Y., Zhu, Z., Feng, Y., & Lee, Y. C. (2024). Multi-agents are social groups: Investigating social influence of multiple agents in human-agent interactions. arXiv, arXiv:2411.04578. [Google Scholar] [CrossRef]
Su, M., Guo, R., Wang, H., Wang, S., & Niu, P. (2017, July 18–20). View frustum culling algorithm based on optimized scene management structure. 2017 IEEE International Conference on Information and Automation (ICIA) (pp. 838–842), Macau, China. [Google Scholar] [CrossRef]
Sun, J. C. Y., Ye, S. L., Yu, S. J., & Chiu, T. K. F. (2023). Effects of wearable hybrid AR/VR learning material on high school students’ situational interest, engagement, and learning performance: The case of a physics laboratory learning environment. Journal of Science Education and Technology, 32(1), 1–12. [Google Scholar] [CrossRef]
Sun, Z., Zhu, M., Shan, X., & Lee, C. (2022). Augmented tactile-perception and haptic-feedback rings as human-machine interfaces aiming for immersive interactions. Nature Communications, 13(1), 5224. [Google Scholar] [CrossRef] [PubMed]
Suo, J., Zhang, W., Gong, J., Yuan, X., Brady, D. J., & Dai, Q. (2023). Computational imaging and artificial intelligence: The next revolution of mobile vision. Proceedings of the IEEE, 111(12), 1607–1639. [Google Scholar] [CrossRef]
Vrins, A., Pruss, E., Prinsen, J., Ceccato, C., & Alimardani, M. (2022). Are you paying attention? The effect of embodied interaction with an adaptive robot tutor on user engagement and learning performance. In F. Cavallo, J.-J. Cabibihan, L. Fiorini, A. Sorrentino, H. He, X. Liu, Y. Matsumoto, & S. S. Ge (Eds.), Social robotics (Vol. 13818, pp. 135–145). Springer Nature. [Google Scholar] [CrossRef]
Wang, X., Wang, Y., Yang, J., Jia, X., Li, L., Ding, W., & Wang, F. Y. (2024). The survey on multi-source data fusion in cyber-physical-social systems: Foundational infrastructure for industrial metaverses and industries 5.0. Information Fusion, 107, 102321. [Google Scholar] [CrossRef]
Wang, Z., Bian, W., & Prisacariu, V. A. (2025). CrossScore: Towards multi-view image evaluation and scoring. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, & G. Varol (Eds.), Computer vision—ECCV 2024 (Vol. 15067, pp. 492–510). Springer Nature. [Google Scholar] [CrossRef]
Wu, X., Chen, X., Zhao, J., & Xie, Y. (2024). Influences of design and knowledge type of interactive virtual museums on learning outcomes: An eye-tracking evidence-based study. Education and Information Technologies, 29(6), 7223–7258. [Google Scholar] [CrossRef]
Xi, Z., Chen, W., Guo, X., He, W., Ding, Y., Hong, B., Zhang, M., Wang, J., Jin, S., Zhou, E., Zheng, R., Fan, X., Wang, X., Xiong, L., Zhou, Y., Wang, W., Jiang, C., Zou, Y., Liu, X., … Gui, T. (2023). The rise and potential of large language model based agents: A survey. arXiv, arXiv:2309.07864. [Google Scholar] [CrossRef]
Xia, Y., Shin, S.-Y., & Lee, H.-A. (2024). Adaptive learning in AI agents for the metaverse: The ALMAA framework. Applied Sciences, 14(23), 11410. [Google Scholar] [CrossRef]
Yu, D. (2023, September 18–20). AI-empowered metaverse learning simulation technology application. 2023 International Conference on Intelligent Metaverse Technologies & Applications (iMETA) (pp. 1–6), Tartu, Estonia. [Google Scholar] [CrossRef]
Zhai, X., Xu, J., Chen, N. S., Shen, J., Li, Y., Wang, Y., Chu, X., & Zhu, Y. (2023). The syncretic effect of dual-source data on affective computing in online learning contexts: A perspective from convolutional neural network with attention mechanism. Journal of Educational Computing Research, 61(2), 466–493. [Google Scholar] [CrossRef]
Zhai, X. S., Chu, X. Y., Chen, M., Shen, J., & Lou, F. L. (2023). Can edu-metaverse reshape virtual teaching community (VTC) to promote educational equity? An exploratory study. IEEE Transactions on Learning Technologies, 16(6), 1130–1140. [Google Scholar] [CrossRef]
Zhao, G., Liu, S., Zhu, W. J., & Qi, Y. H. (2021). A lightweight mobile outdoor augmented reality method using deep learning and knowledge modeling for scene perception to improve learning experience. International Journal of Human–Computer Interaction, 37(9), 884–901. [Google Scholar] [CrossRef]
Zhao, Z., Zhao, B., Ji, Z., & Liang, Z. (2022). On the personalized learning space in educational metaverse based on heart rate signal. International Journal of Information and Communication Technology Education (IJICTE), 18(2), 1–12. [Google Scholar] [CrossRef]
Zhou, X., Yang, Q., Zheng, X., Liang, W., Wang, K. I. K., Ma, J., Pan, Y., & Jin, Q. (2024). Personalized federated learning with model-contrastive learning for multi-modal user modeling in human-centric metaverse. IEEE Journal on Selected Areas in Communications, 42(4), 817–831. [Google Scholar] [CrossRef]
Zhu, L., Li, Y., Sandström, E., Huang, S., Schindler, K., & Armeni, I. (2024). LoopSplat: Loop closure by registering 3D gaussian splats. arXiv, arXiv:2408.10154. [Google Scholar] [CrossRef]

Figure 1. The wearable metaverse learning environment conceptual framework.

Figure 2. Multi-source data fusion framework.

Figure 3. Multi-agent collaborative model.

Figure 4. Low-computation-cost rendering framework.

Table 1. Embodied interaction strategies in wearable metaverse environments.

Interaction Dimension	Contents	Description
Learner-to-Learner	Gesture Recognition	Capturing natural gestures for non-verbal communication and object manipulation in virtual spaces.
	Synchronized Activities	Mapping shared physical activities (e.g., virtual sports) in real time to foster collaboration.
	Haptic Feedback	Simulating remote physical touch, enhancing social presence and emotional connection.
Learner-to-Metaverse	Immersive Operations	Enabling direct and intuitive interaction with virtual objects through body movements.
	Multisensory Feedback	Providing rich experiences through integrated visual, auditory, and haptic feedback.
	Spatial Navigation	Allowing for natural navigation of virtual spaces using physical movements to enhance exploration.
Learner-to-Real-Environment	AR Annotations	Overlaying real-world objects with contextual learning information.
	Interaction Mapping	Mapping real-world actions to virtual environments for seamless learning.
	Environmental Adaptation	Dynamically adjusting learning content based on environmental data.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, J.; Zhai, X.; Chen, N.-S.; Ghani, U.; Istenic, A.; Xin, J. Integrating AI-Driven Wearable Metaverse Technologies into Ubiquitous Blended Learning: A Framework Based on Embodied Interaction and Multi-Agent Collaboration. Educ. Sci. 2025, 15, 900. https://doi.org/10.3390/educsci15070900

AMA Style

Xu J, Zhai X, Chen N-S, Ghani U, Istenic A, Xin J. Integrating AI-Driven Wearable Metaverse Technologies into Ubiquitous Blended Learning: A Framework Based on Embodied Interaction and Multi-Agent Collaboration. Education Sciences. 2025; 15(7):900. https://doi.org/10.3390/educsci15070900

Chicago/Turabian Style

Xu, Jiaqi, Xuesong Zhai, Nian-Shing Chen, Usman Ghani, Andreja Istenic, and Junyi Xin. 2025. "Integrating AI-Driven Wearable Metaverse Technologies into Ubiquitous Blended Learning: A Framework Based on Embodied Interaction and Multi-Agent Collaboration" Education Sciences 15, no. 7: 900. https://doi.org/10.3390/educsci15070900

APA Style

Xu, J., Zhai, X., Chen, N.-S., Ghani, U., Istenic, A., & Xin, J. (2025). Integrating AI-Driven Wearable Metaverse Technologies into Ubiquitous Blended Learning: A Framework Based on Embodied Interaction and Multi-Agent Collaboration. Education Sciences, 15(7), 900. https://doi.org/10.3390/educsci15070900

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Integrating AI-Driven Wearable Metaverse Technologies into Ubiquitous Blended Learning: A Framework Based on Embodied Interaction and Multi-Agent Collaboration

Abstract

1. Introduction

2. Literature Review

2.1. Wearable Devices in Ubiquitous Blended Learning

2.2. Embodied Interaction in Ubiquitous Blended Learning

2.3. AI Agents in Metaverse

3. A Conceptual Framework for Wearable Metaverse Environments

3.1. The Overall Framework of the Model

3.2. Embodied Interaction Module

3.2.1. Data Collection and Sensor Integration

3.2.2. Embodied Interaction Strategies

3.3. Multi-Agent Collaboration Module

3.3.1. Functions of Multi-Agent Module

3.3.2. Intelligent Interaction Mechanisms

3.3.3. Collaboration Using CrewAI and ST-GNNs

3.4. Multi-Source Data Fusion Module

3.5. Low-Computation-Cost Strategy Module

4. Technical Approaches for Implementing Wearable Metaverse Environments

4.1. Enhancing Precision Through Multi-Source Data Fusion

4.1.1. Feature Extraction with MobileNetV4

4.1.2. Dynamic Cross-Modality Fusion with xLSTM

4.2. Agents’ Collaboration Based on Multi-Agent Framework and Graph Neural Networks

4.2.1. Spatio-Temporal Collaboration Modeling with ST-GNNs

4.2.2. Distributed and Hybrid Decision-Making with CrewAI

4.3. Optimization of Visual Experiences Based on Low Computation Cost

4.3.1. Low-Computation-Cost Environment Perception and Modeling

4.3.2. Gaze Prediction Using Visual Attention Models

4.3.3. Gaze Prediction and Dynamic Rendering Cache Mechanism

4.3.4. Visual Perception Optimization

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI