Subject-Independent Multimodal Interaction Modeling for Joint Emotion and Immersion Estimation in Virtual Reality

Wang, Haibing; Wang, Mujiangshan

doi:10.3390/sym18030451

Open AccessArticle

Subject-Independent Multimodal Interaction Modeling for Joint Emotion and Immersion Estimation in Virtual Reality

by

Haibing Wang

¹ and

Mujiangshan Wang

^2,*

¹

Design College, Shandong University of Arts, Jinan 250014, China

²

Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China

^*

Author to whom correspondence should be addressed.

Symmetry 2026, 18(3), 451; https://doi.org/10.3390/sym18030451

Submission received: 14 January 2026 / Revised: 27 February 2026 / Accepted: 2 March 2026 / Published: 6 March 2026

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

Virtual Reality (VR) has emerged as a powerful medium for immersive human–computer interaction, where users’ emotional and experiential states play a pivotal role in shaping engagement and perception. However, existing affective computing approaches often model emotion recognition and immersion estimation as independent problems, overlooking their intrinsic coupling and the structured relationships underlying multimodal physiological signals. In this work, we propose a modality-aware multi-task learning framework that jointly models emotion recognition and immersion estimation from a graph-structured and symmetry-aware interaction perspective. Specifically, heterogeneous physiological and behavioral modalities—including eye-tracking, electrocardiogram (ECG), and galvanic skin response (GSR)—are treated as relational components with structurally symmetric encoding and fusion mechanisms, while their cross-modality dependencies are adaptively aggregated to preserve interaction symmetry at the representation level and introduce controlled asymmetry at the task-optimization level through weighted multi-task learning, without introducing explicit graph neural network architectures. To support reproducible evaluation, the VREED dataset is further extended with quantitative immersion annotations derived from presence-related self-reports via weighted aggregation and factor analysis. Extensive experiments demonstrate that the proposed framework consistently outperforms recurrent, convolutional, and Transformer-based baselines. Compared with the strongest Transformer baseline, the proposed framework yields consistent relative performance gains of approximately 3–7% for emotion recognition metrics and reduces immersion estimation errors by nearly 9%. Beyond empirical improvements, this study provides a structured interpretation of multimodal affective modeling that highlights symmetry, coupling, and controlled symmetry breaking in multi-task learning, offering a principled foundation for adaptive VR systems, emotion-driven personalization, and dynamic user experience optimization.

Keywords:

affective computing; graph-structured modeling; symmetry-aware learning; multimodal fusion; emotion–immersion modeling; virtual reality; human–computer interaction; adaptive systems

1. Introduction

With the rapid advancement of Virtual Reality (VR) technology, its applications in immersive art presentation, personalized recommendation, and educational training have expanded significantly. In virtual environments, users’ emotional states and immersion experiences not only directly affect content receptiveness but also provide vital insights for human–computer interaction system design [1,2]. Recent studies have shown that immersive VR can substantially modulate affective responses and esthetic judgments, highlighting the importance of emotion-aware modeling in virtual environments [3,4]. Traditional emotion recognition methods rely heavily on facial expressions or speech; however, such cues are often compromised in immersive VR scenarios due to headset occlusion or latency issues, resulting in reduced robustness and limited real-time applicability [5,6].

In recent years, physiological signals such as eye-tracking, electrocardiogram (ECG), and galvanic skin response (GSR) have attracted increasing attention as reliable and stable indicators of affective and experiential states. These signals capture fine-grained emotional fluctuations and subjective immersion more accurately under VR conditions and have been widely adopted in multimodal emotion recognition systems [3,7,8]. For example, Koelstra et al. constructed the DEAP dataset to demonstrate correlations between EEG signals and emotional responses [9], while Tabbaa et al. introduced the VREED dataset integrating eye-tracking and GSR signals for emotion evaluation in VR environments [10]. Electroencephalogram (EEG) has also been widely recognized as an informative modality for emotion recognition due to its sensitivity to neural dynamics. A recent survey by Chen et al. [11] summarized major advances in EEG-based affective modeling, while also noting practical challenges related to signal noise, preprocessing complexity, and real-time deployment. In contrast, eye-tracking, ECG, and GSR provide a more robust and deployment-friendly solution for continuous emotion and immersion assessment in VR environments. Beyond affective recognition, recent VR studies have also explored how advanced interaction paradigms, including large language model-powered guidance and cinematic VR simulations, influence user behavior, learning outcomes, and physiological responses [12,13]. Moreover, an emerging line of research has leveraged immersive VR not only as a sensing platform but also as an active intervention medium for emotional regulation and mental health support. Recent works have demonstrated the effectiveness of VR-based systems in treating emotional disorders, promoting psychological well-being, and alleviating chronic pain through embodied and interactive experiences [14,15,16]. These application-oriented findings further emphasize the necessity of accurate emotion and immersion modeling, as reliable affective perception constitutes a fundamental prerequisite for adaptive, personalized, and therapeutic VR systems. From a structural perspective, these multimodal signals form a set of heterogeneous yet interrelated components whose interactions exhibit inherent relational symmetry across modalities.

Meanwhile, psychological constructs such as sense of presence and flow state have been shown to play a mediating role in affective experience. Lønne et al. reported that higher immersion levels are associated with more positive emotional responses [1], while Ochs and Sonderegger revealed complex interdependencies between presence, engagement, and learning motivation [17,18]. These findings suggest that emotion and immersion should not be treated as isolated phenomena, but rather as coupled dimensions within a unified experiential structure. Such coupling naturally reflects a form of task-level symmetry, where both dimensions share latent representations while contributing asymmetrically to overall user experience. Multi-task learning (MTL) has therefore emerged as a promising paradigm for jointly modeling emotion recognition and immersion estimation by enabling representation sharing across related tasks. Nevertheless, conventional MTL approaches often suffer from task interference and gradient conflicts, particularly when handling heterogeneous objectives such as emotion classification and immersion regression. Recent studies have proposed conflict-aware optimization strategies to alleviate these issues and improve training stability [19,20]. From a learning-theoretic viewpoint, these challenges can be interpreted as manifestations of symmetry breaking during joint optimization, where balanced task coupling must be carefully controlled to avoid dominance by a single objective. Despite these advances, current studies on multimodal emotion and immersion assessment still face three key limitations:

(1): Lack of a unified modeling framework that jointly captures emotion recognition and immersion estimation as structurally coupled tasks. Most existing methods address these problems independently, thereby failing to exploit their intrinsic relational symmetry [1,7,18].
(2): Insufficient design of modality-aware feature extraction and fusion mechanisms for heterogeneous physiological signals. Many approaches overlook the balanced yet distinct roles of eye-tracking, ECG, and GSR modalities, resulting in suboptimal representation learning and weak cross-modality coordination [21,22,23].
(3): Severe task interference in multi-task optimization, arising from differences in label structures and learning objectives, which can lead to unstable convergence and asymmetric task dominance during training [24,25].

From a structural reliability perspective, diagnosability theory in interconnection networks has provided a rigorous framework for characterizing fault tolerance under comparison-based models. Subsequent extensions systematically generalized these ideas through connectivity-aware and g-good-neighbor diagnosability analyses, establishing global reliability criteria for complex networks with structured symmetry and constrained fault propagation [26,27,28,29,30,31]. To address these challenges, we propose MMEA-Net, a modality-aware framework for joint emotion recognition and immersion estimation in VR environments. MMEA-Net adopts a structured interaction perspective in which multimodal physiological signals are treated as relational components and emotion–immersion objectives are modeled as coupled tasks with controlled symmetry and asymmetry. The main contributions of this work are summarized as follows:

(1): We propose a unified multi-task framework, MMEA-Net, for jointly modeling emotion classification and immersion regression in VR environments. Compared with the best-performing Transformer-based baseline, MMEA-Net achieves a +2.55% improvement in test accuracy (75.42% vs. 72.87%), a +2.54% increase in F1-score (74.19% vs. 71.65%), and a +0.04 gain in Cohen’s Kappa (0.66 vs. 0.62) for emotion classification. For immersion estimation, the proposed model reduces RMSE by 0.09 (0.96 vs. 1.05), lowers MAE by 0.07 (0.74 vs. 0.81), and improves the coefficient of determination $R^{2}$ by +0.05 (0.63 vs. 0.58), demonstrating effective cross-task synergy.
(2): We design a Hybrid-M modality-aware module and a Cross-Domain Fusion mechanism. Hybrid-M employs dedicated sub-networks to encode temporal characteristics of eye-tracking, ECG, and GSR signals, while the fusion mechanism facilitates structured and symmetry-aware interaction across modalities and tasks.
(3): We extend the VREED dataset by annotating quantitative immersion scores, enabling dual-task benchmarking and supporting reproducible research in multimodal affective computing.

Organization of the Paper

The rest of this manuscript is structured as follows. Section 2 reviews recent advances in multimodal physiological affect modeling and multi-task learning in virtual reality scenarios. Section 3 presents the proposed MMEA-Net framework, including modality-specific encoding, structured cross-domain fusion, and the joint learning objective. Section 4 reports the experimental setup and quantitative evaluations, covering baseline comparisons, component, modality, and task ablation studies, as well as loss-weight sensitivity analysis. Section 5 discusses downstream task extensions enabled by the proposed framework. Section 6 provides an in-depth discussion on generalization, user variability, and implications of multi-task affective modeling. Finally, Section 7 concludes the paper by summarizing key findings, discussing limitations, and outlining future research directions.

2. Related Work

2.1. Emotion Recognition Using Physiological Signals

Emotion recognition using physiological signals has become an important research direction in affective computing. The DEAP dataset constructed by Koelstra et al. [9] is one of the earliest large-scale multimodal resources, comprising EEG, GSR, and facial signals annotated with affective states. This dataset has inspired extensive subsequent studies on robust representation learning from physiological inputs, particularly in settings where conventional behavioral cues are unreliable. Recent research has increasingly adopted deep neural architectures to capture temporal and inter-signal dependencies. For instance, Hu et al. [32] and Xiao et al. [33] proposed convolutional recurrent models augmented with self-attention mechanisms to model spatiotemporal patterns in EEG sequences, achieving strong performance on arousal and valence prediction. Lin and Li [21] provided a comprehensive survey of physiological modalities, including ECG and GSR, emphasizing their relative stability compared with facial expressions and speech, especially under constrained or immersive conditions.

More recently, information-theoretic and structure-aware modeling strategies have been introduced to further enhance EEG-based affective analysis. Chen et al. [34] demonstrated that normalized mutual information features can effectively capture nonlinear dependencies in EEG signals, enabling robust decoding of complex cognitive and affective driving states when combined with a self-optimized kernel-based learning framework. Building on this idea, Chen et al. [35] proposed an end-to-end EEG-based graph attention convolutional network, in which mutual information-driven connectivity was exploited to model inter-channel relationships for driver fatigue detection. These studies indicate a growing trend toward connectivity-aware and structure-preserving EEG representation learning, complementing conventional temporal feature extraction pipelines.

Beyond EEG-centric approaches, a growing body of work has explored multimodal fusion strategies. Fu et al. [7] combined eye-tracking, ECG, and GSR signals with transfer learning to enhance classification robustness. Katada and Okada [23] focused on user-independent modeling by incorporating modality importance weighting, while Moin et al. [6] developed a multimodal fusion pipeline for real-time human–machine interaction. Collectively, these studies suggest that physiological modalities can be regarded as heterogeneous yet structurally related components whose coordinated modeling is essential for reliable emotion recognition.

In related bio-signal recognition tasks, Wei et al. [36] showed that nonlinear feature decomposition combined with deep temporal–spatial learning can achieve competitive performance even with single-channel sEMG signals. Their study highlights the importance of multi-level feature representations and diverse evaluation metrics when analyzing noisy physiological signals, which is consistent with the multi-metric performance assessment strategy adopted in this work. Recent advances in temporal modeling further emphasize the importance of capturing long-term dependencies in sequential affective recognition tasks. Wang et al. [37] proposed ResLNet, a deep residual LSTM architecture designed to process longer temporal inputs through residual connections, thereby improving recognition performance under extended sequences. Similarly, Li et al. [38] introduced an LSTM network with neural plasticity mechanisms, enabling adaptive parameter modulation based on evolving temporal dynamics for robust driver fatigue recognition in real-world scenarios. In addition, adaptive temporal weighting strategies have been explored in multi-source domain adaptation settings, where Liu et al. [39] designed a weight-aware mechanism to dynamically balance heterogeneous temporal features for human motion intention recognition. Compared with recurrent residual architectures or plasticity-enhanced LSTM models, recent transformer-based frameworks capture long-range dependencies through global self-attention mechanisms that directly model temporal correlations without relying solely on recurrent state propagation. This evolution from recurrent stacking toward attention-based global modeling provides a broader context for understanding the temporal design choices adopted in the proposed framework.

2.2. Multimodal Emotion Modeling in Virtual Reality

Virtual Reality (VR) provides a natural environment for eliciting complex emotional responses through immersive and interactive stimuli. The VREED dataset introduced by Tabbaa et al. [10] collected eye-tracking and physiological signals during emotion-inducing VR experiences, offering a valuable benchmark for affective analysis in immersive settings. Building on this dataset, Alharbi [40] proposed a transparent deep learning framework for VR-based emotion recognition, emphasizing explainable feature selection across multiple physiological channels. A number of studies have further examined the relationships among presence, immersion, and emotional experience. Lønne et al. [1] showed that higher immersion levels are associated with enhanced affective responses in educational VR environments. Souza et al. [41] surveyed metrics for measuring presence in virtual environments, while Ochs and Sonderegger [18] identified links between immersion, cognitive engagement, and emotional outcomes. Related investigations by Yang et al. [2] and Liu et al. [42] explored how immersive art and educational visualization influence user flow, neural activity, and emotional engagement. These findings collectively indicate that emotion and immersion are not independent constructs, but rather interrelated dimensions that evolve jointly during immersive experiences. Beyond physiological signal analysis, recent advances in expressive modeling and human–computer interaction also provide relevant insights for immersive affective systems. Song et al. [43,44] investigated expressive 3D facial animation using local-to-global latent diffusion and personalized speech-driven modeling, highlighting the importance of emotion-consistent representations for realistic virtual avatars. In parallel, Hu et al. [45] demonstrated real-time human–computer interaction via wearable devices, emphasizing low-latency multimodal sensing and perception in immersive environments. From a relational modeling perspective, Meng et al. [46] proposed a heterogeneous graph-based framework for conversational emotion recognition, showing strong potential in capturing structured dependencies among multimodal cues. These studies further motivate the need for structured fusion and coordinated modeling in VR-oriented affective analysis. Recent advances in structured affective representation learning highlight the importance of pose awareness and expert-aware refinement in emotion-related tasks. Liu et al. [47] proposed a pose-aware contrastive framework that enhances facial representation robustness by modeling pose variations, demonstrating improved affective feature consistency across viewing conditions. Lin et al. [48] introduced a multi-expert ensemble framework (MLM-EOE) for automatic depression detection, where expert-specific sub-models collaboratively refine affective representations, emphasizing task-sensitive calibration and structured ensemble integration. Although MMEA-Net does not explicitly adopt pose-aware contrastive objectives or multi-expert branches, it shares a similar goal of structured and task-aware representation learning. Modality-specific encoders extract complementary behavioral and physiological cues, while cross-domain fusion and controlled task-level modulation enable coordinated refinement within a unified latent space. Beyond representation learning, advances in perception-layer modeling for VR and human–computer interaction further strengthen the linkage between low-level sensing and high-level affective inference. Head motion tracking via wireless earphones [45], multi-sensor fusion of IMU and acoustics [49], contactless eye blink detection using UWB radar [50], and robust interaction recognition with extended Kalman filtering [51] demonstrate how accurate and structured signal estimation supports reliable behavioral understanding in dynamic environments. Although primarily focused on perception rather than affective prediction, these studies illustrate the perception–representation–inference linkage that underpins immersive emotion modeling. In addition, immersive VR has been applied in clinical contexts such as prospective memory and eye fixation recovery following traumatic brain injury [52], underscoring the practical importance of accurately estimating immersion. By jointly modeling emotion and immersion, MMEA-Net provides a structured foundation for adaptive VR systems in rehabilitation, education, and immersive training scenarios.

2.3. Multi-Task Learning for Affective and Immersion-Aware Modeling

Multi-task learning (MTL) has been widely adopted to improve generalization by sharing representations across related tasks. Transformer-based architectures, including the original Transformer [53], T5 [54], and Vision Transformer (ViT) [55], have demonstrated strong capability in capturing shared structural patterns across modalities and objectives. In affective computing, several studies have applied MTL to mitigate overfitting and enhance task synergy. Liu and Yu [20] proposed MT2ST, which dynamically alternates between multi-task and single-task training phases. Liu et al. [19] introduced the Conflict-Averse Gradient Descent (CAGrad) algorithm to reduce gradient interference, an issue that becomes particularly pronounced when jointly optimizing heterogeneous objectives such as emotion classification and continuous state estimation. From a structural perspective, these approaches aim to balance task coupling while preventing dominance by a single objective during training. Beyond affective computing, recent advances in multi-modal perception and cross-modal representation learning further emphasize modality interaction and alignment. Almujally et al. [56] proposed a multi-modal remote perception framework that highlights coordinated feature extraction and modality-aware fusion for heterogeneous sensory streams. Shen et al. [57] introduced a vision-language cyclic interaction model (VLCIM) that achieves bidirectional cross-modal alignment through iterative refinement. In addition, hierarchical embedding strategies such as the word–phrase fusion mechanism with sparse SoftMax optimization [58] demonstrate how multi-level feature aggregation enhances structured representation consistency. Although developed in different domains, these studies collectively underscore two principles: (1) effective multi-modal modeling requires explicit or implicit cross-modal alignment, and (2) hierarchical or cyclic interaction mechanisms improve structured representation learning. Inspired by these insights, MMEA-Net adopts modality-aware encoding and structured cross-domain fusion to implicitly align heterogeneous physiological and behavioral signals within a unified latent space. Controlled task-level modulation under multi-task learning further enables coordinated emotion–immersion modeling without requiring explicit paired cross-modal supervision. Despite these advances, relatively few frameworks explicitly incorporate immersion or presence estimation alongside emotion recognition, overlooking the influence of immersive experience on affective processing [1,41]. Joint modeling of affective and immersive dimensions thus represents a promising direction for capturing structured dependencies in VR user experience. Similar challenges arise in other multi-source inference scenarios. For example, geophysical studies integrate triplicated seismic waveforms across multiple phases and scales to derive reliable conclusions from coordinated evidence [59]. This analogy highlights the importance of structured fusion and multi-scale consistency, reinforcing our design choice to jointly model emotion and immersion within a unified multimodal framework.

2.4. Structured Graph-Based Dependency Modeling

Recent studies have explored explicit graph construction strategies to model structured dependencies in affective and sequential learning tasks. For example, Meng et al. [46] proposed a heterogeneous graph message passing framework for conversational emotion recognition, where inter-speaker and temporal relationships are explicitly encoded through graph nodes and multi-step propagation. Similarly, Lv et al. [60] introduced a DAG-structured multi-granularity representation learning strategy that models hierarchical token dependencies within a directed acyclic graph topology. In addition, Wang et al. [61] developed a Flat-Lattice-CNN architecture that incorporates lattice-based graph structures to enhance structured token interactions. While these approaches rely on explicit graph adjacency construction and node-level message passing mechanisms, the proposed MMEA-Net adopts a fundamentally different modeling paradigm. Instead of predefined graph topologies or edge propagation procedures, relational symmetry across modalities is realized implicitly through parallel modality-specific encoders, sparse attention-based cross-domain fusion, and controlled task-level asymmetry during multi-task optimization. This implicit structured interaction strategy enables flexible multimodal coordination without introducing explicit graph parameters or rigid topological constraints.

3. Methodology

3.1. Overview of MMEA-Net

In this section, we present Multimodal Emotion and Engagement Assessment Network (MMEA-Net), a deep learning framework designed for the joint modeling of emotion state classification and immersion level estimation in virtual reality environments. Figure 1 provides an overview of the proposed architecture, highlighting its structured organization and task-coupled design.

Overview of Emotion and Immersion Modeling

The proposed MMEA-Net jointly models emotion recognition and immersion estimation from heterogeneous physiological, visual, and behavioral signals. Eye-tracking data encode visual attention and interaction patterns, while ECG and GSR signals capture internal physiological responses related to affective and immersive states. These multimodal inputs are first processed by modality-specific Hybrid-M encoders to extract structured temporal representations, Unlike recurrent residual LSTM or neural plasticity-based architectures that rely on state propagation across time steps, the proposed framework models long-range temporal dependencies through multi-scale convolutional encoding combined with global self-attention, enabling direct capture of global temporal correlations without recurrent stacking, which are subsequently integrated by the Cross-Domain Fusion module into a shared latent space. Emotion classification and immersion regression are then performed using task-specific prediction heads on this unified representation, enabling both tasks to benefit from shared multimodal cues. The dependency between emotion and immersion is not explicitly predefined, but is implicitly captured through shared representation learning and joint multi-task optimization.

MMEA-Net adopts a multimodal formulation that integrates three heterogeneous yet structurally related data streams: visual and behavioral information derived from eye tracking, physiological dynamics captured by electrocardiogram (ECG), and autonomic responses measured through galvanic skin response (GSR). Treating these modalities as complementary components within a unified representation space enables the model to capture coordinated patterns of user responses that are difficult to observe through single-modality analysis. The overall architecture is organized around three tightly coupled components:

Hybrid-M modules, which perform modality-specific feature extraction to preserve the intrinsic temporal characteristics of each signal while maintaining a consistent structural interface across modalities.
A Cross-Domain Fusion module, which integrates modality-specific representations through coordinated interactions, allowing information to be exchanged while maintaining balance among modalities.
A Multi-scale Feature Extraction (MFE) mechanism, embedded within both the modality-specific and fusion stages, to capture temporal patterns at different resolutions in a structurally consistent manner.

This structured design allows MMEA-Net to address key challenges in multimodal emotion and immersion analysis, including heterogeneity across physiological signals and imbalance between learning objectives. By organizing modality processing and task prediction in a coordinated and symmetric fashion, the framework promotes stable representation sharing while avoiding dominance by any single modality or task. Furthermore, the dual-task learning formulation enables emotion recognition and immersion estimation to be optimized jointly. Rather than treating these objectives independently, MMEA-Net exploits their intrinsic coupling, encouraging the learning of shared representations that reflect the mutual influence between affective and immersive states. The model is trained and evaluated using the VREED dataset, which provides synchronized eye-tracking, ECG, and GSR recordings collected from participants exposed to diverse VR scenarios. The availability of aligned emotional labels and immersion-related self-reports makes this dataset particularly suitable for studying the structured relationship between emotion and immersion within a unified learning framework.

3.2. Hybrid-M: Modality-Specific Feature Extraction

The Hybrid-M module constitutes the foundational layer of the proposed framework and is responsible for processing each input modality independently prior to cross-domain fusion. This design follows a parallel and structurally consistent processing strategy, ensuring that heterogeneous physiological and behavioral signals are encoded through comparable transformation pipelines while preserving their modality-specific characteristics.

For each modality

m \in {S, I, G}

, where S denotes visual and behavioral eye-tracking signals, I corresponds to physiological ECG data, and G represents physiological GSR data, the Hybrid-M module performs feature extraction according to

T_{m} = Hybrid - M (X_{m}),

(1)

where

X_{m}

is the raw input of modality m, and

T_{m}

is the resulting tensor representation. This formulation enforces a uniform mapping structure across modalities, allowing modality-specific information to be encoded within a shared representational framework.

As illustrated in Figure 2 right, the Hybrid-M module is composed of several sequential processing stages:

Vectorization. Raw input signals are first transformed into vectorized representations suitable for neural processing. For eye-tracking data, this includes spatial gaze coordinates, pupil dilation, and fixation-related statistics. For ECG signals, features such as R–R intervals and heart rate variability are extracted, while GSR signals are decomposed into tonic and phasic components. This step establishes a common vector-level interface across modalities.
Multi-scale feature extraction. To capture temporal patterns occurring at different resolutions, a multi-scale feature extraction mechanism is applied:

$F_{M F E} = MLP ({Conv}_{1 \times 1} (F) \oplus {Conv}_{3 \times 3} (F) \oplus {Conv}_{7 \times 7} (F)),$

(2)

where F denotes the input feature sequence, ${Conv}_{k \times k}$ represents convolution with kernel size $k \times k$ , and ⊕ denotes concatenation. This design enables the model to capture both short-term fluctuations and longer-term trends in a balanced manner.
Normalization and linear transformation. The extracted features are normalized and linearly transformed to stabilize optimization and align feature distributions across modalities:

$F_{n o r m} = LayerNorm (Linear (F_{M F E} \oplus F_{v e c})),$

(3)

where $F_{v e c}$ denotes the original vectorized features retained through a skip connection. This operation preserves modality-specific information while enforcing scale consistency.
Transformer-based temporal modeling. A transformer decoder is employed to model sequential dependencies:

$Z = TransformerDecoder (F_{n o r m}),$

(4)

enabling the extraction of long-range temporal relationships that are critical for modeling evolving emotional and physiological responses.
Linear projection and residual refinement. The transformer output is further processed by linear and fully connected layers:

$F_{F C} = FC (Linear (Z)),$

(5)

followed by a residual connection and normalization:

$F_{r e s} = LayerNorm (F_{F C} + F_{n o r m}) .$

(6)

This step refines the representation while maintaining structural consistency across modalities.
Multi-head attention. Finally, a multi-head attention mechanism is applied to emphasize informative temporal segments:

$MultiHead (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{h}) W^{O},$

(7)

where each attention head is computed as

${head}_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}),$

(8)

and

$Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V .$

(9)

The output of the Hybrid-M module is a structured tensor representation

T_{m}

that encodes modality-specific temporal dynamics within a unified processing framework. By enforcing parallel and structurally aligned transformations across modalities, Hybrid-M preserves signal heterogeneity while enabling balanced integration in subsequent fusion stages. This design provides a stable and symmetric foundation for cross-domain interaction and joint emotion–immersion modeling. Importantly, the Hybrid-M modules are shared by both emotion recognition and immersion estimation tasks and are jointly optimized in a unified multi-task learning framework. During training, gradients from the emotion classification loss and the immersion regression loss are simultaneously backpropagated through the Hybrid-M encoders, allowing both objectives to influence the modality-specific feature extraction process. As a result, the learned representations implicitly capture task-relevant dependencies at the feature level, reflecting physiological patterns that are informative for both affective states and immersive experience. This shared optimization mechanism establishes the first stage of cross-task coupling, without introducing explicit task conditioning or hierarchical task dependency.

3.3. Cross-Domain Fusion

The Cross-Domain Fusion module, shown in Figure 2 left, is designed to integrate modality-specific representations while maintaining their individual structural properties. This module aims to coordinate information from heterogeneous sources in a balanced manner, enabling structured interaction among physiological and behavioral signals for joint emotion and immersion modeling.

Given the tensor representations S, I, and G extracted from eye-tracking, ECG, and GSR modalities, respectively, the fusion process is formulated as

Z_{f u s e d} = CrossDomainFusion (S, I, G) .

(10)

Rather than collapsing modalities into a single representation at an early stage, this formulation preserves modality-specific structure while allowing controlled interaction across domains.

The fusion procedure consists of several sequential stages:

Initial tensor combination. The modality-specific tensors are first combined through learnable weighting:

$Z_{i n i t i a l} = α S \oplus β I \oplus γ G,$

(11)

where $α$ , $β$ , and $γ$ are trainable parameters and ⊕ denotes concatenation. This operation establishes a balanced aggregation scheme in which each modality contributes proportionally, preventing dominance by any single source.
Normalization and linear transformation. The combined representation is then normalized and linearly transformed:

$Z_{n o r m} = Norm (Linear (Z_{i n i t i a l})),$

(12)

ensuring scale alignment and structural consistency across modalities before higher-order interaction.
Sparse attention layers. A stack of sparse attention layers refines the normalized representation:

$Z_{i} = SparseAttention (Z_{i - 1}), i = 1, \dots, N,$

(13)

with $Z_{0} = Z_{n o r m}$ . The sparsity constraint restricts attention to a subset of informative relationships, promoting efficient and structured interaction among features while reducing redundancy.
Parallel feature projection. The output of the sparse attention layers is processed by multiple parallel multilayer perceptrons:

$F_{j} = {MLP}_{j} (Z_{N}), j = 1, 2, 3,$

(14)

where each MLP captures a distinct transformation perspective. This parallel design preserves symmetry in feature processing, allowing multiple coordinated views of the fused representation.
Cross-attention integration. The parallel feature projections are integrated through a cross-attention mechanism:

$Z_{f u s e d} = CrossAttention (F_{1}, F_{2}, F_{3}),$

(15)

defined as

$CrossAttention (F_{1}, F_{2}, F_{3}) = softmax (\frac{F_{1} F_{2}^{T}}{\sqrt{d_{k}}}) F_{3} .$

(16)

This operation enables structured information exchange across feature perspectives, reinforcing complementary patterns while maintaining internal consistency.

The resulting fused representation

Z_{f u s e d}

encodes coordinated multimodal information in a unified yet structured form. By organizing fusion through balanced weighting, sparse interaction, parallel transformation, and cross-perspective integration, the Cross-Domain Fusion module captures inter-modal relationships without sacrificing modality-specific integrity. This design supports stable downstream prediction for both emotion classification and immersion estimation by preserving symmetry and coordination across modalities. Furthermore, the Cross-Domain Fusion module facilitates cross-task interaction by aligning multimodal information before task-specific decoding. Both emotion classification and immersion regression heads operate on the same fused latent representation, which encourages the network to preserve information that is jointly informative for both tasks. Through this shared representation space, emotion-related and immersion-related cues can mutually influence the feature learning process in a soft and symmetric manner. Notably, this interaction is realized at the representation level rather than through explicit task-to-task conditioning, thereby avoiding rigid directional assumptions while maintaining flexibility across different application scenarios.

3.4. Multi-Scale Feature Extraction (MFE)

The Multi-scale Feature Extraction (MFE) module is designed to capture temporal patterns at different resolutions, which is essential for modeling the complex dynamics of emotional responses and immersion levels in virtual reality environments. Physiological and behavioral signals inherently exhibit variations across multiple time scales, ranging from rapid transient reactions to slowly evolving trends. The MFE module addresses this characteristic by organizing feature extraction into parallel and structurally consistent branches operating at different temporal scales.

Given an input feature representation x, the MFE module computes the multi-scale output as

x^{'} = MFE (x) .

(17)

This formulation allows features extracted at different temporal resolutions to be integrated within a single structured representation.

As illustrated in Figure 3, the MFE module consists of the following stages:

Initial feature projection. The input representation is first transformed using a multilayer perceptron:

$F_{i n i t i a l} = MLP (x),$

(18)

which maps the input features into a space suitable for parallel multi-scale processing.
Parallel multi-scale convolutions. Three convolutional operations with different kernel sizes are applied in parallel:

$\begin{matrix} F_{s m a l l} & = {Conv}_{1 \times 1} (F_{i n i t i a l}), \end{matrix}$

(19)

$\begin{matrix} F_{m e d i u m} & = {Conv}_{3 \times 3} (F_{i n i t i a l}), \end{matrix}$

(20)

$\begin{matrix} F_{l a r g e} & = {Conv}_{7 \times 7} (F_{i n i t i a l}), \end{matrix}$

(21)

where each branch captures patterns at a distinct temporal resolution. The parallel design ensures structural symmetry across scales, allowing fine-grained, intermediate, and coarse temporal dynamics to be modeled in a balanced manner.
Feature aggregation. The outputs from different scales are concatenated:

$F_{c o n c a t} = F_{s m a l l} \oplus F_{m e d i u m} \oplus F_{l a r g e},$

(22)

preserving information from all temporal resolutions while maintaining their individual contributions.
Integrated transformation. The aggregated representation is further processed through convolutional and nonlinear transformations:

$F_{c o n v} = {Conv}_{3 \times 3} (F_{c o n c a t}),$

(23)

$F_{r e l u} = ReLU (F_{c o n v}),$

(24)

followed by a final MLP:

$x^{'} = MLP (F_{r e l u}),$

(25)

which integrates information across scales into a unified feature representation.

By organizing feature extraction through parallel branches and coordinated aggregation, the MFE module captures temporal dynamics at multiple resolutions without favoring any single scale. This structurally balanced design enables MMEA-Net to represent both rapid emotional reactions and gradual changes in immersion in a consistent manner, providing robust multi-scale features for downstream multimodal fusion and dual-task prediction.

3.4.1. Design Rationale of Kernel Sizes

The kernel sizes

3 \times 3

,

5 \times 5

, and

7 \times 7

are selected to provide complementary temporal receptive fields with a compact and balanced design. Physiological and behavioral signals in VR environments exhibit heterogeneous dynamics, where rapid transient responses (e.g., short-term gaze shifts or autonomic spikes) coexist with medium-term modulation patterns and slowly evolving immersion trends. The

3 \times 3

convolution focuses on local and fine-grained temporal variations, the

5 \times 5

kernel captures intermediate-range dependencies, and the

7 \times 7

kernel aggregates broader contextual information to model longer-term temporal correlations. This coarse-to-fine scale configuration enables the MFE module to encode short-, mid-, and long-range temporal patterns in parallel, while preserving structural symmetry across branches.

3.4.2. Comparison with Alternative Scale Configurations

An alternative strategy is to approximate large receptive fields by stacking multiple

3 \times 3

convolutions. Although such designs are effective in deep vision models, they introduce increased network depth and additional nonlinear transformations, which may amplify optimization instability and overfitting when training data are limited, as is common in multimodal physiological datasets. In contrast, explicitly incorporating

5 \times 5

and

7 \times 7

kernels provides direct access to wider temporal contexts with fewer sequential operations, resulting in more stable feature aggregation under the proposed multi-task learning framework.

3.4.3. Impact of Using More Diverse Kernel Sizes

Extending the MFE module with additional kernel sizes (e.g.,

1 \times 1

,

9 \times 9

, or dilated convolutions) may further increase receptive field diversity; however, such extensions also introduce higher computational overhead and feature redundancy. Excessively large receptive fields tend to over-smooth high-frequency physiological variations that are critical for emotion discrimination, while overly fine scales may be sensitive to noise. Therefore, the adopted three-scale configuration represents a practical trade-off between temporal coverage, representation diversity, and computational efficiency, which is further supported by the ablation analysis showing consistent performance degradation when the MFE component is removed.

3.5. Training and Optimization

MMEA-Net is trained in an end-to-end manner using a unified multi-task objective that jointly optimizes emotion state classification and immersion level estimation. The overall loss function is formulated as a weighted combination of task-specific objectives and regularization:

L_{t o t a l} = λ_{1} L_{e m o t i o n} + λ_{2} L_{i m m e r s i o n} + λ_{3} L_{r e g},

(26)

where

L_{e m o t i o n}

corresponds to the classification loss for emotion recognition,

L_{i m m e r s i o n}

measures regression error for immersion estimation,

L_{r e g}

is a regularization term to prevent overfitting, and

λ_{1}

,

λ_{2}

, and

λ_{3}

control the relative contribution of each component. This formulation enables balanced optimization across heterogeneous objectives while avoiding dominance by any single task.

From an optimization perspective, the weighting coefficients

λ_{1}

,

λ_{2}

, and

λ_{3}

explicitly regulate the relative gradient contributions of different objectives during backpropagation. Accordingly, the parameter update at each training iteration can be expressed as

\nabla_{θ} L_{t o t a l} = λ_{1} \nabla_{θ} L_{e m o t i o n} + λ_{2} \nabla_{θ} L_{i m m e r s i o n} + λ_{3} \nabla_{θ} L_{r e g},

(27)

where

θ

denotes the learnable parameters of MMEA-Net. In this formulation,

λ_{1}

controls the optimization pressure toward discriminative emotion-related representations, while

λ_{2}

adjusts the influence of continuous immersion regression on the shared backbone. By tuning

λ_{1}

and

λ_{2}

, the proposed framework achieves controlled asymmetry at the task level: although both tasks share a symmetric network structure and fusion mechanism, their gradients are intentionally weighted to prevent task dominance and negative interference during joint optimization. The regularization coefficient

λ_{3}

serves as a global constraint that stabilizes training by penalizing excessive parameter magnitudes, thereby improving generalization and mitigating overfitting caused by task-specific gradient fluctuations. The emotion classification loss is defined using categorical cross-entropy:

L_{e m o t i o n} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} y_{i, c} log (p_{i, c}),

(28)

where N denotes the number of samples, C is the number of emotion categories,

y_{i, c}

indicates whether sample i belongs to class c, and

p_{i, c}

represents the predicted class probability. This loss encourages discriminative separation among emotional states while remaining compatible with joint optimization. Immersion level estimation is modeled as a regression task using mean squared error:

L_{i m m e r s i o n} = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2},

(29)

where

y_{i}

and

{\hat{y}}_{i}

denote the ground truth and predicted immersion levels for sample i, respectively. This objective captures continuous variations in immersive experience and complements the discrete emotion classification task. To improve generalization and stabilize training, L2 regularization is applied to all learnable parameters:

L_{r e g} = \frac{1}{2} \sum_{w \in W} w^{2},

(30)

where

W

denotes the set of model parameters. Regularization serves as a constraint that maintains symmetry in parameter updates across tasks during optimization. Model training is performed using the Adam optimizer with an initial learning rate of 0.001. A learning rate decay strategy is employed to promote stable convergence:

lr = {lr}_{0} \cdot {decay_rate}^{⌊ epoch / decay_steps ⌋},

(31)

where

{lr}_{0}

is the initial learning rate,

decay_rate

is set to 0.95, and

decay_steps

is set to 2. To further reduce overfitting, dropout with a rate of 0.3 is applied to fully connected layers, and weight decay with a coefficient of 0.0001 is used. Early stopping based on validation performance is adopted to prevent unnecessary overtraining. Mini-batch gradient descent is conducted with a batch size of 32, and training is performed for up to 100 epochs. The dataset is partitioned into training (70%), validation (15%), and test (15%) subsets, with stratified sampling to preserve balanced class distributions. This training protocol ensures consistent evaluation and supports stable joint learning of emotion and immersion under a unified optimization framework.

Potential Challenges and Mitigation Strategies in Training with Heterogeneous Multimodal Data

When jointly training on heterogeneous multimodal data such as eye-tracking, electrocardiogram (ECG), and galvanic skin response (GSR) signals, several challenges inevitably arise, including modality-wise distribution discrepancies, inconsistent noise characteristics, and temporal synchronization errors. These modalities differ substantially in sampling rates, numerical scales, and signal dynamics. If early fusion is performed directly at the raw signal level, such heterogeneity may introduce bias during optimization and adversely affect training stability.

To alleviate the impact of modality distribution differences, the proposed framework employs modality-specific Hybrid-M encoders to perform independent normalization and feature transformation for each input stream prior to cross-modal fusion, thereby reducing direct interference at the raw signal level. In addition, multi-scale feature extraction and attention-based temporal modeling enhance robustness to modality-dependent noise, enabling the network to focus on more informative temporal segments while suppressing unreliable observations.

With respect to temporal synchronization, all modalities are aligned at the temporal-window level before feature extraction to ensure coarse-grained temporal correspondence across signals. The subsequent transformer-based temporal modeling is able to tolerate residual temporal misalignment by learning contextual dependencies, rather than relying on strict point-wise synchronization. Although these strategies do not eliminate all potential sources of bias in heterogeneous multimodal data, they provide a practical and scalable solution for achieving stable joint training in immersive virtual reality settings.

3.6. Discussion on Symmetry Awareness and Symmetry Breaking

The term “symmetry-aware” in this work does not imply strict structural symmetry at all levels, but rather emphasizes maintaining architectural consistency and symmetric processing across modalities. Hybrid-M adopts parallel and structurally aligned encoders, while Cross-Domain Fusion employs symmetric fusion and attention mechanisms to preserve symmetry at the representation level. At the same time, controlled symmetry breaking is intentionally introduced during multi-task optimization through asymmetric loss weighting, inducing gradient-level asymmetry to prevent task dominance and negative transfer. This design preserves structural symmetry while allowing limited and functional asymmetry, thereby balancing stability, generalization, and task-specific adaptation.

4. Experiments

4.1. Datasets

4.1.1. VREED Dataset Overview

The VREED [9] (Virtual Reality for Emotion Elicitation Dataset) is a multimodal dataset designed to support emotion recognition research in immersive virtual reality environments. It provides synchronized recordings of physiological signals, behavioral measurements, and subjective self-reports collected from 25 participants exposed to a range of emotion-eliciting VR scenarios. The dataset comprises the following components:

Visual and behavioral data, including eye-tracking metrics such as fixation duration, saccade amplitude, pupil dilation, and gaze coordinates.
Physiological data from electrocardiogram (ECG) recordings sampled at 256 Hz, capturing heart rate dynamics and variability.
Physiological data from galvanic skin response (GSR) recordings sampled at 128 Hz, reflecting autonomic arousal through skin conductance changes.
Self-reported emotion annotations, where participants evaluated their affective states using the Self-Assessment Manikin (SAM) scale along the valence, arousal, and dominance dimensions.
Post-exposure questionnaire responses, including presence-related items used to assess immersion levels.

Participants experienced eight distinct 360° VR video stimuli selected to elicit representative emotional states across the valence–arousal space, such as joy, fear, sadness, and calmness. Each recording session lasted approximately 60 s, resulting in a total of 440 synchronized multimodal sequences. The structured alignment of behavioral signals, physiological responses, and subjective reports makes VREED particularly suitable for studying the coordinated relationship between emotion and immersion in VR environments.

4.1.2. Immersion Level Extraction

Although the VREED dataset was originally developed for emotion recognition, it also contains structured information relevant to immersion assessment through post-exposure presence questionnaires. These questionnaires consist of seven presence-related items derived from the Presence Questionnaire (PQ), each rated on a 7-point Likert scale ranging from 0 to 6.

To derive a continuous immersion level annotation, we designed a systematic aggregation procedure that integrates multiple PQ items while accounting for their relative contributions to the overall sense of presence. The extraction process involves the following steps:

Reverse coding. For negatively phrased items (POST_PQ2, POST_PQ3, and POST_PQ5), reverse coding is applied to ensure a consistent directional interpretation, such that higher scores uniformly correspond to higher immersion levels:

$P Q_{reversed} = 6 - P Q_{original} .$

(32)
Item selection. Based on factor analysis of the PQ responses, six items closely associated with immersion-related constructs are retained (POST_PQ1, POST_PQ2, POST_PQ3, POST_PQ4, POST_PQ5, and POST_PQ7). The item POST_PQ6, which primarily reflects attention to background music, is excluded due to its weak correlation with the remaining presence dimensions.
Weighted aggregation. A weighted average of the selected items is computed to obtain a unified immersion score:

$\begin{matrix} Immersion_Score = & \frac{w_{1} \cdot P Q 1 + w_{2} \cdot (6 - P Q 2) + w_{3} \cdot (6 - P Q 3) + w_{4} \cdot P Q 4}{\sum_{i \in {1, 2, 3, 4, 5, 7}} w_{i}} \\ + \frac{w_{5} \cdot (6 - P Q 5) + w_{7} \cdot P Q 7}{\sum_{i \in {1, 2, 3, 4, 5, 7}} w_{i}}, \end{matrix}$

(33)

where $w_{i}$ denotes the weight associated with each PQ item. These weights are determined through principal component analysis (PCA), reflecting the relative contribution of each dimension to the overall immersion construct.

This aggregation strategy produces a continuous immersion label that preserves the balanced contribution of multiple presence-related dimensions, enabling consistent integration with physiological and behavioral features during joint emotion–immersion modeling.

4.1.3. Data Preprocessing

Physiological Signal Preprocessing

Prior to model input, heterogeneous physiological and behavioral signals are processed following standard preprocessing practices commonly adopted in multimodal affective computing studies. For ECG recordings, the raw waveforms are not directly fed into the network; instead, cardiac temporal descriptors derived from RR intervals and heart rate variability are extracted to characterize cardiovascular dynamics. GSR signals are smoothed to obtain stable skin conductance representations that capture overall autonomic arousal trends while reducing high-frequency noise. Eye-tracking data rely on the preprocessed gaze-related features provided by the dataset, with short missing segments caused by blinks or tracking loss handled through interpolation. These preprocessing steps aim to improve signal robustness and modality consistency without introducing task-specific assumptions or altering the original data semantics. Prior to model training, a series of preprocessing steps were applied to ensure consistency and structural alignment across multimodal data streams:

Temporal alignment: All modalities were synchronized and resampled to a unified sampling rate of 64 Hz, enabling consistent temporal correspondence across signals.
Normalization: Z-score normalization was applied to all features to standardize their scale:

$x_{n o r m a l i z e d} = \frac{x - μ}{σ},$

(34)

where $μ$ and $σ$ denote the mean and standard deviation of each feature, respectively.
Segmentation: Continuous recordings were segmented into non-overlapping windows of 5 s, resulting in approximately 5280 multimodal segments used for model training and evaluation.
Feature extraction: For each modality, both statistical descriptors (e.g., mean and standard deviation) and temporal features (e.g., frequency-domain characteristics and rate-of-change measures) were extracted to capture complementary signal properties.
Missing data handling: Occasional missing values in eye-tracking signals, such as those caused by blinks, were addressed using forward filling for short gaps and interpolation for longer gaps to preserve temporal continuity.

After preprocessing, the dataset was divided into training (70%), validation (15%), and test (15%) subsets using stratified sampling to maintain similar distributions of emotion labels and immersion scores across splits. To prevent data leakage, all segments originating from the same participant–video pair were assigned exclusively to a single subset. For clarity, the validation set is used during model development to tune parameters and monitor performance, helping researchers decide when and how to adjust the model. In contrast, the test set is kept completely unseen during training and validation, and is used only once to provide an unbiased evaluation of the final model’s performance on new data.

4.2. Experimental Setup

4.2.1. Implementation Details

MMEA-Net was implemented using PyTorch 1.9.0. All experiments were conducted on a workstation equipped with an NVIDIA RTX 3090 GPU with 24 GB memory, an Intel Core i9-10900K CPU, and 64 GB RAM.

Model hyperparameters were selected through grid search on the validation set and configured as follows:

Number of transformer layers in Hybrid-M: 4.
Number of sparse attention layers in Cross-Domain Fusion: 3.
Number of attention heads: 8.
Hidden dimension: 256.
Dropout rate: 0.3.
Learning rate: 0.001 with cosine annealing schedule.
Batch size: 32.
Maximum epochs: 100 with early stopping (patience = 10).
Loss weights: $λ_{1} = 1.0$ for emotion classification, $λ_{2} = 0.5$ for immersion estimation, and $λ_{3} = 0.0001$ for regularization.

To improve robustness, data augmentation techniques were applied during training, including random temporal shifting within

\pm 0.5

s, magnitude scaling within

\pm 10 %

, and additive Gaussian noise with standard deviation

σ = 0.01

.

4.2.2. Evaluation Metrics

For the emotion state classification task, the following evaluation metrics are employed. Among them, Precision and Recall are reported as supplementary metrics to complement the primary performance measures. In the context of emotion state classification, these two metrics provide class-sensitive insights into the reliability of predicted emotion labels and the coverage of true emotion samples, respectively, which may not be fully reflected by aggregated metrics such as Accuracy and F1-score. Therefore, Precision and Recall are used as auxiliary indicators and are only applicable to the classification task considered in this work.

Accuracy. The proportion of correctly classified samples:

$Accuracy = \frac{Number of correctly classified samples}{Total number of samples} .$

(35)
F1-score. The harmonic mean of precision and recall:

$F 1 - score = 2 \times \frac{Precision \times Recall}{Precision + Recall},$

(36)

where Precision = TP/(TP + FP) and Recall = TP/(TP + FN).
Cohen’s Kappa is a measure of agreement between predicted and true labels that accounts for chance agreement:

$κ = \frac{p_{o} - p_{e}}{1 - p_{e}},$

(37)

where $p_{o}$ denotes the observed agreement and $p_{e}$ represents the expected agreement by chance.

For the immersion level estimation task, the following regression metrics are used:

Root Mean Square Error (RMSE). The square root of the mean squared prediction error:

$RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}} .$

(38)
Mean Absolute Error (MAE). The average absolute difference between predicted and ground-truth immersion levels:

$MAE = \frac{1}{N} \sum_{i = 1}^{N} | y_{i} - {\hat{y}}_{i} | .$

(39)
Coefficient of Determination ( $R^{2}$ ). The proportion of variance in the target variable explained by the model:

$R^{2} = 1 - \frac{\sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{N} {(y_{i} - \bar{y})}^{2}},$

(40)

where $\bar{y}$ denotes the mean of the ground-truth immersion scores.

4.2.3. Baseline Models

To evaluate the effectiveness of the proposed MMEA-Net architecture, several baseline models are constructed for comparison. These baselines are obtained by replacing the Hybrid-M modules with alternative sequence modeling approaches and substituting the Cross-Domain Fusion mechanism with direct feature concatenation. This design enables a controlled assessment of the contributions of each architectural component. The baseline models include:

LSTM-based model [62]: Uses two-layer LSTM networks with 128 hidden units for each modality, followed by direct feature concatenation.
BiLSTM-based model [63]: Employs bidirectional LSTMs to model forward and backward temporal dependencies, with simple concatenation for multimodal fusion.
GRU-based model [64]: Replaces Hybrid-M modules with GRU networks (two layers, 128 hidden units) and applies direct feature concatenation.
1D-CNN-based model [65]: Uses one-dimensional convolutional networks with multiple kernel sizes (3, 5, and 7) for temporal feature extraction, followed by feature concatenation.
XGBoost-based model [66]: Extracts handcrafted features from each modality and performs prediction using XGBoost, with concatenated outputs.
CNN-LSTM-based model [67]: Combines convolutional layers for feature extraction with LSTM layers for temporal modeling, using direct concatenation for fusion.
Transformer-based model [68]: Applies Transformer encoder blocks with multi-head self-attention to each modality, followed by simple feature concatenation.
MLP-based model [69]: Processes modality-specific features using multilayer perceptrons and concatenates the resulting representations.
MMEA-Net (Simple Fusion): Retains the Hybrid-M modules while replacing the Cross-Domain Fusion mechanism with direct feature concatenation.

All baseline models follow the same overall processing pipeline, training strategy, and evaluation protocol as MMEA-Net. Each model operates on the same set of modalities, namely eye-tracking, ECG, and GSR signals, and performs joint emotion classification and immersion estimation. Identical preprocessing procedures, including temporal alignment, normalization, and segmentation into 5-second windows, are applied across all methods. Hyperparameters for each baseline are optimized via grid search on the validation set, focusing on learning rate, hidden dimensions, and dropout rates. The loss formulation and task weighting remain consistent for all models, ensuring a fair and balanced comparison. This experimental design allows for a systematic evaluation of both the Hybrid-M modules and the Cross-Domain Fusion mechanism under a unified dual-task setting.

4.3. Experimental Results and Analysis

In this section, we report the experimental results of the proposed MMEA-Net and compare its performance with various baseline models on the VREED dataset. All models are evaluated on both validation and test sets using the classification and regression metrics described earlier for emotion recognition and immersion estimation.

4.3.1. Comparison with Baseline Models

Table 1 summarizes the quantitative performance of MMEA-Net and baseline methods on the validation and test sets. The baselines cover a broad range of modeling paradigms, including recurrent architectures (LSTM [62], BiLSTM [63], and GRU [64]), convolutional approaches (1D-CNN [65] and CNN-LSTM [67]), tree-based models (XGBoost [66]), and attention-driven methods such as Transformer [68] and MLP-based [69] models. In addition to predictive performance, Table 1 further reports the number of trainable parameters and FLOPs for each method, providing a complementary comparison from the perspective of computational complexity and efficiency. As shown in Table 1 and Figure 4, MMEA-Net achieves the best overall performance across both tasks on the validation and test sets. On the emotion classification task, the proposed model attains the highest test accuracy, F1 score, and Cohen’s Kappa, indicating improved discriminative capability and classification stability. For immersion estimation, MMEA-Net yields the lowest prediction errors (RMSE and MAE) and the highest coefficient of determination, demonstrating superior regression performance. Notably, these performance gains are achieved with comparable or even lower parameter counts and FLOPs than Transformer-based models, indicating a favorable balance between prediction accuracy and computational cost. Several observations can be drawn from the comparative results. First, the comparison between MMEA-Net and its simple fusion variant indicates that structured cross-domain fusion contributes substantially to performance gains, particularly for immersion prediction. Specifically, the simple fusion variant adopts a straightforward concatenation-based fusion strategy, where modality-specific features are concatenated and directly forwarded to task-specific prediction heads, without explicitly modeling structured cross-task or cross-modality interactions. Second, recurrent and convolutional baselines exhibit competitive but consistently lower performance, suggesting limitations in capturing complex temporal and cross-modal interactions. Third, attention-based models improve over traditional sequence models but still fall short of MMEA-Net, highlighting the importance of jointly modeling multi-scale temporal features and coordinated multimodal fusion. Finally, the relatively weaker performance of the XGBoost-based model underscores the difficulty of representing multimodal physiological dynamics using handcrafted features alone. Furthermore, recent state-space sequence models such as Mamba [70] and its improved variant Mamba-3 [71], as well as function-approximation architectures like KAN [72], demonstrate competitive performance with reduced computational complexity. However, despite their efficiency advantages, these models still perform slightly below Transformer-based baselines in this multimodal VR setting, further emphasizing the necessity of structured cross-domain interaction modeling for joint emotion–immersion learning. Overall, these results demonstrate that the proposed framework effectively integrates multimodal signals and jointly optimizes emotion and immersion objectives within a unified and balanced modeling strategy. The consistent improvements observed across both tasks further suggest that the superiority of MMEA-Net arises from its structured fusion design rather than increased model complexity. To further provide a class-sensitive and multi-perspective evaluation of emotion recognition performance in this study, Precision and Recall are additionally reported as supplementary metrics for analysis purposes only. These metrics are introduced to complement the primary evaluation results without affecting the main experimental conclusions, with detailed results summarized in Table 2.

4.3.2. Statistical Significance Analysis

To further assess the reliability of the observed performance gains, we conducted paired t-tests on the test set by comparing MMEA-Net with the strongest baseline, as well as representative ablation variants and modality-reduced configurations, under identical experimental settings. The statistical analysis shows that the improvements achieved by MMEA-Net in both emotion classification and immersion estimation are statistically significant (

p < 0.05

). In addition, all reported results are averaged over 10 independent runs, and the performance variance across runs is small, with fluctuations not exceeding

\pm 0.1

for all evaluation metrics. These findings indicate that the reported gains are consistent and reproducible, and are unlikely to be caused by random initialization or stochastic training effects, thereby providing strong evidence for the effectiveness and stability of the proposed framework.

4.4. Cross-Subject Generalization Evaluation

Although the standard experimental protocol employs stratified random splitting at the segment level, emotion recognition and immersion estimation in VR environments are inherently subject-dependent. To rigorously assess subject-independent generalization capability, we further conducted a cross-subject evaluation experiment.

Experimental Protocol.

A subject-level 5-fold cross-validation strategy was adopted. The 25 participants in the VREED dataset were randomly divided into five disjoint groups (approximately five participants per group). In each fold, data from one group were reserved exclusively for testing, while the remaining subjects were used for training and validation. All segments derived from the same participant were assigned entirely to a single fold to strictly prevent subject-level information leakage. All preprocessing steps, model configurations, hyperparameters, loss weight settings, and evaluation metrics remained identical to those used in the standard experimental protocol. Hyperparameters were selected solely based on the validation data within each training fold and were not tuned on the test subjects, ensuring a strictly subject-independent evaluation. Thus, performance variation arises solely from the more challenging subject-independent setting.

Comparison Baseline.

Since the Transformer-based model achieved the strongest performance among all baseline methods under the standard evaluation setting, it was selected as the primary comparison model for cross-subject analysis. This choice focuses the evaluation on the most competitive baseline with comparable attention-based sequence modeling capacity and model complexity, thereby isolating the contribution of the proposed structured fusion and symmetry-aware multi-task design (Table 3).

Under the cross-subject setting, performance slightly decreases compared to the random segment-level split due to increased inter-subject physiological variability, which is expected in subject-independent evaluation scenarios. However, MMEA-Net consistently outperforms the Transformer baseline across all classification and regression metrics. The consistent improvements in Accuracy, F1-score, Precision, and Recall indicate enhanced class-wise discrimination capability, while reductions in RMSE and MAE further demonstrate improved robustness in immersion prediction under unseen-subject conditions. The relatively small standard deviations across folds further demonstrate stable and consistent generalization behavior. These findings suggest that the proposed modality-aware structured fusion and controlled symmetry-aware multi-task optimization effectively capture subject-invariant multimodal patterns rather than overfitting to participant-specific characteristics.

4.4.1. Model Component Analysis

To examine the contribution of individual components in MMEA-Net, we conducted ablation experiments by selectively removing or replacing key modules. The quantitative results for both emotion classification and immersion prediction are summarized in Table 4, while Figure 5 provides a visual comparison of performance changes across different evaluation metrics.

The ablation results indicate that each component contributes to the overall performance of MMEA-Net. Removing the Hybrid-M module leads to a noticeable degradation in both emotion classification accuracy and immersion prediction quality, reflecting its role in preserving modality-specific temporal characteristics. Excluding the Cross-Domain Fusion module also results in consistent performance drops, underscoring the importance of structured multimodal integration for jointly modeling affective and immersive states. In addition, the removal of the Multi-scale Feature Extraction (MFE) component negatively affects performance, confirming the necessity of capturing temporal patterns at multiple resolutions, particularly for physiological signals exhibiting diverse dynamics.

Statistical Significance Analysis

To further evaluate the reliability of the observed performance differences among model variants, we conducted paired t-tests on the test set by comparing the full MMEA-Net with each ablated variant under identical experimental settings. The results indicate that the performance degradation caused by removing individual components (e.g., Hybrid-M, Cross-Domain Fusion, and MFE) is statistically significant across both emotion classification and immersion prediction metrics (

p < 0.05

). All results are averaged over 10 independent runs, and the performance variance across runs remains small, with fluctuations not exceeding

\pm 0.12

for all reported metrics. These findings confirm that the observed performance drops are consistent and not attributable to random initialization or stochastic training effects, thereby providing statistical evidence for the effectiveness of each architectural component in MMEA-Net.

4.4.2. Multimodal Contribution Analysis

To further analyze the influence of different data modalities, we evaluated MMEA-Net under single-modality, dual-modality, and full-modality settings. Table 5 summarizes the quantitative results, while Figure 6 provides a visual comparison of classification and immersion prediction performance across different modality combinations.

The results demonstrate that models relying on a single modality exhibit substantially lower performance than those using multimodal inputs, highlighting the benefit of integrating heterogeneous physiological and behavioral signals. Among individual modalities, eye-tracking yields the strongest performance, likely due to its direct association with user attention and interaction patterns. Dual-modality configurations consistently outperform single-modality setups, with the combination of eye-tracking and ECG achieving the best overall balance between classification and regression performance. Incorporating all three modalities further improves results across both tasks, indicating that complementary information from visual, cardiac, and autonomic signals jointly contributes to a more comprehensive representation of emotional and immersive states.

Statistical Significance Analysis

To further assess the statistical reliability of the observed performance differences among different modality configurations, we conducted paired t-tests on the test set by comparing the full-modality setting with single-modality and dual-modality variants under identical experimental conditions. The analysis indicates that the performance improvements achieved by incorporating multiple modalities are statistically significant across both emotion classification and immersion prediction metrics (

p < 0.05

). In particular, the gains obtained by combining eye-tracking with ECG, as well as by integrating all three modalities, are consistently significant when compared to single-modality baselines. All reported results are averaged over 10 independent runs, with limited performance variance across runs (within

\pm 0.13

for all metrics), confirming that the observed improvements are robust and reproducible rather than arising from random initialization or stochastic training effects.

4.4.3. Single-Task vs. Multi-Task Learning

To further examine the impact of joint optimization, we compared single-task learning settings, where emotion classification or immersion prediction is performed independently, with the multi-task learning configuration used in MMEA-Net. The quantitative results are reported in Table 6, while Figure 7 visualizes the performance differences between single-task and multi-task learning.

The results show that the multi-task learning strategy yields consistent improvements over single-task learning for both objectives. Emotion classification accuracy increases by 1.61% compared to the single-task setting, while immersion prediction exhibits a reduction of 5.88% in RMSE. These gains suggest that jointly learning emotion and immersion enables the model to exploit shared and complementary information between the two tasks, leading to more informative representations. In addition, the multi-task formulation contributes to improved generalization and reduced overfitting, particularly under limited data conditions.

Statistical Significance Analysis

To further examine whether the observed performance improvements stem from the multi-task learning strategy rather than random variation, we conducted paired t-tests on the test set by comparing the multi-task configuration with the corresponding single-task baselines under identical experimental settings. The statistical results indicate that the gains achieved by multi-task learning in both emotion classification and immersion prediction are statistically significant (

p < 0.05

). Specifically, the improvements in emotion classification metrics and the reduction in immersion prediction errors remain consistent across repeated runs. All results are averaged over 10 independent runs, and the performance variance across runs is limited, with fluctuations not exceeding

\pm 0.11

for all evaluation metrics. These findings provide statistical evidence that jointly optimizing emotion recognition and immersion estimation leads to more robust and generalizable representations than training each task independently.

4.4.4. Summary

Through a comprehensive set of ablation studies, we evaluated the contribution of individual components, data modalities, and learning strategies in MMEA-Net. The experimental findings can be summarized as follows. First, the Hybrid-M and Cross-Domain Fusion modules play a central role in effectively encoding and integrating multimodal physiological and behavioral signals. Second, multi-scale feature extraction enhances the model’s ability to capture temporal dynamics at different resolutions, which is especially important for physiological data. Third, multimodal configurations consistently outperform single-modality settings, demonstrating the benefit of leveraging complementary information from heterogeneous sources. Finally, the multi-task learning framework achieves better overall performance than single-task learning, indicating that emotion recognition and immersion estimation can mutually reinforce each other when modeled jointly. These observations highlight the effectiveness of multimodal, multi-scale, and multi-task learning strategies for emotion and immersion assessment in virtual reality environments.

4.5. Loss Weight Analysis

To investigate the balance between emotion classification and immersion prediction, we conducted experiments with different combinations of loss weights. The regularization weight

λ_{3}

was fixed at 0.0001, while

λ_{1}

and

λ_{2}

were varied from 0 to 1 in increments of 0.2, subject to the constraint

λ_{1} + λ_{2} = 1

to maintain a consistent overall loss scale. The experimental results are summarized in Table 7, and the corresponding performance trends are visualized in Figure 8.

As shown in Table 7, the configuration with

λ_{1} = 0.4

and

λ_{2} = 0.6

yields the best overall performance, achieving a balanced improvement in both emotion classification accuracy and immersion prediction error. When the loss function places full emphasis on a single task, performance on the other task degrades substantially, indicating that independent optimization fails to capture shared information between emotion and immersion. A moderate emphasis on immersion prediction appears to benefit emotion classification performance, suggesting that immersion-related signals provide complementary contextual information that supports affective inference. This observation is consistent with findings in immersive experience research, where immersion often influences the intensity and expression of emotional responses. These results demonstrate the importance of balanced task weighting in multi-task learning and confirm that appropriate coordination between emotion and immersion objectives leads to more stable and effective optimization. The loss weights

λ_{1} = 0.4

and

λ_{2} = 0.6

are therefore adopted in all subsequent experiments.

Statistical Significance Analysis

To further verify that the observed performance differences under varying loss weight configurations are statistically reliable, we conducted paired t-tests on the test set by comparing the optimal setting (

λ_{1} = 0.4

,

λ_{2} = 0.6

) with alternative weight combinations under identical experimental conditions. The statistical analysis confirms that the balanced loss weighting yields statistically significant improvements in both emotion classification and immersion prediction metrics (

p < 0.05

). All reported results are averaged over 10 independent runs, and the performance variance across runs remains small, with fluctuations not exceeding

\pm 0.12

for all evaluation metrics. These results indicate that the observed performance gains are robust and not caused by random initialization or stochastic optimization effects, providing statistical evidence for the effectiveness of the chosen loss weight configuration.

5. Downstream Task Extensions Based on the MMEA-Net Model

Beyond quantitative evaluation, the proposed MMEA-Net framework can be naturally extended to a range of downstream tasks that rely on joint emotion recognition and immersion estimation. These extensions demonstrate how multimodal affective and experiential representations can support adaptive decision-making and interaction design in virtual reality environments.

One representative extension is personalized art content recommendation driven by emotional and immersion analysis. By continuously estimating users’ emotional states and immersion levels, the system can adapt recommended content to users’ momentary affective conditions and engagement patterns. For example, when low emotional valence or declining immersion is detected, the recommendation strategy may shift toward more stimulating or interactive content to restore engagement. This task illustrates how joint emotion–immersion modeling can enhance personalization by incorporating experiential context rather than relying solely on static user preferences.

Another extension involves emotion- and immersion-aware dynamic art presentation. Using real-time predictions from MMEA-Net, presentation parameters such as pacing, scene transitions, or interaction intensity can be adjusted to respond to changes in users’ emotional states or immersion levels. When decreasing engagement or affective responses are observed, the system can adapt the presentation flow to maintain continuity and user involvement. This extension highlights the applicability of the proposed framework to adaptive presentation systems that respond to experiential feedback.

MMEA-Net also supports real-time art content switching and optimization based on affective feedback. Continuous monitoring of emotional and immersion signals enables timely identification of disengagement or emotional shifts, triggering automatic content transitions or adjustments in visual and auditory elements. Such feedback-driven adaptation allows immersive systems to respond promptly to user state changes, providing a mechanism for maintaining sustained engagement during interactive experiences.

In addition, the multimodal representations learned by MMEA-Net can be leveraged to support art creation and curation processes. By analyzing aggregated emotional responses and immersion patterns across users, the system can provide feedback regarding how different artistic elements influence affective and experiential outcomes. This information can assist artists and curators in refining presentation strategies or content composition, translating multimodal physiological and behavioral data into actionable insights for creative optimization.

Finally, emotion and immersion analysis enables broader user experience enhancement strategies. By jointly modeling affective and experiential dimensions, interactive systems can dynamically adjust interface design, guidance mechanisms, or interaction modes to support emotional regulation and sustained engagement. This extension emphasizes the role of multimodal affective computing in experience-aware system design, where emotional and immersion feedback informs adaptive interaction strategies.

Collectively, these downstream task extensions demonstrate that MMEA-Net is not limited to isolated prediction tasks but can serve as a general affective–experiential perception module for adaptive virtual reality systems. They further illustrate how joint modeling of emotion and immersion can support personalized interaction, dynamic presentation, and experience-aware optimization in immersive environments.

6. Discussion

This section discusses the implications of the proposed MMEA-Net framework beyond quantitative performance, with a particular focus on generalization across virtual reality environments and user populations. While the experimental results demonstrate consistent improvements in both emotion recognition and immersion estimation, several aspects merit further discussion regarding applicability and potential limitations.

6.1. Generalization Across VR Environments

Emotion recognition and immersion perception in virtual reality are inherently influenced by environmental factors, including scene design, interaction modality, content semantics, and display fidelity. Previous studies have shown that variations in VR environments can affect both affective responses and the subjective sense of presence, thereby altering the statistical properties of physiological signals [1,41]. As a result, models trained on data collected from a specific set of VR scenarios may encounter distribution shifts when deployed in environments with substantially different visual styles, narrative structures, or interaction mechanisms. In this work, MMEA-Net is evaluated on the VREED dataset, which contains multiple VR stimuli designed to elicit diverse emotional states. Although the experimental results suggest that the proposed multi-task and multimodal design captures stable emotion–immersion relationships within this setting, the framework does not explicitly model environment-specific factors. Therefore, its generalization to unseen VR scenarios with fundamentally different content or interaction paradigms remains an open question. Future studies involving cross-environment evaluation or domain adaptation strategies could provide deeper insights into this issue.

6.2. User Variability and Cross-Cultural Considerations

Emotion recognition research is widely acknowledged to be affected by individual differences across users, as well as by potential variations in emotional expression and perception across demographic and cultural contexts. Although physiological signals are often regarded as more stable and less consciously controllable than overt behavioral cues, existing reviews indicate that their association with emotional states may still vary due to factors such as age, gender, cultural background, and personal experience [21,73]. In immersive VR settings, these differences may be further modulated by users’ familiarity with virtual environments, subjective immersion sensitivity, and interaction habits [1]. Consequently, models trained on data from a relatively homogeneous participant group may exhibit reduced performance when applied to users from different cultural or demographic backgrounds. However, systematic investigations of cross-cultural generalization remain relatively scarce in current multimodal affective computing literature, particularly under immersive VR conditions [74]. From a methodological perspective, MMEA-Net emphasizes multimodal physiological consistency and joint modeling of emotion and immersion, which may help mitigate reliance on culture-specific expressive patterns. Nevertheless, the present study does not incorporate explicit user-adaptive, demographic-aware, or culture-aware modeling mechanisms. Accordingly, no strong claim is made regarding cross-cultural robustness. A more comprehensive evaluation involving diverse user populations and cross-cultural VR datasets would be necessary to draw definitive conclusions.

6.3. Implications for Multi-Task Affective Modeling

The discussion above highlights that generalization challenges in affective computing are closely related to both data diversity and modeling assumptions. By jointly modeling emotion recognition and immersion estimation, MMEA-Net introduces an additional structural constraint that encourages shared representation learning across related experiential dimensions. This design aligns with prior findings suggesting that multi-task learning can improve robustness by leveraging task-level complementarities [19]. However, the effectiveness of such task coupling under substantial domain or population shifts remains an open research question. Future extensions may consider incorporating adaptive weighting strategies, user-specific normalization, or continual learning mechanisms to further enhance robustness across environments and populations. Overall, the proposed framework provides a structured foundation for exploring these directions, while the present discussion clarifies the scope and boundaries of the current study.

7. Conclusions, Limitations, and Future Work

This paper presented a multimodal learning framework for joint emotion recognition and immersion estimation in virtual reality environments. By integrating eye-tracking, electrocardiogram, and galvanic skin response signals within a unified multi-task architecture, the proposed approach enables simultaneous modeling of users’ affective states and immersive experiences.

On the test set, the proposed model achieves an emotion recognition accuracy of 75.42%, an F1-score of 74.19%, and a Cohen’s Kappa of 0.66, while attaining an RMSE of 0.96, an MAE of 0.74, and an

R^{2}

value of 0.63 for immersion prediction. The experimental results demonstrate that jointly learning these two related tasks improves performance compared to single-task baselines, highlighting the benefit of shared representations for multimodal affective analysis. The study further illustrated how joint emotion–immersion modeling can support a range of downstream tasks, including adaptive content recommendation, dynamic presentation adjustment, real-time content switching, creative feedback analysis, and user experience optimization in immersive environments. These extensions indicate that the learned multimodal representations can serve as a general perception component for experience-aware virtual reality systems. Despite the encouraging results, this study has several inherent limitations. The proposed framework is evaluated on a controlled VR dataset with a limited participant pool, and its generalization to more diverse VR environments, user populations, and cultural contexts has not yet been fully explored.

Future work will focus on extending the framework to larger and more diverse multimodal datasets, as well as improving robustness and generalization across users, scenarios, and application domains. Incorporating long-term temporal modeling, contextual memory mechanisms, and adaptive personalization strategies may further enhance the modeling of sustained emotional and immersive experiences.

Author Contributions

Conceptualization, H.W. and M.W.; methodology, H.W.; software, H.W.; validation, H.W. and M.W.; formal analysis, H.W.; investigation, H.W.; resources, M.W.; data curation, H.W.; writing—original draft preparation, H.W.; writing—review and editing, M.W.; visualization, H.W.; supervision, M.W.; project administration, M.W.; funding acquisition, M.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data supporting the findings of this study are available from the corresponding author upon reasonable request. The extended VREED dataset with immersion labels used in this study is not publicly available due to participant privacy protection agreements.

Acknowledgments

The authors would like to thank Shandong University of Arts for providing institutional support and a supportive academic environment for this research.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Lønne, T.; Karlsen, H.; Langvik, E.; Saksvik-Lehouillier, I. The effect of immersion on sense of presence and affect when experiencing an educational scenario in virtual reality: A randomized controlled study. Heliyon 2023, 9, e17196. [Google Scholar] [CrossRef]
Yang, X.; Cheng, P.; Liu, X.; Shih, S. The impact of immersive virtual reality on art education: A study of flow state, cognitive load, brain state, and motivation. Educ. Inf. Technol. 2024, 29, 6087–6106. [Google Scholar] [CrossRef]
Marín-Morales, J.; Llinares, C.; Guixeres, J.; Alcañiz, M. Emotion Recognition in Immersive Virtual Reality: From Statistics to Affective Computing. Sensors 2020, 20, 5163. [Google Scholar] [CrossRef] [PubMed]
Chen, Z.; Han, Z.; Wu, L.; Huang, J. Multisensory Imagery Enhances the Aesthetic Evaluation of Paintings: A Virtual Reality Study. Empir. Stud. Arts 2026. [Google Scholar] [CrossRef]
Cai, Y.; Li, X.; Li, J. Emotion recognition using different sensors, emotion models, methods and datasets: A comprehensive review. Sensors 2023, 23, 2455. [Google Scholar] [CrossRef]
Moin, A.; Aadil, F.; Ali, Z.; Kang, D. Emotion recognition framework using multiple modalities for an effective human-computer interaction. J. Supercomput. 2023, 79, 9320–9349. [Google Scholar] [CrossRef]
Fu, Z.; Zhang, B.; He, X.; Li, Y.; Wang, H.; Huang, J. Emotion recognition based on multi-modal physiological signals and transfer learning. Front. Neurosci. 2022, 16, 1000716. [Google Scholar] [CrossRef]
Lee, Y.; Pae, D.; Hong, D.; Lim, M.; Kang, T. Emotion recognition with short-period physiological signals using bimodal sparse autoencoders. Intell. Autom. Soft Comput. 2022, 32, 657–673. [Google Scholar] [CrossRef]
Koelstra, S.; Muhl, C.; Soleymani, M.; Lee, J.; Yazdani, A.; Ebrahimi, T.; Patras, I. DEAP: A database for emotion analysis using physiological signals. IEEE Trans. Affect. Comput. 2011, 3, 18–31. [Google Scholar] [CrossRef]
Tabbaa, L.; Searle, R.; Bafti, S.; Hossain, M.; Intarasisrisawat, J.; Glancy, M.; Ang, C. VREED: Virtual reality emotion recognition dataset using eye tracking and physiological measures. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2021, 5, 1–20. [Google Scholar] [CrossRef]
Chen, J.; Cui, Y.; Wei, C.; Polat, K.; Alenezi, F. Advances in EEG-based emotion recognition: Challenges, methodologies, and future directions. Appl. Soft Comput. 2025, 180, 113478. [Google Scholar] [CrossRef]
Yahyaeian, A.A.; Sabet, M.; Zhang, J.; Jones, A. Enhancing Immersive Learning: An Exploratory Pilot Study on Large Language Model-Powered Guidance in Virtual Reality Labs. Comput. Appl. Eng. Educ. 2026, 34, e70127. [Google Scholar] [CrossRef]
De Giglio, V.; Evangelista, A.; Giannakakis, G.; Konstantaras, A.; Kamarianakis, Z.; Uva, A.E.; Manghisi, V.M. Assessing the Impact of Cinematic Virtual Reality Simulations on Young Drivers: Behavior and Physiological Responses. Virtual Real. 2026, 30, 11. [Google Scholar] [CrossRef]
García-Batista, Z.E.; Guerra-Peña, K.; Jurnet, I.A.; Cano-Vindel, A.; Álvarez-Hernández, A.; Herrera-Martinez, S.; Medrano, L.A. Design and Preliminary Evaluation of AYRE: A Virtual Reality-Based Intervention for the Treatment of Emotional Disorders. J. Behav. Cogn. Ther. 2026, 36, 100560. [Google Scholar] [CrossRef]
Wei, L.; Liu, L.; Faridniya, H. Promoting Mental Health and Preventing Emotional Disorders in Vulnerable Adolescent Girls through VR-Based Extreme Sports. Acta Psychol. 2026, 262, 106088. [Google Scholar] [CrossRef]
Baker, N.A.; Polhemus, A.H.; Baird, J.M.; Kenney, M. Embodied Fully Immersive Virtual Reality as a Therapeutic Modality to Treat Chronic Pain: A Scoping Review. Virtual Worlds 2026, 5, 3. [Google Scholar] [CrossRef]
Hernandez-Melgarejo, G.; Luviano-Juarez, A.; Fuentes-Aguilar, R. A framework to model and control the state of presence in virtual reality systems. IEEE Trans. Affect. Comput. 2022, 13, 1854–1867. [Google Scholar] [CrossRef]
Ochs, C.; Sonderegger, A. The interplay between presence and learning. Front. Virtual Real. 2022, 3, 742509. [Google Scholar] [CrossRef]
Liu, B.; Liu, X.; Jin, X.; Stone, P.; Liu, Q. Conflict-averse gradient descent for multi-task learning. Proc. Adv. Neural Inf. Process. Syst. 2021, 34, 18878–18890. [Google Scholar]
Liu, D.; Yu, Y. MT2ST: Adaptive Multi-Task to Single-Task Learning. arXiv 2024, arXiv:2406.18038. [Google Scholar] [CrossRef]
Lin, W.; Li, C. Review of studies on emotion recognition and judgment based on physiological signals. Appl. Sci. 2023, 13, 2573. [Google Scholar] [CrossRef]
Zhang, Y.; Cheng, C.; Zhang, Y. Multimodal emotion recognition based on manifold learning and convolution neural network. Multimed. Tools Appl. 2022, 81, 33253–33268. [Google Scholar] [CrossRef]
Katada, S.; Okada, S. Biosignal-based user-independent recognition of emotion and personality with importance weighting. Multimed. Tools Appl. 2022, 81, 30219–30241. [Google Scholar] [CrossRef]
Dissanayake, V.; Seneviratne, S.; Rana, R.; Wen, E.; Kaluarachchi, T.; Nanayakkara, S. SigRep: Toward robust wearable emotion recognition with contrastive representation learning. IEEE Access 2022, 10, 18105–18120. [Google Scholar] [CrossRef]
Yan, J.; Zheng, W.; Xu, Q.; Lu, G.; Li, H.; Wang, B. Sparse kernel reduced-rank regression for bimodal emotion recognition from facial expression and speech. IEEE Trans. Multimed. 2016, 18, 1319–1329. [Google Scholar] [CrossRef]
Wang, M.; Yang, W.; Wang, S. Conditional matching preclusion number for the Cayley graph on the symmetric group. Acta Math. Appl. Sin. (Chin. Ser.) 2013, 36, 813–820. [Google Scholar]
Wang, M.; Yang, W.; Guo, Y.; Wang, S. Conditional fault tolerance in a class of Cayley graphs. Int. J. Comput. Math. 2016, 93, 67–82. [Google Scholar] [CrossRef]
Wang, S.; Wang, Y.; Wang, M. Connectivity and matching preclusion for leaf-sort graphs. J. Interconnect. Netw. 2019, 19, 1940007. [Google Scholar] [CrossRef]
Wang, S.; Wang, M. The strong connectivity of bubble-sort star graphs. Comput. J. 2019, 62, 715–729. [Google Scholar] [CrossRef]
Wang, S.; Wang, M. A Note on the Connectivity of m-Ary n-Dimensional Hypercubes. Parallel Process. Lett. 2019, 29, 1950017. [Google Scholar] [CrossRef]
Jiang, J.; Wu, L.; Yu, J.; Wang, M.; Kong, H.; Zhang, Z.; Wang, J. Robustness of bilayer railway-aviation transportation network considering discrete cross-layer traffic flow assignment. Transp. Res. Part D Transp. Environ. 2024, 127, 104071. [Google Scholar] [CrossRef]
Hu, Z.; Chen, L.; Luo, Y.; Zhou, J. EEG-based emotion recognition using convolutional recurrent neural network with multi-head self-attention. Appl. Sci. 2022, 12, 11255. [Google Scholar] [CrossRef]
Xiao, G.; Shi, M.; Ye, M.; Xu, B.; Chen, Z.; Ren, Q. 4D attention-based neural network for EEG emotion recognition. Cogn. Neurodynamics 2022, 16, 805–818. [Google Scholar] [CrossRef] [PubMed]
Chen, J.; Fan, F.; Wei, C.; Polat, K.; Alenezi, F. Decoding driving states based on normalized mutual information features and hyperparameter self-optimized Gaussian kernel-based radial basis function extreme learning machine. Chaos Solitons Fractals 2025, 199, 116751. [Google Scholar] [CrossRef]
Chen, J.; Cui, Y.; Wei, C.; Polat, K.; Alenezi, F. Driver fatigue detection using EEG-based graph attention convolutional neural networks: An end-to-end learning approach with mutual information-driven connectivity. Appl. Soft Comput. 2026, 186, 114097. [Google Scholar] [CrossRef]
Wei, C.; Alenezi, F.; Chen, J.; Wang, H.; Polat, K. Nonlinear Feature Decomposition and Deep Temporal–Spatial Learning for Single-Channel sEMG-Based Lower Limb Motion Recognition. IEEE Sens. J. 2026, 26, 4120–4126. [Google Scholar] [CrossRef]
Wang, T.; Li, J.; Wu, H.N.; Li, C.; Snoussi, H.; Wu, Y. ResLNet: Deep residual LSTM network with longer input for action recognition. Front. Comput. Sci. 2022, 16, 166334. [Google Scholar] [CrossRef]
Li, Z.; Cai, J.; Chen, Q.; Chen, L.; Qing, M.; Yang, S.X. An LSTM Network with Neural Plasticity for Driver Fatigue Recognition on Real Roads. IEEE Trans. Ind. Electron. 2025, 72, 14668–14676. [Google Scholar] [CrossRef]
Liu, X.Y.; Li, G.; Zhou, X.H.; Liang, X.; Hou, Z.G. A Weight-Aware-Based Multisource Unsupervised Domain Adaptation Method for Human Motion Intention Recognition. IEEE Trans. Cybern. 2025, 55, 3131–3143. [Google Scholar] [CrossRef]
Alharbi, H. Explainable feature selection and deep learning based emotion recognition in virtual reality using eye tracker and physiological data. Front. Med. 2024, 11, 1438720. [Google Scholar] [CrossRef]
Souza, V.; Maciel, A.; Nedel, L.; Kopper, R. Measuring presence in virtual environments: A survey. ACM Comput. Surv. 2021, 54, 1–37. [Google Scholar] [CrossRef]
Liu, X.; Zhou, H.; Liu, J. Deep learning-based analysis of the influence of illustration design on emotions in immersive art. Mob. Inf. Syst. 2022, 2022, 3120955. [Google Scholar] [CrossRef]
Song, W.; Wang, X.; Jiang, Y.; Li, S.; Hao, A.; Hou, X.; Qin, H. Expressive 3D Facial Animation Generation Based on Local-to-Global Latent Diffusion. IEEE Trans. Vis. Comput. Graph. 2024, 30, 3321–3336. [Google Scholar] [CrossRef] [PubMed]
Song, W.; Wang, X.; Zheng, S.; Li, S.; Hao, A.; Hou, X. TalkingStyle: Personalized Speech-Driven 3D Facial Animation with Style Preservation. IEEE Trans. Vis. Comput. Graph. 2025, 31, 4682–4694. [Google Scholar] [CrossRef]
Hu, J.; Jiang, H.; Xiao, Z.; Chen, S.; Dustdar, S.; Liu, J. HeadTrack: Real-Time Human–Computer Interaction via Wireless Earphones. IEEE J. Sel. Areas Commun. 2024, 42, 990–1002. [Google Scholar] [CrossRef]
Meng, T.; Shou, Y.; Ai, W.; Du, J.; Liu, H.; Li, K. A Multi-Message Passing Framework Based on Heterogeneous Graphs in Conversational Emotion Recognition. Neurocomputing 2024, 569, 127109. [Google Scholar] [CrossRef]
Liu, Y.; Feng, S.; Liu, S.; Zhan, Y.; Tao, D.; Chen, Z.; Chen, Z. Sample-cohesive pose-aware contrastive facial representation learning. Int. J. Comput. Vis. 2025, 133, 3727–3745. [Google Scholar] [CrossRef]
Lin, Z.; Wang, Y.; Zhou, Y.; Du, F.; Yang, Y. MLM-EOE: Automatic Depression Detection via Sentimental Annotation and Multi-Expert Ensemble. IEEE Trans. Affect. Comput. 2025, 16, 2842–2858. [Google Scholar] [CrossRef]
Hu, J.; Jiang, H.; Liu, D.; Xiao, Z.; Zhang, Q.; Liu, J.; Dustdar, S. Combining IMU with acoustics for head motion tracking leveraging wireless earphone. IEEE Trans. Mob. Comput. 2023, 23, 6835–6847. [Google Scholar] [CrossRef]
Hu, J.; Jiang, H.; Liu, D.; Xiao, Z.; Zhang, Q.; Min, G.; Liu, J. Real-time contactless eye blink detection using UWB radar. IEEE Trans. Mob. Comput. 2023, 23, 6606–6619. [Google Scholar] [CrossRef]
Bukht, T.F.N.; Alazeb, A.; Mudawi, N.A.; Alabdullah, B.; Alnowaiser, K.; Jalal, A.; Liu, H. Robust Human Interaction Recognition Using Extended Kalman Filter. Comput. Mater. Contin. 2024, 81, 2987–3002. [Google Scholar] [CrossRef]
Linton, K.F.; Abbasi, B.; Jimenez, M.G.; Aragon, J.; Gendron, A.; Gomez, R.; Hampton, S.; Michael, B.; Monson, S.; Paulus, N.; et al. Immersive virtual reality for prospective memory and eye fixation recovery following traumatic brain injury: A pilot study. Brain Heart 2024, 2, 2685. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Polosukhin, I. Attention is all you need. Proc. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Liu, P. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
Almujally, N.A.; Rafique, A.A.; Al Mudawi, N.; Alazeb, A.; Alonazi, M.; Algarni, A.; Jalal, A.; Liu, H. Multi-modal remote perception learning for object sensory data. Front. Neurorobot. 2024, 18, 1427786. [Google Scholar] [CrossRef]
Shen, X.; Li, L.; Ma, Y.; Xu, S.; Liu, J.; Yang, Z.; Shi, Y. VLCIM: A vision-language cyclic interaction model for industrial defect detection. IEEE Trans. Instrum. Meas. 2025, 74, 2538713. [Google Scholar] [CrossRef]
Lv, S.; Lu, S.; Wang, R.; Yin, L.; Yin, Z.; A. AlQahtani, S.; Tian, J.; Zheng, W. Enhancing chinese dialogue generation with word–phrase fusion embedding and sparse softmax optimization. Systems 2024, 12, 516. [Google Scholar] [CrossRef]
Li, G.; Bai, L.; Zhang, H.; Xu, Q.; Zhou, Y.; Gao, Y.; Wang, M.; Li, Z. Velocity Anomalies around the Mantle Transition Zone beneath the Qiangtang Terrane, Central Tibetan Plateau from Triplicated P Waveforms. Earth Space Sci. 2022, 9, e2021EA002060. [Google Scholar] [CrossRef]
Lv, S.; Yang, B.; Wang, R.; Lu, S.; Tian, J.; Zheng, W.; Chen, X.; Yin, L. Dynamic multi-granularity translation system: Dag-structured multi-granularity representation and self-attention. Systems 2024, 12, 420. [Google Scholar] [CrossRef]
Wang, S.; Zhang, K.; Liu, A. Flat-Lattice-CNN: A model for Chinese medical-named-entity recognition. PLoS ONE 2025, 20, e0331464. [Google Scholar] [CrossRef]
Gers, F.; Schmidhuber, J.; Cummins, F. Learning to forget: Continual prediction with LSTM. Neural Comput. 2000, 12, 2451–2471. [Google Scholar] [CrossRef]
Siami-Namini, S.; Tavakoli, N.; Namin, A. The performance of LSTM and BiLSTM in forecasting time series. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019; IEEE: New York, NY, USA, 2019; pp. 3285–3292. [Google Scholar] [CrossRef]
Rana, R. Gated Recurrent Unit (GRU) for Emotion Classification from Noisy Speech. arXiv 2016, arXiv:1612.07778. [Google Scholar] [CrossRef]
Azizjon, M.; Jumabek, A.; Kim, W. 1D CNN Based Network Intrusion Detection with Normalization on Imbalanced Data. In Proceedings of the 2020 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Fukuoka, Japan, 19–21 February 2020; IEEE: New York, NY, USA, 2020; pp. 218–224. [Google Scholar] [CrossRef]
Nielsen, D. Tree Boosting with XGBoost—Why Does XGBoost Win “Every” Machine Learning Competition? Master’s Thesis, Norwegian University of Science and Technology (NTNU), Trondheim, Norway, 2016. [Google Scholar]
Lu, W.; Li, J.; Li, Y.; Sun, A.; Wang, J. A CNN-LSTM-Based Model to Forecast Stock Prices. Complexity 2020, 2020, 6622927. [Google Scholar] [CrossRef]
Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in Transformer. Proc. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar]
Valanarasu, J.; Patel, V. UNeXt: MLP-Based Rapid Medical Image Segmentation Network. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2022, Singapore, 18–22 September 2022; Springer: Cham, Switzerland, 2022; pp. 23–33. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7 October 2024. [Google Scholar]
Lahoti, A.; Li, K.; Chen, B.; Wang, C.; Bick, A.; Kolter, J.Z.; Dao, T.; Gu, A. Mamba-3: Improved Sequence Modeling using State Space Principles. In Proceedings of the Fourteenth International Conference on Learning Representations, Rio de Janeiro, Brazil, 23–27 April 2026. [Google Scholar]
Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljacic, M.; Hou, T.Y.; Tegmark, M. KAN: Kolmogorov–Arnold Networks. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Wang, Y.; Song, W.; Tao, W.; Liotta, A.; Yang, D.; Li, X.; Zhang, W. A systematic review on affective computing: Emotion models, databases, and recent advances. Inf. Fusion 2022, 83, 19–52. [Google Scholar] [CrossRef]
Ezzameli, K.; Mahersia, H. Emotion recognition from unimodal to multimodal analysis: A review. Inf. Fusion 2023, 99, 101847. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of the proposed MMEA-Net. The framework consists of three main components: (1) modality-specific feature extraction modules for eye-tracking, ECG, and GSR signals; (2) a cross-domain fusion module for structured multimodal integration; and (3) dual prediction heads for emotion classification and immersion estimation.

Figure 2. Detailed architecture of the Hybrid-M module (right) and Cross-Domain Fusion module (left). The Hybrid-M module transforms raw multimodal inputs into structured tensor representations through vectorization, multi-scale feature extraction, normalization, transformer-based temporal modeling, and attention mechanisms.

Figure 3. Architecture of the Multi-scale Feature Extraction (MFE) module. Input features are first transformed through an MLP, followed by parallel convolutional branches with different kernel sizes to capture patterns at multiple temporal scales. The resulting features are concatenated and further processed to produce a unified multi-scale representation.

Figure 4. Visualization of test-set performance comparison among MMEA-Net and baseline models. (a) Accuracy and F1-score; (b) Cohen’s Kappa and RMSE; (c) MAE and

R^{2}

.

Figure 4. Visualization of test-set performance comparison among MMEA-Net and baseline models. (a) Accuracy and F1-score; (b) Cohen’s Kappa and RMSE; (c) MAE and

R^{2}

.

Figure 5. Ablation study results of MMEA-Net. (Left): classification performance in terms of Accuracy and F1-score. (Right): immersion regression performance evaluated by Kappa, RMSE, MAE, and

R^{2}

.

Figure 5. Ablation study results of MMEA-Net. (Left): classification performance in terms of Accuracy and F1-score. (Right): immersion regression performance evaluated by Kappa, RMSE, MAE, and

R^{2}

.

Figure 6. Multimodal contribution analysis of MMEA-Net. (Left): emotion classification performance (Accuracy and F1-score) under different modality combinations. (Right): immersion prediction performance evaluated by Kappa, RMSE, MAE, and

R^{2}

across multimodal settings.

Figure 6. Multimodal contribution analysis of MMEA-Net. (Left): emotion classification performance (Accuracy and F1-score) under different modality combinations. (Right): immersion prediction performance evaluated by Kappa, RMSE, MAE, and

R^{2}

across multimodal settings.

Figure 7. Comparison between single-task and multi-task learning in MMEA-Net. (Left): emotion classification performance (Accuracy and F1-score) under single-task and multi-task settings. (Right): immersion prediction performance evaluated by Kappa, RMSE, MAE, and

R^{2}

.

Figure 7. Comparison between single-task and multi-task learning in MMEA-Net. (Left): emotion classification performance (Accuracy and F1-score) under single-task and multi-task settings. (Right): immersion prediction performance evaluated by Kappa, RMSE, MAE, and

R^{2}

.

Figure 8. Loss weight analysis of MMEA-Net. (Left): emotion classification performance (Accuracy and F1-score) under different loss weight combinations. (Right): immersion prediction performance evaluated by Kappa, RMSE, MAE, and

R^{2}

with varying

λ_{1}

and

λ_{2}

.

Figure 8. Loss weight analysis of MMEA-Net. (Left): emotion classification performance (Accuracy and F1-score) under different loss weight combinations. (Right): immersion prediction performance evaluated by Kappa, RMSE, MAE, and

R^{2}

with varying

λ_{1}

and

λ_{2}

.

Table 1. Performance and computational complexity comparison of MMEA-Net and baseline models on the validation and test sets of the VREED dataset. MMEA-Net (Simple Fusion) adopts a straightforward concatenation-based fusion strategy without explicit structured cross-domain interaction modeling. Bold values indicate the best performance.

Model	Params (M)	FLOPs (G)	Validation Set						Test Set
			Emotion Classification			Immersion Prediction			Emotion Classification			Immersion Prediction
			Acc (%)	F1 (%)	Kappa	RMSE	MAE	R²	Acc (%)	F1 (%)	Kappa	RMSE	MAE	R²
LSTM-based [62]	1.5	0.32	67.43	65.87	0.53	1.24	0.97	0.46	65.91	64.22	0.51	1.31	1.02	0.43
BiLSTM-based [63]	2.1	0.46	69.21	67.95	0.56	1.18	0.92	0.49	68.07	66.54	0.55	1.23	0.95	0.47
GRU-based [64]	1.3	0.28	67.85	66.32	0.54	1.21	0.94	0.47	66.23	64.89	0.52	1.28	0.99	0.44
1D-CNN-based [65]	1.8	0.41	70.14	68.76	0.58	1.14	0.88	0.52	68.92	67.41	0.56	1.19	0.93	0.49
XGBoost-based [66]	0.3	–	65.37	63.54	0.51	1.29	1.03	0.41	64.21	62.38	0.49	1.35	1.08	0.38
CNN-LSTM-based [67]	2.6	0.55	71.53	70.24	0.60	1.08	0.84	0.55	70.19	68.75	0.58	1.15	0.89	0.52
Attention-based	3.1	0.68	72.68	71.42	0.62	1.06	0.82	0.57	71.34	70.05	0.60	1.12	0.87	0.54
Transformer-based [68]	5.4	0.96	74.12	72.98	0.64	0.99	0.76	0.61	72.87	71.65	0.62	1.05	0.81	0.58
MLP-based [69]	0.9	0.17	63.79	62.14	0.48	1.33	1.07	0.38	62.45	60.89	0.46	1.39	1.13	0.35
Mamba-based [70]	4.2	0.78	73.54	72.18	0.63	1.01	0.78	0.59	72.31	71.02	0.61	1.07	0.83	0.56
Mamba-3-based [71]	4.6	0.84	73.89	72.63	0.64	0.98	0.75	0.60	72.65	71.41	0.62	1.03	0.80	0.57
KAN-based [72]	3.5	0.65	73.02	71.86	0.62	1.03	0.80	0.58	71.92	70.74	0.60	1.09	0.85	0.55
MMEA-Net (Simple Fusion)	2.8	0.63	73.65	72.31	0.63	1.02	0.79	0.59	72.14	70.88	0.61	1.08	0.84	0.56
MMEA-Net (Ours)	3.2	0.71	76.93	75.47	0.68	0.91	0.70	0.65	75.42	74.19	0.66	0.96	0.74	0.63

Table 2. Supplementary comparison of Precision and Recall (%) for emotion classification on the validation and test sets of the VREED dataset. Bold values indicate the best performance.

Model	Validation Set		Test Set
Model	Precision (%)	Recall (%)	Precision (%)	Recall (%)
LSTM-based [62]	66.12	65.64	64.78	63.89
BiLSTM-based [63]	68.01	67.89	66.92	66.17
GRU-based [64]	66.74	66.01	65.32	64.56
1D-CNN-based [65]	69.08	68.43	67.95	66.88
XGBoost-based [66]	64.11	62.98	63.02	61.85
CNN-LSTM-based [67]	70.31	70.17	69.12	68.44
Attention-based	71.56	71.28	70.24	69.86
Transformer-based [68]	73.21	72.84	72.03	71.18
MLP-based [69]	62.85	61.73	61.64	60.29
Mamba-based [70]	72.84	72.36	71.58	70.74
Mamba-3-based [71]	73.12	72.69	71.94	71.05
KAN-based [72]	72.41	71.93	71.16	70.28
MMEA-Net (Simple Fusion)	72.48	72.15	71.32	70.46
MMEA-Net (Ours)	76.12	75.06	74.88	73.54

Table 3. Cross-subject (5-fold) performance comparison (Mean ± Std). Bold values indicate the best performance.

Model	Emotion Classification					Immersion Prediction
Model	Acc (%)	F1 (%)	Precision (%)	Recall (%)	Kappa	RMSE	MAE	R²
Transformer-based	70.84 ± 1.92	69.37 ± 1.85	70.11 ± 1.78	68.92 ± 1.94	0.58 ± 0.03	1.08 ± 0.07	0.84 ± 0.06	0.55 ± 0.04
MMEA-Net (Ours)	73.26 ± 1.74	71.98 ± 1.69	72.84 ± 1.63	71.32 ± 1.71	0.61 ± 0.02	1.01 ± 0.06	0.79 ± 0.05	0.59 ± 0.03

Table 4. Ablation study on model components. Each row corresponds to a variant of MMEA-Net with a specific component removed or replaced.

Model Variant	Accuracy (%)	F1 (%)	Kappa	RMSE	MAE	R²
MMEA-Net (Full Model)	75.42	74.19	0.66	0.96	0.74	0.63
w/o Hybrid-M	71.35	70.21	0.59	1.14	0.89	0.54
w/o Cross-Domain Fusion	72.14	70.88	0.61	1.08	0.84	0.56
w/o MFE	72.93	71.56	0.62	1.03	0.80	0.58

Table 5. Ablation study on different modalities and their combinations.

Modalities Used	Accuracy (%)	F1 (%)	Kappa	RMSE	MAE	R²
Eye-tracking only	65.37	63.92	0.51	1.28	1.02	0.44
ECG only	62.14	60.53	0.47	1.35	1.09	0.40
GSR only	59.86	58.12	0.45	1.41	1.15	0.36
Eye-tracking + ECG	71.29	69.87	0.60	1.09	0.86	0.56
Eye-tracking + GSR	70.45	68.92	0.59	1.12	0.89	0.54
ECG + GSR	68.73	67.18	0.56	1.18	0.94	0.51
All modalities	75.42	74.19	0.66	0.96	0.74	0.63

Table 6. Comparison between single-task and multi-task learning approaches.

Learning Approach	Accuracy (%)	F1 (%)	Kappa	RMSE	MAE	R²
Single-task (Emotion Classification)	73.81	72.54	0.63	–	–	–
Single-task (Immersion Prediction)	–	–	–	1.02	0.80	0.59
Multi-task Learning	75.42	74.19	0.66	0.96	0.74	0.63

Table 7. Performance comparison under different loss weight settings for emotion classification (

λ_{1}

) and immersion prediction (

λ_{2}

).

Table 7. Performance comparison under different loss weight settings for emotion classification (

λ_{1}

) and immersion prediction (

λ_{2}

).

$λ_{1}$	$λ_{2}$	Accuracy (%)	F1 (%)	Kappa	RMSE	MAE	R²
0.8	0.2	74.65	73.28	0.64	1.15	0.92	0.49
0.6	0.4	75.21	73.93	0.65	1.02	0.81	0.58
0.4	0.6	75.42	74.19	0.66	0.96	0.74	0.63
0.2	0.8	74.83	73.65	0.65	0.98	0.76	0.62

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, H.; Wang, M. Subject-Independent Multimodal Interaction Modeling for Joint Emotion and Immersion Estimation in Virtual Reality. Symmetry 2026, 18, 451. https://doi.org/10.3390/sym18030451

AMA Style

Wang H, Wang M. Subject-Independent Multimodal Interaction Modeling for Joint Emotion and Immersion Estimation in Virtual Reality. Symmetry. 2026; 18(3):451. https://doi.org/10.3390/sym18030451

Chicago/Turabian Style

Wang, Haibing, and Mujiangshan Wang. 2026. "Subject-Independent Multimodal Interaction Modeling for Joint Emotion and Immersion Estimation in Virtual Reality" Symmetry 18, no. 3: 451. https://doi.org/10.3390/sym18030451

APA Style

Wang, H., & Wang, M. (2026). Subject-Independent Multimodal Interaction Modeling for Joint Emotion and Immersion Estimation in Virtual Reality. Symmetry, 18(3), 451. https://doi.org/10.3390/sym18030451

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Subject-Independent Multimodal Interaction Modeling for Joint Emotion and Immersion Estimation in Virtual Reality

Abstract

1. Introduction

Organization of the Paper

2. Related Work

2.1. Emotion Recognition Using Physiological Signals

2.2. Multimodal Emotion Modeling in Virtual Reality

2.3. Multi-Task Learning for Affective and Immersion-Aware Modeling

2.4. Structured Graph-Based Dependency Modeling

3. Methodology

3.1. Overview of MMEA-Net

Overview of Emotion and Immersion Modeling

3.2. Hybrid-M: Modality-Specific Feature Extraction

3.3. Cross-Domain Fusion

3.4. Multi-Scale Feature Extraction (MFE)

3.4.1. Design Rationale of Kernel Sizes

3.4.2. Comparison with Alternative Scale Configurations

3.4.3. Impact of Using More Diverse Kernel Sizes

3.5. Training and Optimization

Potential Challenges and Mitigation Strategies in Training with Heterogeneous Multimodal Data

3.6. Discussion on Symmetry Awareness and Symmetry Breaking

4. Experiments

4.1. Datasets

4.1.1. VREED Dataset Overview

4.1.2. Immersion Level Extraction

4.1.3. Data Preprocessing

Physiological Signal Preprocessing

4.2. Experimental Setup

4.2.1. Implementation Details

4.2.2. Evaluation Metrics

4.2.3. Baseline Models

4.3. Experimental Results and Analysis

4.3.1. Comparison with Baseline Models

4.3.2. Statistical Significance Analysis

4.4. Cross-Subject Generalization Evaluation

4.4.1. Model Component Analysis

Statistical Significance Analysis

4.4.2. Multimodal Contribution Analysis

Statistical Significance Analysis

4.4.3. Single-Task vs. Multi-Task Learning

Statistical Significance Analysis

4.4.4. Summary

4.5. Loss Weight Analysis

Statistical Significance Analysis

5. Downstream Task Extensions Based on the MMEA-Net Model

6. Discussion

6.1. Generalization Across VR Environments

6.2. User Variability and Cross-Cultural Considerations

6.3. Implications for Multi-Task Affective Modeling

7. Conclusions, Limitations, and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI