Subject-Independent Multimodal Interaction Modeling for Joint Emotion and Immersion Estimation in Virtual Reality
Abstract
1. Introduction
- (1)
- (2)
- Insufficient design of modality-aware feature extraction and fusion mechanisms for heterogeneous physiological signals. Many approaches overlook the balanced yet distinct roles of eye-tracking, ECG, and GSR modalities, resulting in suboptimal representation learning and weak cross-modality coordination [21,22,23].
- (3)
- (1)
- We propose a unified multi-task framework, MMEA-Net, for jointly modeling emotion classification and immersion regression in VR environments. Compared with the best-performing Transformer-based baseline, MMEA-Net achieves a +2.55% improvement in test accuracy (75.42% vs. 72.87%), a +2.54% increase in F1-score (74.19% vs. 71.65%), and a +0.04 gain in Cohen’s Kappa (0.66 vs. 0.62) for emotion classification. For immersion estimation, the proposed model reduces RMSE by 0.09 (0.96 vs. 1.05), lowers MAE by 0.07 (0.74 vs. 0.81), and improves the coefficient of determination by +0.05 (0.63 vs. 0.58), demonstrating effective cross-task synergy.
- (2)
- We design a Hybrid-M modality-aware module and a Cross-Domain Fusion mechanism. Hybrid-M employs dedicated sub-networks to encode temporal characteristics of eye-tracking, ECG, and GSR signals, while the fusion mechanism facilitates structured and symmetry-aware interaction across modalities and tasks.
- (3)
- We extend the VREED dataset by annotating quantitative immersion scores, enabling dual-task benchmarking and supporting reproducible research in multimodal affective computing.
Organization of the Paper
2. Related Work
2.1. Emotion Recognition Using Physiological Signals
2.2. Multimodal Emotion Modeling in Virtual Reality
2.3. Multi-Task Learning for Affective and Immersion-Aware Modeling
2.4. Structured Graph-Based Dependency Modeling
3. Methodology
3.1. Overview of MMEA-Net
Overview of Emotion and Immersion Modeling
- Hybrid-M modules, which perform modality-specific feature extraction to preserve the intrinsic temporal characteristics of each signal while maintaining a consistent structural interface across modalities.
- A Cross-Domain Fusion module, which integrates modality-specific representations through coordinated interactions, allowing information to be exchanged while maintaining balance among modalities.
- A Multi-scale Feature Extraction (MFE) mechanism, embedded within both the modality-specific and fusion stages, to capture temporal patterns at different resolutions in a structurally consistent manner.
3.2. Hybrid-M: Modality-Specific Feature Extraction
- Vectorization. Raw input signals are first transformed into vectorized representations suitable for neural processing. For eye-tracking data, this includes spatial gaze coordinates, pupil dilation, and fixation-related statistics. For ECG signals, features such as R–R intervals and heart rate variability are extracted, while GSR signals are decomposed into tonic and phasic components. This step establishes a common vector-level interface across modalities.
- Multi-scale feature extraction. To capture temporal patterns occurring at different resolutions, a multi-scale feature extraction mechanism is applied:where F denotes the input feature sequence, represents convolution with kernel size , and ⊕ denotes concatenation. This design enables the model to capture both short-term fluctuations and longer-term trends in a balanced manner.
- Normalization and linear transformation. The extracted features are normalized and linearly transformed to stabilize optimization and align feature distributions across modalities:where denotes the original vectorized features retained through a skip connection. This operation preserves modality-specific information while enforcing scale consistency.
- Transformer-based temporal modeling. A transformer decoder is employed to model sequential dependencies:enabling the extraction of long-range temporal relationships that are critical for modeling evolving emotional and physiological responses.
- Linear projection and residual refinement. The transformer output is further processed by linear and fully connected layers:followed by a residual connection and normalization:This step refines the representation while maintaining structural consistency across modalities.
- Multi-head attention. Finally, a multi-head attention mechanism is applied to emphasize informative temporal segments:where each attention head is computed asand
3.3. Cross-Domain Fusion
- Initial tensor combination. The modality-specific tensors are first combined through learnable weighting:where , , and are trainable parameters and ⊕ denotes concatenation. This operation establishes a balanced aggregation scheme in which each modality contributes proportionally, preventing dominance by any single source.
- Normalization and linear transformation. The combined representation is then normalized and linearly transformed:ensuring scale alignment and structural consistency across modalities before higher-order interaction.
- Sparse attention layers. A stack of sparse attention layers refines the normalized representation:with . The sparsity constraint restricts attention to a subset of informative relationships, promoting efficient and structured interaction among features while reducing redundancy.
- Parallel feature projection. The output of the sparse attention layers is processed by multiple parallel multilayer perceptrons:where each MLP captures a distinct transformation perspective. This parallel design preserves symmetry in feature processing, allowing multiple coordinated views of the fused representation.
- Cross-attention integration. The parallel feature projections are integrated through a cross-attention mechanism:defined asThis operation enables structured information exchange across feature perspectives, reinforcing complementary patterns while maintaining internal consistency.
3.4. Multi-Scale Feature Extraction (MFE)
- Initial feature projection. The input representation is first transformed using a multilayer perceptron:which maps the input features into a space suitable for parallel multi-scale processing.
- Parallel multi-scale convolutions. Three convolutional operations with different kernel sizes are applied in parallel:where each branch captures patterns at a distinct temporal resolution. The parallel design ensures structural symmetry across scales, allowing fine-grained, intermediate, and coarse temporal dynamics to be modeled in a balanced manner.
- Feature aggregation. The outputs from different scales are concatenated:preserving information from all temporal resolutions while maintaining their individual contributions.
- Integrated transformation. The aggregated representation is further processed through convolutional and nonlinear transformations:followed by a final MLP:which integrates information across scales into a unified feature representation.
3.4.1. Design Rationale of Kernel Sizes
3.4.2. Comparison with Alternative Scale Configurations
3.4.3. Impact of Using More Diverse Kernel Sizes
3.5. Training and Optimization
Potential Challenges and Mitigation Strategies in Training with Heterogeneous Multimodal Data
3.6. Discussion on Symmetry Awareness and Symmetry Breaking
4. Experiments
4.1. Datasets
4.1.1. VREED Dataset Overview
- Visual and behavioral data, including eye-tracking metrics such as fixation duration, saccade amplitude, pupil dilation, and gaze coordinates.
- Physiological data from electrocardiogram (ECG) recordings sampled at 256 Hz, capturing heart rate dynamics and variability.
- Physiological data from galvanic skin response (GSR) recordings sampled at 128 Hz, reflecting autonomic arousal through skin conductance changes.
- Self-reported emotion annotations, where participants evaluated their affective states using the Self-Assessment Manikin (SAM) scale along the valence, arousal, and dominance dimensions.
- Post-exposure questionnaire responses, including presence-related items used to assess immersion levels.
4.1.2. Immersion Level Extraction
- Reverse coding. For negatively phrased items (POST_PQ2, POST_PQ3, and POST_PQ5), reverse coding is applied to ensure a consistent directional interpretation, such that higher scores uniformly correspond to higher immersion levels:
- Item selection. Based on factor analysis of the PQ responses, six items closely associated with immersion-related constructs are retained (POST_PQ1, POST_PQ2, POST_PQ3, POST_PQ4, POST_PQ5, and POST_PQ7). The item POST_PQ6, which primarily reflects attention to background music, is excluded due to its weak correlation with the remaining presence dimensions.
- Weighted aggregation. A weighted average of the selected items is computed to obtain a unified immersion score:where denotes the weight associated with each PQ item. These weights are determined through principal component analysis (PCA), reflecting the relative contribution of each dimension to the overall immersion construct.
4.1.3. Data Preprocessing
Physiological Signal Preprocessing
- Temporal alignment: All modalities were synchronized and resampled to a unified sampling rate of 64 Hz, enabling consistent temporal correspondence across signals.
- Normalization: Z-score normalization was applied to all features to standardize their scale:where and denote the mean and standard deviation of each feature, respectively.
- Segmentation: Continuous recordings were segmented into non-overlapping windows of 5 s, resulting in approximately 5280 multimodal segments used for model training and evaluation.
- Feature extraction: For each modality, both statistical descriptors (e.g., mean and standard deviation) and temporal features (e.g., frequency-domain characteristics and rate-of-change measures) were extracted to capture complementary signal properties.
- Missing data handling: Occasional missing values in eye-tracking signals, such as those caused by blinks, were addressed using forward filling for short gaps and interpolation for longer gaps to preserve temporal continuity.
4.2. Experimental Setup
4.2.1. Implementation Details
- Number of transformer layers in Hybrid-M: 4.
- Number of sparse attention layers in Cross-Domain Fusion: 3.
- Number of attention heads: 8.
- Hidden dimension: 256.
- Dropout rate: 0.3.
- Learning rate: 0.001 with cosine annealing schedule.
- Batch size: 32.
- Maximum epochs: 100 with early stopping (patience = 10).
- Loss weights: for emotion classification, for immersion estimation, and for regularization.
4.2.2. Evaluation Metrics
- Accuracy. The proportion of correctly classified samples:
- F1-score. The harmonic mean of precision and recall:where Precision = TP/(TP + FP) and Recall = TP/(TP + FN).
- Cohen’s Kappa is a measure of agreement between predicted and true labels that accounts for chance agreement:where denotes the observed agreement and represents the expected agreement by chance.
- Root Mean Square Error (RMSE). The square root of the mean squared prediction error:
- Mean Absolute Error (MAE). The average absolute difference between predicted and ground-truth immersion levels:
- Coefficient of Determination (). The proportion of variance in the target variable explained by the model:where denotes the mean of the ground-truth immersion scores.
4.2.3. Baseline Models
- LSTM-based model [62]: Uses two-layer LSTM networks with 128 hidden units for each modality, followed by direct feature concatenation.
- BiLSTM-based model [63]: Employs bidirectional LSTMs to model forward and backward temporal dependencies, with simple concatenation for multimodal fusion.
- GRU-based model [64]: Replaces Hybrid-M modules with GRU networks (two layers, 128 hidden units) and applies direct feature concatenation.
- 1D-CNN-based model [65]: Uses one-dimensional convolutional networks with multiple kernel sizes (3, 5, and 7) for temporal feature extraction, followed by feature concatenation.
- XGBoost-based model [66]: Extracts handcrafted features from each modality and performs prediction using XGBoost, with concatenated outputs.
- CNN-LSTM-based model [67]: Combines convolutional layers for feature extraction with LSTM layers for temporal modeling, using direct concatenation for fusion.
- Transformer-based model [68]: Applies Transformer encoder blocks with multi-head self-attention to each modality, followed by simple feature concatenation.
- MLP-based model [69]: Processes modality-specific features using multilayer perceptrons and concatenates the resulting representations.
- MMEA-Net (Simple Fusion): Retains the Hybrid-M modules while replacing the Cross-Domain Fusion mechanism with direct feature concatenation.
4.3. Experimental Results and Analysis
4.3.1. Comparison with Baseline Models
4.3.2. Statistical Significance Analysis
4.4. Cross-Subject Generalization Evaluation
- Experimental Protocol.
- Comparison Baseline.
4.4.1. Model Component Analysis
Statistical Significance Analysis
4.4.2. Multimodal Contribution Analysis
Statistical Significance Analysis
4.4.3. Single-Task vs. Multi-Task Learning
Statistical Significance Analysis
4.4.4. Summary
4.5. Loss Weight Analysis
Statistical Significance Analysis
5. Downstream Task Extensions Based on the MMEA-Net Model
6. Discussion
6.1. Generalization Across VR Environments
6.2. User Variability and Cross-Cultural Considerations
6.3. Implications for Multi-Task Affective Modeling
7. Conclusions, Limitations, and Future Work
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Lønne, T.; Karlsen, H.; Langvik, E.; Saksvik-Lehouillier, I. The effect of immersion on sense of presence and affect when experiencing an educational scenario in virtual reality: A randomized controlled study. Heliyon 2023, 9, e17196. [Google Scholar] [CrossRef]
- Yang, X.; Cheng, P.; Liu, X.; Shih, S. The impact of immersive virtual reality on art education: A study of flow state, cognitive load, brain state, and motivation. Educ. Inf. Technol. 2024, 29, 6087–6106. [Google Scholar] [CrossRef]
- Marín-Morales, J.; Llinares, C.; Guixeres, J.; Alcañiz, M. Emotion Recognition in Immersive Virtual Reality: From Statistics to Affective Computing. Sensors 2020, 20, 5163. [Google Scholar] [CrossRef] [PubMed]
- Chen, Z.; Han, Z.; Wu, L.; Huang, J. Multisensory Imagery Enhances the Aesthetic Evaluation of Paintings: A Virtual Reality Study. Empir. Stud. Arts 2026. [Google Scholar] [CrossRef]
- Cai, Y.; Li, X.; Li, J. Emotion recognition using different sensors, emotion models, methods and datasets: A comprehensive review. Sensors 2023, 23, 2455. [Google Scholar] [CrossRef]
- Moin, A.; Aadil, F.; Ali, Z.; Kang, D. Emotion recognition framework using multiple modalities for an effective human-computer interaction. J. Supercomput. 2023, 79, 9320–9349. [Google Scholar] [CrossRef]
- Fu, Z.; Zhang, B.; He, X.; Li, Y.; Wang, H.; Huang, J. Emotion recognition based on multi-modal physiological signals and transfer learning. Front. Neurosci. 2022, 16, 1000716. [Google Scholar] [CrossRef]
- Lee, Y.; Pae, D.; Hong, D.; Lim, M.; Kang, T. Emotion recognition with short-period physiological signals using bimodal sparse autoencoders. Intell. Autom. Soft Comput. 2022, 32, 657–673. [Google Scholar] [CrossRef]
- Koelstra, S.; Muhl, C.; Soleymani, M.; Lee, J.; Yazdani, A.; Ebrahimi, T.; Patras, I. DEAP: A database for emotion analysis using physiological signals. IEEE Trans. Affect. Comput. 2011, 3, 18–31. [Google Scholar] [CrossRef]
- Tabbaa, L.; Searle, R.; Bafti, S.; Hossain, M.; Intarasisrisawat, J.; Glancy, M.; Ang, C. VREED: Virtual reality emotion recognition dataset using eye tracking and physiological measures. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2021, 5, 1–20. [Google Scholar] [CrossRef]
- Chen, J.; Cui, Y.; Wei, C.; Polat, K.; Alenezi, F. Advances in EEG-based emotion recognition: Challenges, methodologies, and future directions. Appl. Soft Comput. 2025, 180, 113478. [Google Scholar] [CrossRef]
- Yahyaeian, A.A.; Sabet, M.; Zhang, J.; Jones, A. Enhancing Immersive Learning: An Exploratory Pilot Study on Large Language Model-Powered Guidance in Virtual Reality Labs. Comput. Appl. Eng. Educ. 2026, 34, e70127. [Google Scholar] [CrossRef]
- De Giglio, V.; Evangelista, A.; Giannakakis, G.; Konstantaras, A.; Kamarianakis, Z.; Uva, A.E.; Manghisi, V.M. Assessing the Impact of Cinematic Virtual Reality Simulations on Young Drivers: Behavior and Physiological Responses. Virtual Real. 2026, 30, 11. [Google Scholar] [CrossRef]
- García-Batista, Z.E.; Guerra-Peña, K.; Jurnet, I.A.; Cano-Vindel, A.; Álvarez-Hernández, A.; Herrera-Martinez, S.; Medrano, L.A. Design and Preliminary Evaluation of AYRE: A Virtual Reality-Based Intervention for the Treatment of Emotional Disorders. J. Behav. Cogn. Ther. 2026, 36, 100560. [Google Scholar] [CrossRef]
- Wei, L.; Liu, L.; Faridniya, H. Promoting Mental Health and Preventing Emotional Disorders in Vulnerable Adolescent Girls through VR-Based Extreme Sports. Acta Psychol. 2026, 262, 106088. [Google Scholar] [CrossRef]
- Baker, N.A.; Polhemus, A.H.; Baird, J.M.; Kenney, M. Embodied Fully Immersive Virtual Reality as a Therapeutic Modality to Treat Chronic Pain: A Scoping Review. Virtual Worlds 2026, 5, 3. [Google Scholar] [CrossRef]
- Hernandez-Melgarejo, G.; Luviano-Juarez, A.; Fuentes-Aguilar, R. A framework to model and control the state of presence in virtual reality systems. IEEE Trans. Affect. Comput. 2022, 13, 1854–1867. [Google Scholar] [CrossRef]
- Ochs, C.; Sonderegger, A. The interplay between presence and learning. Front. Virtual Real. 2022, 3, 742509. [Google Scholar] [CrossRef]
- Liu, B.; Liu, X.; Jin, X.; Stone, P.; Liu, Q. Conflict-averse gradient descent for multi-task learning. Proc. Adv. Neural Inf. Process. Syst. 2021, 34, 18878–18890. [Google Scholar]
- Liu, D.; Yu, Y. MT2ST: Adaptive Multi-Task to Single-Task Learning. arXiv 2024, arXiv:2406.18038. [Google Scholar] [CrossRef]
- Lin, W.; Li, C. Review of studies on emotion recognition and judgment based on physiological signals. Appl. Sci. 2023, 13, 2573. [Google Scholar] [CrossRef]
- Zhang, Y.; Cheng, C.; Zhang, Y. Multimodal emotion recognition based on manifold learning and convolution neural network. Multimed. Tools Appl. 2022, 81, 33253–33268. [Google Scholar] [CrossRef]
- Katada, S.; Okada, S. Biosignal-based user-independent recognition of emotion and personality with importance weighting. Multimed. Tools Appl. 2022, 81, 30219–30241. [Google Scholar] [CrossRef]
- Dissanayake, V.; Seneviratne, S.; Rana, R.; Wen, E.; Kaluarachchi, T.; Nanayakkara, S. SigRep: Toward robust wearable emotion recognition with contrastive representation learning. IEEE Access 2022, 10, 18105–18120. [Google Scholar] [CrossRef]
- Yan, J.; Zheng, W.; Xu, Q.; Lu, G.; Li, H.; Wang, B. Sparse kernel reduced-rank regression for bimodal emotion recognition from facial expression and speech. IEEE Trans. Multimed. 2016, 18, 1319–1329. [Google Scholar] [CrossRef]
- Wang, M.; Yang, W.; Wang, S. Conditional matching preclusion number for the Cayley graph on the symmetric group. Acta Math. Appl. Sin. (Chin. Ser.) 2013, 36, 813–820. [Google Scholar]
- Wang, M.; Yang, W.; Guo, Y.; Wang, S. Conditional fault tolerance in a class of Cayley graphs. Int. J. Comput. Math. 2016, 93, 67–82. [Google Scholar] [CrossRef]
- Wang, S.; Wang, Y.; Wang, M. Connectivity and matching preclusion for leaf-sort graphs. J. Interconnect. Netw. 2019, 19, 1940007. [Google Scholar] [CrossRef]
- Wang, S.; Wang, M. The strong connectivity of bubble-sort star graphs. Comput. J. 2019, 62, 715–729. [Google Scholar] [CrossRef]
- Wang, S.; Wang, M. A Note on the Connectivity of m-Ary n-Dimensional Hypercubes. Parallel Process. Lett. 2019, 29, 1950017. [Google Scholar] [CrossRef]
- Jiang, J.; Wu, L.; Yu, J.; Wang, M.; Kong, H.; Zhang, Z.; Wang, J. Robustness of bilayer railway-aviation transportation network considering discrete cross-layer traffic flow assignment. Transp. Res. Part D Transp. Environ. 2024, 127, 104071. [Google Scholar] [CrossRef]
- Hu, Z.; Chen, L.; Luo, Y.; Zhou, J. EEG-based emotion recognition using convolutional recurrent neural network with multi-head self-attention. Appl. Sci. 2022, 12, 11255. [Google Scholar] [CrossRef]
- Xiao, G.; Shi, M.; Ye, M.; Xu, B.; Chen, Z.; Ren, Q. 4D attention-based neural network for EEG emotion recognition. Cogn. Neurodynamics 2022, 16, 805–818. [Google Scholar] [CrossRef] [PubMed]
- Chen, J.; Fan, F.; Wei, C.; Polat, K.; Alenezi, F. Decoding driving states based on normalized mutual information features and hyperparameter self-optimized Gaussian kernel-based radial basis function extreme learning machine. Chaos Solitons Fractals 2025, 199, 116751. [Google Scholar] [CrossRef]
- Chen, J.; Cui, Y.; Wei, C.; Polat, K.; Alenezi, F. Driver fatigue detection using EEG-based graph attention convolutional neural networks: An end-to-end learning approach with mutual information-driven connectivity. Appl. Soft Comput. 2026, 186, 114097. [Google Scholar] [CrossRef]
- Wei, C.; Alenezi, F.; Chen, J.; Wang, H.; Polat, K. Nonlinear Feature Decomposition and Deep Temporal–Spatial Learning for Single-Channel sEMG-Based Lower Limb Motion Recognition. IEEE Sens. J. 2026, 26, 4120–4126. [Google Scholar] [CrossRef]
- Wang, T.; Li, J.; Wu, H.N.; Li, C.; Snoussi, H.; Wu, Y. ResLNet: Deep residual LSTM network with longer input for action recognition. Front. Comput. Sci. 2022, 16, 166334. [Google Scholar] [CrossRef]
- Li, Z.; Cai, J.; Chen, Q.; Chen, L.; Qing, M.; Yang, S.X. An LSTM Network with Neural Plasticity for Driver Fatigue Recognition on Real Roads. IEEE Trans. Ind. Electron. 2025, 72, 14668–14676. [Google Scholar] [CrossRef]
- Liu, X.Y.; Li, G.; Zhou, X.H.; Liang, X.; Hou, Z.G. A Weight-Aware-Based Multisource Unsupervised Domain Adaptation Method for Human Motion Intention Recognition. IEEE Trans. Cybern. 2025, 55, 3131–3143. [Google Scholar] [CrossRef]
- Alharbi, H. Explainable feature selection and deep learning based emotion recognition in virtual reality using eye tracker and physiological data. Front. Med. 2024, 11, 1438720. [Google Scholar] [CrossRef]
- Souza, V.; Maciel, A.; Nedel, L.; Kopper, R. Measuring presence in virtual environments: A survey. ACM Comput. Surv. 2021, 54, 1–37. [Google Scholar] [CrossRef]
- Liu, X.; Zhou, H.; Liu, J. Deep learning-based analysis of the influence of illustration design on emotions in immersive art. Mob. Inf. Syst. 2022, 2022, 3120955. [Google Scholar] [CrossRef]
- Song, W.; Wang, X.; Jiang, Y.; Li, S.; Hao, A.; Hou, X.; Qin, H. Expressive 3D Facial Animation Generation Based on Local-to-Global Latent Diffusion. IEEE Trans. Vis. Comput. Graph. 2024, 30, 3321–3336. [Google Scholar] [CrossRef] [PubMed]
- Song, W.; Wang, X.; Zheng, S.; Li, S.; Hao, A.; Hou, X. TalkingStyle: Personalized Speech-Driven 3D Facial Animation with Style Preservation. IEEE Trans. Vis. Comput. Graph. 2025, 31, 4682–4694. [Google Scholar] [CrossRef]
- Hu, J.; Jiang, H.; Xiao, Z.; Chen, S.; Dustdar, S.; Liu, J. HeadTrack: Real-Time Human–Computer Interaction via Wireless Earphones. IEEE J. Sel. Areas Commun. 2024, 42, 990–1002. [Google Scholar] [CrossRef]
- Meng, T.; Shou, Y.; Ai, W.; Du, J.; Liu, H.; Li, K. A Multi-Message Passing Framework Based on Heterogeneous Graphs in Conversational Emotion Recognition. Neurocomputing 2024, 569, 127109. [Google Scholar] [CrossRef]
- Liu, Y.; Feng, S.; Liu, S.; Zhan, Y.; Tao, D.; Chen, Z.; Chen, Z. Sample-cohesive pose-aware contrastive facial representation learning. Int. J. Comput. Vis. 2025, 133, 3727–3745. [Google Scholar] [CrossRef]
- Lin, Z.; Wang, Y.; Zhou, Y.; Du, F.; Yang, Y. MLM-EOE: Automatic Depression Detection via Sentimental Annotation and Multi-Expert Ensemble. IEEE Trans. Affect. Comput. 2025, 16, 2842–2858. [Google Scholar] [CrossRef]
- Hu, J.; Jiang, H.; Liu, D.; Xiao, Z.; Zhang, Q.; Liu, J.; Dustdar, S. Combining IMU with acoustics for head motion tracking leveraging wireless earphone. IEEE Trans. Mob. Comput. 2023, 23, 6835–6847. [Google Scholar] [CrossRef]
- Hu, J.; Jiang, H.; Liu, D.; Xiao, Z.; Zhang, Q.; Min, G.; Liu, J. Real-time contactless eye blink detection using UWB radar. IEEE Trans. Mob. Comput. 2023, 23, 6606–6619. [Google Scholar] [CrossRef]
- Bukht, T.F.N.; Alazeb, A.; Mudawi, N.A.; Alabdullah, B.; Alnowaiser, K.; Jalal, A.; Liu, H. Robust Human Interaction Recognition Using Extended Kalman Filter. Comput. Mater. Contin. 2024, 81, 2987–3002. [Google Scholar] [CrossRef]
- Linton, K.F.; Abbasi, B.; Jimenez, M.G.; Aragon, J.; Gendron, A.; Gomez, R.; Hampton, S.; Michael, B.; Monson, S.; Paulus, N.; et al. Immersive virtual reality for prospective memory and eye fixation recovery following traumatic brain injury: A pilot study. Brain Heart 2024, 2, 2685. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Polosukhin, I. Attention is all you need. Proc. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Liu, P. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
- Almujally, N.A.; Rafique, A.A.; Al Mudawi, N.; Alazeb, A.; Alonazi, M.; Algarni, A.; Jalal, A.; Liu, H. Multi-modal remote perception learning for object sensory data. Front. Neurorobot. 2024, 18, 1427786. [Google Scholar] [CrossRef]
- Shen, X.; Li, L.; Ma, Y.; Xu, S.; Liu, J.; Yang, Z.; Shi, Y. VLCIM: A vision-language cyclic interaction model for industrial defect detection. IEEE Trans. Instrum. Meas. 2025, 74, 2538713. [Google Scholar] [CrossRef]
- Lv, S.; Lu, S.; Wang, R.; Yin, L.; Yin, Z.; A. AlQahtani, S.; Tian, J.; Zheng, W. Enhancing chinese dialogue generation with word–phrase fusion embedding and sparse softmax optimization. Systems 2024, 12, 516. [Google Scholar] [CrossRef]
- Li, G.; Bai, L.; Zhang, H.; Xu, Q.; Zhou, Y.; Gao, Y.; Wang, M.; Li, Z. Velocity Anomalies around the Mantle Transition Zone beneath the Qiangtang Terrane, Central Tibetan Plateau from Triplicated P Waveforms. Earth Space Sci. 2022, 9, e2021EA002060. [Google Scholar] [CrossRef]
- Lv, S.; Yang, B.; Wang, R.; Lu, S.; Tian, J.; Zheng, W.; Chen, X.; Yin, L. Dynamic multi-granularity translation system: Dag-structured multi-granularity representation and self-attention. Systems 2024, 12, 420. [Google Scholar] [CrossRef]
- Wang, S.; Zhang, K.; Liu, A. Flat-Lattice-CNN: A model for Chinese medical-named-entity recognition. PLoS ONE 2025, 20, e0331464. [Google Scholar] [CrossRef]
- Gers, F.; Schmidhuber, J.; Cummins, F. Learning to forget: Continual prediction with LSTM. Neural Comput. 2000, 12, 2451–2471. [Google Scholar] [CrossRef]
- Siami-Namini, S.; Tavakoli, N.; Namin, A. The performance of LSTM and BiLSTM in forecasting time series. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019; IEEE: New York, NY, USA, 2019; pp. 3285–3292. [Google Scholar] [CrossRef]
- Rana, R. Gated Recurrent Unit (GRU) for Emotion Classification from Noisy Speech. arXiv 2016, arXiv:1612.07778. [Google Scholar] [CrossRef]
- Azizjon, M.; Jumabek, A.; Kim, W. 1D CNN Based Network Intrusion Detection with Normalization on Imbalanced Data. In Proceedings of the 2020 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Fukuoka, Japan, 19–21 February 2020; IEEE: New York, NY, USA, 2020; pp. 218–224. [Google Scholar] [CrossRef]
- Nielsen, D. Tree Boosting with XGBoost—Why Does XGBoost Win “Every” Machine Learning Competition? Master’s Thesis, Norwegian University of Science and Technology (NTNU), Trondheim, Norway, 2016. [Google Scholar]
- Lu, W.; Li, J.; Li, Y.; Sun, A.; Wang, J. A CNN-LSTM-Based Model to Forecast Stock Prices. Complexity 2020, 2020, 6622927. [Google Scholar] [CrossRef]
- Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in Transformer. Proc. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar]
- Valanarasu, J.; Patel, V. UNeXt: MLP-Based Rapid Medical Image Segmentation Network. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2022, Singapore, 18–22 September 2022; Springer: Cham, Switzerland, 2022; pp. 23–33. [Google Scholar]
- Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7 October 2024. [Google Scholar]
- Lahoti, A.; Li, K.; Chen, B.; Wang, C.; Bick, A.; Kolter, J.Z.; Dao, T.; Gu, A. Mamba-3: Improved Sequence Modeling using State Space Principles. In Proceedings of the Fourteenth International Conference on Learning Representations, Rio de Janeiro, Brazil, 23–27 April 2026. [Google Scholar]
- Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljacic, M.; Hou, T.Y.; Tegmark, M. KAN: Kolmogorov–Arnold Networks. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
- Wang, Y.; Song, W.; Tao, W.; Liotta, A.; Yang, D.; Li, X.; Zhang, W. A systematic review on affective computing: Emotion models, databases, and recent advances. Inf. Fusion 2022, 83, 19–52. [Google Scholar] [CrossRef]
- Ezzameli, K.; Mahersia, H. Emotion recognition from unimodal to multimodal analysis: A review. Inf. Fusion 2023, 99, 101847. [Google Scholar] [CrossRef]








| Model | Params (M) | FLOPs (G) | Validation Set | Test Set | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Emotion Classification | Immersion Prediction | Emotion Classification | Immersion Prediction | |||||||||||
| Acc (%) | F1 (%) | Kappa | RMSE | MAE | R2 | Acc (%) | F1 (%) | Kappa | RMSE | MAE | R2 | |||
| LSTM-based [62] | 1.5 | 0.32 | 67.43 | 65.87 | 0.53 | 1.24 | 0.97 | 0.46 | 65.91 | 64.22 | 0.51 | 1.31 | 1.02 | 0.43 |
| BiLSTM-based [63] | 2.1 | 0.46 | 69.21 | 67.95 | 0.56 | 1.18 | 0.92 | 0.49 | 68.07 | 66.54 | 0.55 | 1.23 | 0.95 | 0.47 |
| GRU-based [64] | 1.3 | 0.28 | 67.85 | 66.32 | 0.54 | 1.21 | 0.94 | 0.47 | 66.23 | 64.89 | 0.52 | 1.28 | 0.99 | 0.44 |
| 1D-CNN-based [65] | 1.8 | 0.41 | 70.14 | 68.76 | 0.58 | 1.14 | 0.88 | 0.52 | 68.92 | 67.41 | 0.56 | 1.19 | 0.93 | 0.49 |
| XGBoost-based [66] | 0.3 | – | 65.37 | 63.54 | 0.51 | 1.29 | 1.03 | 0.41 | 64.21 | 62.38 | 0.49 | 1.35 | 1.08 | 0.38 |
| CNN-LSTM-based [67] | 2.6 | 0.55 | 71.53 | 70.24 | 0.60 | 1.08 | 0.84 | 0.55 | 70.19 | 68.75 | 0.58 | 1.15 | 0.89 | 0.52 |
| Attention-based | 3.1 | 0.68 | 72.68 | 71.42 | 0.62 | 1.06 | 0.82 | 0.57 | 71.34 | 70.05 | 0.60 | 1.12 | 0.87 | 0.54 |
| Transformer-based [68] | 5.4 | 0.96 | 74.12 | 72.98 | 0.64 | 0.99 | 0.76 | 0.61 | 72.87 | 71.65 | 0.62 | 1.05 | 0.81 | 0.58 |
| MLP-based [69] | 0.9 | 0.17 | 63.79 | 62.14 | 0.48 | 1.33 | 1.07 | 0.38 | 62.45 | 60.89 | 0.46 | 1.39 | 1.13 | 0.35 |
| Mamba-based [70] | 4.2 | 0.78 | 73.54 | 72.18 | 0.63 | 1.01 | 0.78 | 0.59 | 72.31 | 71.02 | 0.61 | 1.07 | 0.83 | 0.56 |
| Mamba-3-based [71] | 4.6 | 0.84 | 73.89 | 72.63 | 0.64 | 0.98 | 0.75 | 0.60 | 72.65 | 71.41 | 0.62 | 1.03 | 0.80 | 0.57 |
| KAN-based [72] | 3.5 | 0.65 | 73.02 | 71.86 | 0.62 | 1.03 | 0.80 | 0.58 | 71.92 | 70.74 | 0.60 | 1.09 | 0.85 | 0.55 |
| MMEA-Net (Simple Fusion) | 2.8 | 0.63 | 73.65 | 72.31 | 0.63 | 1.02 | 0.79 | 0.59 | 72.14 | 70.88 | 0.61 | 1.08 | 0.84 | 0.56 |
| MMEA-Net (Ours) | 3.2 | 0.71 | 76.93 | 75.47 | 0.68 | 0.91 | 0.70 | 0.65 | 75.42 | 74.19 | 0.66 | 0.96 | 0.74 | 0.63 |
| Model | Validation Set | Test Set | ||
|---|---|---|---|---|
| Precision (%) | Recall (%) | Precision (%) | Recall (%) | |
| LSTM-based [62] | 66.12 | 65.64 | 64.78 | 63.89 |
| BiLSTM-based [63] | 68.01 | 67.89 | 66.92 | 66.17 |
| GRU-based [64] | 66.74 | 66.01 | 65.32 | 64.56 |
| 1D-CNN-based [65] | 69.08 | 68.43 | 67.95 | 66.88 |
| XGBoost-based [66] | 64.11 | 62.98 | 63.02 | 61.85 |
| CNN-LSTM-based [67] | 70.31 | 70.17 | 69.12 | 68.44 |
| Attention-based | 71.56 | 71.28 | 70.24 | 69.86 |
| Transformer-based [68] | 73.21 | 72.84 | 72.03 | 71.18 |
| MLP-based [69] | 62.85 | 61.73 | 61.64 | 60.29 |
| Mamba-based [70] | 72.84 | 72.36 | 71.58 | 70.74 |
| Mamba-3-based [71] | 73.12 | 72.69 | 71.94 | 71.05 |
| KAN-based [72] | 72.41 | 71.93 | 71.16 | 70.28 |
| MMEA-Net (Simple Fusion) | 72.48 | 72.15 | 71.32 | 70.46 |
| MMEA-Net (Ours) | 76.12 | 75.06 | 74.88 | 73.54 |
| Model | Emotion Classification | Immersion Prediction | ||||||
|---|---|---|---|---|---|---|---|---|
| Acc (%) | F1 (%) | Precision (%) | Recall (%) | Kappa | RMSE | MAE | R2 | |
| Transformer-based | 70.84 ± 1.92 | 69.37 ± 1.85 | 70.11 ± 1.78 | 68.92 ± 1.94 | 0.58 ± 0.03 | 1.08 ± 0.07 | 0.84 ± 0.06 | 0.55 ± 0.04 |
| MMEA-Net (Ours) | 73.26 ± 1.74 | 71.98 ± 1.69 | 72.84 ± 1.63 | 71.32 ± 1.71 | 0.61 ± 0.02 | 1.01 ± 0.06 | 0.79 ± 0.05 | 0.59 ± 0.03 |
| Model Variant | Accuracy (%) | F1 (%) | Kappa | RMSE | MAE | R2 |
|---|---|---|---|---|---|---|
| MMEA-Net (Full Model) | 75.42 | 74.19 | 0.66 | 0.96 | 0.74 | 0.63 |
| w/o Hybrid-M | 71.35 | 70.21 | 0.59 | 1.14 | 0.89 | 0.54 |
| w/o Cross-Domain Fusion | 72.14 | 70.88 | 0.61 | 1.08 | 0.84 | 0.56 |
| w/o MFE | 72.93 | 71.56 | 0.62 | 1.03 | 0.80 | 0.58 |
| Modalities Used | Accuracy (%) | F1 (%) | Kappa | RMSE | MAE | R2 |
|---|---|---|---|---|---|---|
| Eye-tracking only | 65.37 | 63.92 | 0.51 | 1.28 | 1.02 | 0.44 |
| ECG only | 62.14 | 60.53 | 0.47 | 1.35 | 1.09 | 0.40 |
| GSR only | 59.86 | 58.12 | 0.45 | 1.41 | 1.15 | 0.36 |
| Eye-tracking + ECG | 71.29 | 69.87 | 0.60 | 1.09 | 0.86 | 0.56 |
| Eye-tracking + GSR | 70.45 | 68.92 | 0.59 | 1.12 | 0.89 | 0.54 |
| ECG + GSR | 68.73 | 67.18 | 0.56 | 1.18 | 0.94 | 0.51 |
| All modalities | 75.42 | 74.19 | 0.66 | 0.96 | 0.74 | 0.63 |
| Learning Approach | Accuracy (%) | F1 (%) | Kappa | RMSE | MAE | R2 |
|---|---|---|---|---|---|---|
| Single-task (Emotion Classification) | 73.81 | 72.54 | 0.63 | – | – | – |
| Single-task (Immersion Prediction) | – | – | – | 1.02 | 0.80 | 0.59 |
| Multi-task Learning | 75.42 | 74.19 | 0.66 | 0.96 | 0.74 | 0.63 |
| Accuracy (%) | F1 (%) | Kappa | RMSE | MAE | R2 | ||
|---|---|---|---|---|---|---|---|
| 0.8 | 0.2 | 74.65 | 73.28 | 0.64 | 1.15 | 0.92 | 0.49 |
| 0.6 | 0.4 | 75.21 | 73.93 | 0.65 | 1.02 | 0.81 | 0.58 |
| 0.4 | 0.6 | 75.42 | 74.19 | 0.66 | 0.96 | 0.74 | 0.63 |
| 0.2 | 0.8 | 74.83 | 73.65 | 0.65 | 0.98 | 0.76 | 0.62 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Wang, H.; Wang, M. Subject-Independent Multimodal Interaction Modeling for Joint Emotion and Immersion Estimation in Virtual Reality. Symmetry 2026, 18, 451. https://doi.org/10.3390/sym18030451
Wang H, Wang M. Subject-Independent Multimodal Interaction Modeling for Joint Emotion and Immersion Estimation in Virtual Reality. Symmetry. 2026; 18(3):451. https://doi.org/10.3390/sym18030451
Chicago/Turabian StyleWang, Haibing, and Mujiangshan Wang. 2026. "Subject-Independent Multimodal Interaction Modeling for Joint Emotion and Immersion Estimation in Virtual Reality" Symmetry 18, no. 3: 451. https://doi.org/10.3390/sym18030451
APA StyleWang, H., & Wang, M. (2026). Subject-Independent Multimodal Interaction Modeling for Joint Emotion and Immersion Estimation in Virtual Reality. Symmetry, 18(3), 451. https://doi.org/10.3390/sym18030451

