ImmerseFM-3D: A Foundation Model Framework for Generalizable 360-Degree Video Streaming with Cross-Modal Scene Understanding
Abstract
1. Introduction
- ImmerseFM-3D architecture. The first unified foundation model that jointly solves all four 360° streaming sub-tasks through a single shared cross-modal latent representation. Unlike ensembles of task-specific models, ImmerseFM-3D eliminates the inter-task error propagation bottleneck: the bitrate decoder implicitly conditions on viewport prediction uncertainty through the shared bottleneck, a capability structurally impossible in siloed pipelines. This yields VP MAE reductions of up to 34%, BVR reduction of 63%, and tile F1 improvement of 11 pp over the best single-task baselines (Section 5.1).
- Variational information bottleneck for multi-modal streaming. The first application of information bottleneck theory [42] to compress a seven-modality immersive streaming representation into a 256-dimensional sufficient statistic. This enables (a) implicit uncertainty-aware bitrate decisions without dedicated confidence estimation modules, and (b) 72% zero-shot cross-format transfer to volumetric content versus 48% for pre-trained fine-tuning by learning domain-agnostic compressed representations (Section 5.3).
- Sample efficient meta-learning personalization. A compound MAML [43] HyperNetwork episodic memory personalization module that reduces the 90% threshold adaptation time to 22 s from ≥47 s for standalone MAML, and delivers statistically significant QoE gains from a single second of user data across all four sub-tasks simultaneously, the first 360° streaming personalization system to achieve this at scale (Section 5.2).
- IMMERSE-1M dataset. The largest publicly released multi-modal benchmark for immersive streaming research: 1000 h of 360° and volumetric video, 524 participants, 10,000 network traces, and 50,000+ subjective MOS ratings [45] with synchronized head-motion [46,47], eye-tracking [39], ambisonics audio, and network traces across six content format categories (Section 4.1).
- Volumetric 6DoF extension. Spherical harmonic viewport decoding and depth-aware tile importance weighting [44] reduce 6DoF viewport error by 40.8% and increase Point-to-Plane PSNR by 7.3 dB over the depth-agnostic 2D baseline, establishing the first foundation-model baseline for next-generation holographic streaming (Section 3.7 and Section 5.4).
2. Related Work
2.1. Viewport Prediction: From Heuristics to Foundation Representations
2.2. Adaptive Bitrate Streaming and Tile Selection
2.3. Quality of Experience Estimation and Multi-Modal Perception
2.4. Meta-Learning, Personalization, and Collaborative Adaptation
3. Proposed Methodology
3.1. System Architecture Overview
- Modality-specific encoders that map seven heterogeneous input streams into a common 512-dimensional latent space;
- Cross-modal fusion module with an information bottleneck that compresses the joint representation to 256 dimensions while discarding task-irrelevant variation;
- Task-specific decoders that transform the shared representation into actionable streaming decisions; and
- Meta-learning adaptation module enabling rapid personalization to individual users from as few as 1–30 s of interaction data.
3.2. Problem Formulation
3.3. Modality-Specific Encoders
3.3.1. Video Encoder
3.3.2. Network Encoder
3.3.3. User Behavior Encoder
3.3.4. Audio Encoder
3.3.5. Depth, Eye-Tracking and Semantics Encoders
3.4. Cross-Modal Fusion with Information Bottleneck
3.4.1. Cross-Modal Attention
3.4.2. Variational Information Bottleneck
3.4.3. Uncertainty Estimation
3.5. Task-Specific Decoders
3.5.1. Viewport Prediction Decoder
3.5.2. Bitrate Allocation Decoder
3.5.3. Tile Selection and QoE Decoders
3.6. Meta-Learning Adapter for Personalization
3.6.1. Episodic Memory Module
3.6.2. Hypernetwork Adaptation
3.6.3. MAML Bi-Level Optimization
3.7. Volumetric Video Extension
3.8. Training Objective and Procedure
| Algorithm 1 ImmerseFM-3D Multi-Stage Training | ||
| Input: Dataset , user splits
Output: Optimised parameters | ||
| Stage 1: Pre-training Randomly initialise all networks. for epoch to : Sample batch Encode: Fuse: Compress: via reparameterisation Decode: Compute loss: Update via AdamW (lr ) | ||
| Stage 2: Meta-training for epoch to : Sample user batch for each user : Sample (5 samples) and Compute adapted parameters: Compute query loss: Update base parameters: | ||
| Stage 3: Fine-tuning Set lr ; jointly optimise all losses for epochs. Return: | ||
3.9. Implementation Details
4. Experimental Setup
4.1. IMMERSE-1M Dataset
4.1.1. Video Content
4.1.2. User Behavior Data
4.1.3. Network Traces
4.1.4. Auxiliary Modalities and Dataset Splits
4.1.5. Dataset Demographics and Cultural Diversity
- Western European cohort (, 68%): Corbillon et al. [5] (59 participants, Télécom Paris), David et al. [38] (57 participants, University of Nantes), Rai et al. [47] (40 participants, University of Nantes), and new recordings collected at the University of Strathclyde, Glasgow (200 participants recruited under institutional IRB approval).
- VP MAE at 1 s: East Asian 5.34°, Western European 5.15°, overall 5.21°. The 0.19° inter-group gap is small relative to the 2.62° margin over the best baseline (STMRQ: 7.83°).
- QoE Pearson : East Asian 0.878, Western European 0.897, overall 0.891. The 0.019 inter-group gap is below the conventional threshold of practical significance (0.05).
4.2. Evaluation Protocols
4.3. Baseline Methods
4.4. Evaluation Metrics
4.4.1. Viewport Prediction
4.4.2. Bitrate Allocation
- Top-k Accuracy:
- Mean Absolute Error:
- Bandwidth Violation Rate:
4.4.3. QoE Estimation
4.4.4. Tile Selection
4.4.5. 6DoF Volumetric Metrics
5. Results
5.1. Experiment 1: Generalization to Unseen Content
5.1.1. Viewport Prediction
5.1.2. Bitrate Allocation and Tile Selection
5.1.3. QoE Estimation
5.1.4. Per-Category Generalization
5.2. Experiment 2: Few-Shot Personalization
5.3. Experiment 3: Cross-Format Transfer
5.4. Experiment 4: Volumetric Video Streaming
5.5. Experiment 5: Ablation Studies
5.6. Computational Efficiency
6. Discussion
6.1. Interpretation of Main Results
6.2. Cross-Modal Synergies
6.3. Role of the Information Bottleneck
6.4. Meta-Learning and Personalization Dynamics
6.5. Volumetric and 6DoF Streaming Analysis
6.6. Computational Considerations and Deployment Pathways
6.7. Limitations and Future Work
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| 6DoF | Six degrees of freedom |
| ABR | Adaptive bitrate |
| BVR | Bandwidth violation rate |
| CEAP | Continuous physiological and behavioral emotion annotation |
| CLIP | Contrastive language–image pre-training |
| CMMST | Cross-modal multi-scale transformer |
| CNN | Convolutional neural network |
| CPU | Central processing unit |
| D-SAV360 | Dataset of scanpaths on ambisonic videos |
| DASH | Dynamic adaptive streaming over HTTP |
| EDA | Electrodermal activity |
| EMD | Empirical mode decomposition |
| EPASS360 | Ensemble prediction and allocation streaming system for 360° video |
| ERP | Equirectangular projection |
| FOA | First-order ambisonics |
| FoV | Field of view |
| GFLOPs | Giga floating-point operations per second |
| GNN | Graph neural network |
| GOP | Group of pictures |
| GPU | Graphics processing unit |
| GRU | Gated recurrent unit |
| HMD | Head-mounted display |
| HTTP | Hypertext transfer protocol |
| IB | Information bottleneck |
| IMMERSE | Immersive multi-modal evaluation and representation for streaming |
| ITU | International Telecommunication Union |
| JUST360 | Joint utility streaming for 360° video |
| KKT | Karush–Kuhn–Tucker |
| KL | Kullback–Leibler (divergence) |
| LSTM | Long short-term memory |
| LTE | Long-term evolution |
| MADRL | Multi-agent deep reinforcement learning |
| MAE | Mean absolute error |
| MAML | Model-agnostic meta-learning |
| MDA | Multi-dimensional attention |
| MEC | Mobile edge computing |
| MFTR | Multi-modal fusion transformer |
| MLP | Multi-layer perceptron |
| MOS | Mean opinion score |
| MPEG | Moving Picture Experts Group |
| MSE | Mean squared error |
| OMAF | Omnidirectional media application format |
| P2P-PSNR | Point-to-plane peak signal-to-noise ratio |
| PSNR | Peak signal-to-noise ratio |
| QoE | Quality of experience |
| RGB | Red–green–blue |
| RL | Reinforcement learning |
| RMSE | Root mean squared error |
| RNN | Recurrent neural network |
| ROI | Region of interest |
| RTT | Round-trip time |
| SOTA | State of the art |
| SRD | Spatial relationship description |
| STMRQ | Spatiotemporal motion-aware rate-quality predictor |
| SVC | Scalable video coding |
| TCN | Temporal convolutional network |
| VATP360 | Viewport adaptive 360° video streaming based on tile priority |
| VP | Viewport prediction |
| VPT360 | Viewport prediction transformer for 360° video |
| VQA | Visual quality assessment |
| VR | Virtual reality |
References
- Geyer, C.; Daniilidis, K. Omnidirectional video. Vis. Comput. 2003, 19, 405–416. [Google Scholar] [CrossRef]
- Argyriou, L.; Economou, D.; Bouki, V. Design methodology for 360° immersive video applications: The case study of a cultural heritage virtual tour. Pers. Ubiquitous Comput. 2020, 24, 843–859. [Google Scholar] [CrossRef]
- Hendriks Vettehen, P.; Wiltink, D.; Huiskamp, M.; Schaap, G.; Ketelaar, P. Taking the full view: How viewers respond to 360-degree video news. Comput. Hum. Behav. 2019, 91, 24–32. [Google Scholar] [CrossRef]
- Sitzmann, V.; Serrano, A.; Pavel, A.; Agrawala, M.; Gutierrez, D.; Masia, B.; Wetzstein, G. Saliency in VR: How Do People Explore Virtual Environments? IEEE Trans. Vis. Comput. Graph. 2018, 24, 1633–1642. [Google Scholar] [CrossRef]
- Corbillon, X.; Simon, G.; Devlic, A.; Chakareski, J. Viewport-Adaptive Navigable 360-Degree Video Delivery. In Proceedings of the 2017 IEEE International Conference on Communications (ICC), Paris, France, 21–25 May 2017; IEEE: New York, NY, USA, 2017; pp. 1–7. [Google Scholar] [CrossRef]
- Xie, L.; Xu, Z.; Ban, Y.; Zhang, X.; Guo, Z. 360ProbDASH: Improving QoE of 360 Video Streaming Using Tile-based HTTP Adaptive Streaming. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 315–323. [Google Scholar] [CrossRef]
- Skupin, R.; Sanchez, Y.; Jiao, L.; Hellge, C.; Schierl, T. Tile-Based Rate Assignment for 360-Degree Video Based on Spatio-Temporal Activity Metrics. In Proceedings of the 2018 IEEE International Symposium on Multimedia (ISM), Taichung, Taiwan, 10–12 December 2018; IEEE: New York, NY, USA, 2018; pp. 65–68. [Google Scholar] [CrossRef]
- Son, J.; Ryu, E.S. Tile-based 360-degree video streaming for mobile virtual reality in cyber physical system. Comput. Electr. Eng. 2018, 72, 361–368. [Google Scholar] [CrossRef]
- De La Fuente, Y.S.; Bhullar, G.S.; Skupin, R.; Hellge, C.; Schierl, T. Delay Impact on MPEG OMAF’s Tile-Based Viewport-Dependent 360° Video Streaming. IEEE J. Emerg. Sel. Top. Circuits Syst. 2019, 9, 18–28. [Google Scholar] [CrossRef]
- Ozcinar, C.; Cabrera, J.; Smolic, A. Viewport-Aware Omnidirectional Video Streaming Using Visual Attention and Dynamic Tiles. In Proceedings of the 2018 7th European Workshop on Visual Information Processing (EUVIP), Tampere, Finland, 26–28 November 2018; IEEE: New York, NY, USA, 2018; pp. 1–6. [Google Scholar] [CrossRef]
- Jiang, X.; Naas, S.A.; Chiang, Y.H.; Sigg, S.; Ji, Y. SVP: Sinusoidal Viewport Prediction for 360-Degree Video Streaming. IEEE Access 2020, 8, 164471–164481. [Google Scholar] [CrossRef]
- Bao, Y.; Wu, H.; Zhang, T.; Ramli, A.A.; Liu, X. Shooting a moving target: Motion-prediction-based transmission for 360-degree videos. In Proceedings of the 2016 IEEE International Conference on Big Data (Big Data), Washington, DC, USA, 5–8 December 2016; IEEE: New York, NY, USA, 2016; pp. 1161–1170. [Google Scholar] [CrossRef]
- Xu, M.; Song, Y.; Wang, J.; Qiao, M.; Huo, L.; Wang, Z. Predicting Head Movement in Panoramic Video: A Deep Reinforcement Learning Approach. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 2693–2708. [Google Scholar] [CrossRef] [PubMed]
- Hu, H.N.; Lin, Y.C.; Liu, M.Y.; Cheng, H.T.; Chang, Y.J.; Sun, M. Deep 360 Pilot: Learning a Deep Agent for Piloting through 360° Sports Videos. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 1396–1405. [Google Scholar] [CrossRef]
- Chao, F.Y.; Ozcinar, C.; Smolic, A. Transformer-based Long-Term Viewport Prediction in 360° Video: Scanpath is All You Need. In Proceedings of the 2021 IEEE 23rd International Workshop on Multimedia Signal Processing (MMSP), Tampere, Finland, 6–8 October 2021; IEEE: New York, NY, USA, 2021; pp. 1–6. [Google Scholar] [CrossRef]
- Zhang, Z.; Chen, Y.; Zhang, W.; Yan, C.; Zheng, Q.; Wang, Q.; Chen, W. Tile Classification Based Viewport Prediction with Multi-modal Fusion Transformer. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 3560–3568. [Google Scholar] [CrossRef]
- Tian, Y.; Zhong, Y.; Han, Y.; Chen, F. Viewport prediction with cross modal multiscale transformer for 360° video streaming. Sci. Rep. 2025, 15, 30346. [Google Scholar] [CrossRef]
- Guo, Y.; Xu, M.; Jiang, L.; Deng, X.; Zhou, J.; Chen, G.; Sigal, L. Proposal With Alignment: A Bi-Directional Transformer for 360° Video Viewport Proposal. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 11423–11437. [Google Scholar] [CrossRef]
- Lyu, J. A MDA-based multi-modal fusion model for panoramic viewport prediction. Adv. Eng. Innov. 2024, 15, 1–8. [Google Scholar] [CrossRef]
- Ghosh, A.; Aggarwal, V.; Qian, F. A Robust Algorithm for Tile-based 360-degree Video Streaming with Uncertain FoV Estimation. arXiv 2018, arXiv:1812.00816. [Google Scholar] [CrossRef]
- Zhao, L.; Cui, Y.; Liu, Z.; Zhang, Y.; Yang, S. Adaptive Streaming of 360 Videos with Perfect, Imperfect, and Unknown FoV Viewing Probabilities in Wireless Networks. IEEE Trans. Image Process. 2021, 30, 7744–7759. [Google Scholar] [CrossRef]
- Feng, W.; Wang, S.; Dai, Y. Adaptive 360-Degree Streaming: Optimizing With Multi-Window and Stochastic Viewport Prediction. IEEE Trans. Mob. Comput. 2025, 24, 5903–5915. [Google Scholar] [CrossRef]
- Setayesh, M.; Wong, V.W.S. Viewport Prediction, Bitrate Selection, and Beamforming Design for THz-Enabled 360° Video Streaming. IEEE Trans. Wirel. Commun. 2025, 24, 1849–1865. [Google Scholar] [CrossRef]
- Park, S.; Hoai, M.; Bhattacharya, A.; Das, S.R. Adaptive Streaming of 360-Degree Videos with Reinforcement Learning. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 5–9 January 2021; IEEE: New York, NY, USA, 2021; pp. 1838–1847. [Google Scholar] [CrossRef]
- Ao, A.; Park, S. Applying Transformer-Based Computer Vision Models to Adaptive Bitrate Allocation for 360 Live Streaming. In Proceedings of the 2024 IEEE Wireless Communications and Networking Conference (WCNC), Dubai, United Arab Emirates, 21–24 April 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar] [CrossRef]
- Nguyen, D.V.; Tran, H.T.T.; Thang, T.C. An Evaluation of Tile Selection Methods for Viewport-Adaptive Streaming of 360-Degree Video. ACM Trans. Multimed. Comput. Commun. Appl. 2020, 16, 1–24. [Google Scholar] [CrossRef]
- Pang, Z. VATP360: Viewport Adaptive 360-Degree Video Streaming based on Tile Priority. arXiv 2023, arXiv:2307.15984. [Google Scholar] [CrossRef]
- Croci, S.; Ozcinar, C.; Zerman, E.; Knorr, S.; Cabrera, J.; Smolic, A. Visual attention-aware quality estimation framework for omnidirectional video using spherical Voronoi diagram. Qual. User Exp. 2020, 5, 4. [Google Scholar] [CrossRef]
- Van Kasteren, A.; Brunnström, K.; Hedlund, J.; Snijders, C. Quality of experience of 360 video—Subjective and eye-tracking assessment of encoding and freezing distortions. Multimed. Tools Appl. 2022, 81, 9771–9802. [Google Scholar] [CrossRef]
- Chiariotti, F. A survey on 360-degree video: Coding, quality of experience and streaming. Comput. Commun. 2021, 177, 133–155. [Google Scholar] [CrossRef]
- Shen, G.; Ma, M.; Xu, G. An Optimal SVC Bitstream Schema for Viewport-dependent 360-degree Video Streaming. arXiv 2023, arXiv:2304.05654. [Google Scholar] [CrossRef]
- Romero Rondon, M.F.; Sassatelli, L.; Aparicio-Pardo, R.; Precioso, F. TRACK: A New Method from a Re-examination of Deep Architectures for Head Motion Prediction in 360-degree Videos. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 5681–5699. [Google Scholar] [CrossRef]
- Jiang, Y.; Poularakis, K.; Kiedanski, D.; Kompella, S.; Tassiulas, L. Robust and Resource-efficient Machine Learning Aided Viewport Prediction in Virtual Reality. arXiv 2022, arXiv:2212.09945. [Google Scholar] [CrossRef]
- Feng, X.; Swaminathan, V.; Wei, S. Viewport Prediction for Live 360-Degree Mobile Video Streaming Using User-Content Hybrid Motion Tracking. Proc. ACM on Interact. Mob. Wearable Ubiquitous Technol. 2019, 3, 1–22. [Google Scholar] [CrossRef]
- Zhang, L.; Chen, P.; Zhang, C.; Pan, C.; Long, T.; Xu, W.; Cui, L.; Liu, J. Optimizing Mobile-Friendly Viewport Prediction for Live 360-Degree Video Streaming. IEEE Trans. Mob. Comput. 2025, 24, 10441–10455. [Google Scholar] [CrossRef]
- Shimamura, R.; Feng, Q.; Koyama, Y.; Nakatsuka, T.; Fukayama, S.; Hamasaki, M.; Goto, M.; Morishima, S. Audio–visual object removal in 360-degree videos. Vis. Comput. 2020, 36, 2117–2128. [Google Scholar] [CrossRef]
- Liu, H.; Luo, T.; Luo, K.; Jiang, Q.; Sun, P.; Wang, J.; Huang, R.; Chen, Q.; Wang, W.; Li, X.; et al. OmniAudio: Generating Spatial Audio from 360-Degree Video. arXiv 2025, arXiv:2504.14906. [Google Scholar] [CrossRef]
- David, E.J.; Gutiérrez, J.; Coutrot, A.; Da Silva, M.P.; Callet, P.L. A dataset of head and eye movements for 360° videos. In Proceedings of the 9th ACM Multimedia Systems Conference, Amsterdam, The Netherlands, 12–15 June 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 432–437. [Google Scholar] [CrossRef]
- Xu, Y.; Du, J.; Wang, J.; Ning, Y.; Zhou, S.; Cao, Y. Panonut360: A Head and Eye Tracking Dataset for Panoramic Video. In Proceedings of the ACM Multimedia Systems Conference 2024 on ZZZ, Bari, Italy, 15–18 April 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 319–325. [Google Scholar] [CrossRef]
- Zhang, G.; Wu, C.; Gao, Q. Exploiting layer and spatial correlations to enhance SVC and tile based 360-degree video streaming. Comput. Netw. 2021, 191, 107985. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar] [CrossRef]
- Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. arXiv 2000, arXiv:physics/0004057. [Google Scholar] [CrossRef] [PubMed]
- Finn, C.; Abbeel, P.; Levine, S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. arXiv 2017, arXiv:1703.03400. [Google Scholar] [CrossRef]
- Li, J.; Li, Z.; Liu, Z.; Zhou, P.; Hong, R.; Li, Q.; Hu, H. Viewport Prediction for Volumetric Video Streaming by Exploring Video Saliency and Trajectory Information. arXiv 2025, arXiv:2311.16462. [Google Scholar] [CrossRef]
- Xue, T.; El Ali, A.; Zhang, T.; Ding, G.; Cesar, P. CEAP-360VR: A Continuous Physiological and Behavioral Emotion Annotation Dataset for 360° VR Videos. IEEE Trans. Multimed. 2023, 25, 243–255. [Google Scholar] [CrossRef]
- Wu, C.; Tan, Z.; Wang, Z.; Yang, S. A Dataset for Exploring User Behaviors in VR Spherical Video Streaming. In Proceedings of the 8th ACM on Multimedia Systems Conference, Taipei, Taiwan, 20–23 June 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 193–198. [Google Scholar] [CrossRef]
- Rai, Y.; Gutiérrez, J.; Le Callet, P. A Dataset of Head and Eye Movements for 360 Degree Images. In Proceedings of the 8th ACM on Multimedia Systems Conference, Taipei, Taiwan, 20–23 June 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 205–210. [Google Scholar] [CrossRef]
- Martinez, J.; Black, M.J.; Romero, J. On Human Motion Prediction Using Recurrent Neural Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 4674–4683. [Google Scholar] [CrossRef]
- Manfredi, G.; Racanelli, V.A.; De Cicco, L.; Mascolo, S. LSTM-based Viewport Prediction for Immersive Video Systems. In Proceedings of the 2023 21st Mediterranean Communication and Computer Networking Conference (MedComNet), Island of Ponza, Italy, 13–15 June 2023; IEEE: New York, NY, USA, 2023; pp. 49–52. [Google Scholar] [CrossRef]
- Li, J.; Han, L.; Zhang, C.; Li, Q.; Liu, Z. Spherical Convolution Empowered Viewport Prediction in 360 Video Multicast with Limited FoV Feedback. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–23. [Google Scholar] [CrossRef]
- Gao, B.; Sheng, D.; Zhang, L.; Qi, Q.; He, B.; Zhuang, Z.; Wang, J. STAR-VP: Improving Long-term Viewport Prediction in 360° Videos via Space-aligned and Time-varying Fusion. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 5556–5565. [Google Scholar] [CrossRef]
- Wang, M.; Peng, S.; Chen, X.; Zhao, Y.; Xu, M.; Xu, C. CoLive: An Edge-Assisted Online Learning Framework for Viewport Prediction in 360° Live Streaming. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; IEEE: New York, NY, USA, 2022; pp. 1–6. [Google Scholar] [CrossRef]
- Wahba, M.Z.A.; Baldoni, S.; Battisti, F. Learning-Based Viewport Prediction for 360-Degree Videos: A Review. Electronics 2025, 14, 3743. [Google Scholar] [CrossRef]
- Qian, F.; Han, B.; Xiao, Q.; Gopalakrishnan, V. Flare: Practical Viewport-Adaptive 360-Degree Video Streaming for Mobile Devices. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, New Delhi, India, 29 October–2 November 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 99–114. [Google Scholar] [CrossRef]
- Li, Z.; Wang, Y.; Liu, Y.; Li, J.; Zhu, P. JUST360: Optimizing 360-Degree Video Streaming Systems with Joint Utility. IEEE Trans. Broadcast. 2024, 70, 468–481. [Google Scholar] [CrossRef]
- Zou, J.; Li, C.; Liu, C.; Yang, Q.; Xiong, H.; Steinbach, E. Probabilistic Tile Visibility-Based Server-Side Rate Adaptation for Adaptive 360-Degree Video Streaming. IEEE J. Sel. Top. Signal Process. 2020, 14, 161–176. [Google Scholar] [CrossRef]
- Kumar, S.; Bhagat, L.; A., A.F.; Jin, J. Multi-neural network based tiled 360video caching with Mobile Edge Computing. J. Netw. Comput. Appl. 2022, 201, 103342. [Google Scholar] [CrossRef]
- Qiu, M.; Shao, F. Blind 360-degree image quality assessment via saliency-guided convolution neural network. Optik 2021, 240, 166858. [Google Scholar] [CrossRef]
- Chen, P.W.; Yang, T.S.; Huang, G.L.; Huang, C.W.; Chao, Y.C.; Lu, C.H.; Wu, P.Y. Viewing Bias Matters in 360 Videos Visual Saliency Prediction. IEEE Access 2023, 11, 46084–46094. [Google Scholar] [CrossRef]
- Vats, S.; Park, J.; Nahrstedt, K.; Zink, M.; Sitaraman, R.; Hellwagner, H. Semantic-Aware View Prediction for 360-Degree Videos at the 5G Edge. In Proceedings of the 2022 IEEE International Symposium on Multimedia (ISM), Naples, Italy, 5–7 December 2022; IEEE: New York, NY, USA, 2022; pp. 121–128. [Google Scholar] [CrossRef]
- Adhuran, J.; Martini, M.G. Efficient viewport prediction and tiling schemes for 360 degree video streaming. In Proceedings of the ACM Multimedia Systems Conference 2024 on ZZZ, Bari, Italy, 15–18 April 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 374–380. [Google Scholar] [CrossRef]
- Xu, Y.; Dong, Y.; Wu, J.; Sun, Z.; Shi, Z.; Yu, J.; Gao, S. Gaze Prediction in Dynamic 360° Immersive Videos. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: New York, NY, USA, 2018; pp. 5333–5342. [Google Scholar] [CrossRef]
- Chao, F.Y.; Ozcinar, C.; Smolic, A. A QoE Study for Viewport Prediction Approaches in Tile-Based 360 Video Streaming. 2023. Available online: https://www.researchgate.net/publication/366974389_A_QoE_Study_for_Viewport_Prediction_Approaches_in_Tile-based_360_Video_Streaming?channel=doi&linkId=63bc3ffcc3c99660ebdf4b3f&showFulltext=true (accessed on 15 March 2026).
- Zhou, H.; Zhao, F.; Li, C. Multi-scale Historical Trajectory Decomposition for Viewport Prediction in 360-degree Videos. ACM Trans. Multimed. Comput. Commun. Appl. 2025, 21, 1–22. [Google Scholar] [CrossRef]
- Liu, S.; Wang, Y.; Li, S.; Liu, Y. MADRL-based bitrate allocation for QoE fairness in 360° video streaming with viewport prediction. Multimed. Syst. 2025, 31, 343. [Google Scholar] [CrossRef]
- Corbillon, X.; De Simone, F.; Simon, G. 360-Degree Video Head Movement Dataset. In Proceedings of the 8th ACM on Multimedia Systems Conference, Taipei, Taiwan, 20–23 June 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 199–204. [Google Scholar] [CrossRef]











| Method | VP | ABR | TS | QoE | MM | Pers. | 6DoF |
|---|---|---|---|---|---|---|---|
| VPT360 [15] | ✓ | – | – | – | – | – | – |
| MFTR [16] | ✓ | – | ∘ | – | ∘ | – | – |
| CMMST [17] | ✓ | – | – | – | ∘ | – | – |
| TRACK [32] | ✓ | – | – | – | – | – | – |
| STAR-VP [51] | ✓ | – | – | – | ∘ | – | – |
| 360ProbDASH [6] | – | ✓ | ✓ | – | – | – | – |
| JUST360 [55] | – | ✓ | ✓ | – | – | – | – |
| RL-Streaming [24] | – | ✓ | – | – | – | – | – |
| VATP360 [27] | – | ✓ | ✓ | – | – | – | – |
| Voronoi-VQA [28] | – | – | – | ✓ | – | – | – |
| CEAP-360VR [45] | – | – | – | ✓ | ✓ | – | – |
| CoLive [52] | ✓ | – | – | – | – | ✓ | – |
| MAML [43] | ∘ | – | – | – | – | ✓ | – |
| STVP [44] | ✓ | – | – | – | – | – | ✓ |
| ImmerseFM-3D (ours) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Modality | Architecture | Input Size | Params (M) |
|---|---|---|---|
| Video () | 3D Swin Transformer (4 stages) | 42.3 | |
| Network () | Dilated TCN (4 layers, ) | 1.2 | |
| Head Motion () | Transformer Encoder (6L, 8H) | 8.7 | |
| Audio () | ResNet-50 + Ambisonics MLP | Mel-spectrogram | 24.6 |
| Depth () | 3D CNN (5 layers) | 3.8 | |
| Eye Tracking () | BiLSTM (2 layers) | 2.1 | |
| Scene Semantics () | CLIP ViT-B/32 + Transformer (frozen) | Key frames @ 1 fps | 86.4 |
| Component | Architecture | Params | Output Dim |
|---|---|---|---|
| Video Encoder | 3D Swin-T (4 stages) | 42.3 M | |
| Network Encoder | Dilated TCN (4 layers) | 1.2 M | |
| User Encoder | Transformer (6L, 8H) | 8.7 M | |
| Audio Encoder | ResNet-50 + Amb. MLP | 24.6 M | |
| Depth Encoder | 3D CNN (5 layers) | 3.8 M | |
| Eye Tracking Encoder | BiLSTM (2 layers) | 2.1 M | |
| Scene Semantics | CLIP ViT-B/32 + Trans. (frozen) | 86.4 M | |
| Cross-Modal Attention | 4 layers, 8 heads | 4.2 M | |
| Information Bottleneck | MLP | 0.13 M | |
| Task Decoders (4×) | MLP + GRU | 1.8 M | Task-dependent |
| Meta-Learning Adapter | HyperNet + Episodic Mem. | 2.3 M | |
| Total | ≈177 M |
| Category | Primary Source | Duration | Resolution | Format |
|---|---|---|---|---|
| 360° Static | Wu et al. [46]; Corbillon et al. [5] | 200 h | 4–8 K | ERP, Cubemap |
| 360° Dynamic | Xu et al. [62]; David et al. [38] | 250 h | 4–8 K | ERP |
| 360° High Motion | Rai et al. [47] | 150 h | 6–8 K | ERP |
| Volumetric | 8i Voxelized; Microsoft RGB-D | 200 h | Point clouds, meshes | |
| Light Field | Stanford Lytro; new captures | 100 h | views | LF images |
| Synthetic | Unreal Engine renders | 100 h | 4–8 K | ERP + depth |
| Total | 1000 h |
| Source | Users | Avg. Duration | Modalities | Rate |
|---|---|---|---|---|
| Wu et al. [46] | 48 | 60 s/video | Head (6DoF) | 30 Hz |
| Corbillon et al. [5] | 59 | 70 s/video | Head (3DoF) | 30 Hz |
| Xu et al. [62] | 120 | 180 s/video | Head + Eye | 90/30 Hz |
| David et al. [38] | 57 | 30–60 s | Head + Eye | 60 Hz |
| Rai et al. [47] | 40 | 20 s/image | Head + Eye | 30 Hz |
| New recordings | 200 | 300 s/video | Head + Eye + Pupillometry | 90 Hz |
| Total | 524 | ≈1200 h |
| Source | Type | Samples | BW Range | RTT Range |
|---|---|---|---|---|
| FCC 4G | Real-world | 3000 | 0.5–50 Mbps | 20–150 ms |
| 5G mmWave (ns-3) | Simulated | 2000 | 10–800 Mbps | 5–30 ms |
| HSDPA | Real-world | 2000 | 0.1–10 Mbps | 50–500 ms |
| WiFi (campus) | Real-world | 1500 | 1–200 Mbps | 2–50 ms |
| Starlink | Real-world | 500 | 20–200 Mbps | 20–80 ms |
| Controlled synthetic | Generated | 1000 | Variable patterns | Variable |
| Total | 10,000 |
| Split | Videos | Users | Net. Traces | Purpose |
|---|---|---|---|---|
| Pre-training | 80% (800 h) | 80% (419) | 80% (8000) | Model pre-training (Stage 1) |
| Meta-training | 10% (100 h) | 10% (52) | 10% (1000) | MAML adaptation (Stage 2) |
| Test | 10% (100 h) | 10% (53) | 10% (1000) | Final held-out evaluation |
| Metric | East Asian () | W. European () | p-Value | Cohen’s d |
|---|---|---|---|---|
| Equatorial bias (%) | 0.014 | 0.26 | ||
| Exploration radius (°) | 0.003 | 0.30 | ||
| Saccade freq. (min−1) | 0.001 | 0.43 |
| Variant | Description |
|---|---|
| Full model | Complete ImmerseFM-3D (all components) |
| w/o cross-modal | Separate task-specific encoders; no cross-modal fusion |
| w/o bottleneck | Direct feature pass-through; no information bottleneck |
| w/o meta-learning | No personalization adapter; fixed population-level model |
| w/o uncertainty | Deterministic latent (single sample, ) |
| w/o audio | Audio encoder removed; 6 modalities only |
| w/o depth | Depth encoder removed; 2D video content only |
| w/o semantics | CLIP scene encoder removed; no semantic features |
| Single-task | Separate independently trained model per task |
| Method | Description | Reference |
|---|---|---|
| VPT360 | Transformer-based trajectory prediction | Chao et al. [63] |
| MFTR | Multi-modal fusion transformer for viewport estimation | Zhang et al. [16] |
| DHP | Deep reinforcement learning viewport prediction | Xu et al. [13] |
| EMD-ML | Empirical mode decomposition + LSTM | Zhou et al. [64] |
| STMRQ | Spatiotemporal graph convolutional network | Liu et al. [65] |
| Method | Description | Reference |
|---|---|---|
| 360ProbDASH | Probabilistic tile pre-fetching ABR | Xie et al. [6] |
| Flare | Viewport-adaptive bitrate streaming | Qian et al. [54] |
| EPASS360 | Ensemble viewport prediction + allocation | Zhang et al. [40] |
| JUST360 | Joint utility optimization for 360° streaming | Li et al. [55] |
| RL-Streaming | Deep RL for adaptive bitrate selection | Park et al. [24] |
| Method | Description | Reference |
|---|---|---|
| QoE Estimation | ||
| Voronoi-VQA | Visual attention-aware 360° quality assessment | Croci et al. [28] |
| Blind360 | Blind 360° image quality via saliency CNN | Qiu and Shao [58] |
| CEAP-360VR | Physiological + behavioral QoE estimation | Xue et al. [45] |
| Meta-Learning Personalization | ||
| MAML | Model-Agnostic Meta-Learning Golub et al. | [43] |
| Reptile | First-order meta-learning (no second-order grad.) | Jiang et al. [33] |
| Fine-tuning | Standard transfer learning from pre-trained model | — |
| Method | 0.5 s | 1 s | 2 s | 3 s | 5 s |
|---|---|---|---|---|---|
| VPT360 [15] | 3.21 | 8.97 | 16.42 | 24.35 | 38.71 |
| MFTR | 2.84 | 7.90 | 14.63 | 22.10 | 35.44 |
| DHP | 3.47 | 9.21 | 17.05 | 25.80 | 40.12 |
| EMD-ML | 2.91 | 8.12 | 15.01 | 21.90 | 34.87 |
| STMRQ | 2.76 | 7.83 | 14.20 | 21.45 | 34.10 |
| ImmerseFM-3D (ours) | 1.93 † | 5.21 † | 9.87 † | 14.82 † | 24.16 † |
| ±95% CI | ±0.04 | ±0.09 | ±0.18 | ±0.26 | ±0.41 |
| Improvement vs. best | −30.1% | −33.5% | −30.5% | −31.0% | −29.2% |
| Method | Acc@1 ↑ | Acc@3 ↑ | BVR ↓ | MAE ↓ | F1 ↑ | IoU ↑ |
|---|---|---|---|---|---|---|
| 360ProbDASH [6] | 70.3 | 88.4 | 14.2 | 0.91 | 0.71 | 0.59 |
| Flare [54] | 74.8 | 90.1 | 11.7 | 0.84 | 0.74 | 0.62 |
| EPASS360 [40] | 77.1 | 91.8 | 10.3 | 0.78 | 0.77 | 0.65 |
| JUST360 [55] | 81.2 | 93.5 | 8.3 | 0.71 | 0.82 | 0.70 |
| RL-Streaming | 79.4 | 92.7 | 9.1 | 0.75 | 0.79 | 0.67 |
| ImmerseFM-3D | 92.1 † | 97.6 † | 3.1 † | 0.43 † | 0.91 † | 0.84 † |
| ±95% CI | ±0.4 | ±0.2 | ±0.1 | ±0.02 | ±0.01 | ±0.01 |
| Method | Pearson | Spearman | RMSE ↓ |
|---|---|---|---|
| Voronoi-VQA [28] | 0.763 | 0.741 | 0.521 |
| Blind360 [58] | 0.712 | 0.694 | 0.567 |
| CEAP-360VR | 0.741 | 0.728 | 0.538 |
| ImmerseFM-3D | 0.891 † | 0.874 † | 0.312 † |
| ±95% CI | ±0.008 | ±0.009 | ±0.011 |
| Method | 1 s | 5 s | 10 s | 30 s | 90%-Threshold (s) |
|---|---|---|---|---|---|
| Fine-tuning | +1.2 | +3.8 | +5.6 | +8.1 | >60 |
| Reptile | +2.1 | +5.3 | +7.4 | +10.2 | >60 |
| MAML [43] | +3.4 | +6.9 | +8.8 | +11.3 | 47 |
| ImmerseFM-3D | +4.8 † | +9.1 † | +11.7 † | +15.7 † | 22 † |
| Method | Zero-Shot | 1% | 5% | 10% | 25% |
|---|---|---|---|---|---|
| Train from scratch | — | 32 | 51 | 67 | 82 |
| Fine-tuning (pre-trained) | 48 | 61 | 74 | 84 | 92 |
| MAML transfer | 55 | 66 | 78 | 87 | 94 |
| ImmerseFM-3D | 72 † | 81 † | 89 † | 94 † | 97 † |
| Method | 6DoF VP Error | Depth Estimation | Volumetric QoE | ||||
|---|---|---|---|---|---|---|---|
| Pos. (cm) ↓ | RMSE ↓ | (%) ↑ | P2P-PSNR ↑ | MOS ↑ | |||
| 2D-only (no depth) | 18.4 | 12.3 | — | — | 28.3 | 0.641 | |
| Depth-agnostic | 15.7 | 10.8 | 0.38 | 79.1 | 31.4 | 0.703 | |
| ImmerseFM-3D | 9.3 † | 6.1 † | 0.21 † | 91.4 † | 38.7 † | 0.862 † | |
| ±95% CI | ±0.4 | ±0.3 | ±0.01 | ±0.6 | ±0.4 | ±0.012 | |
| Variant | VP MAE ↓ | Acc@1 ↑ | Tile F1 ↑ | QoE |
|---|---|---|---|---|
| Full model | 5.21 | 92.1 | 0.910 | 0.891 |
| w/o cross-modal | 7.83 (−50%) | 78.4 (−13.7) | 0.741 (−0.169) | 0.724 (−0.167) |
| w/o bottleneck | 6.94 (−33%) | 81.3 (−10.8) | 0.792 (−0.118) | 0.763 (−0.128) |
| w/o meta-learning | 5.21 (—) | 91.8 (−0.3) | 0.908 (−0.002) | 0.872 (−0.019) |
| w/o uncertainty | 5.47 (−5%) | 89.6 (−2.5) | 0.891 (−0.019) | 0.863 (−0.028) |
| w/o audio | 5.89 (−13%) | 87.4 (−4.7) | 0.871 (−0.039) | 0.844 (−0.047) |
| w/o depth | 5.64 (−8%) | 90.2 (−1.9) | 0.882 (−0.028) | 0.867 (−0.024) |
| w/o semantics | 6.41 (−23%) | 84.7 (−7.4) | 0.831 (−0.079) | 0.791 (−0.100) |
| Single-task | 7.21 (−38%) | 77.8 (−14.3) | 0.751 (−0.159) | 0.731 (−0.160) |
| Removed Modality | VP MAE ↑ | BR Acc ↓ | Tile F1 ↓ | QoE ↓ | Avg. |
|---|---|---|---|---|---|
| None (full model) | 5.21° | 92.1% | 0.910 | 0.891 | — |
| Scene semantics (CLIP) | +23% | −7.4 pp | −0.079 | −0.100 | −15.4 pp |
| Audio (ambisonics) | +13% | −4.7 pp | −0.039 | −0.047 | −10.5 pp |
| Depth maps | +8% | −1.9 pp | −0.028 | −0.024 | −5.5 pp |
| Eye tracking | +5% | −2.5 pp | −0.019 | −0.028 | −4.9 pp |
| Context: component-level ablations | |||||
| w/o cross-modal fusion | +50% | −13.7 pp | −0.169 | −0.167 | −28.5 pp |
| w/o info. bottleneck | +33% | −10.8 pp | −0.118 | −0.128 | −21.8 pp |
| Single-task (no MTL) | +38% | −14.3 pp | −0.159 | −0.160 | −27.2 pp |
| Model | Platform | Lat. (ms) | Peak Mem. (MB) | GFLOPs | Rel. QoE |
|---|---|---|---|---|---|
| ImmerseFM-3D | A100 (training) | 8.2 | 6814 | 287 | 100% |
| ImmerseFM-3D | T4 (edge server) | 31.4 | 3142 | 287 | 100% |
| ImmerseFM-3D | iPhone 12 | 94.7 | 812 | 287 | 100% |
| ImmerseFM-3D | Pixel 6 | 88.3 | 742 | 287 | 100% |
| ImmerseFM-3D-Mobile | T4 (edge server) | 12.1 | 924 | 81 | 91% |
| ImmerseFM-3D-Mobile | iPhone 12 | 22.1 | 248 | 81 | 91% |
| ImmerseFM-3D-Mobile | Pixel 6 | 20.8 | 231 | 81 | 91% |
| ImmerseFM-3D-Mobile | HTC Vive Pro Eye | 18.6 | 194 | 81 | 91% |
| JUST360 [55] | T4 (edge) | 18.9 | 1203 | 124 | 88% |
| MFTR | T4 (edge) | 22.3 | 1847 | 163 | 85% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Gallena Watthage, R.S.; Fernando, A. ImmerseFM-3D: A Foundation Model Framework for Generalizable 360-Degree Video Streaming with Cross-Modal Scene Understanding. Appl. Sci. 2026, 16, 3424. https://doi.org/10.3390/app16073424
Gallena Watthage RS, Fernando A. ImmerseFM-3D: A Foundation Model Framework for Generalizable 360-Degree Video Streaming with Cross-Modal Scene Understanding. Applied Sciences. 2026; 16(7):3424. https://doi.org/10.3390/app16073424
Chicago/Turabian StyleGallena Watthage, Reka Sandaruwan, and Anil Fernando. 2026. "ImmerseFM-3D: A Foundation Model Framework for Generalizable 360-Degree Video Streaming with Cross-Modal Scene Understanding" Applied Sciences 16, no. 7: 3424. https://doi.org/10.3390/app16073424
APA StyleGallena Watthage, R. S., & Fernando, A. (2026). ImmerseFM-3D: A Foundation Model Framework for Generalizable 360-Degree Video Streaming with Cross-Modal Scene Understanding. Applied Sciences, 16(7), 3424. https://doi.org/10.3390/app16073424

