DiffVP: A Diffusion Model with Explicit Coordinate-Temporal Encoding for Viewport Prediction in 360∘ Videos
Abstract
1. Introduction
- DiffVP introduces diffusion models into the task of viewport prediction in 360° videos, modeling future trajectories as probability distributions conditioned on historical trajectories and saliency information.
- The ECTE separately models temporal and spatial features to capture both temporal dependencies of viewing trajectories and spatial relationships among coordinates.
- The CASF module is designed to achieve cross-modal alignment between saliency features and trajectory features, while performing feature interaction across temporal and channel dimensions to enhance the guidance of visual content in viewport prediction.
2. Related Work
2.1. Viewport Prediction in 360° Videos
2.1.1. Trajectory-Based Viewport Prediction Methods
2.1.2. Content-Based Viewport Prediction Methods
2.2. Diffusion Model
3. Method
3.1. Construction of Conditional Information
3.1.1. Trajectory Feature Extraction
3.1.2. Saliency Feature Processing
3.2. Explicit Coordinate-Temporal Encoding
3.3. Coordinate-Aware Saliency Feature Fusion
3.4. Trajectory Generation
3.5. Loss Function
4. Experiments
4.1. Datasets and Evaluation Metrics
- (1)
- David_MMSys [42] comprises nineteen 360° video clips, each with a duration of 20 s. The dataset records both head-tracking and eye-tracking data from 57 participants during free-viewing sessions, and focuses on free-viewing behavior with relatively homogeneous video durations.
- (2)
- Wu_MMSys [43] includes eighteen 360° videos covering five different scene categories (e.g., natural landscapes, urban architecture, sports events, etc.), and collects head-tracking data from 48 participants during video viewing.
- (3)
- Xu_PAMI [17] is the largest dataset in scale, comprising seventy-six 360° video clips of variable duration (ranging from 10 to 80 s, with an average length of 25 s). The video content covers a wide variety of scenes, and the dataset includes both head movement data and eye movement data collected from 58 participants.
4.2. Implementation Details
4.3. Comparison to State-of-the-Arts
- (1)
- TRACK [22]: This model utilizes three separate LSTM modules to handle trajectory features, visual features, and their concatenated features, aiming to dynamically balance the contributions of trajectory and visual information across different prediction horizons.
- (2)
- VPT360 [8]: A Transformer-based model that solely employs a Transformer encoder to process trajectory information for temporal prediction.
- (3)
- MFTR [25]: A complex multi-modal fusion Transformer model that adopts three Transformer-encoder-based modules to process trajectory features, visual features, and their concatenated features, respectively.
- (4)
- STAR-VP [33]: A Transformer-based model that converts saliency information into a compact pixel-wise representation aligned with trajectory features. It employs a gating mechanism to achieve dynamic fusion, emphasizing trajectory features for short-term predictions while reinforcing visual information for long-term predictions.
4.3.1. Quantitative Evaluation
4.3.2. Qualitative Evaluation
4.4. Computational Efficiency Analysis
4.5. Ablation Study
5. Conclusions and Future Work
5.1. Conclusions
5.2. Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Guo, H.; Wang, F.; Zhang, W.; Zhu, Y.; Cui, L.; Liu, J.; Yu, F.R.; Zhang, L. Joint Adaptation for Mobile 360-Degree Video Streaming and Enhancement. IEEE Trans. Mob. Comput. 2025, 24, 7726–7741. [Google Scholar] [CrossRef]
- Delgado, C.Y.; Mayer, R.E. Implementing pretraining to optimise learning in immersive virtual reality. J. Comput. Assist. Learn. 2025, 41, e13099. [Google Scholar] [CrossRef]
- Yaqoob, A.; Bi, T.; Muntean, G.M. A survey on adaptive 360 video streaming: Solutions, challenges and opportunities. IEEE Commun. Surv. Tutor. 2020, 22, 2801–2838. [Google Scholar] [CrossRef]
- Yaqoob, A.; Muntean, G.M. Advanced predictive tile selection using dynamic tiling for prioritized 360-Degree video vr streaming. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 20, 6. [Google Scholar]
- Subhan, F.E.; Yaqoob, A.; Muntean, C.H.; Muntean, G.M. EDGE360: Edge-Enabled Multi-Agent DRL for Region-Aware Rate Adaptation Solution to Enhance Quality of 360-Degree Video Streaming. IEEE Trans. Mob. Comput. 2025, 25, 1918–1935. [Google Scholar] [CrossRef]
- Liu, Y.; Wang, D.; Song, B. Viewport Prediction with Unsupervised Multiscale Causal Representation Learning for Virtual Reality Video Streaming. IEEE Trans. Multimed. 2025, 27, 4752–4764. [Google Scholar] [CrossRef]
- Li, X.; Wang, S.; Zhu, C.; Song, L.; Xie, R.; Zhang, W. Viewport Prediction for Panoramic Video with Multi-CNN. In Proceedings of the 2019 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB); IEEE: New York, NY, USA, 2019; pp. 1–6. [Google Scholar]
- Chao, F.Y.; Ozcinar, C.; Smolic, A. Transformer-based Long-Term Viewport Prediction in 360-Degree Video: Scanpath is All You Need. In Proceedings of the 2021 IEEE 23rd International Workshop on Multimedia Signal Processing (MMSP); IEEE: New York, NY, USA, 2021; pp. 1–6. [Google Scholar]
- Chen, X.; Kasgari, A.T.Z.; Saad, W. Deep Learning for Content-Based Personalized Viewport Prediction of 360-Degree VR Videos. IEEE Netw. Lett. 2020, 2, 81–84. [Google Scholar] [CrossRef]
- Xu, X.; Tan, X.; Wang, S.; Liu, Z.; Zheng, Q. Multi-features fusion based viewport prediction with gnn for 360-degree video streaming. In Proceedings of the 2023 IEEE International Conference on Metaverse Computing, Networking and Applications (MetaCom); IEEE: New York, NY, USA, 2023; pp. 57–64. [Google Scholar]
- Tashiro, Y.; Song, J.; Song, Y.; Ermon, S. CSDI: Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation. In Proceedings of the Advances in Neural Information Processing Systems; NeurIPS Proceedings: San Diego, CA, USA, 2021; pp. 24804–24816. [Google Scholar]
- Yang, Y.; Jin, M.; Wen, H.; Zhang, C.; Liang, Y.; Ma, L.; Wang, Y.; Liu, C.M.; Yang, B.; Xu, Z.; et al. A Survey on Diffusion Models for Time Series and Spatio-Temporal Data. Acm Comput. Surv. 2024, 58, 196. [Google Scholar] [CrossRef]
- Yuan, X.; Qiao, Y. Diffusion-TS: Interpretable Diffusion for General Time Series Generation. In Proceedings of the Twelfth International Conference on Learning Representations; ICLR: Vienna, Austria, 2024; pp. 1–29. [Google Scholar]
- Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 6840–6851. [Google Scholar]
- Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. In Proceedings of the International Conference on Learning Representations; ICLR: Vienna, Austria, 2021; pp. 1–20. [Google Scholar]
- Chen, Y.; Lu, H.; Qin, L.; Wu, C.; Chen, C.W. Streaming 360° VR Video with Statistical QoS Provisioning in mmWave Networks from Delay and Rate Perspectives. IEEE Trans. Wirel. Commun. 2025, 24, 4721–4737. [Google Scholar] [CrossRef]
- Xu, M.; Song, Y.; Wang, J.; Qiao, M.; Huo, L.; Wang, Z. Predicting Head Movement in Panoramic Video: A Deep Reinforcement Learning Approach. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 2693–2708. [Google Scholar] [CrossRef] [PubMed]
- Petrangeli, S.; Simon, G.; Swaminathan, V. Trajectory-Based Viewport Prediction for 360-Degree Virtual Reality Videos. In Proceedings of the 2018 IEEE International Conference on Artificial Intelligence and Virtual Reality (AIVR); IEEE: New York, NY, USA, 2018; pp. 157–160. [Google Scholar]
- Yaqoob, A.; Muntean, G.M. A Collaborative Trajectory-Oriented Viewport Prediction for on-Demand and Live 360-Degree VR Video Streaming. In Proceedings of the 2023 IEEE 20th International Conference on Smart Communities: Improving Quality of Life Using AI, Robotics and IoT (HONET); IEEE: New York, NY, USA, 2023; pp. 1–6. [Google Scholar]
- Chen, J.; Luo, X.; Hu, M.; Wu, D.; Zhou, Y. Sparkle: User-Aware Viewport Prediction in 360-Degree Video Streaming. IEEE Trans. Multimed. 2021, 23, 3853–3866. [Google Scholar] [CrossRef]
- Li, C.; Zhang, W.; Liu, Y.; Wang, Y. Very Long Term Field of View Prediction for 360-Degree Video Streaming. In Proceedings of the 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR); IEEE: New York, NY, USA, 2019; pp. 297–302. [Google Scholar]
- Rondón, M.F.R.; Sassatelli, L.; Aparicio-Pardo, R.; Precioso, F. TRACK: A New Method from a Re-Examination of Deep Architectures for Head Motion Prediction in 360-Degree Videos. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 5681–5699. [Google Scholar] [PubMed]
- Xu, Y.; Dong, Y.; Wu, J.; Sun, Z.; Shi, Z.; Yu, J.; Gao, S. Gaze Prediction in Dynamic 360-Degree Immersive Videos. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2018; pp. 5333–5342. [Google Scholar]
- Wang, M.; Peng, S.; Chen, X.; Zhao, Y.; Xu, M.; Xu, C. CoLive: An Edge-Assisted Online Learning Framework for Viewport Prediction in 360-Degree Live Streaming. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME); IEEE: New York, NY, USA, 2022; pp. 1–6. [Google Scholar]
- Zhang, Z.; Chen, Y.; Zhang, W.; Yan, C.; Zheng, Q.; Wang, Q.; Chen, W. Tile Classification Based Viewport Prediction with Multi-modal Fusion Transformer. In Proceedings of the 31st ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2023; MM ’23; pp. 3560–3568. [Google Scholar]
- Zhang, Z.; Du, H.; Huang, S.; Zhang, W.; Zheng, Q. VRFormer: 360-Degree Video Streaming with FoV Combined Prediction and Super resolution. In Proceedings of the 2022 ISPA/BDCloud/SocialCom/SustainCom; IEEE: New York, NY, USA, 2022; pp. 531–538. [Google Scholar]
- Tang, J.; Huo, Y.; Yang, S.; Jiang, J. A Viewport Prediction Framework for Panoramic Videos. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN); IEEE: New York, NY, USA, 2020; pp. 1–8. [Google Scholar]
- Chopra, L.; Chakraborty, S.; Mondal, A.; Chakraborty, S. PARIMA: Viewport Adaptive 360-Degree Video Streaming. In Proceedings of the Web Conference 2021; Association for Computing Machinery: New York, NY, USA, 2021; WWW ’21; pp. 2379–2391. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
- Li, J.; Han, L.; Zhang, C.; Li, Q.; Liu, Z. Spherical Convolution Empowered Viewport Prediction in 360 Video Multicast with Limited FoV Feedback. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–23. [Google Scholar] [CrossRef]
- Peng, S.; Hu, J.; Li, Z.; Xiao, H.; Yang, S.; Xu, C. Spherical Convolution-based Saliency Detection for FoV Prediction in 360-degree Video Streaming. In Proceedings of the 2023 International Wireless Communications and Mobile Computing (IWCMC); IEEE: New York, NY, USA, 2023; pp. 162–167. [Google Scholar]
- Wu, C.; Zhang, R.; Wang, Z.; Sun, L. A Spherical Convolution Approach for Learning Long Term Viewport Prediction in 360 Immersive Video. Proc. AAAI Conf. Artif. Intell. 2020, 34, 14003–14040. [Google Scholar] [CrossRef]
- Gao, B.; Sheng, D.; Zhang, L.; Qi, Q.; He, B.; Zhuang, Z.; Wang, J. STAR-VP: Improving Long-term Viewport Prediction in 360-Degree Videos via Space-aligned and Time-varying Fusion. In Proceedings of the 32nd ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2024; MM ’24; pp. 5556–5565. [Google Scholar]
- Meijer, C.; Chen, L.Y. The rise of diffusion models in time-series forecasting. arXiv 2024, arXiv:2401.03006. [Google Scholar] [CrossRef]
- Rasul, K.; Seward, C.; Schuster, I.; Vollgraf, R. Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting. arXiv 2021, arXiv:2101.12072. [Google Scholar] [CrossRef]
- Alcaraz, J.L.; Strodthoff, N. Diffusion-based Time Series Imputation and Forecasting with Structured State Space Models. Trans. Mach. Learn. Res. 2023, 1–36. Available online: https://openreview.net/forum?id=hHiIbk7ApW (accessed on 15 March 2026).
- Wang, D.; Cheng, M.; Liu, Z.; Liu, Q. TimeDART: A Diffusion Autoregressive Transformer for Self-Supervised Time Series Representation. In Proceedings of the Forty-Second International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2025; pp. 1–25. [Google Scholar]
- Wang, Y.; Zhang, F.L.; Dodgson, N.A. ScanTD: 360-Degree Scanpath Prediction based on Time-Series Diffusion. In Proceedings of the 32nd ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2024; MM ’24; pp. 7764–7773. [Google Scholar]
- Yun, H.; Lee, S.; Kim, G. Panoramic Vision Transformer for Saliency Detection in 360-Degree Videos. In Proceedings of the Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Part XXXV. pp. 422–439. [Google Scholar]
- Kong, Z.; Ping, W.; Huang, J.; Zhao, K.; Catanzaro, B. DiffWave: A Versatile Diffusion Model for Audio Synthesis. In Proceedings of the International Conference on Learning Representations; ICLR: Appleton, WI, USA, 2021; pp. 1–17. [Google Scholar]
- Cuturi, M.; Blondel, M. Soft-DTW: A differentiable loss function for time-series. In Proceedings of the 34th International Conference on Machine Learning; Association for Computing Machinery: New York, NY, USA, 2017; Volume 70, ICML’17; pp. 894–903. [Google Scholar]
- David, E.J.; Gutiérrez, J.; Coutrot, A.; Da Silva, M.P.; Callet, P.L. A dataset of head and eye movements for 360-Degree videos. In Proceedings of the 9th ACM Multimedia Systems Conference; Association for Computing Machinery: New York, NY, USA, 2018; MMSys ’18; pp. 432–437. [Google Scholar]
- Wu, C.; Tan, Z.; Wang, Z.; Yang, S. A Dataset for Exploring User Behaviors in VR Spherical Video Streaming. In Proceedings of the 8th ACM on Multimedia Systems Conference; Association for Computing Machinery: New York, NY, USA, 2017; MMSys’17; pp. 193–198. [Google Scholar]
- Rondón, M.F.R.; Sassatelli, L.; Aparicio-Pardo, R.; Precioso, F. A unified evaluation framework for head motion prediction methods in 360-Degree videos. In Proceedings of the 11th ACM Multimedia Systems Conference; Association for Computing Machinery: New York, NY, USA, 2020; MMSys ’20; pp. 279–284. [Google Scholar]
- Chao, F.Y.; Zhang, L.; Hamidouche, W.; Deforges, O. Salgan360: Visual Saliency Prediction on 360-Degree Images with Generative Adversarial Networks. In Proceedings of the 2018 IEEE International Conference on Multimedia & Expo Workshops (ICMEW); IEEE: New York, NY, USA, 2018; pp. 1–4. [Google Scholar]
- Zhang, Z.; Xu, Y.; Yu, J.; Gao, S. Saliency Detection in 360-Degree Videos. In Proceedings of the Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018, Part VII; Springer: Berlin/Heidelberg, Germany, 2018; pp. 504–520. [Google Scholar]






| Method | Pub. | David_MMSys | Wu_MMSys | Xu_PAMI | |||
|---|---|---|---|---|---|---|---|
| OD ↓ | IoU ↑ | OD ↓ | IoU ↑ | OD ↓ | IoU ↑ | ||
| Track [22] | TPAMI’21 | 1.123 | 25.05% | 0.613 | 51.12% | 0.408 | 63.32% |
| VPT360 [8] | MMSP’21 | 1.127 | 26.00% | 0.624 | 52.04% | 0.421 | 62.47% |
| MFTR [25] | MM’23 | 1.064 | 27.98% | 0.599 | 52.02% | 0.418 | 62.59% |
| STAR-VP [33] | MM’24 | 0.967 | 33.26% | 0.531 | 56.82% | 0.410 | 63.10% |
| DiffVP | ’26 | 0.962 | 34.55% | 0.503 | 59.69% | 0.388 | 66.01% |
| Methods | Pub. | FLOPs (G) | Parameters (MB) | Inference Time (s) * |
|---|---|---|---|---|
| STAR-VP [33] | MM’24 | 6.9 | 1.87 | 0.03 |
| DiffVP (DDMP) | ’26 | 7.45 | 1.96 | 2.51 |
| DiffVP (DDIM) | ’26 | 7.45 | 1.96 | 0.34 |
| Baseline | Diff | ECTE | CASF | Cat_Sal * | DDPM | DDIM | David_MMSys | Wu_MMSys | Xu_PAMI | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| OD ↓ | IoU ↑ | OD ↓ | IoU ↑ | OD ↓ | IoU ↑ | |||||||
| √ | × | × | × | × | × | × | 1.132 | 26.02% | 0.621 | 51.25% | 0.432 | 62.18% |
| √ | √ | × | × | × | × | √ | 1.076 | 28.62% | 0.589 | 53.17% | 0.411 | 63.05% |
| × | √ | √ | × | × | × | √ | 1.042 | 31.37% | 0.545 | 56.17% | 0.398 | 64.21% |
| √ | √ | × | × | √ | × | √ | 1.053 | 30.16% | 0.574 | 54.81% | 0.408 | 63.67% |
| √ | √ | × | √ | × | × | √ | 1.015 | 32.24% | 0.534 | 57.19% | 0.402 | 64.87% |
| × | √ | √ | √ | × | × | √ | 0.962 | 34.55% | 0.503 | 59.69% | 0.388 | 66.01% |
| × | √ | √ | √ | × | √ | × | 0.966 | 34.32% | 0.502 | 59.82% | 0.391 | 65.78% |
| Method | Pub. | David_MMSys | Wu_MMSys | Xu_PAMI | |||
|---|---|---|---|---|---|---|---|
| OD ↓ | IoU ↑ | OD ↓ | IoU ↑ | OD ↓ | IoU ↑ | ||
| SalGAN360 [45] | ICMEW’18 | 1.056 | 30.27% | 0.612 | 52.34% | 0.420 | 63.15% |
| Spherical U-Net [46] | ECCV’18 | 1.002 | 31.46% | 0.552 | 55.04% | 0.405 | 63.29% |
| Offline-DHP [39] | TPAMI’18 | 0.991 | 33.75% | 0.539 | 56.32% | 0.392 | 64.33% |
| PAVER [39] | ECCV’22 | 0.962 | 34.55% | 0.503 | 59.69% | 0.388 | 66.01% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Zheng, H.; Du, L.; Nie, X.; Dong, F. DiffVP: A Diffusion Model with Explicit Coordinate-Temporal Encoding for Viewport Prediction in 360∘ Videos. Electronics 2026, 15, 1326. https://doi.org/10.3390/electronics15061326
Zheng H, Du L, Nie X, Dong F. DiffVP: A Diffusion Model with Explicit Coordinate-Temporal Encoding for Viewport Prediction in 360∘ Videos. Electronics. 2026; 15(6):1326. https://doi.org/10.3390/electronics15061326
Chicago/Turabian StyleZheng, Huimin, Lina Du, Xiushan Nie, and Fei Dong. 2026. "DiffVP: A Diffusion Model with Explicit Coordinate-Temporal Encoding for Viewport Prediction in 360∘ Videos" Electronics 15, no. 6: 1326. https://doi.org/10.3390/electronics15061326
APA StyleZheng, H., Du, L., Nie, X., & Dong, F. (2026). DiffVP: A Diffusion Model with Explicit Coordinate-Temporal Encoding for Viewport Prediction in 360∘ Videos. Electronics, 15(6), 1326. https://doi.org/10.3390/electronics15061326

