Learning-Based Viewport Prediction for 360-Degree Videos: A Review
Abstract
1. Introduction
- Recency: We focused on papers published within the last 10 years (2016–2025) in order to reflect the development of viewport prediction, starting from regression methods, which were already being used in 2016, and ending with modern approaches employed in the last year.
- Diversity of approaches: We intentionally included works using a variety of strategies, namely head movement-based, saliency-based, and hybrid approaches, to ensure broad coverage of the field and show the strengths and drawbacks of every approach.
- Technique variety: For each category, we selected works that employed a range of techniques, such as regression, clustering, LSTM, Transformer models, reinforcement learning (RL), and graph neural networks (GNNs), to reflect the technical diversity in the state-of-the-art methods.
- Duplicates (across databases or different papers using the same viewport prediction approach);
- Papers that were not written in English;
- Papers with no citations;
- Papers not addressing omnidirectional content;
- Papers focusing on proposing new streaming and/or bit allocation techniques employing existing viewport prediction approaches.
2. Viewport Prediction
2.1. Head Position Approaches
2.1.1. Regression Approaches
2.1.2. Clustering Approaches
2.1.3. LSTM
2.1.4. Transformers
2.2. Saliency-Based Approaches
2.3. Hybrid Approaches
2.3.1. LSTM
- A double-stacked LSTM with 256 units each, processing head movement information;
- A double-stacked LSTM with 256 units each, processing content-based saliency after flattening it;
- A double-stacked LSTM with 256 units each, processing the concatenated output from the first and second steps.
2.3.2. Convolutional LSTM
2.3.3. Gated Recurrent Unit
2.3.4. Transformers
2.3.5. Graph Neural Network (GNN)
2.3.6. Reinforcement Learning (RL)
3. Datasets
- Exploration videos: In this category, users tend to explore the entire scene without focusing on a specific object or region. These scenes usually contain no motion and no dominant visual target. An example is represented by natural landscapes or panoramic views. As a result, while watching the video, head movements are widely distributed, thus making viewport prediction particularly challenging.
- Static focus: These videos contain salient objects that remain fixed near the center of the scene, such as a performer in a concert or a static speaker. Because the main object naturally attracts attention, these videos are considered easier for viewport prediction tasks.
- Moving focus: In these videos, salient objects move around the scene, requiring users to follow their motion. An example of this category is represented by sports videos, where multiple players move inside the scene, or dynamic scenes with multiple actors. Predicting head orientation is more complex because of the variability in motion and attention shifts.
- Ride-based videos: This category includes camera motion during video capture. One of the most common scenarios is represented by car driving. These videos often introduce a forward-moving visual flow, which adds another layer of complexity to predicting user gaze and head movement.
- VR-EyeTracking—introduced in [26]: This is the largest known 360-degree video dataset, which contains 208 videos with a 4K resolution; each video has been watched by 45 users. The videos in the dataset have durations ranging from 20 to 60 s at 25 fps. The video content is varied, including indoor scenes, outdoor environments, sports, music performances, and documentaries. Users watched the videos using an HTC Vive headset with a 7invensun a-Glass eye tracker. This dataset is provided in the form of MP4 videos, along with the corresponding head and eye fixations.
- Sport360—introduced in [84]: This dataset includes 104 videos (extracted from the Sport-360 dataset [107]) viewed by 27 participants. As for the first dataset, the HTC Vive and the a-Glass eye tracker were used. Differently from VR-Eyetracking, this dataset provides just the video frames and the corresponding saliency maps along with the ground-truth eye fixations, making it useful for saliency estimation tasks.
- PSV-HM—introduced in [97]: This dataset contains 76 videos viewed by 58 users. The videos, taken from YouTube and VRCun, have a duration of between 10 and 80 s and resolutions varying from 3K to 8K. They were watched using the HTC Vive, with the head orientation recorded for each viewer. The dataset includes diverse content types, such as animation, gaming, sports, nature scenes, and driving. It is available as MP4 videos along with the corresponding head orientation data.
- Other datasets: Additional datasets include collections with a diverse number of videos and subjects, offering varied levels of annotation and content diversity.
4. Evaluation Metrics and Comparison Methods
4.1. Evaluation Metrics
- Accuracy: the ratio of correctly classified tiles to the total number of tiles.
- Intersection over Union (IoU): the intersection between predicted tiles and ground-truth tiles divided by the union of predicted and ground-truth tiles.
- Recall: answers the question of how many tiles predicted by the system the user actually looked at.
- Precision: answers the question of how many of the tiles that were predicted to be in the user’s viewport actually were in the viewport.
- F1-Score: the harmonic mean of precision and recall.
- Mean FoV Overlap: the average ratio of the IoU of the predicted and ground-truth viewport areas over a defined prediction window. The overlapping area is computed as
- Intersection Angle Error: the distance between the predicted point and the ground-truth point on a sphere.
- Great Circle Distance/Orthodromic Distance: the shortest distance between the center of the predicted viewport and the ground-truth viewport on the surface of a sphere.
- Average Great Circle Distance: the average of the great circle distance over all future time steps.
- Viewport PSNR(V-PSNR): where the PSNR is computed only over the user’s viewport compared to the reference ground truth.
- VWS-PSNR: where the PSNR is computed over the user’s viewport, incorporating a spherical correction and the weighted MSE compared to the reference ground truth.
- Hit Rate: the average percentage of the predicted FoV area, computed frame by frame within a given second, relative to the total area of the actual FoV across those frames.
- Missing Ratio/Blank Ratio: the ratio of missing tiles to viewed tiles.
- Unseen Ratio: the ratio of tiles that arrived at the client but were not watched to the total number of transmitted tiles.
- Bandwidth Consumption: the amount of bandwidth used to stream the predicted tiles.
- Peak Bandwidth: the highest bandwidth consumption recorded during the streaming process.
- Total Re-Buffering Time: the overall time spent re-buffering during playback.
- Buffer Stall Count: the number of times that video playback is interrupted due to buffer depletion.
- Buffer Stall Duration: the total number of seconds for which playback is halted while waiting for the buffer to refill.
- Viewport Deviation: the percentage of blank area inside the viewport.
- Viewport Quality Variance: adopts the coefficient of variation (CV) to evaluate the viewport quality variance in one segment (as detailed in Section 2.1.1).
4.1.1. Accuracy
4.1.2. Mean FoV Overlap
4.1.3. IoU
4.2. Comparison Between Methods
5. Discussion and Recommendations
- Collect rich datasets: There is a need for larger and more varied datasets that include different content types and user data (e.g., eye gaze, head traces, and saliency ground truth). These will aid in model generalization and their application to real-life scenarios.
- Unify the evaluation procedure: Shared evaluation datasets and metrics would be a relevant contribution. This will allow fair and objective comparisons between different approaches.
- Design efficient models: Future research should focus on lightweight models that are suitable for real-time deployment on low-end devices. In addition, scalability should be granted for multi-user and long-term prediction scenarios. Furthermore, models should be developed with a user-centric perspective by incorporating QoE as a core evaluation criterion—ensuring not only the accuracy of viewport prediction but also smooth playback, reduced latency, and enhanced overall viewing satisfaction.
- Improve saliency estimation: It is important to design robust saliency estimations for 360-degree videos. They should account for spherical distortions and large frame sizes.
- Optimize input windows: The input data size should be carefully set in order to balance the input quality, the risk of noisy samples, and high computational complexity.
- Investigate diverse input types: Current approaches mainly rely on video content and users’ behavioral data. Future research could explore combining these with physiological signals.
- Realize QoE by-design: While QoE metrics are often used for model evaluation, QoE principles should also guide the design of new algorithms. To better model users’ perceptions, future works should integrate subjective pilot studies in combination with objective metrics to evaluate algorithm performance.
- Develop user-tailored prediction models: Although many approaches have been proposed, different AI models and input types may be better suited for different users. Therefore, a more general framework that integrates multiple Artificial Intelligence (AI) models and input modalities could adapt predictions to individual users.
- Integrate viewport prediction in streaming platforms: To bridge the gap between the research community and final users, viewport prediction should be integrated into VR streaming platforms. Although this is an active field, as indicated by the efforts in this direction by Meta [109], prediction techniques should first reach sufficient maturity in terms of accuracy, prediction horizons, and generalizability across multiple users.
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
AI | Artificial Intelligence |
ARIMA | Autoregressive Integrated Moving Average |
AUC-Judd | Area Under the Curve—Judd |
BiLSTM | Bidirectional Long Short-Term Memory |
CBAM | Convolutional Block Attention Module |
CC | Pearson Linear Correlation Coefficient |
CNN | Convolutional Neural Network |
Conv-LSTM | Convolutional Long Short-Term Memory |
DFCA | Dynamic Federated Clustering Algorithm |
DHP | Deep Reinforcement Learning-Based Head Movement Prediction |
DL | Deep Learning |
DoF | Degrees of Freedom |
DR | Dead Reckoning |
DRL | Deep Reinforcement Learning |
ERP | Equirectangular Projection |
FN | False Negative |
FoV | Field of View |
FP | False Positive |
FSM | Fused Saliency Map |
GBVS | Graph-Based Visual Saliency |
GCN | Graph Convolutional Network |
GEP | Gaze-Based Field of View Prediction |
GNN | Graph Neural Network |
GRU | Gated Recurrent Unit |
HMD | Head-Mounted Display |
IoU | Intersection over Union |
KL | Kulback–Leibler Divergence |
KNN | K-Nearest Neighbors |
LR | Linear Regression |
LSTM | Long Short-Term Memory |
MIAE | Mean Intersection Angle Error |
ML | Machine Learning |
MPC | Model Predictive Controller |
MSE | Mean Squared Error |
NSS | Normalized Scanpath Saliency |
PEVQ | Perceptual Evaluation of Video Quality |
PoI | Point of Interest |
PSNR | Peak Signal-to-Noise Ratio |
QoE | Quality of Experience |
RL | Reinforcement learning |
RMSE | Root Mean Square Error |
RNN | Recurrent Neural Network |
RoI | Regions of Interest |
RR | Ridge Regression |
SE-net | Squeeze-and-Excitation Network |
SE-Unet | Squeeze-and-Excitation Network and U-Net |
SP-ConvGRU | Spherical Convolutional Gated Recursive Unit |
SP-GRU | Spherical Gated Recurrent Unit |
SSIM | Structural Similarity Index |
SVM | Support Vector Machine |
SVR | Support Vector Regressor |
TN | True Negative |
TP | True Positive |
TT | Total True |
VR | Virtual Reality |
VST | Visual Saliency Transformer |
VWS-PSNR | Viewport-Weighted PSNR |
WLR | Weighted Linear Regression |
References
- Global Industry Analysts Inc. Virtual Reality (VR)-Global Strategic Business Report. 2025. Available online: https://www.researchandmarkets.com/reports/3633908/virtual-reality-vr-global-strategic-business?srsltid=AfmBOookGFSnubhPSmOYKDCSUry2qI5-UrXnjJh24schZn0F6-wX2hjn (accessed on 1 September 2025).
- Zink, M.; Sitaraman, R.; Nahrstedt, K. Scalable 360° Video Stream Delivery: Challenges, Solutions, and Opportunities. Proc. IEEE 2019, 107, 639–650. [Google Scholar] [CrossRef]
- Nguyen, A.; Yan, Z. Enhancing 360 Video Streaming through Salient Content in Head-Mounted Displays. Sensors 2023, 23, 4016. [Google Scholar] [CrossRef] [PubMed]
- Park, S.; Bhattacharya, A.; Yang, Z.; Das, S.; Samaras, D. Mosaic: Advancing User Quality of Experience in 360-Degree Video Streaming with Machine Learning. IEEE Trans. Netw. Serv. Manag. 2021, 18, 1000–1015. [Google Scholar] [CrossRef]
- Islam, R.; Desai, K.; Quarles, J. Cybersickness Prediction from Integrated HMD’s Sensors: A Multimodal Deep Fusion Approach using Eye-tracking and Head-tracking Data. In Proceedings of the 2021 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Bari, Italy, 4–8 October 2021; pp. 31–40. [Google Scholar] [CrossRef]
- Islam, R.; Desai, K.; Quarles, J. Towards Forecasting the Onset of Cybersickness by Fusing Physiological, Head-tracking and Eye-tracking with Multimodal Deep Fusion Network. In Proceedings of the 2022 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Singapore, 17–21 October 2022; pp. 121–130. [Google Scholar] [CrossRef]
- Yaqoob, A.; Bi, T.; Muntean, G.M. A Survey on Adaptive 360° Video Streaming: Solutions, Challenges and Opportunities. IEEE Commun. Surv. Tutorials 2020, 22, 2801–2838. [Google Scholar] [CrossRef]
- Dziubinski, K.; Bandai, M. Local and Global Viewport History Sampling for Improved User Quality of Experience in Viewport-Aware Tile-Based 360-Degree Video Streaming. IEEE Access 2024, 12, 137455–137471. [Google Scholar] [CrossRef]
- Bao, Y.; Wu, H.; Zhang, T.; Ramli, A.A.; Liu, X. Shooting a moving target: Motion-prediction-based transmission for 360-degree videos. In Proceedings of the 2016 IEEE International Conference on Big Data (Big Data), Washington, DC, USA, 5–8 December 2016; pp. 1161–1170. [Google Scholar] [CrossRef]
- Qian, F.; Ji, L.; Han, B.; Gopalakrishnan, V. Optimizing 360 video delivery over cellular networks. In Proceedings of the 5th Workshop on All Things Cellular: Operations, Applications and Challenges, New York, NY, USA, 3–7 October 2016; pp. 1–6. [Google Scholar] [CrossRef]
- Qian, F.; Han, B.; Xiao, Q.; Gopalakrishnan, V. Flare: Practical Viewport-Adaptive 360-Degree Video Streaming for Mobile Devices. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, New Delhi, India, 29 October–2 November 2018; pp. 99–114. [Google Scholar] [CrossRef]
- Ban, Y.; Xie, L.; Xu, Z.; Zhang, X.; Guo, Z.; Wang, Y. CUB360: Exploiting Cross-Users Behaviors for Viewport Prediction in 360 Video Adaptive Streaming. In Proceedings of the 2018 IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, USA, 23–27 July 2018; pp. 1–6. [Google Scholar] [CrossRef]
- Baldoni, S.; Poci, O.; Calvagno, G.; Battisti, F. An Ablation Study on 360-Degree Saliency Estimation. In Proceedings of the 2023 International Symposium on Image and Signal Processing and Analysis (ISPA), Rome, Italy, 18–19 September 2023; pp. 1–6. [Google Scholar] [CrossRef]
- Aladagli, A.D.; Ekmekcioglu, E.; Jarnikov, D.; Kondoz, A. Predicting head trajectories in 360° virtual reality videos. In Proceedings of the 2017 International Conference on 3D Immersion (IC3D), Brussels, Belgium, 11–12 December 2017; pp. 1–6. [Google Scholar] [CrossRef]
- Petrangeli, S.; Simon, G.; Swaminathan, V. Trajectory-Based Viewport Prediction for 360-Degree Virtual Reality Videos. In Proceedings of the 2018 IEEE International Conference on Artificial Intelligence and Virtual Reality (AIVR), Taichung, Taiwan, 10–12 December 2018; pp. 157–160. [Google Scholar] [CrossRef]
- Nasrabadi, A.T.; Samiei, A.; Prakash, R. Viewport prediction for 360° videos: A clustering approach. In Proceedings of the 30th ACM Workshop on Network and Operating Systems Support for Digital Audio and Video, Istanbul, Turkey, 10–11 June 2020; pp. 34–39. [Google Scholar] [CrossRef]
- Li, J.; Wang, Y.; Liu, Y. Meta360: Exploring User-Specific and Robust Viewport Prediction in360-Degree Videos through Bi-Directional LSTM and Meta-Adaptation. In Proceedings of the 2023 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Sydney, Australia, 16–20 October 2023; pp. 652–661. [Google Scholar] [CrossRef]
- Mahmoud, M.; Rizou, S.; Panayides, A.S.; Kantartzis, N.V.; Karagiannidis, G.K.; Lazaridis, P.I.; Zaharis, Z.D. Optimized Tile Quality Selection in Multi-User 360° Video Streaming. IEEE Open J. Commun. Soc. 2024, 5, 7301–7316. [Google Scholar] [CrossRef]
- Feng, W.; Wang, S.; Dai, Y. Adaptive 360-Degree Streaming: Optimizing with Multi-window and Stochastic Viewport Prediction. IEEE Trans. Mob. Comput. 2025, 24, 5903–5915. [Google Scholar] [CrossRef]
- Chao, F.Y.; Ozcinar, C.; Smolic, A. Transformer-based Long-Term Viewport Prediction in 360° Video: Scanpath is All You Need. In Proceedings of the 2021 IEEE 23rd International Workshop on Multimedia Signal Processing (MMSP), Tampere, Finland, 6–8 October 2021; pp. 1–6. [Google Scholar] [CrossRef]
- Wang, H.; Long, Z.; Dong, H.; El Saddik, A. MADRL-Based Rate Adaptation for 360° Video Streaming with Multiviewpoint Prediction. IEEE Internet Things J. 2024, 11, 26503–26517. [Google Scholar] [CrossRef]
- Ao, A.; Park, S. Applying Transformer-Based Computer Vision Models to Adaptive Bitrate Allocation for 360° Live Streaming. In Proceedings of the 2024 IEEE Wireless Communications and Networking Conference (WCNC), Dubai, United Arab Emirates, 21–24 April 2024; pp. 1–6. [Google Scholar] [CrossRef]
- Fan, C.L.; Lee, J.; Lo, W.C.; Huang, C.Y.; Chen, K.T.; Hsu, C.H. Fixation Prediction for 360° Video Streaming in Head-Mounted Virtual Reality. In Proceedings of the 27th Workshop on Network and Operating Systems Support for Digital Audio and Video, Taipei, Taiwan, 20–23 June 2017; pp. 67–72. [Google Scholar] [CrossRef]
- Nguyen, A.; Yan, Z.; Nahrstedt, K. Your Attention is Unique: Detecting 360-Degree Video Saliency in Head-Mounted Display for Head Movement Prediction. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 1190–1198. [Google Scholar] [CrossRef]
- Li, Y.; Xu, Y.; Xie, S.; Ma, L.; Sun, J. Two-Layer FoV Prediction Model for Viewport Dependent Streaming of 360-Degree Videos. In Proceedings of the International ICST Conference on Communications and Networking in China, Chengdu, China, 23–25 October 2018. [Google Scholar]
- Xu, Y.; Dong, Y.; Wu, J.; Sun, Z.; Shi, Z.; Yu, J.; Gao, S. Gaze Prediction in Dynamic 360° Immersive Videos. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5333–5342. [Google Scholar] [CrossRef]
- Fan, C.L.; Yen, S.C.; Huang, C.Y.; Hsu, C.H. Optimizing Fixation Prediction Using Recurrent Neural Networks for 360∘ Video Streaming in Head-Mounted Virtual Reality. IEEE Trans. Multimed. 2020, 22, 744–759. [Google Scholar] [CrossRef]
- Feng, X.; Liu, Y.; Wei, S. LiveDeep: Online Viewport Prediction for Live Virtual Reality Streaming Using Lifelong Deep Learning. In Proceedings of the 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), Atlanta, GA, USA, 22–26 March 2020; pp. 800–808. [Google Scholar] [CrossRef]
- Rondón, M.F.R.; Sassatelli, L.; Aparicio-Pardo, R.; Precioso, F. TRACK: A New Method from a Re-Examination of Deep Architectures for Head Motion Prediction in 360° Videos. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 5681–5699. [Google Scholar] [CrossRef]
- Manfredi, G.; Racanelli, V.A.; De Cicco, L.; Mascolo, S. LSTM-based Viewport Prediction for Immersive Video Systems. In Proceedings of the 2023 21st Mediterranean Communication and Computer Networking Conference (MedComNet), Island of Ponza, Italy, 13–15 June 2023; pp. 49–52. [Google Scholar] [CrossRef]
- Wang, M.; Peng, S.; Chen, X.; Zhao, Y.; Xu, M.; Xu, C. CoLive: An Edge-Assisted Online Learning Framework for Viewport Prediction in 360° Live Streaming. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar] [CrossRef]
- Dong, P.; Shen, R.; Xie, X.; Li, Y.; Zuo, Y.; Zhang, L. Predicting Long-Term Field of View in 360-Degree Video Streaming. IEEE Netw. 2023, 37, 26–33. [Google Scholar] [CrossRef]
- Wang, M.; Chen, X.; Yang, X.; Peng, S.; Zhao, Y.; Xu, M.; Xu, C. CoLive: Edge-Assisted Clustered Learning Framework for Viewport Prediction in 360∘ Live Streaming. IEEE Trans. Multimed. 2024, 26, 5078–5091. [Google Scholar] [CrossRef]
- Zhang, L.; Zhou, H.; Shen, L.; Liu, J.; Cui, L. Towards Attention-Aware Interactive 360-Degree Video Streaming on Smartphones. IEEE Netw. 2025, 39, 147–156. [Google Scholar] [CrossRef]
- Li, J.; Han, L.; Zhang, C.; Li, Q.; Liu, Z. Spherical Convolution Empowered Viewport Prediction in 360 Video Multicast with Limited FoV Feedback. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 3. [Google Scholar] [CrossRef]
- Setayesh, M.; Wong, V.W.S. Viewport Prediction, Bitrate Selection, and Beamforming Design for THz-Enabled 360° Video Streaming. IEEE Trans. Wirel. Commun. 2025, 24, 1849–1865. [Google Scholar] [CrossRef]
- Zhang, Z.; Du, H.; Huang, S.; Zhang, W.; Zheng, Q. VRFormer: 360-Degree Video Streaming with FoV Combined Prediction and Super resolution. In Proceedings of the 2022 IEEE Intl Conf on Parallel and Distributed Processing with Applications, Big Data and Cloud Computing, Sustainable Computing and Communications, Social Computing and Networking (ISPA/BDCloud/SocialCom/SustainCom), Melbourne, Australia, 17–19 December 2022; pp. 531–538. [Google Scholar] [CrossRef]
- Gao, B.; Sheng, D.; Zhang, L.; Qi, Q.; He, B.; Zhuang, Z.; Wang, J. STAR-VP: Improving Long-term Viewport Prediction in 360° Videos via Space-aligned and Time-varying Fusion. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, 28 October–1 November 2024; pp. 5556–5565. [Google Scholar] [CrossRef]
- Guo, Y.; Xu, M.; Jiang, L.; Deng, X.; Zhou, J.; Chen, G.; Sigal, L. Proposal with Alignment: A Bi-Directional Transformer for 360° Video Viewport Proposal. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 11423–11437. [Google Scholar] [CrossRef]
- Xu, X.; Tan, X.; Wang, S.; Liu, Z.; Zheng, Q. Multi-Features Fusion based Viewport Prediction with GNN for 360-Degree Video Streaming. In Proceedings of the 2023 IEEE International Conference on Metaverse Computing, Networking and Applications (MetaCom), Kyoto, Japan, 26–28 June 2023; pp. 57–64. [Google Scholar] [CrossRef]
- Xu, M.; Song, Y.; Wang, J.; Qiao, M.; Huo, L.; Wang, Z. Predicting Head Movement in Panoramic Video: A Deep Reinforcement Learning Approach. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 2693–2708. [Google Scholar] [CrossRef]
- Wu, C.; Tan, Z.; Wang, Z.; Yang, S. A Dataset for Exploring User Behaviors in VR Spherical Video Streaming. In Proceedings of the 8th ACM on Multimedia Systems Conference, Taipei, Taiwan, 20–23 June 2017; pp. 193–198. [Google Scholar] [CrossRef]
- Bao, Y.; Zhang, T.; Pande, A.; Wu, H.; Liu, X. Motion-Prediction-Based Multicast for 360-Degree Video Transmissions. In Proceedings of the 2017 14th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON), San Diego, CA, USA, 12–14 June 2017; pp. 1–9. [Google Scholar] [CrossRef]
- Markley, L.; Cheng, Y.; Crassidis, J.; Oshman, Y. Averaging Quaternions. J. Guid. Control Dyn. 2007, 30, 1193–1196. [Google Scholar] [CrossRef]
- Nasrabadi, A.T.; Samiei, A.; Mahzari, A.; McMahan, R.P.; Prakash, R.; Farias, M.C.Q.; Carvalho, M.M. A taxonomy and dataset for 360° videos. In Proceedings of the 10th ACM Multimedia Systems Conference, Amherst, MA, USA, 18–21 June 2019; pp. 273–278. [Google Scholar] [CrossRef]
- Lo, W.C.; Fan, C.L.; Lee, J.; Huang, C.Y.; Chen, K.T.; Hsu, C.H. 360° Video Viewing Dataset in Head-Mounted Virtual Reality. In Proceedings of the 8th ACM on Multimedia Systems Conference, Taipei, Taiwan, 20–23 June 2017; pp. 211–216. [Google Scholar] [CrossRef]
- Li, C.; Xu, M.; Du, X.; Wang, Z. Bridge the Gap Between VQA and Human Behavior on Omnidirectional Video: A Large-Scale Dataset and a Deep Learning Model. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 932–940. [Google Scholar] [CrossRef]
- Developers, F. FFmpeg Multimedia Framework. Available online: https://ffmpeg.org/ (accessed on 29 July 2025).
- Guo, C.; Cui, Y.; Liu, Z. Optimal Multicast of Tiled 360 VR Video in OFDMA Systems. IEEE Commun. Lett. 2018, 22, 2563–2566. [Google Scholar] [CrossRef]
- Mahmoud, M.; Valiandi, I.; Panayides, A.S.; Rizou, S.; Lazaridis, P.I.; Karagiannidis, G.K.; Kantartzis, N.V.; Zaharis, Z.D. Versatile Video Coding Performance Evaluation for Tiled 360 deg Videos. In Proceedings of the European Wireless 2023; 28th European Wireless Conference, Rome, Italy, 2–4 October 2023; pp. 191–196. [Google Scholar]
- David, E.J.; Gutiérrez, J.; Coutrot, A.; Da Silva, M.P.; Callet, P.L. A dataset of head and eye movements for 360° videos. In Proceedings of the 9th ACM Multimedia Systems Conference, Amsterdam, The Netherlands, 12–15 June 2018; pp. 432–437. [Google Scholar] [CrossRef]
- Hou, X.; Dey, S.; Zhang, J.; Budagavi, M. Predictive Adaptive Streaming to Enable Mobile 360-Degree and VR Experiences. IEEE Trans. Multimed. 2021, 23, 716–731. [Google Scholar] [CrossRef]
- Zhang, Y.; Zhao, P.; Bian, K.; Liu, Y.; Song, L.; Li, X. DRL360: 360-degree Video Streaming with Deep Reinforcement Learning. In Proceedings of the IEEE INFOCOM 2019—IEEE Conference on Computer Communications, Paris, France, 29 April–2 May 2019; pp. 1252–1260. [Google Scholar] [CrossRef]
- Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–12 August 2017; Volume 70, pp. 1126–1135. [Google Scholar]
- Lu, Y.; Zhu, Y.; Wang, Z. Personalized 360-Degree Video Streaming: A Meta-Learning Approach. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 3143–3151. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
- Itti, L.; Koch, C.; Niebur, E. A Model of Saliency-based Visual Attention for Rapid Scene Analysis. Pattern Anal. Mach. Intell. IEEE Trans. 1998, 20, 1254–1259. [Google Scholar] [CrossRef]
- Lou, J.; Lin, H.; Marshall, D.; Saupe, D.; Liu, H. TranSalNet: Towards perceptually relevant visual saliency prediction. Neurocomputing 2022, 494, 455–467. [Google Scholar] [CrossRef]
- Liu, N.; Zhang, N.; Wan, K.; Shao, L.; Han, J. Visual Saliency Transformer. arXiv 2021, arXiv:2104.12099. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long Short-term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Lucas, B.D.; Kanade, T. An iterative image registration technique with an application to stereo vision. In Proceedings of the 7th International Joint Conference on Artificial Intelligence, Vancouver, BC, Canada, 24–28 August 1981; Volume 2, pp. 674–679. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Cornia, M.; Baraldi, L.; Serra, G.; Cucchiara, R. A deep multi-level network for saliency prediction. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 3488–3493. [Google Scholar] [CrossRef]
- Mavlankar, A.; Girod, B. Video Streaming with Interactive Pan/Tilt/Zoom. In High-Quality Visual Experience: Creation, Processing and Interactivity of High-Resolution and High-Dimensional Video Signals; Mrak, M., Grgic, M., Kunt, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2010; pp. 431–455. [Google Scholar] [CrossRef]
- El-Ganainy, T.; Hefeeda, M. Streaming virtual reality content. arXiv 2016, arXiv:1612.08350. [Google Scholar] [CrossRef]
- Corbillon, X.; Simon, G.; Devlic, A.; Chakareski, J. Viewport-adaptive navigable 360-degree video delivery. In Proceedings of the 2017 IEEE International Conference on Communications (ICC), Paris, France, 21–25 May 2017; pp. 1–7. [Google Scholar] [CrossRef]
- Yu, M.; Lakshman, H.; Girod, B. A Framework to Evaluate Omnidirectional Video Coding Schemes. In Proceedings of the 2015 IEEE International Symposium on Mixed and Augmented Reality, Fukuoka, Japan, 29 September–3 October 2015; pp. 31–36. [Google Scholar] [CrossRef]
- Pan, J.; Sayrol, E.; Giro-i Nieto, X.; McGuinness, K.; O’Connor, N.E. Shallow and Deep Convolutional Networks for Saliency Prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Huang, X.; Shen, C.; Boix, X.; Zhao, Q. SALICON: Reducing the Semantic Gap in Saliency Prediction by Adapting Deep Neural Networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 262–270. [Google Scholar] [CrossRef]
- Xie, L.; Xu, Z.; Ban, Y.; Zhang, X.; Guo, Z. 360ProbDASH: Improving QoE of 360 Video Streaming Using Tile-based HTTP Adaptive Streaming. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 315–323. [Google Scholar] [CrossRef]
- De Abreu, A.; Ozcinar, C.; Smolic, A. Look around you: Saliency maps for omnidirectional images in VR applications. In Proceedings of the 2017 Ninth International Conference on Quality of Multimedia Experience (QoMEX), Erfurt, Germany, 31 May–2 June 2017; pp. 1–6. [Google Scholar] [CrossRef]
- Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, inception-ResNet and the impact of residual connections on learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 4278–4284. [Google Scholar] [CrossRef]
- Ilg, E.; Mayer, N.; Saikia, T.; Keuper, M.; Dosovitskiy, A.; Brox, T. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2462–2470. [Google Scholar]
- Feng, X.; Swaminathan, V.; Wei, S. Viewport Prediction for Live 360-Degree Mobile Video Streaming Using User-Content Hybrid Motion Tracking. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2019, 3, 43. [Google Scholar] [CrossRef]
- Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.k.; Woo, W.c. Convolutional LSTM Network: A machine learning approach for precipitation nowcasting. In Proceedings of the 29th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 1, pp. 802–810. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
- Nguyen, A.; Yan, Z. A saliency dataset for 360-degree videos. In Proceedings of the 10th ACM Multimedia Systems Conference, Amherst, MA, USA, 18–21 June 2019; pp. 279–284. [Google Scholar] [CrossRef]
- Zhang, L.; Suo, Y.; Wu, X.; Wang, F.; Chen, Y.; Cui, L.; Liu, J.; Ming, Z. TBRA: Tiling and Bitrate Adaptation for Mobile 360-Degree Video Streaming. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, 20–24 October 2021; pp. 4007–4015. [Google Scholar] [CrossRef]
- Dasari, M.; Bhattacharya, A.; Vargas, S.; Sahu, P.; Balasubramanian, A.; Das, S.R. Streaming 360-Degree Videos Using Super-Resolution. In Proceedings of the IEEE INFOCOM 2020—IEEE Conference on Computer Communications, Toronto, ON, Canada, 6–9 July 2020; pp. 1977–1986. [Google Scholar] [CrossRef]
- Peng, S.; Hu, J.; Li, Z.; Xiao, H.; Yang, S.; Xu, C. Spherical Convolution-based Saliency Detection for FoV Prediction in 360-degree Video Streaming. In Proceedings of the 2023 International Wireless Communications and Mobile Computing (IWCMC), Marrakesh, Morocco, 19–23 June 2023; pp. 162–167. [Google Scholar] [CrossRef]
- Coors, B.; Condurache, A.P.; Geiger, A. SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Zhang, Z.; Xu, Y.; Yu, J.; Gao, S. Saliency Detection in 360° Videos. In Proceedings of the Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018; Proceedings, Part VII. Springer: Berlin/Heidelberg, Germany, 2018; pp. 504–520. [Google Scholar] [CrossRef]
- Cho, K.; Merrienboer, B.; Bahdanau, D.; Bengio, Y. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. In Proceedings of the SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, 25 October 2014. [Google Scholar] [CrossRef]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar] [CrossRef]
- Chao, F.Y.; Zhang, L.; Hamidouche, W.; Deforges, O. Salgan360: Visual Saliency Prediction on 360 Degree Images with Generative Adversarial Networks. In Proceedings of the 2018 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), San Diego, CA, USA, 23–27 July 2018; pp. 1–4. [Google Scholar] [CrossRef]
- Jiang, L.; Xu, M.; Liu, T.; Qiao, M.; Wang, Z. DeepVS: A Deep Learning Based Video Saliency Prediction Approach. In Proceedings of the Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018; Proceedings, Part XIV. Spring: Berlin/Heidelberg, Germany, 2018; pp. 625–642. [Google Scholar] [CrossRef]
- Che, Z.; Borji, A.; Zhai, G.; Min, X.; Guo, G.; Le Callet, P. How is Gaze Influenced by Image Transformations? Dataset and Model. Trans. Img. Proc. 2020, 29, 2287–2300. [Google Scholar] [CrossRef] [PubMed]
- Setayesh, M.; Wong, V.W. A Content-based Viewport Prediction Framework for 360° Video Using Personalized Federated Learning and Fusion Techniques. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 654–659. [Google Scholar] [CrossRef]
- Yun, H.; Lee, S.; Kim, G. Panoramic Vision Transformer for Saliency Detection in 360° Videos. arXiv 2022, arXiv:2209.08956. [Google Scholar] [CrossRef]
- Liu, X.; Deng, Y.; Han, C.; Renzo, M.D. Learning-Based Prediction, Rendering and Transmission for Interactive Virtual Reality in RIS-Assisted Terahertz Networks. IEEE J. Sel. Areas Commun. 2022, 40, 710–724. [Google Scholar] [CrossRef]
- Yaqoob, A.; Muntean, G.M. A Combined Field-of-View Prediction-Assisted Viewport Adaptive Delivery Scheme for 360° Videos. IEEE Trans. Broadcast. 2021, 67, 746–760. [Google Scholar] [CrossRef]
- Huang, R.; Wong, V.W.; Schober, R. Rate-Splitting for Intelligent Reflecting Surface-Aided Multiuser VR Streaming. IEEE J. Sel. Areas Commun. 2023, 41, 1516–1535. [Google Scholar] [CrossRef]
- Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. Proc. AAAI Conf. Artif. Intell. 2021, 35, 11106–11115. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Identity mappings in deep residual networks. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part IV 14. Springer: Cham, Switzerland, 2016; pp. 630–645. [Google Scholar]
- Pang, H.; Zhang, C.; Wang, F.; Liu, J.; Sun, L. Towards Low Latency Multi-viewpoint 360° Interactive Video: A Multimodal Deep Reinforcement Learning Approach. In Proceedings of the IEEE INFOCOM 2019—IEEE Conference on Computer Communications, Paris, France, 29 April–2 May 2019; pp. 991–999. [Google Scholar] [CrossRef]
- Li, C.; Zhang, W.; Liu, Y.; Wang, Y. Very Long Term Field of View Prediction for 360-Degree Video Streaming. In Proceedings of the 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), San Jose, CA, USA, 28–30 March 2019; pp. 297–302. [Google Scholar] [CrossRef]
- Chen, J.; Hu, M.; Luo, Z.; Wang, Z.; Wu, D. SR360: Boosting 360-degree video streaming with super-resolution. In Proceedings of the NOSSDAV ’20: Proceedings of the 30th ACM Workshop on Network and Operating Systems Support for Digital Audio and Video, Istanbul, Turkey, 10–11 June 2020; pp. 1–6. [Google Scholar] [CrossRef]
- Zhang, Z.; Chen, Y.; Zhang, W.; Yan, C.; Zheng, Q.; Wang, Q.; Chen, W. Tile Classification Based Viewport Prediction with Multi-modal Fusion Transformer. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; MM ’23. pp. 3560–3568. [Google Scholar] [CrossRef]
- Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning Spatio-Temporal Transformer for Visual Tracking. arXiv 2021, arXiv:2103.17154. [Google Scholar] [CrossRef]
- Meinhardt, T.; Kirillov, A.; Leal-Taixé, L.; Feichtenhofer, C. TrackFormer: Multi-Object Tracking with Transformers. arXiv 2021, arXiv:2101.02702. [Google Scholar] [CrossRef]
- Dahou Djilali, Y.A.; Tliba, M.; McGuinness, K.; O’Connor, N. ATSal: An Attention Based Architecture for Saliency Prediction in 360 Videos. arXiv 2020, arXiv:2011.10600. [Google Scholar] [CrossRef]
- Vo, C.H.; Chiang, J.C.; Le, D.H.; Nguyen, T.T.; Pham, T.V. Saliency Prediction for 360-degree Video. In Proceedings of the 2020 5th International Conference on Green Technology and Sustainable Development (GTSD), Ho Chi Minh City, Vietnam, 27–28 November 2020; pp. 442–448. [Google Scholar] [CrossRef]
- Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; NIPS’16. pp. 3844–3852. [Google Scholar]
- Chopra, L.; Chakraborty, S.; Mondal, A.; Chakraborty, S. PARIMA: Viewport Adaptive 360-Degree Video Streaming. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; ACM: New York, NY, USA, 2021. WWW ’21. [Google Scholar] [CrossRef]
- Hu, H.N.; Lin, Y.C.; Liu, M.Y.; Cheng, H.T.; Chang, Y.J.; Sun, M. Deep 360 Pilot: Learning a Deep Agent for Piloting through 360° Sports Videos. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1396–1405. [Google Scholar] [CrossRef]
- Chen, P.W.; Yang, T.S.; Huang, G.L.; Huang, C.W.; Chao, Y.C.; Lu, C.H.; Wu, P.Y. Viewing Bias Matters in 360° Videos Visual Saliency Prediction. IEEE Access 2023, 11, 46084–46094. [Google Scholar] [CrossRef]
- Chen, S.; Kuzyakov, E.; Peng, R. Enhancing High-Resolution 360 Streaming with View Prediction. 2017. Available online: https://engineering.fb.com/2017/04/19/virtual-reality/enhancing-high-resolution-360-streaming-with-view-prediction/ (accessed on 31 August 2025).
Category | Reference | Input | Output | Technique | Horizon | |
---|---|---|---|---|---|---|
Head movement approach (Section 2.1) | Regression (Section 2.1.1) | Qian et al. 2016 [10] | Head orientation traces | Head orientation | Average LR, WLR | 2 s |
Bao et al. 2016 [9] | Head orientation traces | Head orientation | Naive, LR shallow NN | 1 s | ||
Qian et al. 2018 [11] | Head orientation traces | Head orientation | Naive, LR RR, SVR | 3 s | ||
Ban et al. 2018 [12] | Head orientation traces (current and previous users) | Tile probabilities | LR, KNN | 6 s | ||
Clustering (Section 2.1.2) | Petrangeli et al. 2018 [15] | Head orientation traces (current and previous users) | Head orientation | Spectral clustering | 10 s | |
Nasrabadi et al. 2020 [16] | Head orientation traces (current and previous users) | Head orientation | Naive, custom clustering | 10 s | ||
Dziubinski et al. 2024 [8] | Video frames, tiles in viewports (previous users) | Tile quality level | KNN, DBSCAN | 1, 2 s | ||
LSTM (Section 2.1.3) | Li et al. 2023 [17] | Head orientation traces | Head orientation | BiLSTM, metalearning | 5 s | |
Mahmoud et al. 2024 [18] | Head orientation traces | Tile probabilities Tile quality levels | LSTM | 1 s | ||
Feng et al. 2025 [19] | Head scanpath | Tile probabilities | LSTM | 1–6 s | ||
Transformer (Section 2.1.4) | Chao et al. 2021 [20] | Head scanpath | Head scanpath | Transformer | 5 s | |
Wang et al. 2024 [21] | Head scanpath | Head scanpath | Transformer DRL | 1–5 s | ||
Saliency approach (Section 2.2) | Saliency estimation | Aladagli et al. 2017 [14] | Video frames | Saliency map | GBVS | 2 s |
Transformer | Alice et al. 2024 [22] | Video frames | Tile probabilities | TranSalNet VST | 1 s | |
Head movement with saliency approach (Section 2.3) | LSTM (Section 2.3.1) | Fan et al. 2017 [23] | Head orientation traces, motion and saliency maps | Tile probabilities | LSTM, CNN Optical flow | 1 s |
Nguyen et al. 2018 [24] | Viewport maps, saliency maps | Viewport map | LSTM PanoSalNet | 2.5 s | ||
Li et al. 2018 [25] | Head orientation traces, navigation speed, video frames | Tile probabilities | LSTM, FSM, optical flow | 1 s, 2 s | ||
Xu et al. 2018 [26] | Eye gaze scanpath, video frames, current and past FoV images | Gaze displacement | LSTM, FlowNet2, SalNet | 1 s | ||
Fan et al. 2020 [27] | Head orientation traces, motion and saliency maps | Tile probabilities | LSTM, CNN, Optical flow | 1 s | ||
Feng et al. 2020 [28] | Head orientation traces, video frames | Tiles in viewport | LSTM, CNN | 2 s | ||
Rondón et al. 2022 [29] | Head orientation traces, saliency maps | Head orientation | LSTM PanoSalNet | 5 s | ||
Nguyen et al. 2023 [3] | Viewport maps, saliency maps | Viewport map | LSTM PanoSalNet | 1 s | ||
Manfredi et al. 2023 [30] | PoIs | PoI | LSTM | 3 s | ||
Conv-LSTM (Section 2.3.2) | Wang et al. 2022 [31] | Viewport maps, video frames | Viewport map | ConvLSTM, CNN | 2 s | |
Dong et al. 2023 [32] | Eye gaze heatmaps (current and previous users) | Eye gaze heatmaps | Conv-LSTM SE-Unet | 10 s | ||
Wang et al. 2024 [33] | Video frames Viewport maps (current and previous users) | Viewport map | ConvLSTM, CNN, Clustering | - | ||
Zhang et al. 2025 [34] | Eye gaze scanpath, video frames | Tile probabilities | ConvLSTM, GCN | - | ||
GRU (Section 2.3.3) | Li et al. 2023 [35] | Video frames, viewport maps (previous users) | Viewport map | S-SPCNN, T-SPCNN, SP-ConvGRU | 2 s | |
Setayesh et al. 2024 [36] | Head orientation traces, video frames | Viewport tiles | PAVER, GRU | - | ||
Transformers (Section 2.3.4) | Zhang et al. 2022 [37] | FoV images and head and eye gaze scanpaths | Head scanpath | Transformer, CNN | 10 s | |
Gao et al. 2024 [38] | Video frames, head scanpath | Head scanpath | PAVER, LSTM, Transformer | 1, 2–5 s | ||
Guo et al. 2024 [39] | Video frames | Head scanpath | Bidirectional Transformer, residual network | - | ||
GNN (Section 2.3.5) | Xu et al. 2023 [40] | Video frames, head orientation traces (current and previous users) | Tile probabilities | GNN, LSTM, SalNet, K-means, FlowNet | 3 s | |
RL (Section 2.3.6) | Xu et al. 2019 [41] | Video frames, head scanpaths (current and previous users), | Head scanpath | DRL | 0.03 s |
Reference | Videos | Users | Duration | Resolution | Video Source | Data Capture System | Employed In | Available |
---|---|---|---|---|---|---|---|---|
Qian et al. 2016 [10] | 4 | 5 | 100–206 s | - | YouTube and Facebook | Head orientation from smartphone inside headset | [10] | no |
Bao et al. 2016 [9] | 16 | 153 | 30 s | 1080p–4K | YouTube | Head orientation from HMD | [9,14,15] | no |
Wu et al. 2017 [42] | 18 | 48 | 164–655 s | - | - | Head position and orientation from HMD | [12,20,22,28,31,33,38] | yes |
Fan et al. 2017 [23] | 10 | 25 | - | 4K | YouTube | Head position and orientation from sensor logger and screen capturer | [17,23,29] | no |
Lo et al. 2017 [46] | 10 | 50 | 60 s | 4K | YouTube | Head position and orientation from sensor logger and screen capturer | [8,17,25,27,29,40] | yes |
Qian et al. 2018 [11] | 10 | 130 | 117–293 s | 4K | YouTube | Head orientation from smartphone inside headset | [11,34] | yes |
David et al. 2018 [51] | 19 | 57 | 20 s | 4K | YouTube | Eye tracker and head scanpath (derived from eye) | [19,20,21,38] | yes |
Nguyen et al. 2018 [24] | 11 | 48 | 20–45 s | - | Videos from existing datasets | Eye gaze from head orientation logs in original datasets | [3,17,24,29] | yes |
Xu et al. 2018 [26] | 208 | 45 | 20–60 s | 4K | YouTube | Eye tracker | [17,26,29,32,35,37,39] | yes |
Zhang et al. 2018 [84] | 104 | 27 | 20–60 s | - | Videos from existing datasets | Eye tracker | [35,36,82] | yes |
Xu et al. 2019 [41] | 76 | 58 | 10–80 s | 3–8K | YouTube and VRcun | Head orientation form HMD and eye tracker | [17,20,29,39,41] | yes |
Li et al. 2018 [47] | 60 | 221 | 10–23 s | 4–8K | Custom and YouTube | Head orientation form HMD and eye tracker | [18,39] | yes |
Nasrabadi et al. 2019 [45] | 28 | 30 | 60 s | 2–4K | Custom (Samsung Gear 360) and YouTube | Head orientation from HMD | [16,38] | yes |
Nguyen et al. 2019 [79] | 24 | 48 | 60–655 s | - | Videos from existing datasets | Eye gaze from head orientation logs in original datasets | [31] | yes |
Manfredi et al. 2023 [30] | 31 | 30 | - | - | Manual PoI extraction from existing datasets | Actual PoIs from extracted fixations | [30] | no |
Reference | Evaluation Metric | Comparison Method |
---|---|---|
Qian et al. 2016 [10] | Accuracy | Variations of own method |
Bao et al. 2016 [9] | Mean error, RMSE, 99th percentile, and 99.9th percentile | Naive approach and variations of own method |
Qian et al. 2018 [11] | Accuracy | Naive approach |
Ban et al. 2018 [12] | Viewport deviation, V-PSNR, viewport quality variance, and bandwidth consumption | Variations of own method |
Petrangeli et al. 2018 [15] | Great circle distance and mean FoV overlap | Naive approach, [12,43] |
Nasrabadi et al. 2020 [16] | Mean FoV overlap | Naive approach and variations of own method |
Dziubinski et al. 2024 [8] | Average bitrate of video overlapping with viewport, Q3, and Q4 | Variations of own method |
Li et al. 2023 [17] | Accuracy, F1-score, mean FoV overlap, intersection angle error, and IoU | [23,24,25,26,29,41,55] |
Mahmoud et al. 2024 [18] | VWS-PSNR | [48,49,50] |
Feng et al. 2025 [19] | Q1, Q2, Q3, and re-buffering time | [11,52,53] |
Chao et al. 2021 [20] | Average great circle distance, average ratio of overlapping tiles, and mean FoV overlap | [16,24,29] |
Wang et al. 2024 [21] | Great circle distance, average great circle distance | [20,29] |
Aladagli et al. 2017 [14] | Cross-correlation | - |
Alice et al. 2024 [22] | Q1, Q2, and Q3 | [24] |
Fan et al. 2017 [23] | Accuracy, F1-score, missing ratio, bandwidth consumption, re-buffering time, PSNR, SSIM, PEVQ, and running time | Naive approach, [65], and variations of own method |
Nguyen et al. 2018 [24] | Accuracy | Variations of own method |
Li et al. 2018 [25] | Accuracy and F1-score | [10] |
Xu et al. 2018 [26] | Mean intersection angle error | Variations of own method |
Fan et al. 2020 [27] | Missing ratio, unseen ratio, bandwidth consumption, peak bandwidth, V-PSNR, and re-buffering time | Naive approach and DR |
Feng et al. 2020 [28] | Bandwidth consumption, accuracy, and processing time | [75] and variations of own method |
Rondón et al. 2022 [29] | Accuracy, F1-score, mean FoV overlap, intersection angle error, and IoU | [12,14,23,24,25,26,41] |
Nguyen et al. 2023 [3] | Buffer stall count, buffer stall duration, blank ratio, SSIM, and bandwidth saving | [71], stream all tiles (no prediction), and variations of own method |
Manfredi et al. 2023 [30] | Accuracy | Naive approach |
Wang et al. 2022 [31] | Accuracy, recall, precision, bandwidth saving, and processing time | [24,28] |
Dong et al. 2023 [32] | Hit rate, viewport deviation, MSE, viewport PSNR, bandwidth consumption, viewport quality variance, and re-buffering time | [12] and variations of own method |
Wang et al. 2024 [33] | Accuracy, precision, recall, F1-score, and bandwidth saving | [24,28,31] |
Zhang et al. 2025 [34] | PSNR, inter-chunk PSNR, and re-buffering time | [11,80,81] |
Li et al. 2023 [35] | Accuracy, precision, and recall | [11,24,87,88,89] |
Setayesh et al. 2024 [36] | Tiles overlap, Q1, Q2, Q3, and re-buffering time | [92,93,94] |
Zhang et al. 2022 [37] | Viewport overlap ratio, MSE, Q1, Q2, and re-buffering time | [12,23,53,81,97,98,99] |
Gao et al. 2024 [38] | IoU and orthodromic distance | [20,29,100] |
Guo et al. 2024 [39] | IoU, mean FoV overlap, and great circle distance | [24,41,101,102,103,104] |
Xu et al. 2023 [40] | Accuracy, Q1, Q2, and Q3 | [28,106] |
Xu et al. 2019 [41] | Mean FoV overlap | Naive approach, random viewport prediction, and [107] |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wahba, M.Z.A.; Baldoni, S.; Battisti, F. Learning-Based Viewport Prediction for 360-Degree Videos: A Review. Electronics 2025, 14, 3743. https://doi.org/10.3390/electronics14183743
Wahba MZA, Baldoni S, Battisti F. Learning-Based Viewport Prediction for 360-Degree Videos: A Review. Electronics. 2025; 14(18):3743. https://doi.org/10.3390/electronics14183743
Chicago/Turabian StyleWahba, Mahmoud Z. A., Sara Baldoni, and Federica Battisti. 2025. "Learning-Based Viewport Prediction for 360-Degree Videos: A Review" Electronics 14, no. 18: 3743. https://doi.org/10.3390/electronics14183743
APA StyleWahba, M. Z. A., Baldoni, S., & Battisti, F. (2025). Learning-Based Viewport Prediction for 360-Degree Videos: A Review. Electronics, 14(18), 3743. https://doi.org/10.3390/electronics14183743