Learning-Based Viewport Prediction for 360-Degree Videos: A Review

Wahba, Mahmoud Z. A.; Baldoni, Sara; Battisti, Federica

doi:10.3390/electronics14183743

Open AccessReview

Learning-Based Viewport Prediction for 360-Degree Videos: A Review

by

Mahmoud Z. A. Wahba

,

Sara Baldoni

^*

and

Federica Battisti

Department of Information Engineering, University of Padova, 35131 Padua, Italy

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(18), 3743; https://doi.org/10.3390/electronics14183743

Submission received: 18 August 2025 / Revised: 9 September 2025 / Accepted: 18 September 2025 / Published: 22 September 2025

(This article belongs to the Special Issue Feature Papers in Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Nowadays, virtual reality is experiencing widespread adoption, and its popularity is expected to grow in the next few decades. A relevant portion of virtual reality content is represented by 360-degree videos, which allow users to be surrounded by the video content and to explore it without limitations. However, 360-degree videos are extremely demanding in terms of storage and streaming requirements. At the same time, users are not able to enjoy the 360-degree content all at once due to the inherent limitations of the human visual system. For this reason, viewport prediction techniques have been proposed: they aim at forecasting where the user will look, thus allowing the transmission of the sole viewport content or the assignment of a different quality level for viewport and non-viewport regions. In this context, artificial intelligence plays a pivotal role in the development of high-performance viewport prediction solutions. In this work, we analyze the evolution of viewport prediction based on machine and deep learning techniques in the last decade, focusing on their classification based on the employed processing technique, as well as the input and output formats. Our review shows common gaps in the existing approaches, thus paving the way for future research. An increase in viewport prediction accuracy and reliability will foster the diffusion of virtual reality content in real-life scenarios.

Keywords:

viewport prediction; 360-degree videos; streaming optimization

1. Introduction

In recent years, virtual reality (VR) technology diffusion has increased, offering highly immersive experiences that bring viewers into virtual environments. Unlike traditional 2D screens, VR allows users to interact with content, direct their focus, move their heads freely, and physically navigate within the virtual space. Therefore, users are provided with three or six degrees of freedom (DoF), plus the ability to interact with the environment. This enhances the sense of presence and gives the impression of physically being in the virtual scene. Virtual Reality (VR) experiences are typically accessed through a head-mounted display (HMD), with various options available, such as the Meta Quest, HTC VIVE, and Apple Vision Pro. The VR market has witnessed substantial growth, with projections indicating an increase from USD 47 billion in 2024 to an estimated USD 125.5 billion by 2030 [1].

In particular, 360-degree video has emerged as a popular form of content for VR experiences, also thanks to the diffusion of lightweight and low-cost capture devices. With 360-degree videos, viewers are surrounded by the media content, as if they were at the center of a sphere, thus allowing free exploration in all directions. This provides viewers with three DoF, allowing them to move their heads along three angles of movement, namely yaw, pitch, and roll, as shown in Figure 1.

Although 360-degree videos enhance immersivity, the bandwidth requirements increase significantly with respect to traditional videos. For example, while streaming a 4K video typically requires 25 Mb/s, delivering a 4K resolution to each eye, allowing full 360-degree viewing, requires about 400 Mb/s [2]. Major streaming platforms like Facebook and YouTube usually transmit the entire 360-degree video frame [3]. However, the limitations of the human visual system, as well as the HMD’s field of view (FoV), mean that users can only see a portion of the complete 360-degree video at any given moment, known as the viewport. This portion typically corresponds to less than 20% of the entire 360-degree video frame [4]. Therefore, sending the full frame is inefficient in terms of bandwidth usage, as it includes a significant amount of unused data. Moreover, transmitting a large amount of data can introduce transmission delays, leading to potential discomfort, such as cybersickness [5,6]. To address these issues, adaptive streaming technologies, such as tile-based adaptive streaming, have been developed to optimize bandwidth usage by predicting the viewport on which the user will focus [7]. These techniques split the 360-degree frames into fixed rectangular regions called tiles and prioritize the streaming of the tiles within the predicted viewport, while sending at a low resolution, or omitting entirely, the other portions of the frame. This method optimizes bandwidth usage. However, determining which areas to stream in a high resolution requires viewport prediction techniques. In addition, the prediction accuracy is of the utmost importance as the misprediction of the current viewport may highly impact users’ quality of experience (QoE) [8].

In this scenario, artificial intelligence (AI)-based solutions come into play. In the last few years, many methods based on both machine learning (ML) and deep learning (DL) have been proposed to successfully perform viewport prediction. A possible strategy to classify viewport prediction techniques consists in considering the type of processed information. The first techniques mainly analyzed the previous head trajectories of the user to forecast subsequent head positions [9,10,11]. An evolution of this approach involves the integration of cross-user fixations to enhance predictions [12].

Approaches that rely on past head positions exhibit high efficiency when the prediction horizon is short (e.g., from 30 milliseconds to 1 s). However, their accuracy diminishes when attempting to make longer predictions, as the relevance of previous positions decreases over time. As an example, in [10], the prediction accuracy was around 96.6% when predicting 1 s ahead. However, when the prediction interval was extended to 2 s, the accuracy dropped to approximately 71.2%.

An alternative approach consists in exploiting saliency information. Saliency estimation allows the prediction of areas on which users will focus their attention [13]. Saliency algorithms usually take as input a video or a single frame and provide as output a map indicating the salient regions. Therefore, viewport prediction methods based on saliency select the next viewport as the one centered on the point with the highest value within the saliency map [14]. However, viewport prediction using saliency-based approaches has some limitations, as the predicted salient points exhibit faster and larger displacements compared to real head movements, which are generally smoother [14]. This makes the saliency-based approach less effective as a standalone method.

To enhance the prediction accuracy by leveraging the benefits of both strategies, hybrid methodologies have been explored, integrating past head positions with saliency estimation techniques. In addition, other information types can be complemented with head movements and saliency maps, such as eye gaze information.

An alternative perspective in classifying viewport prediction techniques consists in analyzing the type of employed algorithm. The types of learning architectures are extremely varied: the first methods presented in the state of the art employed less complex techniques such as regression or clustering. Over the years, however, the complexity of the employed approaches has increased, with the aim of improving the prediction performance and increasing the prediction horizon. Therefore, more complex architectures, such as Long Short-Term Memory (LSTM) and Transformers, have been proposed.

In addition, the output types are varied and may include the next head position, the next gaze position, the probabilities of tiles, or tiles within the viewport, as well as the quality level associated with each tile.

The choice of inputs and outputs, as well as of the specific algorithm, depends on several factors, such as the target complexity and the availability of data. In this survey, we propose a thorough review of the available techniques. Specifically, we classify existing state-of-the-art methods into three main categories, namely head movement-based, saliency-based, and hybrid approaches. The definition of the categories was established following a thorough investigation of the state-of-the-art methods and considering the types of information utilized by each approach for viewport prediction. Head movement-based methods rely only on users’ past head movements to predict the future viewport. Saliency-based methods utilize users’ visual attention, in the form of a saliency map, to perform the prediction. Hybrid approaches combine multiple sources of information—typically both head movement and saliency—to improve the prediction accuracy. Within each category, we identify a set of sub-classes based on the learning technique employed in each approach (e.g., regression, clustering, and Transformer models).

While prior surveys have broadly covered 360-degree video streaming and representation, this work is the first to systematically review viewport prediction methods as a standalone topic. We conduct a comprehensive review of the most notable and diverse viewport prediction methods from the past 10 years.

We selected papers based on the following inclusion criteria:

Recency: We focused on papers published within the last 10 years (2016–2025) in order to reflect the development of viewport prediction, starting from regression methods, which were already being used in 2016, and ending with modern approaches employed in the last year.
Diversity of approaches: We intentionally included works using a variety of strategies, namely head movement-based, saliency-based, and hybrid approaches, to ensure broad coverage of the field and show the strengths and drawbacks of every approach.
Technique variety: For each category, we selected works that employed a range of techniques, such as regression, clustering, LSTM, Transformer models, reinforcement learning (RL), and graph neural networks (GNNs), to reflect the technical diversity in the state-of-the-art methods.

To perform paper selection, we applied the following exclusion criteria:

Duplicates (across databases or different papers using the same viewport prediction approach);
Papers that were not written in English;
Papers with no citations;
Papers not addressing omnidirectional content;
Papers focusing on proposing new streaming and/or bit allocation techniques employing existing viewport prediction approaches.

We used the Scopus and IEEE Xplore databases as these cover the majority of high-impact computer science and multimedia venues relevant to viewport prediction. We used the following search keywords: “viewport prediction” AND (“360 video” OR “360° video” OR “360-degree video” OR “omnidirectional video”). Boolean operators AND/OR were applied consistently across both databases. Paper selection was performed independently by at least two reviewers, with disagreements resolved through discussion to reduce bias. We provide the PRISMA flow diagram in Figure 2.

The remainder of the paper is organized as follows. In Section 2.1, we review the viewport prediction techniques that process only head movement data; in Section 2.2, we describe the methods based on saliency information, and, in Section 2.3, we report the hybrid approaches. Later, in Section 3, we provide an overview of existing datasets for 360-degree videos. In Section 4, we present a comprehensive analysis of the evaluation metrics adopted by state-of-the-art methods and outline the baseline or comparison methods used by each approach. In addition, we provide a discussion of current trends and gaps in Section 5, together with some recommendations for future research directions. Finally, Section 6 concludes the paper.

2. Viewport Prediction

In this section, we review the approaches presented in the state of the art by performing broad classification into (i) head position approaches, (ii) saliency-based approaches, and (iii) hybrid approaches. For each of these categories, we provide differentiation based on the employed processing algorithm and the type of output. A general overview is provided in Figure 3.

The left side of the figure shows the possible input types. As a first option, the input might represent the traces of eye movements regarding the 360° video (eye gaze scanpaths). Alternatively, head-related information could be employed in two forms. In the first option, head movements are represented as points on the sphere or on the 360-degree equirectangular projections (ERPs) of the frames, thus realizing what is known as a head scanpath. In the second option, the input consists of head orientation traces, where orientation angles are expressed with respect to a head-centered system. Users’ viewing behavior can also be represented as viewport maps, i.e., heatmaps in the same format as the video frame that highlight the viewport region. Another option is to provide the 360-degree video frames as input, which typically occurs for saliency-based methods. Finally, points of interest (PoIs), i.e., regions of the 360° frame identified as relevant, could be extracted in advance and then provided as input to the viewport prediction algorithm.

The central column of Figure 3 shows the possible processing strategies. Viewport prediction could be realized by directly processing one or more raw inputs or by pre-processing the video frames for the extraction of saliency and/or motion information. In addition, some methods rely only on data from a single user, while others integrate the data of multiple viewers. Additionally, some approaches combine different modalities.

Finally, the outputs depend on both the input type and the processing method. Possible outputs include the predicted eye gaze, head scanpaths, or orientations. Additionally, algorithms could provide as output, for each frame tile, either its probability of being part of the viewport or its assigned quality level. In other cases, viewport prediction methods could directly output the future PoIs or a set of predicted tiles or viewports.

A summary of the reviewed approaches, together with the corresponding classification, is provided in Table 1.

2.1. Head Position Approaches

The core principle of viewport prediction based on head movements is to record historical data of head displacements to predict the following ones. There are two main sub-categories of head-based viewport prediction: (i) single-user and (ii) multi-user approaches. The former performs movement prediction for a user U_i based on the previous movements of the same user. The second class works under the assumption that users will behave similarly when watching the same video, so that the movement trajectories of different users,

U_{j} \neq U_{i}

, are used to predict the head trajectory of the i-th user. In the second case, the users employed in the training process are usually grouped into sets so that the set that is most similar to the i-th user is employed for prediction. In the following, we give a detailed discussion of the different processing approaches that employ the head movement strategy for viewport prediction.

2.1.1. Regression Approaches

Regression approaches represent some of the first ML-based techniques applied to viewport prediction through head movement analysis. Among the first contributors, Qian et al. proposed a sliding window framework [10]. Specifically, they defined a window size of 1 s for the analysis of head movements, in terms of head orientation, with the aim of predicting the following 2 s of movement data. To this end, they applied three methods: (i) average, (ii) linear regression (LR), and (iii) weight linear regression (WLR). The former is a simplistic model where the average head position is computed in the analysis time window and used as a prediction for the next 2 s. The second approach consists in training an Linear Regression (LR) model on the analysis window by weighting all training samples equally. The third approach refines the LR technique by assigning a larger weight to more recent samples. The authors compared the three techniques by performing a subjective experiment with a custom dataset consisting of four short 360-degree videos sourced from YouTube and watched by five users. The experiments were conducted using a Google Cardboard viewer with a mobile phone device. Accuracy was used as an evaluation metric. The results revealed successful predictions for short-term intervals (from 0.5 to 1 s), with Weighted Linear Regression (WLR) achieving accuracy of 96.6% ± 2.0% for 0.5 s and 92.4% ± 3.7% for 1 s. However, prediction for longer intervals (i.e., 2 s) proved to be more challenging, as the WLR method achieved accuracy of 71.2% ± 7.6%. As expected, WLR demonstrated higher accuracy compared to the other methods, with LR exhibiting better accuracy than the average method. Additionally, the authors measured the bandwidth savings in cases where viewport prediction can be successfully performed, i.e., testing the best-case scenario where the next viewport is known. Their analysis revealed that viewport prediction allows a bandwidth reduction of 80%, thus clearly showing its impact [10].

This study was extended in [11] by collecting a new custom dataset containing head traces from 130 participants who watched 10 360-degree videos. The acquisitions were performed using a Samsung Gear VR headset with a Samsung Galaxy S8 device. In this case, the authors used an analysis window of 1.5 s and aimed at predicting the head movements of the next 3 s. To evaluate the model, they used the same accuracy metric as applied in the previous study [10]. Four models were investigated: (i) static, which assigns to the predicted head position the current location, thus assuming that no movement is performed by the user; (ii) LR; (iii) ridge regression (RR), a more complex version of LR that adds a regularization term to reduce overfitting; and (iv) a support vector regressor (SVR), which utilized support vector machines (SVMs) for regression. The static approach will be referred to as naive in the following. Based on the achieved performance, the authors selected LR to perform a one-second prediction and Ridge Regression (RR) to extend the prediction up to the third second.

As an additional contribution, the authors proposed Flare, a practical streaming system for 360-degree videos on mobile devices incorporating viewport prediction. Flare has been evaluated in terms of the stall duration, average bitrate, and average quality level. Based on these parameters, Flare enhanced the quality by up to 18 times on WiFi compared to the no-prediction approach. Additionally, it achieved bandwidth reductions of up to 35% and quality enhancements of up to 4.9 times on LTE.

Regression for viewport prediction has been applied also in [9]. Specifically, the authors compare three approaches. The first is the naive technique. The second model is LR, and the third algorithm is a shallow neural network model consisting of three layers and 50 hidden neurons. In this study, the next viewport is predicted by estimating the pitch and yaw angles. As users tend to explore content horizontally (yaw change) rather than vertically (pitch change), the focus is placed on yaw prediction, while the pitch is omitted. To evaluate the prediction accuracy, four metrics have been employed: the mean error, root mean square error (RMSE), 99th percentile, and 99.9th percentile. The shallow neural network model demonstrated superior performance compared to the baseline approach, yielding approximately a 50% reduction for all metrics for a 0.2 s prediction window. With respect to the LR method, the improvement was smaller, corresponding to a 0.09 and 0.32 decrease in the mean error and Root Mean Squared Error (RMSE) and a reduction of 1.64 and 3.25 in the percentiles. The models were tested using a custom dataset generated by 153 viewers. Specifically, 35 subjects watched 16 video clips, and 118 watched three to five randomly selected video clips using an Oculus DK2 headset. The prediction results were then employed to select the area to transmit, thus controlling the amount of content delivered to the viewer. Specifically, if the prediction accuracy is high, only the content related to the viewport is sent to the user, whereas, if the accuracy is low, more content is sent to maintain good quality. In the extreme scenario of poor accuracy, such as a high failure ratio, the entire frame might be sent. The authors performed viewport prediction for 1 s windows, demonstrating good performance within 100 − 500 milliseconds. With a prediction window of 0.2 s, they achieved a 45% reduction in bandwidth consumption within a 1% failure ratio [9].

An alternative approach for regression has been presented in [12]. Specifically, the authors propose CUB360, which improves prediction for longer periods by complementing the personalized analysis of user behavior with information from multiple users. In more detail, LR is employed to forecast the user’s head orientations in future frames using the target user’s past trajectory data. Then, the prediction is refined by integrating head orientations from other users who have already viewed the same video, through K-Nearest Neighbors (KNN), by selecting the K closest traces. Finally, a probability voting mechanism is applied, combining the results from LR and KNN. Model evaluation has been performed using a dataset consisting of 18 videos viewed by 48 participants [42]. They tested various sizes of prediction windows, ranging from 2 to 6 s, analyzing head orientations within a sliding window of 1 s. To evaluate the proposed method, four evaluation metrics have been used: (i) the viewport peak signal-to-noise ratio (PSNR), where the Peak Signal-to-Noise Ratio (PSNR) is computed only over the user’s viewport, using the ground truth as a reference; (ii) viewport deviation (v_D), which calculates the percentage of blank area, i.e., empty tiles, inside the viewport, so that the accuracy is computed as 1 − v_D; (iii) viewport quality variance, which utilizes the coefficient of variation to evaluate the consistency of visual quality across the viewport within a single video segment—the coefficient of variation is defined as

CV = (\frac{σ}{μ}) \times 100 %

, where σ indicates the standard deviation and μ the mean; (iv) bandwidth occupation, which calculates the total bitrate consumed per segment, with higher values indicating increased data utilization. The proposed method was compared to the standalone LR approach, without incorporating KNN. Their findings reveal significant improvements compared to the LR approach, with an absolute improvement (i.e., accuracy(LR) − accuracy(CUB360)) of 20.2% and a relative increase (i.e., (accuracy(LR)/accuracy(CUB360)) × 100) of 48.1%. Additionally, CUB360 enhanced the viewport PSNR by 30.28% with respect to LR and improved bandwidth usage by 32.26% [12]. This work demonstrates that clustering approaches represent a valuable technique for viewport prediction, as will be detailed in the following section.

2.1.2. Clustering Approaches

Compared to regression-based studies, clustering approaches have been investigated to forecast the viewport over longer durations, extending up to 10 s. Based on the assumption that many users behave similarly when watching 360-degree videos, the behavior of viewers who watch the same videos can be grouped using clustering techniques. Subsequently, the viewport can be predicted for a new viewer by identifying the cluster to which his/her behavior is aligned [15,16].

For instance, in [15], users’ head orientations over time have been represented as trajectories, and spectral clustering has been applied to group similar trajectories for the same 360-degree video. From each cluster, a trend trajectory is selected as a representative of the set, showing the average behavior of trajectories within that cluster. These trend trajectories are used to predict the future head orientations of new users watching the same 360-degree video. The prediction was performed in intervals of 1, 5, and 10 s. The dataset employed in the study was the one presented in [9]. To evaluate the proposed method, two metrics were used. The first metric was the great circle distance, also known as the orthodromic distance (i.e., the shortest distance between two points on the surface of a sphere), between the center of the predicted viewport and the ground-truth viewport. The second metric measured the ratio of the overlapping area between the predicted viewport and the ground-truth viewport. The authors compared the proposed model with other state-of-the-art approaches, including the LR model introduced by Bao et al. [43], CUB360 [12], and the naive baseline approach. Surprisingly, the comparison highlighted that, for 1 s predictions, the clustering approach and CUB360 were outperformed by the naive technique. However, when the prediction duration was larger, the performance of the clustering-based approach increased. For example, for 5 s predictions, the clustering approach successfully predicted the viewport area (in terms of 50% overlap between the actual and predicted viewport) for 70% of viewers, whereas the naive approach predicted the same overlap percentage for only 50% of viewers [15].

The approach presented in [15] described the position of the viewport using the roll, pitch, and yaw angles and tracked how each angle changed over time as an independent trajectory. In [16], a different clustering strategy has been employed. Specifically, the viewport has been represented as a circle, mapped to the 360-degree scene, and represented through quaternions, whose diameter represents the FoV. Viewers who have watched the same video are grouped together, using orientation traces of 1 s, if the distance between their viewport centers is less than 30 degrees for at least 90% of the time window. Every second, a new set of clusters is created. To predict the next viewport for the current user, two operations are performed: (i) cluster assignment and (ii) cluster-based viewport prediction. First, a user is assigned to a cluster if his/her current viewport center (i.e, at time t) is within 30 degrees of one of the cluster’s viewport centers. Then, the viewports at t + 1 for the users of the cluster are extracted. The quaternions associated with the centers of the viewports are averaged using the method proposed in [44] and used as a prediction of the center of the user’s predicted viewport. In the event that a user cannot be assigned to a cluster, the predicted viewport is set to be equal to the current one. In [16], prediction has been performed for 1, 3, 5, and 10 s using a dataset of 28 videos, each watched by 30 participants [45]. To evaluate the proposed approach, the effectiveness of prediction was measured by computing the overlap between the predicted viewport and the actual one, i.e., the Field of View (FoV) overlap. The model has been compared with a custom LR approach along with the naive approach. The authors show that their algorithm outperforms both methods in more than 40% of the dataset when using a prediction window of 5 s or more. However, for prediction periods of less than 3 s, setting the last viewport as the current prediction yields better results than their proposed method. Additionally, when analyzing the videos in the dataset, they discovered a strong correlation between the prediction accuracy and video content. Specifically, the clustering-based method performs better when a region of interest (RoI) is present in the video, e.g., a moving object, as it captures users’ attention and maintains their focus.

Clustering has been employed for viewport prediction in [8]. As viewport misprediction strongly impacts the user experience, the authors propose to enlarge the predicted viewport by including the surrounding region in the high-quality set of tiles, thus realizing viewport extension. Starting from a global repository stored on the streaming server, the authors propose extracting a local repository by identifying the subset of users that are more similar to the current one through KNN and DBSCAN clustering. Both repositories are employed to compute a score for each tile to rate its popularity. The latter is employed to assign a quality factor to each tile. Specifically, an extension region of tiles is defined to assign high quality to the tiles surrounding the predicted viewport. The method’s performance has been tested using the dataset presented in [46], consisting of 10 videos watched by 30 users. The prediction horizon was set to 1 and 2 s, and the bitrate assigned to the actual viewport was used as an evaluation metric. The experimental results show that KNN has superior performance with respect to DBSCAN. In addition, the best-performing repository, i.e., global vs. local, was dependent on the video content. For slow-paced and content-rich videos, the local repository led to better performance, as global tiles might be confused due to the high number of points of interest. In contrast, the two repositories performed similarly for fast-paced and centralized content, such as rollercoaster videos. Finally, the global repository performed better on small tiles, while the local one achieved superior performance on large tiles.

2.1.3. LSTM

Clustering methods are useful in predicting the behavior of new users by comparing them to previous users who have watched the same videos. However, these techniques have two limitations: (i) viewing trajectories of other users might not be available and (ii) the underlying assumption of similarity between users’ viewing behaviors might not hold. To address these issues, the use of LSTM methods has been explored. LSTM is a type of neural network designed to learn from sequences, thus allowing the analysis of head movements or gaze over time. This capability enables more accurate predictions of future viewports. Furthermore, LSTM can make individual predictions for each user, rather than relying on group data, which adds flexibility and personalization to the predictions.

Due to these properties, an LSTM model was selected for viewport prediction in [18]. The proposed method takes as input the pitch and yaw angle traces and produces as output the tile probabilities, together with the corresponding quality levels. Specifically, the LSTM is employed to predict tile probabilities from past user data, and a greedy method is applied to select the corresponding quality levels. All tiles are initialized with the lowest quality level, which is incrementally improved based on the resulting user’s perceived quality. The quality is opportunistically updated by taking into account the corresponding bitrate increase. Performance evaluation has been conducted with the dataset presented in [47], consisting of 60 videos watched by 221 users, with a prediction horizon of 1 s. To evaluate the proposed method, the viewport-weighted PSNR (VWS-PSNR) metric was utilized. The Viewport Weighted PSNR (VWS-PSNR) is a user-centric quality metric that assesses video quality specifically within the user’s viewport, rather than across the entire 360° frame. It employs a weighted mean squared error with spherical correction calculated only on the viewport pixels to better reflect the perceived quality. The authors simulated a multi-user 360° video streaming system using 20 360-degree video sequences, each segmented into 1 s intervals, along with head movement traces from 20 users. To validate the effectiveness of the proposed greedy method for quality selection, they compared it with the methods described in [48,49,50]. The results demonstrated that the proposed method significantly outperformed the baselines in terms of viewport quality, especially under bandwidth-constrained conditions.

An LSTM-based approach is leveraged also in [19], where the authors define Stoc-360, a multi-window probabilistic viewport prediction and tile-level bitrate adaptation streaming algorithm. Instead of relying on a binary decision to determine whether a tile will appear in the next viewport, Stoc-360 assigns each tile a probability of being viewed, where a higher probability indicates a higher likelihood of being in the upcoming viewport. Recognizing that prediction becomes more difficult as the prediction window increases, the authors use multiple prediction models, each tailored to a specific prediction window (e.g., 1 s, 2 s, and so on). To further mitigate prediction challenges with longer windows, they introduce a patience pattern, which temporarily pauses video downloading to reduce the prediction window length. Their viewport prediction model is based on a three-layer LSTM with 128 hidden units that takes as input the head trajectory on the sphere, also referred to as the head scanpath. To evaluate the QoE, they consider four metrics: viewport quality (higher bitrate yields better visual quality), spatial variation (bitrate uniformity across tiles in the viewport), temporal variation (bitrate consistency over time), and re-buffering time (playback interruptions). Experiments were conducted on the dataset presented in [51], consisting of 19 videos watched by 57 users. Stoc-360 has been compared and evaluated against three state-of-the-art models: Flare [11], TBAS [52], and DRL360 [53]. Stoc-360 outperformed all baselines, achieving a 16.75% higher Quality of Experience (QoE) score than TBAS, a 21.46% gain over Flare, and a 48.80% increase relative to DRL360.

As described in the previous paragraphs, LSTM models are adept in learning patterns from sequences, such as how a user’s gaze moves over time in 360-degree videos. However, standard LSTM only looks at the past to make predictions. To successfully perform viewport prediction, it may also be helpful to look at future information in the sequence. This is where Bidirectional Long Short-Term Memory (BiLSTM) comes into play. A BiLSTM network processes the sequence in both the forward and backward directions, so that it can learn from the full context.

A BiLSTM model has been applied to viewport prediction in [17], where it has been integrated with metalearning. Metadata are incorporated to tailor personalized prediction models to individual users’ preferences. To better understand user preferences, user characteristics and behavior when viewing 360-degree videos have been analyzed using the algorithm presented in [54]. The prediction horizon for viewport prediction was set to 5 s. The experimental evaluation encompassed five datasets, with the largest comprising 208 videos watched by 30 users [26], along with four additional datasets: one with 76 videos viewed by 58 users [41], another with 11 videos viewed by 48 users [24], a third with 10 videos watched by 25 users [23], and the last corresponding to the dataset presented in [46]. To evaluate the proposed method, several metrics were used. In more detail, the authors employed both standard indicators such as accuracy, the F1-score, and the Intersection over Union (IoU) (whose interpretation in viewport prediction is provided in Section 4.1) and task-specific metrics. Specifically, they considered the mean FoV overlap, which is here defined as the area of intersection between the predicted and ground-truth tiles divided by the area of union of predicted and ground-truth tiles; the intersection angle error, which measures the distance between the predicted point and the actual point on a sphere; and the orthodromic distance. Li et al. compared their model with several state-of-the-art methods, including those presented in [23,24,25,26,29,41,55]. They reported that their model achieves better performance by enhancing the orthodromic distance by 10% to 50% in a prediction window of 5 s. The authors highlighted that the performance increase was higher in datasets with intense user head movements.

2.1.4. Transformers

Unlike LSTM models, which process sequences step by step and rely on memory to retain important information, Transformers use a self-attention mechanism that allows them to consider the entire sequence at once. This makes them faster and more effective in capturing long-term dependencies, especially in tasks involving long sequences [56,57]. For this reason, Transformers have achieved great success in many fields and are now widely used in various applications, including viewport prediction. For instance, in [20], the Viewport Prediction Transformer (VPT360) model has been introduced. It forecasts the future positions of viewport centers by analyzing the user’s previous viewport positions over time and representing them as head scanpaths. The model examines the evolution of the scanpath and captures the dependencies between positions. In [20], a scanpath of 1 s has been employed to estimate the future viewport positions for 5 s. VPT360 was evaluated using the datasets presented in [41,42,51], and its performance was assessed using three metrics: the average great circle distance; the average ratio of overlapping tiles, which measures the percentage of overlap between the predicted and ground-truth tiles; and the mean overlap, which computes the average intersection over union between the predicted and ground-truth areas for a given prediction window. The VPT360 model was compared with [29] on the dataset presented in [51] and demonstrated superior performance for the entire 5 s window. Moreover, VPT360 was compared with other models, such as [16,24], as well as [29], on the dataset presented in [42], outperforming these methods for prediction horizons larger than 0.5 s in terms of the average ratio of overlapping tiles.

Another Transformer-based approach was presented in [21]. The authors propose exploiting attention mechanisms to extract spatiotemporal information and predict multiple candidate head trajectories based on historical head data. Then, deep reinforcement learning (DRL) is employed to maximize a QoE objective based on the network conditions. Specifically, three candidate viewports are predicted, and each is assigned a different bitrate by the Deep Reinforcement Learning (DRL) module. The latter operates based on a QoE objective composed of four components: (i) viewport quality, measured as the bitrate assigned to the viewport region; (ii) viewport temporal variation, measured as the bitrate difference of the viewport region in consecutive instants; (iii) viewport spatial variation, measured as the quality difference between the different viewport regions; and (iv) a re-buffering measure. The proposed method has been evaluated using the dataset presented in [51] with 1 s data as input and a prediction horizon between 1 and 5 s. To evaluate viewport prediction, two metrics were used: the great circle distance and the average great circle distance. The proposed method was compared with many approaches, including [20,29]. It achieved the smallest average error across all time windows.

2.2. Saliency-Based Approaches

The second category of input for viewport prediction algorithms is represented by saliency information.

As an example, in [14], a saliency-only viewport prediction approach has been presented. Specifically, the Graph-Based Visual Saliency (GBVS) algorithm has been employed to predict the saliency of each viewport during the viewing session, under the hypothesis that the most salient area in each frame represents the next viewport. Therefore, for each frame, a single center of attention was selected using the technique proposed in [58], where the location of attention is designated by the pixel with the highest saliency value. The evaluation was performed based on the dataset presented in [9]. Cross-correlation analysis was performed between the predicted attention center and real head movements to assess their relations. However, the results indicate that this approach is not robust, as the predicted attention points exhibit overly rapid and extended movements compared to real head movements, which are smoother.

Viewport prediction based solely on saliency information has been explored also in [22]. Viewport prediction was performed using a pre-trained 2D Transformer-based saliency model, applied directly to 360-degree video frames without any additional training or fine-tuning specific to 360-degree content. Specifically, the 360-degree frames in ERP format were processed through a 2D saliency estimation model, after which each frame was divided into 9 × 16 rectangular tiles. The average saliency value within each tile was calculated and employed as a measure of the probability that the tile would be viewed by a user. For saliency estimation, two 2D models were employed: TranSalNet [59], which combines Transformer layers with convolutional layers, and the Visual Saliency Transformer (VST) [60], which relies solely on a Transformer architecture without additional convolutional components. The authors introduced an algorithm that computes the weighted sum of the saliency values of each tile along with the saliency values of its neighboring tiles to determine the final bitrate for each tile. The experimental evaluation was conducted using the publicly available dataset proposed in [42], and the model proposed in [24] was used as a baseline for comparison. The methods were compared in a live streaming scenario, and three metrics were employed to evaluate the QoE: the average bitrate in a user’s viewport for each frame, the variation in the bitrate within the viewport for each frame, and the variation in the bitrate across successive tiles. The proposed method is suitable for real-time live streaming, as it does not require previous head movement data; it enables real-time bitrate allocation and achieves promising performance. Moreover, the authors demonstrated that their method can achieve up to a 50% increase in bitrate for the predicted viewport compared to standard 360-degree viewing. However, with respect to [24], which benefits from previous head orientation information, the obtained results are less accurate.

Indeed, methods based only on saliency do not account for the physical limitations of head movements, which are inherently learned by approaches based on head trajectories. In addition, as 360-degree videos are explored through HMDs, head tracking can be performed by extracting telemetry information from the headset, without placing an additional burden on the video experience. Therefore, instead of relying solely on saliency, head movements and saliency analyses can be integrated to obtain a hybrid viewport prediction approach, as will be detailed in the following section.

2.3. Hybrid Approaches

As previously discussed, various ML and DL approaches have been explored to predict the future viewport. For instance, some methods rely solely on the past head movements of the viewer to predict the upcoming one, but these techniques are most effective for short-term predictions [9,10,11]. Another strategy involves clustering algorithms, which consider the previous head movements of the target user alongside those of other viewers who have watched the same video, enabling predictions up to 10 s. However, these methods are frequently considered unsatisfactory and impractical, as they rely on the availability of a set of previous viewing trajectories for a target video. Therefore, this type of approach may not be applicable when dealing with a new 360-degree video. Furthermore, the efficacy of this strategy depends on the type of content [15,16]. An alternative approach consists in exploiting the saliency map generated from the video frames to anticipate the future viewport, focusing on the most salient regions. However, this approach often produces unsatisfactory outcomes, particularly in achieving smooth transitions in the predicted viewport [14]. Consequently, there has been a push to explore novel methods for viewport prediction, combining the previous head movements of the target user with saliency maps of the 360-degree video. Numerous Deep Learning (DL) techniques have been investigated for this purpose, as will be detailed in the following.

2.3.1. LSTM

Fan et al. were among the first to incorporate head orientations, saliency maps, and motion maps, i.e., optical flow information, into viewport prediction [23]. Their method used an LSTM network [61], which is well suited for learning dependencies among video frames. The study involved evaluating two types of networks: the orientation-based network and the tile-based network. The orientation-based network takes as input the concatenation of the previous user’s head orientations and the downsampled version of the saliency map and motion map (64 × 64 pixels), whereas the tile-based network uses the viewed tiles instead of head orientations. Motion maps were generated using Lucas–Kanade optical flow [62] between consecutive frames, while saliency maps were estimated using a pre-trained CNN based on VGG-16 [63], as proposed in [64]. Due to the resource-intensive nature of motion and saliency map estimation, these procedures were performed offline. The prediction horizon was set to 1 s and the simulations were performed using a custom dataset of 10 360-degree videos viewed by 25 users through the Oculus Rift Head-Mounted Display (HMD). The proposed method was evaluated using many metrics, including accuracy. Due to the novelty of their approach at the time, the authors compared it against three baseline methods. The first was the naive approach. The second used saliency, selecting tiles whose average saliency values exceeded a certain threshold to represent the predicted viewport. The third was the Dead Reckoning (DR) method proposed in [65], where the future viewport is estimated based on the viewer’s current head orientation and velocity. The comparison among the three approaches highlighted that the proposed one offered several benefits, such as an average bandwidth reduction between 22% and 36% and shorter buffering times, with reductions of about 43% [23].

The approach presented in [23] has been extended through the definition of the Future-Aware model [27]. Specifically, the authors included information about video content and addressed distortion effects in projection. To tackle the distortion caused by projection operations (e.g., equirectangular, cubic, and rhombic dodecahedrons [66,67]) and their impacts on saliency detection, the overlapping virtual viewport technique has been employed. This technique involves generating virtual viewports covering the entire sphere, each with a size of 60 degrees and a sampling angle of 30 degrees. Saliency estimation is performed on these viewports using methods developed for 2D images, and the resulting outputs are stitched together to create the final saliency map. The proposed model has been evaluated with a prediction horizon of 1 s using a custom dataset of 10 360-degree videos watched by 50 users collected with the Oculus Rift HMD. For performance evaluation, several metrics were used, including the missing ratio and the unseen ratio, which represents the ratio of tiles that arrived at the client but were not watched compared to the total number of transmitted tiles. In addition, the authors measured the bandwidth consumption and the peak bandwidth, which indicates the highest bandwidth consumption recorded during this process. Additionally, they considered the V-PSNR defined in [68], along with the total re-buffering time, which measures the overall time spent re-buffering during playback. For comparison, they used the naive approach and the DR method proposed in [65]. The authors proved that their approach outperforms other algorithms in terms of bandwidth efficiency and re-buffering time while achieving comparable video quality. The authors show that the proposed method eliminates re-buffering events when the available bandwidth exceeds 25 Mbps, which is approximately 10 Mbps lower than required by alternative algorithms [27].

LSTM networks have been employed also in [24] to predict future head movements, utilizing saliency maps and head orientations as inputs. To facilitate processing, they represented head orientation though a tile-based map in order to align it with the saliency map. For saliency estimation, they introduce the PanoSalNet model inspired by Deep Convnet [69], a convolutional neural network for saliency detection in 2D images. PanoSalNet comprises 8 convolutional layers and one deconvolutional layer, with the initial 3 layers having the same structure as VGG-16 [63]. The subsequent 5 layers are initialized and trained on the SALICON dataset [70], a saliency map dataset for 2D images. The prediction method proposed in [24] forecasts head orientations up to 2.5 s. The overall model pipeline has been trained and evaluated using a custom saliency dataset consisting of 11 videos viewed by 48 users. The proposed method was evaluated against various scenarios presented by the authors in terms of prediction accuracy. The first scenario used only the result from Deep ConvNet as the output of viewport prediction. The second scenario employed the result of PanoSalNet as the output of viewport prediction. The third scenario used Deep ConvNet for saliency estimation, in conjunction with the head orientation, to make predictions. The authors examined the impact of the prediction window on the accuracy. They found that their model performed differently across scenarios, with the accuracy dropping from around 80% with a prediction window of 0.5 s to about 45% with a prediction window of 2.5 s. Additionally, they studied how the content of the video influenced the viewport prediction accuracy. They reported that the best performance, with accuracy of 85%, was achieved with static videos. In contrast, the accuracy decreased for videos with fast movements, falling below 70%. Overall, the model evaluation achieved promising results, yielding accurate predictions from 0.5 to 1 s, facilitating smoother playback by pre-fetching predicted frames in advance [24].

The method presented in [24] has been integrated into a 360 video streaming framework in [3] to enhance the video delivery quality. The proposed method has been evaluated using many metrics: (i) the buffer stall count, which represents the number of times that video playback is interrupted due to buffer depletion; (ii) the buffer stall duration, defined as the total number of seconds for which playback is halted while waiting for the buffer to refill; and (iii) the blank ratio or missing ratio. Moreover, the Structural Similarity Index (SSIM) has been used to measure the differences between tiles of the predicted viewport and the ground truth. Additionally, they measured the saved bandwidth. For comparison, the model was evaluated against three baseline methods. The first [71] used linear regression to predict the viewport by providing as input to the model the saliency map and head orientations. The second baseline performed prediction using only the saliency map, excluding the head orientation. The third baseline skipped prediction entirely and simply streamed all tiles as the output. The experimental evaluation demonstrates that the proposed model reduced bandwidth consumption by 31% with respect to the baseline approaches, while simultaneously reducing the stall duration by 64% and the stall count by 46%. These improvements significantly enhance the QoE in 360-degree video streaming systems.

Li et al. contributed to viewport prediction through LSTM by proposing a model consisting of two components [25]. The first component operates offline and is responsible for generating head maps, while the second component works online to predict future viewports for the target user using an LSTM network. In the first component, head maps are generated by combining the results from saliency estimation and motion detection. Saliency estimation employs the Fused Saliency Map (FSM) method proposed in [72], which is an adaptation of existing saliency models for 360-degree images. Motion detection utilizes Lucas–Kanade optical flow [62]. In the second component, the LSTM network predicts tile probabilities using the head maps generated by the first layer, along with the previous head orientations and navigation speed of the target user. In addition, a correction module is implemented on the LSTM network output, applying a threshold to determine which tiles will be in the future viewport. The LSTM network produces two outputs: a short-term prediction for the next second and a long-term prediction for the next 2 s. The method has been evaluated on the dataset presented in [46], using the accuracy and F1-score as performance metrics. The authors also compared the proposed model with the regression-based approach presented in [10]. The comparison showed improved accuracy, with an average of 85.11% compared to 72.54% in [10] on the validation set. Additionally, the proposed method demonstrated more stability compared to [10].

In a different direction, Xu et al. introduced a model for gaze prediction in 360-degree videos, along with a large-scale eye tracking dataset specifically designed for dynamic 360-degree videos [26]. This model utilizes previous gaze data with saliency and motion features to forecast the displacement of the gaze points. The model includes three modules: (i) the trajectory encoder module, (ii) the saliency encoder module, and (iii) the displacement prediction module. The former employs an LSTM network to encode past eye gaze paths, which are crucial in predicting gaze point displacement. The saliency encoder module processes saliency maps and motion maps across multiple scales, utilizing Inception-ResNet-V2 [73] to extract features for gaze prediction. Saliency is calculated at three scales: local, FoV, and global. Local saliency is computed through the inner product between the 360-degree frame and a Gaussian window centered at the current gaze point. FoV saliency is generated by providing as input the current viewport to SalNet [69], a saliency detection method designed for 2D images. Global saliency is obtained by processing the entire 360-degree frame through SalNet. Motion maps are computed following a similar procedure across three scales, local, FoV, and global, using FlowNet2 [74]. The displacement prediction module utilizes the outputs from the trajectory encoder and saliency encoder modules, employing two fully connected layers to predict the displacement between the current and future gaze points. For performance evaluation, Xu et al. set the prediction horizon to 1 s, training and testing the model on a custom dataset. The dataset consisted of 208 360-degree videos, each viewed by 45 users wearing an HTC VIVE headset with the “7invensun a-Glass” eye tracker. The dataset encompassed diverse dynamic content, including indoor and outdoor scenes such as sports games, music shows, short movies, and documentaries. Videos were captured using both fixed and moving cameras. The proposed method has been compared with four baseline approaches: one that predicts the viewport based on the most salient point, another that uses optical flow to estimate the gaze direction, a third that relies only on the saliency encoder, and a fourth that uses only the trajectory encoder. Performance was evaluated using the mean intersection angle error (MIAE) across all videos and users. Xu et al. demonstrate that the proposed method outperforms the baselines, showing the effectiveness of their proposed model in predicting gaze point displacement.

LSTM networks have been employed also in [30] to predict the future PoIs in 360-degree videos. The authors observed that users tend to focus on a Point of Interest (PoI) after an initial exploration phase. Utilizing six existing datasets, they selected 31 videos with at least one PoI, manually identifying PoIs and comparing them with user-viewed viewports. This was conducted to determine if users followed the PoI or were in exploration mode, i.e., scanning the 360-degree scene without focusing on a specific point. In a preliminary analysis with 30 users across five videos, they found that users spent an average of 81.44% of the playback time viewing a PoI. For viewport prediction, 25 video chunks of 1 s were provided as input to an LSTM network to predict the next three chunks. During the initial 25 s, the naive approach is used as a viewport prediction method, as the model requires 25 s of data to work. Once the model begins producing its own predictions, each output is evaluated for success or failure. If the failure rate exceeds a predefined threshold, the system reverts to using the naive approach instead of the proposed method. Their proposed viewport prediction achieved approximately 60% accuracy, compared to about 58% accuracy for the naive approach.

Moreover, in [28], a framework for viewport prediction tailored to live streaming scenarios has been proposed. The framework aims at achieving three key objectives: (i) online training and prediction to adapt to dynamically generated live VR content; (ii) lifelong updating of the prediction model throughout the video; and (iii) real-time processing achieved through subsampling to reduce delays. The framework combines Convolutional Neural Network (CNN) and LSTM networks for prediction, leveraging both video content analysis and head orientation trajectory analysis. Initially, only the CNN network operates, as there are no trajectories to train the LSTM. When partial trajectories become available, predictions are based on the combined output of the CNN and LSTM networks. The CNN employs a modified VGG network [63] to classify the input into two categories: interesting and not interesting. To reflect every user’s preferences, the CNN model starts to train after the user starts watching the video. The training process is achieved by taking the set of frames that the user has watched and dividing it into patches, which are taken as input by the CNN. To account for fast head movements, the LSTM network utilizes previous head trajectories to predict the tiles belonging to the next viewport. The evaluation of the framework has been performed using the dataset presented in [42], with a prediction horizon of 2 s. The model has been evaluated using three metrics: accuracy, bandwidth usage, and processing time. The processing time for each segment should be less than the segment duration, which was set to 2 s in their work. The model was compared with three baseline approaches. The first baseline used a CNN for viewport prediction, the second utilized an LSTM model for prediction, and the third referred to a model proposed in [75]. The simulation tests showed accuracy of 90%, along with approximately 40% bandwidth savings and fast processing times, meeting the real-time and bandwidth requirements for VR streaming, with accuracy ranging from 97% to 99%.

A comprehensive analysis of LSTM-based viewport prediction has been presented in [29]. For a performance comparison, the naive approach has been first compared to a deep position-only baseline, which exploits only head position information for prediction, utilizing a sequence-to-sequence framework with LSTM units for encoding and decoding. The authors set the prediction horizon to 5 s and initiated predictions after 6 s of video playback to bypass the exploration phase. These approaches were compared to the methods presented in [14,23,24,26,41]. The authors demonstrated that the two baselines outperformed existing approaches for specific types of video content and showed that the methods incorporating saliency information failed to surpass the deep position-only baseline.

To investigate the contribution of saliency information, a saliency-only baseline has been introduced. The ground-truth saliency was derived from tracked eye positions or was set as the center of the explored viewport from the employed datasets. The authors observed the significance of past head positions for short-term prediction and proved their relevance for scenarios with frequent low-velocity and inertia effects present in the videos. In contrast, the use of the ground-truth saliency indicated a notable improvement in viewport prediction for non-exploratory videos, i.e., the ones containing a key PoI that attracts users’ attention, particularly for prediction horizons exceeding 3 s. However, for shorter horizons, typically less than 2.5 s, the contribution of saliency remained uncertain. The performed analysis suggests that, when using ground-truth saliency, the optimal approach involves employing a dedicated recurrent unit to process position information before merging it with saliency-based features. Then, the observations derived from ground-truth saliency information were extended to a case in which the estimated saliency was employed. To this aim, the PanoSalNet model [24] was employed. The performed analysis shows that the saliency contribution in viewport prediction decreases when the saliency estimation is not accurate.

As a final contribution, the authors proposed the TRACK model, which combines saliency and head position information in three steps:

A double-stacked LSTM with 256 units each, processing head movement information;
A double-stacked LSTM with 256 units each, processing content-based saliency after flattening it;
A double-stacked LSTM with 256 units each, processing the concatenated output from the first and second steps.

For performance evaluation, the datasets presented in [23,24,26,41,46] have been employed. The proposed model was compared against several baseline methods (i.e., [12,14,23,24,25,26,41]). To ensure a fair and comprehensive evaluation, the authors adopted the evaluation metrics used for each respective baseline. Specifically, the accuracy and F1-score were used for comparison with the method in [23]. The mean overlap was used for comparison with the method in [41]. The intersection angle error was employed for comparison with [26], while the IoU was used to evaluate the performance relative to the approach in [24]. As a result, the proposed method achieved a mean overlap of 0.968 on average, significantly outperforming the baseline method in [41], which achieved 0.753 for a prediction horizon of 30 ms. Additionally, for a 1 s prediction horizon, the proposed method attained accuracy of 95.48%, compared to 86.35% reported by [23]. After comparing TRACK with the other state-of-the-art baselines, TRACK demonstrated superior performance with respect to [12,14,23,24,25,26,41] for prediction horizons exceeding 3 s. For shorter prediction horizons, TRACK’s performance was comparable to that of the deep position-only baseline.

2.3.2. Convolutional LSTM

A possible alternative to standard LSTM is represented by Conv-LSTM [76], which allows the integration of spatial information. In this direction, Dong et al. introduced the gaze-based field of view prediction (GEP) framework [32]. It forecasts long-term viewports by leveraging the target user’s previous eye gaze trajectories together with the future trajectories of other users watching the same video. They represented historical movement trajectories through eye gaze-based heatmaps computed over 10 s of data and selected similar users through dot product operations. Dong et al. noted that the outcome of this process resembles that of saliency detection but is constructed based on real human attention. The proposed method is based on the combination of a seven-layer U-Net [77] integrated with the Squeeze-and-Excitation Network (SE-net) [78] and Conv-LSTM. Initially, the Convolutional Long Short-Term Memory (Conv-LSTM) model performs temporal predictions using the target user’s eye gaze heatmap; then, the SE-U-Net network makes the final prediction based on the temporal prediction results and the eye gaze heatmaps of other users. To generate heatmaps, the FoV’s center was set to the tile that the eye was fixated on, while other tiles received values based on their distance from the FoV’s center. The prediction horizon was set to 10 s, and the evaluation was conducted on the dataset presented in [26], which consists of 208 videos watched by 45 users. To evaluate the proposed model, the authors used a set of metrics to assess the effectiveness of their method. These included the hit rate, which is defined as the average percentage of the predicted FoV area (computed frame by frame) relative to the actual FoV area within a given second; the viewport deviation; and the mean squared error (MSE), which measures the average squared distance between the predicted and actual FoV center positions in a 2D projection. Additional metrics are the viewport PSNR, bandwidth occupation, viewport quality variance, and re-buffering time. The proposed model was evaluated against several baselines, including the method introduced in [12], as well as various ablated versions of the proposed approach, such as configurations using only eye gaze heatmaps for prediction. The authors showed that the proposed method achieved high effectiveness in predicting long-term (10 s) viewports, outperforming other state-of-the-art methods. As an example, the PSNR achieved by the proposed method is 25.535 dB, while the method presented in [12] achieves 22.980 dB. The average re-buffering time (in seconds) of the proposed method is 0.702 with respect to the 0.822 reported in [12].

In addition, in [31], the CoLIVE framework was introduced. It is designed for live streaming scenarios where viewport prediction is performed online using edge computing. A key innovation with respect to previous approaches is that, as many users exhibit similar behaviors when watching 360-degree videos, CoLIVE implements collaborative online prediction by sharing local losses with a central model, aiding new viewers in viewport prediction. In addition, CoLIVE utilizes Conv-LSTM [76] on the edge side rather than the standard LSTM, due to its capability to capture both the spatial structure of frames and temporal features. CoLIVE operates through several steps. First, a saliency detection model generates a saliency map on the server side and transmits it to the edge side. Second, viewport prediction for each chunk is conducted using Conv-LSTM on the edge side, taking saliency maps and previous viewport trajectories as input. The output provides future viewport locations, and the model parameters are updated after the user watches the video chunk. Third, these parameters are shared with a central model to assist new users in predicting their viewports. Saliency estimation is performed using a CNN with eight convolutional layers and two max-pooling layers. Viewport prediction employs Conv-LSTM followed by two convolutional layers and two de-convolutional layers to predict future viewport locations, utilizing saliency maps and previous head trajectories as input. For experimental validation, a prototype system was created and the model was tested with two public datasets [42,79]. The performance of the proposed model was evaluated using several metrics, including accuracy, recall, precision, bandwidth savings, and processing time. The evaluation was conducted in comparison with the two methods presented in [24,28]. Processing time was evaluated for both saliency estimation and viewport prediction. For viewport prediction, the model successfully predicted the next viewport within the duration of the current video chunk, which was set to 2 s, allowing for smooth and uninterrupted streaming. Wang et al. reported that their framework outperforms the state-of-the-art methods in terms of bandwidth savings and accuracy [31].

Wang et al. later enhanced their model for collaborative online learning by introducing the Dynamic Federated Clustering Algorithm (DFCA), designed to improve the viewport prediction accuracy [33]. This algorithm dynamically groups users based on the gradient updates of their model and the viewing behaviors within a sliding window. The selection of a clustering approach was motivated by the fact that the analysis of data traces revealed consistent viewing patterns among groups of users. By leveraging these similarities, the model can make more accurate predictions for users within the same cluster and provide initial predictions for new users. The DFCA primarily comprises two phases (i.e., dynamic partitioning and periodic merging) to dynamically separate or merge viewer groups based on user behavior by periodically computing the Euclidean distance between their gradient updates. The model was evaluated using the dataset presented in [42]. To evaluate the proposed method, several metrics were employed, including accuracy, precision, recall, the F1-score, and bandwidth savings. The proposed method was compared to LiveDeep [28], PanoSalNet [24], and the earlier CoLIVE model (referred to as CoLIVE w/o) [31]. In terms of average bandwidth savings, CoLIVE w/o achieved the highest value, corresponding to 34.12%, closely followed by the enhanced CoLIVE, which reached 34.04%. In contrast, PanoSalNet and LiveDeep achieved only 28.14% and 21.48%, respectively. The enhanced CoLIVE also demonstrated the highest average accuracy (96.47%), outperforming CoLIVE w/o (95.43%), PanoSalNet (65.74%), and LiveDeep (68.99%).

Conv-LSTM has been also applied to viewport prediction for mobile devices in [34]. Specifically, Conv-LSTM takes two inputs: (i) a saliency map estimated by a graph convolutional network (GCN)-based predictor and (ii) a tile-based map of the gaze trajectory. The latter is built by predicting future trajectories based on historical data and then mapping them to the existing tile. Each tile is allocated a different bitrate depending on its probability of being watched. The authors use the proposed viewport prediction in a larger framework to improve 360-degree video streaming. Performance assessment has been performed using the dataset presented in [11], and the proposed method was compared to three other streaming frameworks, including Flare [11,80,81]. To evaluate the proposed method, three metrics were used: the PSNR to assess video quality against the reference, the inter-chunk PSNR to measure quality consistency between adjacent video segments, and the re-buffering time to evaluate playback smoothness. The performed experimental tests show an increase of up to 17.3% in the perceived subjective quality and a 20.6% bandwidth reduction with respect to the best-performing methods in the state of the art.

In the aforementioned studies, saliency detection was typically performed using 2D convolutional layers on equirectangular frames, thus creating distortion issues due to the projection process. To overcome this issue, in [82], saliency estimation has been explored from a spherical perspective by including the spherical convolutions initially proposed in [83]. Specifically, to achieve convolution on spherical images via their ERPs, local convolution and max pooling are conducted on the tangent plane of the spherical patch. The authors in [82] introduced the SphereU-Net framework, which comprises a U-Net architecture [77] integrated with spherical convolutional layers and spherical max pooling. It takes the equirectangular video frame at time t and the saliency map at time t − 1 as input, producing the saliency map at time t as output. The experimental assessment was conducted using the saliency dataset presented in [84], and Peng et al. demonstrated that their proposed saliency detection framework outperformed other existing saliency models, such as [24,84], in terms of the Pearson linear correlation coefficient (CC), Kullback–Leibler divergence (KL), normalized scanpath saliency (NSS), and area under the curve—Judd (AUC-Judd). The key finding presented in [82] is that, by increasing the saliency estimation accuracy, this method might be able to boost the viewport prediction performance.

2.3.3. Gated Recurrent Unit

Another approach to viewport prediction uses gated recurrent units (GRUs), which are a type of recurrent neural network introduced in [85]. Gated Recurrent Units (GRUs) are similar to LSTM in their ability to handle sequential data by learning which information to keep or forget over time. However, GRUs have a simpler structure with fewer parameters, making them easier to train and more efficient while still capturing important temporal patterns.

Based on this, Li et al. tested the usage of GRUs in combination with spherical convolutions in [35]. They focused on a multicast scenario where previous viewports from multiple users were utilized. Viewport prediction involved the fusion of a saliency feature extraction module and an FoV prediction module. Initially, the video frame at time t was processed by the spatial feature extraction module, which comprised multiple spherical convolutional and de-convolutional layers to identify PoIs in the frame. Concurrently, the temporal feature extraction module analyzed consecutive frames from time t to time n (where n is the prediction period) to extract temporal features. The results of these modules were then fed into the Convolutional Block Attention Module (CBAM), akin to the work done by Woo et al. [86], to enhance the saliency detection accuracy. In parallel, a set of FoV maps of other users is fed into a Spherical Convolutional Gated Recursive Unit (SP-ConvGRU) leveraging spherical convolutions. The outputs of the two parallel branches are combined to provide the predicted FoV. The prediction horizon was set to 2 s. For experimental evaluation, they employed the datasets presented in [26,84]. To evaluate the proposed model, three metrics were used: accuracy, precision, and recall. To evaluate the saliency estimation model and viewport prediction, the model was compared against five baseline methods presented in [11,24,87,88,89]. Through experimentation, Li et al. demonstrated the superior performance of their module compared to other baseline methods [35].

GRUs have been employed also by Setayesh et al. in [36], who expanded the viewport prediction model they presented in [90]. They propose a hybrid approach that encompasses three elements: a saliency estimation method, a technique for head movement prediction, and an integration approach. Concerning saliency estimation, the authors resort to PAVER [91], which processes ERP video patches through an encoder (i.e., a deformable convolution layer) and a decoder that leverages local, spatial, and temporal information to return the final estimated saliency map. The head movement prediction module is composed of an encoder and a decoder, containing four GRU layers each. The maps obtained in each of these steps are integrated through a regional fusion technique. The experimental evaluation has been performed based on the dataset presented in [84]. To evaluate the performance of the model, the tile overlap has been used. It is defined as the intersection between the predicted and the ground-truth viewport divided by the ground-truth viewport. To assess the QoE, a score based on four key factors was defined. The factors are defined as follows: the average quality of tiles within the viewport, the spatial quality smoothness among these tiles, the temporal quality consistency, and the re-buffering delay. Tile quality was determined based on the bitrate and quantization parameters. The proposed method was compared against the approaches presented in [92,93,94]. It shows improvements, achieving 2.41% higher overlap compared to [93] and 15.93% higher than [92]. For the QoE, the proposed method outperformed [92,93,94] in terms of the defined score by 23.77%, 62.7%, and 116.57%, respectively.

2.3.4. Transformers

As an alternative approach for viewport prediction, Transformers have been employed in [37] due to their advantages in handling time-series problems and capturing long-term dependencies compared to RNN networks [56,95]. Specifically, the authors introduced the VRFormer network to address challenges in FoV prediction and adaptive video streaming. The primary challenge is that the streaming quality of 360-degree videos relies on accurate FoV prediction, as lower prediction accuracy leads to wasted bandwidth and poor user experience. The second challenge arises when dealing with tile-based 360-degree video streaming, where the bitrate needs to be dynamically adjusted for different tiles based on the network environment and current video player. To deal with the first challenge, the proposed Transformer takes as input video content, previous head movements, and eye traces. For the second challenge, a DRL network is employed to model the reward for improved user QoE, dynamically controlling video content reconstruction and adaptively allocating rates for future tiles. Focusing on FoV prediction, the VRFormer network consists of three main processes: spatial feature extraction, temporal feature extraction, and content-aware FoV prediction. Spatial features are extracted by dividing 360-degree video frames into 2D FoV images and using a pre-trained ResNet-50-based CNN [96] to extract spatial features and generate attention feature maps. Temporal dependencies are captured using a Transformer encoder network, which takes into account the history of head motion and eye tracking from the initial frame to the current one. The results from both processes are then fed into a Transformer decoder network, which in turn provides its output to three fully connected layers to predict future viewports. The experimental evaluation has been performed on the dataset presented in [26], with the prediction horizon set to 10 s. To evaluate the viewport prediction performance, two metrics were used: the viewport overlap ratio and the average MSE between the predicted and actual viewport regions. The viewport overlap ratio has been computed here as the area of intersection between the predicted and ground-truth viewports divided by their union. The methods presented in [12,23,97,98] were used for comparison. To evaluate the proposed approach in terms of bitrate adaptation and QoE, three additional methods presented in [53,81,99] were considered. The QoE was assessed using three criteria: (i) the average video quality, measured by the average bitrate of tiles within the viewport; (ii) the video smoothness, reflecting fluctuations in tile quality inside the viewport; and (iii) the re-buffering time, which captures playback interruptions. The comparison demonstrates that the proposed approach offers better user QoE compared to other state-of-the-art methods. As the results show, the predicted viewport overlap declines progressively with the extension of the prediction horizon: from 95% at 1 s, to 75% at 5 s, and down to 73% at 10 s. This degradation reflects the increasing difficulty in accurately predicting the viewport over time.

Transformers have been employed for viewport prediction also in [38]. The authors combine historical viewpoint trajectories, which should account for short-term viewing behaviors, and saliency information, which should provide a crucial contribution for long-term predictions. First of all, they process the video frames with the PAVER saliency estimation technique [91] and then convert the obtained saliency to a compact 2D representation that encodes the 3D spatial coordinates and the saliency value. In parallel, an LSTM module uses the past viewpoint positions, i.e., the head scanpath, to perform short-term viewpoint prediction. Then, the LSTM output and the saliency map are provided as input to a spatial attention module that aligns the two information types. The aligned saliency and short-term predictions are provided as input to a temporal attention module, which provides a long-term prediction. Finally, the long- and short-term predictions are fused through a gating fusion module. Performance has been assessed using the datasets presented in [42,45,51]. For performance evaluation, the IoU and orthodromic distance have been used. The proposed approach has been compared with LR, with a custom LSTM-based encoder–decoder module, as well as state-of-the-art approaches such as the ones presented in [20,29,100]. The comparison highlighted that the regression technique, the LSTM-based approach, and the method in [20] had better performance for short-term prediction (less than 1 s). With respect to them, the approach presented in [29] had better long-term performance (2–5 s), while achieving less accurate short-term predictions. In contrast, the method proposed in [38] achieved the best long-term predictions without resulting in poor short-term accuracy. Specifically, the proposed method achieved the lowest orthodromic distance and the highest IoU on all datasets.

The use of Transformers for viewport prediction has been extended to bidirectional Transformers in [39]. Specifically, the proposed framework has three main components: (i) a multi-level residual module aimed at extracting multi-level features through spherical convolutions; (ii) a bidirectional encoder–decoder module based on Transformers, which takes as input the spatial features and extracts inter-frame spatiotemporal information; (iii) a spherical matching module that predicts the viewport proposal along with the corresponding saliency score in order to provide the output viewport trajectory. Performance assessment has been performed using the datasets presented in [26,41,47] based on three metrics: the IoU, mean overlap, and spherical great circle distance. The method was compared against the approaches presented in [24,41,101,102,103,104], showing improved performance. For instance, the model demonstrated its strongest performance on the dataset presented in [41], achieving a great circle distance of 0.341, an IoU of 0.667, and a mean overlap score of 0.751. In comparison, on the dataset presented in [47], the model reached a great circle distance of 0.463, an IoU of 0.588, and a mean overlap score of 0.677. Lastly, on the dataset from [26], it achieved a great circle distance of 0.510, an IoU of 0.560, and a mean overlap score of 0.652.

2.3.5. Graph Neural Network (GNN)

Another valuable approach for viewport prediction is presented by graph neural networks (GNNs) [105]. In [40], a Graph Neural Network (GNN) is employed to forecast future viewports by fusing multiple features. Specifically, a sparse direct graph is built using three extracted features: past head orientation traces of the target user, the video saliency and motion maps, and the head orientation traces of other users viewing the same video. Past head orientations were processed through a two-layer LSTM to provide the head orientation prediction. Salient regions were identified using SalNet, a method tailored to saliency detection in 2D images [69]. For motion maps, FlowNet [74] was employed. For cross-user heatmap generation, Xu et al. relied on the K-means clustering algorithm to group users exhibiting similar behaviors while watching the same video. To align the dimensions of this output with other components, a tile-level map is generated from the predicted orientations. As for the other users’ information, a cross-user heatmap was generated by dividing the video into tiles and computing the normalized probability for each tile. These three categories of features have been employed to define the directed graph. The number of tiles corresponds to the number of nodes in the graph, whose weight is defined as a function of the extracted features, while the edges represent the associations between the tiles. The graph is then provided as input to a two-layer GNN whose output indicates the viewing probability for each tile. The prediction horizon was set to 3 s and the model performance was evaluated using the dataset presented in [46]. To evaluate the proposed method, accuracy was used as the primary metric for comparison with other state-of-the-art approaches. Additionally, the QoE was assessed based on three key factors: the total quality of tiles within the viewport, the spatial quality variation among these tiles, and the temporal quality variation between two consecutive video segments. In this evaluation, tile quality was determined based on the total bitrate. The proposed method has been tested against the approaches presented in [28,106]. The authors showed that the proposed approach outperforms others across various time windows. Specifically, the proposed model achieves average accuracy of 84.6% for a 1 s prediction window, outperforming the best baseline method presented in [28] by 11.1%. Even under a longer 3 s window, the method maintains accuracy of 68.7%. The authors highlight the model’s superiority over existing methods, demonstrating its versatility and adaptability across various video types.

2.3.6. Reinforcement Learning (RL)

Furthermore, reinforcement learning (RL) approaches have been proposed for viewport prediction. For instance, Xu et al. employed DRL to forecast head movement positions, introducing a method called deep reinforcement learning-based head movement prediction (DHP) [41]. Deep Reinforcement Learning-based Head Movement Prediction (DHP) operates in both offline and online modes. The output of offline-DHP is a set of potential head positions, generated by running multiple DRL workflows to determine potential head movement positions in each 360-degree frame. This offline-DHP scheme is designed to model the attention of multiple users. In contrast, the online-DHP model predicts the next head position for an individual based on current and previous frame head movements in real time. The observed content of the 360-degree videos serves as the environment, while the magnitudes and directions of head movement scanpaths serve as actions for multiple DRL agents. The reward is based on the differences in actions between subjects and agents, reflecting how closely the agents mimic the subjects’ scanpath predictions. The agent selects the appropriate scanpath action for the next frame based on the ground truths of previous head movement scanpaths and content observation. Online-DHP emphasizes attention to content by leveraging the learned offline-DHP model. Both the offline-DHP and online-DHP networks are structured with one LSTM layer and four convolutional layers, aimed at extracting temporal and spatial content features. The online-DHP network predicts head positions for the frame at t + 1 by utilizing the head movements of target users from time 0 until t as input. In their experiments, Xu et al. developed the PVS-HM dataset, comprising 76 360-degree videos watched by 58 users with the HTC Vive HMD, including head movement data and eye fixations captured by the embedded eye-tracking module. The proposed method was evaluated using the mean overlap as the primary metric. It was tested against several other techniques, including the naive approach, a method that predicts the viewport randomly, and the method presented in [107]. The authors showed that their model achieved superior performance over other approaches, such as [107], in both online and offline head movement prediction tasks. Specifically, online-DHP predicts the user’s future head position for the next frame with an average mean overlap of 0.75, outperforming the method presented in [107], which achieves 0.40, and the naive approach, which obtains a much lower mean overlap score, i.e., 0.20.

3. Datasets

All 360-degree videos can be classified into different categories based on the visual content [29,108]. In turn, the video content influences how the user explores the 360-degree scene, thus impacting also viewport prediction. The main video categories are reported in the following.

Exploration videos: In this category, users tend to explore the entire scene without focusing on a specific object or region. These scenes usually contain no motion and no dominant visual target. An example is represented by natural landscapes or panoramic views. As a result, while watching the video, head movements are widely distributed, thus making viewport prediction particularly challenging.
Static focus: These videos contain salient objects that remain fixed near the center of the scene, such as a performer in a concert or a static speaker. Because the main object naturally attracts attention, these videos are considered easier for viewport prediction tasks.
Moving focus: In these videos, salient objects move around the scene, requiring users to follow their motion. An example of this category is represented by sports videos, where multiple players move inside the scene, or dynamic scenes with multiple actors. Predicting head orientation is more complex because of the variability in motion and attention shifts.
Ride-based videos: This category includes camera motion during video capture. One of the most common scenarios is represented by car driving. These videos often introduce a forward-moving visual flow, which adds another layer of complexity to predicting user gaze and head movement.

Among these video categories, exploration videos are considered the most difficult for viewport prediction due to the lack of obvious visual cues and motion, while static focus videos are the most predictable due to their centered, low-motion content. Additionally, the viewport prediction accuracy can vary depending on the type of content or the dataset used. Some videos begin with an impactful scene, such as an explosion, after which the real content appears. Because of this, some viewport prediction methods have resorted to cutting the first six or ten seconds to achieve accurate viewport predictions; an example is provided in [29]. Another possibility consists in filtering the dataset to include only the videos that have at least one PoI, as described in [30].

As demonstrated in the previous sections, several publicly available datasets have been used for the training and evaluation of viewport prediction and saliency estimation models. We provide a detailed discussion of the three largest datasets, referred to as VR-EyeTracking, Sport360, and PSV-HM, in the following.

VR-EyeTracking—introduced in [26]: This is the largest known 360-degree video dataset, which contains 208 videos with a 4K resolution; each video has been watched by 45 users. The videos in the dataset have durations ranging from 20 to 60 s at 25 fps. The video content is varied, including indoor scenes, outdoor environments, sports, music performances, and documentaries. Users watched the videos using an HTC Vive headset with a 7invensun a-Glass eye tracker. This dataset is provided in the form of MP4 videos, along with the corresponding head and eye fixations.
Sport360—introduced in [84]: This dataset includes 104 videos (extracted from the Sport-360 dataset [107]) viewed by 27 participants. As for the first dataset, the HTC Vive and the a-Glass eye tracker were used. Differently from VR-Eyetracking, this dataset provides just the video frames and the corresponding saliency maps along with the ground-truth eye fixations, making it useful for saliency estimation tasks.
PSV-HM—introduced in [97]: This dataset contains 76 videos viewed by 58 users. The videos, taken from YouTube and VRCun, have a duration of between 10 and 80 s and resolutions varying from 3K to 8K. They were watched using the HTC Vive, with the head orientation recorded for each viewer. The dataset includes diverse content types, such as animation, gaming, sports, nature scenes, and driving. It is available as MP4 videos along with the corresponding head orientation data.
Other datasets: Additional datasets include collections with a diverse number of videos and subjects, offering varied levels of annotation and content diversity.

We provide in Table 2 a summary of the datasets used for the viewport prediction task. For each dataset, we report (i) the number of videos that it contains, (ii) the number of users who viewed them, (iii) the video duration in seconds, (iv) the video resolution, (v) the source from which the videos were obtained, (vi) the modality used for tracking head and/or eye movements during the experiments, (vii) the studies or methods in which the dataset was employed, and (viii) an indication of whether the dataset is publicly available online. This overview helps to illustrate the scale and diversity of data used across the viewport prediction methods included in this review.

4. Evaluation Metrics and Comparison Methods

In this section, we review the metrics used to evaluate the performance of viewport prediction methods and provide a summary of how the papers discussed in the previous sections compare their results with the state of the art.

4.1. Evaluation Metrics

Several metrics have been employed to evaluate viewport prediction methods, with variations across the approaches. Commonly used metrics include the accuracy, F1-score, mean FoV overlap, and IoU, among others.

Some studies have also integrated viewport prediction into streaming scenarios, where the QoE is assessed using a range of indicators. These include the average bitrate within the user’s viewport for each frame (referred to as Q1), the variation in bitrate within the viewport per frame (Q2), bitrate fluctuations across consecutive chunks or segments (Q3), and bitrate variation across frames within a one-second interval (Q4). Additional metrics such as the re-buffering time and buffer stall count are also used. In addition, some methods consider network-related factors such as bandwidth consumption, the peak bandwidth, and bandwidth savings.

In Table 3, we list all state-of-the-art methods discussed in this work, along with the evaluation metrics that they employed and the comparison methods used to benchmark the proposed approaches. The employed metrics are detailed below.

Accuracy: the ratio of correctly classified tiles to the total number of tiles.

$Accuracy = \frac{TP + TN}{TP + FP + FN + TN} = \frac{TP + TN}{TT + TN},$

(1)

where true positive (TP) indicates the tiles predicted to be in the viewport and actually in the user’s viewport, and true negative (TN) denotes tiles predicted to be outside the viewport and actually not in the user’s viewport. False negative (FN) indicates the tiles in the user’s viewport that were not predicted. Finally, false positive (FP) indicates the tiles predicted to be in the viewport but not in the actual user’s viewport. Total true (TT) represents the sum of TP, FN, and FP.
Intersection over Union (IoU): the intersection between predicted tiles and ground-truth tiles divided by the union of predicted and ground-truth tiles.

$IoU = \frac{TP}{TT} = \frac{| Pred \cap GT |}{| Pred \cup GT |} .$

(2)
Recall: answers the question of how many tiles predicted by the system the user actually looked at.

$Recall = \frac{TP}{TP + FN} = \frac{| Pred \cap GT |}{| GT |} .$

(3)
Precision: answers the question of how many of the tiles that were predicted to be in the user’s viewport actually were in the viewport.

$Precision = \frac{TP}{TP + FP} = \frac{| Pred \cap GT |}{| Pred |} .$

(4)
F1-Score: the harmonic mean of precision and recall.

$F 1 - S c o r e = 2 \times \frac{Precision \times Recall}{Precision + Recall} .$

(5)
Mean FoV Overlap: the average ratio of the IoU of the predicted and ground-truth viewport areas over a defined prediction window. The overlapping area is computed as

$FoV overlap = \frac{Area (Pred \cap GT)}{Area (Pred \cup GT)}$

(6)
Intersection Angle Error: the distance between the predicted point and the ground-truth point on a sphere.
Great Circle Distance/Orthodromic Distance: the shortest distance between the center of the predicted viewport and the ground-truth viewport on the surface of a sphere.
Average Great Circle Distance: the average of the great circle distance over all future time steps.
Viewport PSNR(V-PSNR): where the PSNR is computed only over the user’s viewport compared to the reference ground truth.
VWS-PSNR: where the PSNR is computed over the user’s viewport, incorporating a spherical correction and the weighted MSE compared to the reference ground truth.
Hit Rate: the average percentage of the predicted FoV area, computed frame by frame within a given second, relative to the total area of the actual FoV across those frames.
Missing Ratio/Blank Ratio: the ratio of missing tiles to viewed tiles.
Unseen Ratio: the ratio of tiles that arrived at the client but were not watched to the total number of transmitted tiles.
Bandwidth Consumption: the amount of bandwidth used to stream the predicted tiles.
Peak Bandwidth: the highest bandwidth consumption recorded during the streaming process.
Total Re-Buffering Time: the overall time spent re-buffering during playback.
Buffer Stall Count: the number of times that video playback is interrupted due to buffer depletion.
Buffer Stall Duration: the total number of seconds for which playback is halted while waiting for the buffer to refill.
Viewport Deviation: the percentage of blank area inside the viewport.
Viewport Quality Variance: adopts the coefficient of variation (CV) to evaluate the viewport quality variance in one segment (as detailed in Section 2.1.1).

We analyzed the use of these metrics across the studies and identified the three most commonly employed ones (reported in bold in Table 3): accuracy, mean FoV overlap, and re-buffering time. Notably, while the first two directly evaluate prediction performance, the re-buffering time reflects user annoyance, thus capturing the final user experience.

Despite this, in general, the high variability in the metrics employed to evaluate a viewport prediction approach makes a performance comparison between different approaches challenging. This is further complicated by the fact that some ambiguities in metric definitions exist across studies. Specifically, the ambiguities involve the definitions of the accuracy, mean FoV overlap, and IoU, for which we provide a detailed discussion in the following.

4.1.1. Accuracy

Accuracy is interpreted differently across studies. In this work, we use the term accuracy whenever it is employed in the original papers, ensuring a coherent discussion. The specific accuracy definitions are reported in the following.

In some works, it is clearly defined as the number of overlapping tiles between the predicted and the ground truth, divided by the union of predicted and actual ground-truth tiles—essentially the same as the IoU. This definition is used in [24,35,40]. In contrast, other studies define the prediction accuracy as the number of accurate predictions divided by the total number of predictions, as seen in [10,11].

Several papers define accuracy as the ratio of correctly classified tiles to the union of predicted and ground-truth tiles. However, they do not specify whether “correctly classified” includes both TP and TN or only TP. This interpretation appears in [17,23,25,29]. In [28], a stricter definition is provided: a prediction is deemed accurate only if the predicted viewport completely covers the ground-truth viewport. The final accuracy is then the ratio of correctly predicted frames to the total number of frames. Finally, some works (e.g., [30,31,33]) mention accuracy without offering an explicit definition.

Based on the general definition of accuracy, as well as to differentiate it from the IoU, we recommend the definition provided in Section 4.1.

4.1.2. Mean FoV Overlap

This metric typically reflects the average ratio of the intersection over union of the predicted and ground-truth viewport areas over a defined prediction window. This definition is used in [16,17,20,29]. Other studies use the term “tile overlap” to refer to the ratio between the intersecting tiles divided by the total tiles in the ground-truth viewport, as defined by [36]. Finally, in [37], the authors introduce the concept of the “viewport overlap ratio”, calculated per second by averaging the overlap across all frames within that interval. In this review, we employ the metric names and definitions used in the original papers.

4.1.3. IoU

The IoU is another commonly used metric, defined as the ratio of true positive tiles to the total number of unique tiles labeled as part of either the predicted or ground-truth viewport. Thus, the IoU equals the intersection divided by the union of predicted and actual viewport tiles. While similar to the mean FoV overlap, the key distinction lies in its representation: the mean FoV overlap typically deals with continuous spatial areas, while the IoU operates on discrete labeled tiles. The IoU has been used in works such as [17,29,38,39].

Based on the set of metrics employed in previous studies, we propose an evaluation methodology based on three assessments: (i) the performance of the prediction algorithm, (ii) the impact of viewport prediction on network usage, and (iii) the impact of viewport prediction on users’ QoE. For the first one, a metric that measures how much the predicted viewport overlaps with the ground-truth viewport has to be selected. Based on their wide diffusion and their similar meanings, we recommend the accuracy, the IoU, and the mean FoV overlap. The second assessment should be performed to quantify the savings in term of network resources. A straightforward metric is represented by bandwidth consumption. The last assessment should relate the streaming performance with the users’ QoE. Currently, a unified score does not exist, but a metric could be defined by combining the information concerning the average and standard deviation of the bitrate within the viewport per frame, the bitrate fluctuation across frames or chunks, and buffering stall occurrence and duration. The score should be validated through the use of varied subjective experiments to assess users’ perceived QoE.

4.2. Comparison Between Methods

In evaluating their models, many state-of-the-art papers use variations of the proposed approach as a baseline, meaning that they modify internal components or input types (e.g., using only saliency maps or only head orientation history) to assess the performance. The naive approach is another common comparison baseline, which assumes that the future head orientation remains the same as the last observed one. Many methods benchmark their models against this approach.

5. Discussion and Recommendations

Over the last few years, viewport prediction techniques have evolved from simple ML models to more advanced DL approaches. This shift has expanded the prediction horizons and enlarged the set of input types, including head orientation, saliency maps, and motion maps.

A key issue to analyze is the complexity of the proposed solution. Indeed, the computational requirements of viewport prediction are critical, as the model needs to be lightweight enough for streaming scenarios and ensure timely predictions. Many approaches take this into account by evaluating the proposed methods in terms of the processing time and re-buffering time, as shown in Table 3. However, several approaches fail to mention this aspect, focusing only on the effectiveness of viewport prediction. In the following, we provide a summary of the different approaches balancing these two aspects.

In the early stages, LR was used for viewport prediction due to its simplicity and low computational overhead. It showed strong performance in short-term prediction, especially within 0.1–0.5 s and up to 1 s. However, the accuracy significantly dropped for longer prediction horizons. The reason behind this phenomenon is that the correlation between past and future head movements diminishes over longer time gaps, leading to increased prediction errors. To tackle this problem, researchers resorted to clustering techniques based on head movement data from other users watching the same video. However, this approach relies on the availability of data from other users, which may not always be the case. Additionally, the assumption that users’ viewing behaviors are similar may not hold. Another issue is that clustering methods often fail to outperform simpler, naive approaches in short-term horizons, such as 1 s, as noted in [15], or 3 s, as mentioned in [16]. Moreover, as highlighted in [16], the effectiveness of the clustering approach depends significantly on the video’s content, as the locations of RoIs within the video are crucial.

To address the shortcomings of LR and clustering methods, researchers have turned to Transformers and RNN architectures such as LSTM. These models are well suited for learning sequential patterns and analyzing user head movements over time. Their ability to capture temporal dependencies offers a more robust framework for both short- and long-term viewport prediction tasks compared to previous approaches. It is useful to note that, although Transformers have enhanced capabilities in handling temporal information with respect to other Recurrent Neural Network (RNN)-based models, they also introduce additional complexity.

A major advancement in this field has been the integration of saliency maps with previous head orientation data. This dual-input strategy leverages head orientation to handle short-term predictions while using saliency maps to improve the performance over longer horizons. Supporting this integration, Manfredi et al. [30], in a study involving five videos, each containing at least one PoI, found that users spent on average 81.44% of the time focusing on PoIs. This clearly illustrates the strong influence of visual saliency on user attention and the value of incorporating saliency maps in viewport prediction.

Different strategies have been explored to fuse previous head orientations with saliency maps. Several studies have employed LSTM-based models, but relatively few studies have explored the use of GRUs, Transformers, GNNs or deep reinforcement learning.

To increase the prediction performance with respect to standard LSTM, researchers have investigated more complex architectures exploiting the same principles, such as BiLSTM [17]. However, this approach introduces two notable limitations. First, BiLSTM doubles the number of LSTM layers, using one layer for the forward pass and another for the backward pass, thus increasing the computational complexity. Second, and more critically, BiLSTM relies on future head movement data during training. This makes it less suitable for real-time applications where future data are not available at inference time, thus limiting its practicality in live streaming scenarios.

Another aspect to consider is that, when the saliency maps are integrated into viewport prediction, LSTM-based models process the saliency maps by first flattening them into 1D vectors. However, this process results in the loss of important spatial and temporal structures inherent in the saliency data. To address this limitation, Conv-LSTM models [31,32,33,34] have been proposed, as they can preserve both spatial and temporal features by directly processing the saliency maps in their original form. While this approach improves the effectiveness of processing saliency maps in viewport prediction, it also introduces additional computational complexity, which can be a drawback in resource-constrained or real-time scenarios.

To address this issue, different fusion strategies have been proposed. Some methods combine the previous head orientation with the saliency map before feeding them into the LSTM model, processing them together within the LSTM [24]. Other approaches process the head orientation through an LSTM-based architecture and then fuse the output with the saliency map, without directly processing the saliency map inside the LSTM [26]. As an alternative approach, some authors feed the head orientations and the saliency maps into separate LSTM units. The outputs from both units are then concatenated and fed into another LSTM. This approach has shown better performance compared to others, as reported in [29]. However, it also introduces increased computational complexity due to the use of additional LSTM units.

Another key perspective to consider is the impact of viewport prediction on the user’s QoE. Indeed, neglecting the QoE can lead to poor user satisfaction, as even accurate viewport predictions may fall short if issues like latency or re-buffering disrupt the viewing experience. While some state-of-the-art methods incorporate QoE evaluation—using metrics such as the average bitrate within the user’s viewport per frame, bitrate fluctuations across consecutive segments, the buffer stall count, and others—many approaches fail to consider the QoE at all. However, assessing the QoE remains essential in validating the robustness and practical effectiveness of viewport prediction methods in real streaming scenarios.

Additionally, one relevant issue to highlight is the lack of standardization in performance assessment. Indeed, many state-of-the-art methods use different baselines for comparison and employ different metrics for performance evaluation, as shown in Table 3. For instance, some state-of-the-art approaches compare their results against a naive baseline or even self-compare by altering the type of input to their models, without adequately comparing them against other state-of-the-art methods. This makes it more challenging to evaluate their effectiveness relative to others. Furthermore, there is significant variation in the evaluation metrics used across different methods. Even when the same metric is employed, its definition may differ from one approach to another, as highlighted in Section 4.1. This inconsistency complicates comparisons between state-of-the-art methods. Moreover, the datasets used can vary significantly from one study to another, as illustrated in Table 2, which poses another challenge for fair comparison. Another factor making comparisons difficult is the wide variety of output formats, as seen in Table 1. All of these limitations contribute to the difficulty in establishing a unified standard for performance assessment.

Another limitation is the small size of the available 360-degree video datasets. As reported in Table 2, the largest dataset includes 208 videos and was released in 2018. This restricts the generalization capabilities of models, especially those based on DL. The video content in the datasets is also a very important factor to consider for viewport prediction. Depending on the content characteristics, the prediction accuracy can vary significantly. For instance, Nguyen et al. [3] demonstrated that their model achieved approximately 85% accuracy on static videos (with minimal movement), but the accuracy dropped to 70% when tested on ride videos featuring fast motion. As the video subject influences users’ behavior and might have a direct impact on the performance of the models, having a varied dataset is essential to obtain robust models.

Another challenge to tackle is the lack of effective saliency estimation models for 360-degree videos. As highlighted by the current review, many methods still use 2D saliency models, which do not take into account the distortions typical of 360-degree content. Since saliency maps are often used as input in viewport prediction, improving the performance of saliency estimation models for 360-degree videos is an important research direction. The definition of new approaches specifically tailored to 360-degree content would be a valuable direction in terms of enhancing saliency estimation.

Furthermore, another relevant issue to consider is the limited prediction horizon. Many methods predict viewports for intervals from 1 to 3 s (as shown in Table 1). This short duration is often insufficient for buffering or smooth playback. Although some techniques are designed to reach longer prediction horizons, like 5 or 10 s, using clustering techniques or hybrid approaches, the accuracy of these methods degrades rapidly when the prediction horizon increases, even with advanced frameworks like Transformers. For instance, in [37], a Transformer-based model achieved an overlap ratio of 95% for a 1 s prediction horizon, but this dropped to 75% at 5 s and further declined to 73% at 10 s. These results highlight that, despite architectural advances, accurate long-term viewport prediction remains an open problem requiring further research and innovation. Indeed, while long-term prediction is needed, the model’s accuracy has to remain stable over time.

Finally, the size of the input window (e.g., the length of head trajectories or the number of video frames) is a key parameter to set. While the window size must be large enough to detect relevant motion patterns, it should not include noise or unrelated information.

Based on the survey carried out in this work, we provide the following recommendations:

Collect rich datasets: There is a need for larger and more varied datasets that include different content types and user data (e.g., eye gaze, head traces, and saliency ground truth). These will aid in model generalization and their application to real-life scenarios.
Unify the evaluation procedure: Shared evaluation datasets and metrics would be a relevant contribution. This will allow fair and objective comparisons between different approaches.
Design efficient models: Future research should focus on lightweight models that are suitable for real-time deployment on low-end devices. In addition, scalability should be granted for multi-user and long-term prediction scenarios. Furthermore, models should be developed with a user-centric perspective by incorporating QoE as a core evaluation criterion—ensuring not only the accuracy of viewport prediction but also smooth playback, reduced latency, and enhanced overall viewing satisfaction.
Improve saliency estimation: It is important to design robust saliency estimations for 360-degree videos. They should account for spherical distortions and large frame sizes.
Optimize input windows: The input data size should be carefully set in order to balance the input quality, the risk of noisy samples, and high computational complexity.
Investigate diverse input types: Current approaches mainly rely on video content and users’ behavioral data. Future research could explore combining these with physiological signals.
Realize QoE by-design: While QoE metrics are often used for model evaluation, QoE principles should also guide the design of new algorithms. To better model users’ perceptions, future works should integrate subjective pilot studies in combination with objective metrics to evaluate algorithm performance.
Develop user-tailored prediction models: Although many approaches have been proposed, different AI models and input types may be better suited for different users. Therefore, a more general framework that integrates multiple Artificial Intelligence (AI) models and input modalities could adapt predictions to individual users.
Integrate viewport prediction in streaming platforms: To bridge the gap between the research community and final users, viewport prediction should be integrated into VR streaming platforms. Although this is an active field, as indicated by the efforts in this direction by Meta [109], prediction techniques should first reach sufficient maturity in terms of accuracy, prediction horizons, and generalizability across multiple users.

6. Conclusions

With the diffusion of VR, 360-degree videos are expected to represent a relevant portion of the media content streamed over next-generation networks. Due to the inherent size of this type of video, as well as the limitations of the human visual system, viewport prediction is a key step in the transmission pipeline to reduce the bandwidth requirements without affecting users’ QoE.

This paper presents a thorough review of existing AI-based viewport prediction methods emerging in the last ten years. While initial approaches relied on quite simple head movement forecasts, recent methods employ complex DL architectures, which are better suited to learning the complex nature of human exploration behavior. Apart from the type of processing algorithm, methods can be categorized based on the type of input, which spans from human data such as eye or head traces to video content features such as saliency or motion maps.

Despite the rich contributions of the research community, several challenges are still present. Based on this, future research should target the collection of varied and large datasets, as well as increasing the prediction horizon and reducing models’ complexity. In addition, attention towards users’ perceived QoE should be included in the method design. Solving these issues will foster the application of viewport prediction techniques in real-life scenarios.

Author Contributions

Investigation, M.Z.A.W. and S.B.; writing—original draft preparation, M.Z.A.W. and S.B.; writing—review and editing, S.B. and F.B.; visualization, S.B.; supervision, F.B.; funding acquisition, F.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the European Union under the Italian National Recovery and Resilience Plan (NRRP) Mission 4, Component 2, Investment 1.3, CUP C93C22005250001, partnership on “Telecommunications of the Future” (PE00000001 program “RESTART”).

Data Availability Statement

Data sharing is not applicable.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
ARIMA	Autoregressive Integrated Moving Average
AUC-Judd	Area Under the Curve—Judd
BiLSTM	Bidirectional Long Short-Term Memory
CBAM	Convolutional Block Attention Module
CC	Pearson Linear Correlation Coefficient

CNN	Convolutional Neural Network
Conv-LSTM	Convolutional Long Short-Term Memory
DFCA	Dynamic Federated Clustering Algorithm
DHP	Deep Reinforcement Learning-Based Head Movement Prediction
DL	Deep Learning
DoF	Degrees of Freedom
DR	Dead Reckoning
DRL	Deep Reinforcement Learning
ERP	Equirectangular Projection
FN	False Negative
FoV	Field of View
FP	False Positive
FSM	Fused Saliency Map
GBVS	Graph-Based Visual Saliency
GCN	Graph Convolutional Network
GEP	Gaze-Based Field of View Prediction
GNN	Graph Neural Network
GRU	Gated Recurrent Unit
HMD	Head-Mounted Display
IoU	Intersection over Union
KL	Kulback–Leibler Divergence
KNN	K-Nearest Neighbors
LR	Linear Regression
LSTM	Long Short-Term Memory
MIAE	Mean Intersection Angle Error
ML	Machine Learning
MPC	Model Predictive Controller
MSE	Mean Squared Error
NSS	Normalized Scanpath Saliency
PEVQ	Perceptual Evaluation of Video Quality
PoI	Point of Interest
PSNR	Peak Signal-to-Noise Ratio
QoE	Quality of Experience
RL	Reinforcement learning
RMSE	Root Mean Square Error
RNN	Recurrent Neural Network
RoI	Regions of Interest
RR	Ridge Regression
SE-net	Squeeze-and-Excitation Network
SE-Unet	Squeeze-and-Excitation Network and U-Net
SP-ConvGRU	Spherical Convolutional Gated Recursive Unit
SP-GRU	Spherical Gated Recurrent Unit
SSIM	Structural Similarity Index
SVM	Support Vector Machine
SVR	Support Vector Regressor
TN	True Negative
TP	True Positive
TT	Total True
VR	Virtual Reality
VST	Visual Saliency Transformer
VWS-PSNR	Viewport-Weighted PSNR
WLR	Weighted Linear Regression

References

Global Industry Analysts Inc. Virtual Reality (VR)-Global Strategic Business Report. 2025. Available online: https://www.researchandmarkets.com/reports/3633908/virtual-reality-vr-global-strategic-business?srsltid=AfmBOookGFSnubhPSmOYKDCSUry2qI5-UrXnjJh24schZn0F6-wX2hjn (accessed on 1 September 2025).
Zink, M.; Sitaraman, R.; Nahrstedt, K. Scalable 360° Video Stream Delivery: Challenges, Solutions, and Opportunities. Proc. IEEE 2019, 107, 639–650. [Google Scholar] [CrossRef]
Nguyen, A.; Yan, Z. Enhancing 360 Video Streaming through Salient Content in Head-Mounted Displays. Sensors 2023, 23, 4016. [Google Scholar] [CrossRef] [PubMed]
Park, S.; Bhattacharya, A.; Yang, Z.; Das, S.; Samaras, D. Mosaic: Advancing User Quality of Experience in 360-Degree Video Streaming with Machine Learning. IEEE Trans. Netw. Serv. Manag. 2021, 18, 1000–1015. [Google Scholar] [CrossRef]
Islam, R.; Desai, K.; Quarles, J. Cybersickness Prediction from Integrated HMD’s Sensors: A Multimodal Deep Fusion Approach using Eye-tracking and Head-tracking Data. In Proceedings of the 2021 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Bari, Italy, 4–8 October 2021; pp. 31–40. [Google Scholar] [CrossRef]
Islam, R.; Desai, K.; Quarles, J. Towards Forecasting the Onset of Cybersickness by Fusing Physiological, Head-tracking and Eye-tracking with Multimodal Deep Fusion Network. In Proceedings of the 2022 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Singapore, 17–21 October 2022; pp. 121–130. [Google Scholar] [CrossRef]
Yaqoob, A.; Bi, T.; Muntean, G.M. A Survey on Adaptive 360° Video Streaming: Solutions, Challenges and Opportunities. IEEE Commun. Surv. Tutorials 2020, 22, 2801–2838. [Google Scholar] [CrossRef]
Dziubinski, K.; Bandai, M. Local and Global Viewport History Sampling for Improved User Quality of Experience in Viewport-Aware Tile-Based 360-Degree Video Streaming. IEEE Access 2024, 12, 137455–137471. [Google Scholar] [CrossRef]
Bao, Y.; Wu, H.; Zhang, T.; Ramli, A.A.; Liu, X. Shooting a moving target: Motion-prediction-based transmission for 360-degree videos. In Proceedings of the 2016 IEEE International Conference on Big Data (Big Data), Washington, DC, USA, 5–8 December 2016; pp. 1161–1170. [Google Scholar] [CrossRef]
Qian, F.; Ji, L.; Han, B.; Gopalakrishnan, V. Optimizing 360 video delivery over cellular networks. In Proceedings of the 5th Workshop on All Things Cellular: Operations, Applications and Challenges, New York, NY, USA, 3–7 October 2016; pp. 1–6. [Google Scholar] [CrossRef]
Qian, F.; Han, B.; Xiao, Q.; Gopalakrishnan, V. Flare: Practical Viewport-Adaptive 360-Degree Video Streaming for Mobile Devices. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, New Delhi, India, 29 October–2 November 2018; pp. 99–114. [Google Scholar] [CrossRef]
Ban, Y.; Xie, L.; Xu, Z.; Zhang, X.; Guo, Z.; Wang, Y. CUB360: Exploiting Cross-Users Behaviors for Viewport Prediction in 360 Video Adaptive Streaming. In Proceedings of the 2018 IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, USA, 23–27 July 2018; pp. 1–6. [Google Scholar] [CrossRef]
Baldoni, S.; Poci, O.; Calvagno, G.; Battisti, F. An Ablation Study on 360-Degree Saliency Estimation. In Proceedings of the 2023 International Symposium on Image and Signal Processing and Analysis (ISPA), Rome, Italy, 18–19 September 2023; pp. 1–6. [Google Scholar] [CrossRef]
Aladagli, A.D.; Ekmekcioglu, E.; Jarnikov, D.; Kondoz, A. Predicting head trajectories in 360° virtual reality videos. In Proceedings of the 2017 International Conference on 3D Immersion (IC3D), Brussels, Belgium, 11–12 December 2017; pp. 1–6. [Google Scholar] [CrossRef]
Petrangeli, S.; Simon, G.; Swaminathan, V. Trajectory-Based Viewport Prediction for 360-Degree Virtual Reality Videos. In Proceedings of the 2018 IEEE International Conference on Artificial Intelligence and Virtual Reality (AIVR), Taichung, Taiwan, 10–12 December 2018; pp. 157–160. [Google Scholar] [CrossRef]
Nasrabadi, A.T.; Samiei, A.; Prakash, R. Viewport prediction for 360° videos: A clustering approach. In Proceedings of the 30th ACM Workshop on Network and Operating Systems Support for Digital Audio and Video, Istanbul, Turkey, 10–11 June 2020; pp. 34–39. [Google Scholar] [CrossRef]
Li, J.; Wang, Y.; Liu, Y. Meta360: Exploring User-Specific and Robust Viewport Prediction in360-Degree Videos through Bi-Directional LSTM and Meta-Adaptation. In Proceedings of the 2023 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Sydney, Australia, 16–20 October 2023; pp. 652–661. [Google Scholar] [CrossRef]
Mahmoud, M.; Rizou, S.; Panayides, A.S.; Kantartzis, N.V.; Karagiannidis, G.K.; Lazaridis, P.I.; Zaharis, Z.D. Optimized Tile Quality Selection in Multi-User 360° Video Streaming. IEEE Open J. Commun. Soc. 2024, 5, 7301–7316. [Google Scholar] [CrossRef]
Feng, W.; Wang, S.; Dai, Y. Adaptive 360-Degree Streaming: Optimizing with Multi-window and Stochastic Viewport Prediction. IEEE Trans. Mob. Comput. 2025, 24, 5903–5915. [Google Scholar] [CrossRef]
Chao, F.Y.; Ozcinar, C.; Smolic, A. Transformer-based Long-Term Viewport Prediction in 360° Video: Scanpath is All You Need. In Proceedings of the 2021 IEEE 23rd International Workshop on Multimedia Signal Processing (MMSP), Tampere, Finland, 6–8 October 2021; pp. 1–6. [Google Scholar] [CrossRef]
Wang, H.; Long, Z.; Dong, H.; El Saddik, A. MADRL-Based Rate Adaptation for 360° Video Streaming with Multiviewpoint Prediction. IEEE Internet Things J. 2024, 11, 26503–26517. [Google Scholar] [CrossRef]
Ao, A.; Park, S. Applying Transformer-Based Computer Vision Models to Adaptive Bitrate Allocation for 360° Live Streaming. In Proceedings of the 2024 IEEE Wireless Communications and Networking Conference (WCNC), Dubai, United Arab Emirates, 21–24 April 2024; pp. 1–6. [Google Scholar] [CrossRef]
Fan, C.L.; Lee, J.; Lo, W.C.; Huang, C.Y.; Chen, K.T.; Hsu, C.H. Fixation Prediction for 360° Video Streaming in Head-Mounted Virtual Reality. In Proceedings of the 27th Workshop on Network and Operating Systems Support for Digital Audio and Video, Taipei, Taiwan, 20–23 June 2017; pp. 67–72. [Google Scholar] [CrossRef]
Nguyen, A.; Yan, Z.; Nahrstedt, K. Your Attention is Unique: Detecting 360-Degree Video Saliency in Head-Mounted Display for Head Movement Prediction. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 1190–1198. [Google Scholar] [CrossRef]
Li, Y.; Xu, Y.; Xie, S.; Ma, L.; Sun, J. Two-Layer FoV Prediction Model for Viewport Dependent Streaming of 360-Degree Videos. In Proceedings of the International ICST Conference on Communications and Networking in China, Chengdu, China, 23–25 October 2018. [Google Scholar]
Xu, Y.; Dong, Y.; Wu, J.; Sun, Z.; Shi, Z.; Yu, J.; Gao, S. Gaze Prediction in Dynamic 360° Immersive Videos. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5333–5342. [Google Scholar] [CrossRef]
Fan, C.L.; Yen, S.C.; Huang, C.Y.; Hsu, C.H. Optimizing Fixation Prediction Using Recurrent Neural Networks for 360^∘ Video Streaming in Head-Mounted Virtual Reality. IEEE Trans. Multimed. 2020, 22, 744–759. [Google Scholar] [CrossRef]
Feng, X.; Liu, Y.; Wei, S. LiveDeep: Online Viewport Prediction for Live Virtual Reality Streaming Using Lifelong Deep Learning. In Proceedings of the 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), Atlanta, GA, USA, 22–26 March 2020; pp. 800–808. [Google Scholar] [CrossRef]
Rondón, M.F.R.; Sassatelli, L.; Aparicio-Pardo, R.; Precioso, F. TRACK: A New Method from a Re-Examination of Deep Architectures for Head Motion Prediction in 360° Videos. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 5681–5699. [Google Scholar] [CrossRef]
Manfredi, G.; Racanelli, V.A.; De Cicco, L.; Mascolo, S. LSTM-based Viewport Prediction for Immersive Video Systems. In Proceedings of the 2023 21st Mediterranean Communication and Computer Networking Conference (MedComNet), Island of Ponza, Italy, 13–15 June 2023; pp. 49–52. [Google Scholar] [CrossRef]
Wang, M.; Peng, S.; Chen, X.; Zhao, Y.; Xu, M.; Xu, C. CoLive: An Edge-Assisted Online Learning Framework for Viewport Prediction in 360° Live Streaming. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar] [CrossRef]
Dong, P.; Shen, R.; Xie, X.; Li, Y.; Zuo, Y.; Zhang, L. Predicting Long-Term Field of View in 360-Degree Video Streaming. IEEE Netw. 2023, 37, 26–33. [Google Scholar] [CrossRef]
Wang, M.; Chen, X.; Yang, X.; Peng, S.; Zhao, Y.; Xu, M.; Xu, C. CoLive: Edge-Assisted Clustered Learning Framework for Viewport Prediction in 360^∘ Live Streaming. IEEE Trans. Multimed. 2024, 26, 5078–5091. [Google Scholar] [CrossRef]
Zhang, L.; Zhou, H.; Shen, L.; Liu, J.; Cui, L. Towards Attention-Aware Interactive 360-Degree Video Streaming on Smartphones. IEEE Netw. 2025, 39, 147–156. [Google Scholar] [CrossRef]
Li, J.; Han, L.; Zhang, C.; Li, Q.; Liu, Z. Spherical Convolution Empowered Viewport Prediction in 360 Video Multicast with Limited FoV Feedback. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 3. [Google Scholar] [CrossRef]
Setayesh, M.; Wong, V.W.S. Viewport Prediction, Bitrate Selection, and Beamforming Design for THz-Enabled 360° Video Streaming. IEEE Trans. Wirel. Commun. 2025, 24, 1849–1865. [Google Scholar] [CrossRef]
Zhang, Z.; Du, H.; Huang, S.; Zhang, W.; Zheng, Q. VRFormer: 360-Degree Video Streaming with FoV Combined Prediction and Super resolution. In Proceedings of the 2022 IEEE Intl Conf on Parallel and Distributed Processing with Applications, Big Data and Cloud Computing, Sustainable Computing and Communications, Social Computing and Networking (ISPA/BDCloud/SocialCom/SustainCom), Melbourne, Australia, 17–19 December 2022; pp. 531–538. [Google Scholar] [CrossRef]
Gao, B.; Sheng, D.; Zhang, L.; Qi, Q.; He, B.; Zhuang, Z.; Wang, J. STAR-VP: Improving Long-term Viewport Prediction in 360° Videos via Space-aligned and Time-varying Fusion. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, 28 October–1 November 2024; pp. 5556–5565. [Google Scholar] [CrossRef]
Guo, Y.; Xu, M.; Jiang, L.; Deng, X.; Zhou, J.; Chen, G.; Sigal, L. Proposal with Alignment: A Bi-Directional Transformer for 360° Video Viewport Proposal. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 11423–11437. [Google Scholar] [CrossRef]
Xu, X.; Tan, X.; Wang, S.; Liu, Z.; Zheng, Q. Multi-Features Fusion based Viewport Prediction with GNN for 360-Degree Video Streaming. In Proceedings of the 2023 IEEE International Conference on Metaverse Computing, Networking and Applications (MetaCom), Kyoto, Japan, 26–28 June 2023; pp. 57–64. [Google Scholar] [CrossRef]
Xu, M.; Song, Y.; Wang, J.; Qiao, M.; Huo, L.; Wang, Z. Predicting Head Movement in Panoramic Video: A Deep Reinforcement Learning Approach. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 2693–2708. [Google Scholar] [CrossRef]
Wu, C.; Tan, Z.; Wang, Z.; Yang, S. A Dataset for Exploring User Behaviors in VR Spherical Video Streaming. In Proceedings of the 8th ACM on Multimedia Systems Conference, Taipei, Taiwan, 20–23 June 2017; pp. 193–198. [Google Scholar] [CrossRef]
Bao, Y.; Zhang, T.; Pande, A.; Wu, H.; Liu, X. Motion-Prediction-Based Multicast for 360-Degree Video Transmissions. In Proceedings of the 2017 14th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON), San Diego, CA, USA, 12–14 June 2017; pp. 1–9. [Google Scholar] [CrossRef]
Markley, L.; Cheng, Y.; Crassidis, J.; Oshman, Y. Averaging Quaternions. J. Guid. Control Dyn. 2007, 30, 1193–1196. [Google Scholar] [CrossRef]
Nasrabadi, A.T.; Samiei, A.; Mahzari, A.; McMahan, R.P.; Prakash, R.; Farias, M.C.Q.; Carvalho, M.M. A taxonomy and dataset for 360° videos. In Proceedings of the 10th ACM Multimedia Systems Conference, Amherst, MA, USA, 18–21 June 2019; pp. 273–278. [Google Scholar] [CrossRef]
Lo, W.C.; Fan, C.L.; Lee, J.; Huang, C.Y.; Chen, K.T.; Hsu, C.H. 360° Video Viewing Dataset in Head-Mounted Virtual Reality. In Proceedings of the 8th ACM on Multimedia Systems Conference, Taipei, Taiwan, 20–23 June 2017; pp. 211–216. [Google Scholar] [CrossRef]
Li, C.; Xu, M.; Du, X.; Wang, Z. Bridge the Gap Between VQA and Human Behavior on Omnidirectional Video: A Large-Scale Dataset and a Deep Learning Model. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 932–940. [Google Scholar] [CrossRef]
Developers, F. FFmpeg Multimedia Framework. Available online: https://ffmpeg.org/ (accessed on 29 July 2025).
Guo, C.; Cui, Y.; Liu, Z. Optimal Multicast of Tiled 360 VR Video in OFDMA Systems. IEEE Commun. Lett. 2018, 22, 2563–2566. [Google Scholar] [CrossRef]
Mahmoud, M.; Valiandi, I.; Panayides, A.S.; Rizou, S.; Lazaridis, P.I.; Karagiannidis, G.K.; Kantartzis, N.V.; Zaharis, Z.D. Versatile Video Coding Performance Evaluation for Tiled 360 deg Videos. In Proceedings of the European Wireless 2023; 28th European Wireless Conference, Rome, Italy, 2–4 October 2023; pp. 191–196. [Google Scholar]
David, E.J.; Gutiérrez, J.; Coutrot, A.; Da Silva, M.P.; Callet, P.L. A dataset of head and eye movements for 360° videos. In Proceedings of the 9th ACM Multimedia Systems Conference, Amsterdam, The Netherlands, 12–15 June 2018; pp. 432–437. [Google Scholar] [CrossRef]
Hou, X.; Dey, S.; Zhang, J.; Budagavi, M. Predictive Adaptive Streaming to Enable Mobile 360-Degree and VR Experiences. IEEE Trans. Multimed. 2021, 23, 716–731. [Google Scholar] [CrossRef]
Zhang, Y.; Zhao, P.; Bian, K.; Liu, Y.; Song, L.; Li, X. DRL360: 360-degree Video Streaming with Deep Reinforcement Learning. In Proceedings of the IEEE INFOCOM 2019—IEEE Conference on Computer Communications, Paris, France, 29 April–2 May 2019; pp. 1252–1260. [Google Scholar] [CrossRef]
Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–12 August 2017; Volume 70, pp. 1126–1135. [Google Scholar]
Lu, Y.; Zhu, Y.; Wang, Z. Personalized 360-Degree Video Streaming: A Meta-Learning Approach. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 3143–3151. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
Itti, L.; Koch, C.; Niebur, E. A Model of Saliency-based Visual Attention for Rapid Scene Analysis. Pattern Anal. Mach. Intell. IEEE Trans. 1998, 20, 1254–1259. [Google Scholar] [CrossRef]
Lou, J.; Lin, H.; Marshall, D.; Saupe, D.; Liu, H. TranSalNet: Towards perceptually relevant visual saliency prediction. Neurocomputing 2022, 494, 455–467. [Google Scholar] [CrossRef]
Liu, N.; Zhang, N.; Wan, K.; Shao, L.; Han, J. Visual Saliency Transformer. arXiv 2021, arXiv:2104.12099. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Lucas, B.D.; Kanade, T. An iterative image registration technique with an application to stereo vision. In Proceedings of the 7th International Joint Conference on Artificial Intelligence, Vancouver, BC, Canada, 24–28 August 1981; Volume 2, pp. 674–679. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Cornia, M.; Baraldi, L.; Serra, G.; Cucchiara, R. A deep multi-level network for saliency prediction. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 3488–3493. [Google Scholar] [CrossRef]
Mavlankar, A.; Girod, B. Video Streaming with Interactive Pan/Tilt/Zoom. In High-Quality Visual Experience: Creation, Processing and Interactivity of High-Resolution and High-Dimensional Video Signals; Mrak, M., Grgic, M., Kunt, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2010; pp. 431–455. [Google Scholar] [CrossRef]
El-Ganainy, T.; Hefeeda, M. Streaming virtual reality content. arXiv 2016, arXiv:1612.08350. [Google Scholar] [CrossRef]
Corbillon, X.; Simon, G.; Devlic, A.; Chakareski, J. Viewport-adaptive navigable 360-degree video delivery. In Proceedings of the 2017 IEEE International Conference on Communications (ICC), Paris, France, 21–25 May 2017; pp. 1–7. [Google Scholar] [CrossRef]
Yu, M.; Lakshman, H.; Girod, B. A Framework to Evaluate Omnidirectional Video Coding Schemes. In Proceedings of the 2015 IEEE International Symposium on Mixed and Augmented Reality, Fukuoka, Japan, 29 September–3 October 2015; pp. 31–36. [Google Scholar] [CrossRef]
Pan, J.; Sayrol, E.; Giro-i Nieto, X.; McGuinness, K.; O’Connor, N.E. Shallow and Deep Convolutional Networks for Saliency Prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Huang, X.; Shen, C.; Boix, X.; Zhao, Q. SALICON: Reducing the Semantic Gap in Saliency Prediction by Adapting Deep Neural Networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 262–270. [Google Scholar] [CrossRef]
Xie, L.; Xu, Z.; Ban, Y.; Zhang, X.; Guo, Z. 360ProbDASH: Improving QoE of 360 Video Streaming Using Tile-based HTTP Adaptive Streaming. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 315–323. [Google Scholar] [CrossRef]
De Abreu, A.; Ozcinar, C.; Smolic, A. Look around you: Saliency maps for omnidirectional images in VR applications. In Proceedings of the 2017 Ninth International Conference on Quality of Multimedia Experience (QoMEX), Erfurt, Germany, 31 May–2 June 2017; pp. 1–6. [Google Scholar] [CrossRef]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, inception-ResNet and the impact of residual connections on learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 4278–4284. [Google Scholar] [CrossRef]
Ilg, E.; Mayer, N.; Saikia, T.; Keuper, M.; Dosovitskiy, A.; Brox, T. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2462–2470. [Google Scholar]
Feng, X.; Swaminathan, V.; Wei, S. Viewport Prediction for Live 360-Degree Mobile Video Streaming Using User-Content Hybrid Motion Tracking. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2019, 3, 43. [Google Scholar] [CrossRef]
Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.k.; Woo, W.c. Convolutional LSTM Network: A machine learning approach for precipitation nowcasting. In Proceedings of the 29th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 1, pp. 802–810. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Nguyen, A.; Yan, Z. A saliency dataset for 360-degree videos. In Proceedings of the 10th ACM Multimedia Systems Conference, Amherst, MA, USA, 18–21 June 2019; pp. 279–284. [Google Scholar] [CrossRef]
Zhang, L.; Suo, Y.; Wu, X.; Wang, F.; Chen, Y.; Cui, L.; Liu, J.; Ming, Z. TBRA: Tiling and Bitrate Adaptation for Mobile 360-Degree Video Streaming. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, 20–24 October 2021; pp. 4007–4015. [Google Scholar] [CrossRef]
Dasari, M.; Bhattacharya, A.; Vargas, S.; Sahu, P.; Balasubramanian, A.; Das, S.R. Streaming 360-Degree Videos Using Super-Resolution. In Proceedings of the IEEE INFOCOM 2020—IEEE Conference on Computer Communications, Toronto, ON, Canada, 6–9 July 2020; pp. 1977–1986. [Google Scholar] [CrossRef]
Peng, S.; Hu, J.; Li, Z.; Xiao, H.; Yang, S.; Xu, C. Spherical Convolution-based Saliency Detection for FoV Prediction in 360-degree Video Streaming. In Proceedings of the 2023 International Wireless Communications and Mobile Computing (IWCMC), Marrakesh, Morocco, 19–23 June 2023; pp. 162–167. [Google Scholar] [CrossRef]
Coors, B.; Condurache, A.P.; Geiger, A. SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Zhang, Z.; Xu, Y.; Yu, J.; Gao, S. Saliency Detection in 360° Videos. In Proceedings of the Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018; Proceedings, Part VII. Springer: Berlin/Heidelberg, Germany, 2018; pp. 504–520. [Google Scholar] [CrossRef]
Cho, K.; Merrienboer, B.; Bahdanau, D.; Bengio, Y. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. In Proceedings of the SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, 25 October 2014. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar] [CrossRef]
Chao, F.Y.; Zhang, L.; Hamidouche, W.; Deforges, O. Salgan360: Visual Saliency Prediction on 360 Degree Images with Generative Adversarial Networks. In Proceedings of the 2018 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), San Diego, CA, USA, 23–27 July 2018; pp. 1–4. [Google Scholar] [CrossRef]
Jiang, L.; Xu, M.; Liu, T.; Qiao, M.; Wang, Z. DeepVS: A Deep Learning Based Video Saliency Prediction Approach. In Proceedings of the Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018; Proceedings, Part XIV. Spring: Berlin/Heidelberg, Germany, 2018; pp. 625–642. [Google Scholar] [CrossRef]
Che, Z.; Borji, A.; Zhai, G.; Min, X.; Guo, G.; Le Callet, P. How is Gaze Influenced by Image Transformations? Dataset and Model. Trans. Img. Proc. 2020, 29, 2287–2300. [Google Scholar] [CrossRef] [PubMed]
Setayesh, M.; Wong, V.W. A Content-based Viewport Prediction Framework for 360° Video Using Personalized Federated Learning and Fusion Techniques. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 654–659. [Google Scholar] [CrossRef]
Yun, H.; Lee, S.; Kim, G. Panoramic Vision Transformer for Saliency Detection in 360° Videos. arXiv 2022, arXiv:2209.08956. [Google Scholar] [CrossRef]
Liu, X.; Deng, Y.; Han, C.; Renzo, M.D. Learning-Based Prediction, Rendering and Transmission for Interactive Virtual Reality in RIS-Assisted Terahertz Networks. IEEE J. Sel. Areas Commun. 2022, 40, 710–724. [Google Scholar] [CrossRef]
Yaqoob, A.; Muntean, G.M. A Combined Field-of-View Prediction-Assisted Viewport Adaptive Delivery Scheme for 360° Videos. IEEE Trans. Broadcast. 2021, 67, 746–760. [Google Scholar] [CrossRef]
Huang, R.; Wong, V.W.; Schober, R. Rate-Splitting for Intelligent Reflecting Surface-Aided Multiuser VR Streaming. IEEE J. Sel. Areas Commun. 2023, 41, 1516–1535. [Google Scholar] [CrossRef]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. Proc. AAAI Conf. Artif. Intell. 2021, 35, 11106–11115. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Identity mappings in deep residual networks. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part IV 14. Springer: Cham, Switzerland, 2016; pp. 630–645. [Google Scholar]
Pang, H.; Zhang, C.; Wang, F.; Liu, J.; Sun, L. Towards Low Latency Multi-viewpoint 360° Interactive Video: A Multimodal Deep Reinforcement Learning Approach. In Proceedings of the IEEE INFOCOM 2019—IEEE Conference on Computer Communications, Paris, France, 29 April–2 May 2019; pp. 991–999. [Google Scholar] [CrossRef]
Li, C.; Zhang, W.; Liu, Y.; Wang, Y. Very Long Term Field of View Prediction for 360-Degree Video Streaming. In Proceedings of the 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), San Jose, CA, USA, 28–30 March 2019; pp. 297–302. [Google Scholar] [CrossRef]
Chen, J.; Hu, M.; Luo, Z.; Wang, Z.; Wu, D. SR360: Boosting 360-degree video streaming with super-resolution. In Proceedings of the NOSSDAV ’20: Proceedings of the 30th ACM Workshop on Network and Operating Systems Support for Digital Audio and Video, Istanbul, Turkey, 10–11 June 2020; pp. 1–6. [Google Scholar] [CrossRef]
Zhang, Z.; Chen, Y.; Zhang, W.; Yan, C.; Zheng, Q.; Wang, Q.; Chen, W. Tile Classification Based Viewport Prediction with Multi-modal Fusion Transformer. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; MM ’23. pp. 3560–3568. [Google Scholar] [CrossRef]
Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning Spatio-Temporal Transformer for Visual Tracking. arXiv 2021, arXiv:2103.17154. [Google Scholar] [CrossRef]
Meinhardt, T.; Kirillov, A.; Leal-Taixé, L.; Feichtenhofer, C. TrackFormer: Multi-Object Tracking with Transformers. arXiv 2021, arXiv:2101.02702. [Google Scholar] [CrossRef]
Dahou Djilali, Y.A.; Tliba, M.; McGuinness, K.; O’Connor, N. ATSal: An Attention Based Architecture for Saliency Prediction in 360 Videos. arXiv 2020, arXiv:2011.10600. [Google Scholar] [CrossRef]
Vo, C.H.; Chiang, J.C.; Le, D.H.; Nguyen, T.T.; Pham, T.V. Saliency Prediction for 360-degree Video. In Proceedings of the 2020 5th International Conference on Green Technology and Sustainable Development (GTSD), Ho Chi Minh City, Vietnam, 27–28 November 2020; pp. 442–448. [Google Scholar] [CrossRef]
Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; NIPS’16. pp. 3844–3852. [Google Scholar]
Chopra, L.; Chakraborty, S.; Mondal, A.; Chakraborty, S. PARIMA: Viewport Adaptive 360-Degree Video Streaming. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; ACM: New York, NY, USA, 2021. WWW ’21. [Google Scholar] [CrossRef]
Hu, H.N.; Lin, Y.C.; Liu, M.Y.; Cheng, H.T.; Chang, Y.J.; Sun, M. Deep 360 Pilot: Learning a Deep Agent for Piloting through 360° Sports Videos. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1396–1405. [Google Scholar] [CrossRef]
Chen, P.W.; Yang, T.S.; Huang, G.L.; Huang, C.W.; Chao, Y.C.; Lu, C.H.; Wu, P.Y. Viewing Bias Matters in 360° Videos Visual Saliency Prediction. IEEE Access 2023, 11, 46084–46094. [Google Scholar] [CrossRef]
Chen, S.; Kuzyakov, E.; Peng, R. Enhancing High-Resolution 360 Streaming with View Prediction. 2017. Available online: https://engineering.fb.com/2017/04/19/virtual-reality/enhancing-high-resolution-360-streaming-with-view-prediction/ (accessed on 31 August 2025).

Figure 1. Diagram of 360-degree video content and potential angles of movement: yaw, pitch, and roll.

Figure 2. PRISMA flow diagram of the performed search.

Figure 3. Overview of input/output combinations for viewport prediction.

Table 1. Viewport prediction approaches.

Category		Reference	Input	Output	Technique	Horizon
Head movement approach (Section 2.1)	Regression (Section 2.1.1)	Qian et al. 2016 [10]	Head orientation traces	Head orientation	Average LR, WLR	2 s
		Bao et al. 2016 [9]	Head orientation traces	Head orientation	Naive, LR shallow NN	1 s
		Qian et al. 2018 [11]	Head orientation traces	Head orientation	Naive, LR RR, SVR	3 s
		Ban et al. 2018 [12]	Head orientation traces (current and previous users)	Tile probabilities	LR, KNN	6 s
	Clustering (Section 2.1.2)	Petrangeli et al. 2018 [15]	Head orientation traces (current and previous users)	Head orientation	Spectral clustering	10 s
		Nasrabadi et al. 2020 [16]	Head orientation traces (current and previous users)	Head orientation	Naive, custom clustering	10 s
		Dziubinski et al. 2024 [8]	Video frames, tiles in viewports (previous users)	Tile quality level	KNN, DBSCAN	1, 2 s
	LSTM (Section 2.1.3)	Li et al. 2023 [17]	Head orientation traces	Head orientation	BiLSTM, metalearning	5 s
		Mahmoud et al. 2024 [18]	Head orientation traces	Tile probabilities Tile quality levels	LSTM	1 s
		Feng et al. 2025 [19]	Head scanpath	Tile probabilities	LSTM	1–6 s
	Transformer (Section 2.1.4)	Chao et al. 2021 [20]	Head scanpath	Head scanpath	Transformer	5 s
	Transformer (Section 2.1.4)	Wang et al. 2024 [21]	Head scanpath	Head scanpath	Transformer DRL	1–5 s
Saliency approach (Section 2.2)	Saliency estimation	Aladagli et al. 2017 [14]	Video frames	Saliency map	GBVS	2 s
Saliency approach (Section 2.2)	Transformer	Alice et al. 2024 [22]	Video frames	Tile probabilities	TranSalNet VST	1 s
Head movement with saliency approach (Section 2.3)	LSTM (Section 2.3.1)	Fan et al. 2017 [23]	Head orientation traces, motion and saliency maps	Tile probabilities	LSTM, CNN Optical flow	1 s
		Nguyen et al. 2018 [24]	Viewport maps, saliency maps	Viewport map	LSTM PanoSalNet	2.5 s
		Li et al. 2018 [25]	Head orientation traces, navigation speed, video frames	Tile probabilities	LSTM, FSM, optical flow	1 s, 2 s
		Xu et al. 2018 [26]	Eye gaze scanpath, video frames, current and past FoV images	Gaze displacement	LSTM, FlowNet2, SalNet	1 s
		Fan et al. 2020 [27]	Head orientation traces, motion and saliency maps	Tile probabilities	LSTM, CNN, Optical flow	1 s
		Feng et al. 2020 [28]	Head orientation traces, video frames	Tiles in viewport	LSTM, CNN	2 s
		Rondón et al. 2022 [29]	Head orientation traces, saliency maps	Head orientation	LSTM PanoSalNet	5 s
		Nguyen et al. 2023 [3]	Viewport maps, saliency maps	Viewport map	LSTM PanoSalNet	1 s
		Manfredi et al. 2023 [30]	PoIs	PoI	LSTM	3 s
	Conv-LSTM (Section 2.3.2)	Wang et al. 2022 [31]	Viewport maps, video frames	Viewport map	ConvLSTM, CNN	2 s
		Dong et al. 2023 [32]	Eye gaze heatmaps (current and previous users)	Eye gaze heatmaps	Conv-LSTM SE-Unet	10 s
		Wang et al. 2024 [33]	Video frames Viewport maps (current and previous users)	Viewport map	ConvLSTM, CNN, Clustering	-
		Zhang et al. 2025 [34]	Eye gaze scanpath, video frames	Tile probabilities	ConvLSTM, GCN	-
	GRU (Section 2.3.3)	Li et al. 2023 [35]	Video frames, viewport maps (previous users)	Viewport map	S-SPCNN, T-SPCNN, SP-ConvGRU	2 s
	GRU (Section 2.3.3)	Setayesh et al. 2024 [36]	Head orientation traces, video frames	Viewport tiles	PAVER, GRU	-
	Transformers (Section 2.3.4)	Zhang et al. 2022 [37]	FoV images and head and eye gaze scanpaths	Head scanpath	Transformer, CNN	10 s
		Gao et al. 2024 [38]	Video frames, head scanpath	Head scanpath	PAVER, LSTM, Transformer	1, 2–5 s
		Guo et al. 2024 [39]	Video frames	Head scanpath	Bidirectional Transformer, residual network	-
	GNN (Section 2.3.5)	Xu et al. 2023 [40]	Video frames, head orientation traces (current and previous users)	Tile probabilities	GNN, LSTM, SalNet, K-means, FlowNet	3 s
	RL (Section 2.3.6)	Xu et al. 2019 [41]	Video frames, head scanpaths (current and previous users),	Head scanpath	DRL	0.03 s

Table 2. Dataset summary. The dash (-) indicates that the information was not available in the reference paper.

Reference	Videos	Users	Duration	Resolution	Video Source	Data Capture System	Employed In	Available
Qian et al. 2016 [10]	4	5	100–206 s	-	YouTube and Facebook	Head orientation from smartphone inside headset	[10]	no
Bao et al. 2016 [9]	16	153	30 s	1080p–4K	YouTube	Head orientation from HMD	[9,14,15]	no
Wu et al. 2017 [42]	18	48	164–655 s	-	-	Head position and orientation from HMD	[12,20,22,28,31,33,38]	yes
Fan et al. 2017 [23]	10	25	-	4K	YouTube	Head position and orientation from sensor logger and screen capturer	[17,23,29]	no
Lo et al. 2017 [46]	10	50	60 s	4K	YouTube	Head position and orientation from sensor logger and screen capturer	[8,17,25,27,29,40]	yes
Qian et al. 2018 [11]	10	130	117–293 s	4K	YouTube	Head orientation from smartphone inside headset	[11,34]	yes
David et al. 2018 [51]	19	57	20 s	4K	YouTube	Eye tracker and head scanpath (derived from eye)	[19,20,21,38]	yes
Nguyen et al. 2018 [24]	11	48	20–45 s	-	Videos from existing datasets	Eye gaze from head orientation logs in original datasets	[3,17,24,29]	yes
Xu et al. 2018 [26]	208	45	20–60 s	4K	YouTube	Eye tracker	[17,26,29,32,35,37,39]	yes
Zhang et al. 2018 [84]	104	27	20–60 s	-	Videos from existing datasets	Eye tracker	[35,36,82]	yes
Xu et al. 2019 [41]	76	58	10–80 s	3–8K	YouTube and VRcun	Head orientation form HMD and eye tracker	[17,20,29,39,41]	yes
Li et al. 2018 [47]	60	221	10–23 s	4–8K	Custom and YouTube	Head orientation form HMD and eye tracker	[18,39]	yes
Nasrabadi et al. 2019 [45]	28	30	60 s	2–4K	Custom (Samsung Gear 360) and YouTube	Head orientation from HMD	[16,38]	yes
Nguyen et al. 2019 [79]	24	48	60–655 s	-	Videos from existing datasets	Eye gaze from head orientation logs in original datasets	[31]	yes
Manfredi et al. 2023 [30]	31	30	-	-	Manual PoI extraction from existing datasets	Actual PoIs from extracted fixations	[30]	no

Table 3. Evaluation metrics and comparison methods; metrics in bold are the top three most used ones.

Reference	Evaluation Metric	Comparison Method
Qian et al. 2016 [10]	Accuracy	Variations of own method
Bao et al. 2016 [9]	Mean error, RMSE, 99th percentile, and 99.9th percentile	Naive approach and variations of own method
Qian et al. 2018 [11]	Accuracy	Naive approach
Ban et al. 2018 [12]	Viewport deviation, V-PSNR, viewport quality variance, and bandwidth consumption	Variations of own method
Petrangeli et al. 2018 [15]	Great circle distance and mean FoV overlap	Naive approach, [12,43]
Nasrabadi et al. 2020 [16]	Mean FoV overlap	Naive approach and variations of own method
Dziubinski et al. 2024 [8]	Average bitrate of video overlapping with viewport, Q3, and Q4	Variations of own method
Li et al. 2023 [17]	Accuracy, F1-score, mean FoV overlap, intersection angle error, and IoU	[23,24,25,26,29,41,55]
Mahmoud et al. 2024 [18]	VWS-PSNR	[48,49,50]
Feng et al. 2025 [19]	Q1, Q2, Q3, and re-buffering time	[11,52,53]
Chao et al. 2021 [20]	Average great circle distance, average ratio of overlapping tiles, and mean FoV overlap	[16,24,29]
Wang et al. 2024 [21]	Great circle distance, average great circle distance	[20,29]
Aladagli et al. 2017 [14]	Cross-correlation	-
Alice et al. 2024 [22]	Q1, Q2, and Q3	[24]
Fan et al. 2017 [23]	Accuracy, F1-score, missing ratio, bandwidth consumption, re-buffering time, PSNR, SSIM, PEVQ, and running time	Naive approach, [65], and variations of own method
Nguyen et al. 2018 [24]	Accuracy	Variations of own method
Li et al. 2018 [25]	Accuracy and F1-score	[10]
Xu et al. 2018 [26]	Mean intersection angle error	Variations of own method
Fan et al. 2020 [27]	Missing ratio, unseen ratio, bandwidth consumption, peak bandwidth, V-PSNR, and re-buffering time	Naive approach and DR
Feng et al. 2020 [28]	Bandwidth consumption, accuracy, and processing time	[75] and variations of own method
Rondón et al. 2022 [29]	Accuracy, F1-score, mean FoV overlap, intersection angle error, and IoU	[12,14,23,24,25,26,41]
Nguyen et al. 2023 [3]	Buffer stall count, buffer stall duration, blank ratio, SSIM, and bandwidth saving	[71], stream all tiles (no prediction), and variations of own method
Manfredi et al. 2023 [30]	Accuracy	Naive approach
Wang et al. 2022 [31]	Accuracy, recall, precision, bandwidth saving, and processing time	[24,28]
Dong et al. 2023 [32]	Hit rate, viewport deviation, MSE, viewport PSNR, bandwidth consumption, viewport quality variance, and re-buffering time	[12] and variations of own method
Wang et al. 2024 [33]	Accuracy, precision, recall, F1-score, and bandwidth saving	[24,28,31]
Zhang et al. 2025 [34]	PSNR, inter-chunk PSNR, and re-buffering time	[11,80,81]
Li et al. 2023 [35]	Accuracy, precision, and recall	[11,24,87,88,89]
Setayesh et al. 2024 [36]	Tiles overlap, Q1, Q2, Q3, and re-buffering time	[92,93,94]
Zhang et al. 2022 [37]	Viewport overlap ratio, MSE, Q1, Q2, and re-buffering time	[12,23,53,81,97,98,99]
Gao et al. 2024 [38]	IoU and orthodromic distance	[20,29,100]
Guo et al. 2024 [39]	IoU, mean FoV overlap, and great circle distance	[24,41,101,102,103,104]
Xu et al. 2023 [40]	Accuracy, Q1, Q2, and Q3	[28,106]
Xu et al. 2019 [41]	Mean FoV overlap	Naive approach, random viewport prediction, and [107]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wahba, M.Z.A.; Baldoni, S.; Battisti, F. Learning-Based Viewport Prediction for 360-Degree Videos: A Review. Electronics 2025, 14, 3743. https://doi.org/10.3390/electronics14183743

AMA Style

Wahba MZA, Baldoni S, Battisti F. Learning-Based Viewport Prediction for 360-Degree Videos: A Review. Electronics. 2025; 14(18):3743. https://doi.org/10.3390/electronics14183743

Chicago/Turabian Style

Wahba, Mahmoud Z. A., Sara Baldoni, and Federica Battisti. 2025. "Learning-Based Viewport Prediction for 360-Degree Videos: A Review" Electronics 14, no. 18: 3743. https://doi.org/10.3390/electronics14183743

APA Style

Wahba, M. Z. A., Baldoni, S., & Battisti, F. (2025). Learning-Based Viewport Prediction for 360-Degree Videos: A Review. Electronics, 14(18), 3743. https://doi.org/10.3390/electronics14183743

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Learning-Based Viewport Prediction for 360-Degree Videos: A Review

Abstract

1. Introduction

2. Viewport Prediction

2.1. Head Position Approaches

2.1.1. Regression Approaches

2.1.2. Clustering Approaches

2.1.3. LSTM

2.1.4. Transformers

2.2. Saliency-Based Approaches

2.3. Hybrid Approaches

2.3.1. LSTM

2.3.2. Convolutional LSTM

2.3.3. Gated Recurrent Unit

2.3.4. Transformers

2.3.5. Graph Neural Network (GNN)

2.3.6. Reinforcement Learning (RL)

3. Datasets

4. Evaluation Metrics and Comparison Methods

4.1. Evaluation Metrics

4.1.1. Accuracy

4.1.2. Mean FoV Overlap

4.1.3. IoU

4.2. Comparison Between Methods

5. Discussion and Recommendations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI