DiffVP: A Diffusion Model with Explicit Coordinate-Temporal Encoding for Viewport Prediction in 360∘ Videos

Zheng, Huimin; Du, Lina; Nie, Xiushan; Dong, Fei

doi:10.3390/electronics15061326

Open AccessArticle

DiffVP: A Diffusion Model with Explicit Coordinate-Temporal Encoding for Viewport Prediction in 360^∘ Videos

¹

School of Computer and Artificial Intelligence, Shandong Jianzhu University, Jinan 250101, China

²

Institute of Applied AI Engineering and Technology, Shandong Jianzhu University, Jinan 250101, China

³

School of Journalism and Communication, Shandong Normal University, Jinan 250014, China

^*

Authors to whom correspondence should be addressed.

Electronics 2026, 15(6), 1326; https://doi.org/10.3390/electronics15061326

Submission received: 26 January 2026 / Revised: 14 February 2026 / Accepted: 16 March 2026 / Published: 23 March 2026

Download

Browse Figures

Review Reports Versions Notes

Abstract

Viewport prediction is a key component in tile-based 360° video streaming. Existing viewport prediction models based on Long Short-term Memory Networks (LSTM) or Transformer typically output a single deterministic future trajectory through deterministic mapping, which fails to capture the inherent randomness in viewing behavior. Moreover, when encoding trajectory features, such models often map trajectory coordinates directly into a high-dimensional space while neglecting the spatial information inherent in the coordinates themselves. Additionally, they exhibit limitations in capturing cross-modal relationships between visual and trajectory features. To address these issues, this paper proposes DiffVP, a diffusion model for viewport prediction in 360° videos. Under the constraints of viewing historical trajectories and video saliency maps, DiffVP leverages Denoising Diffusion Implicit Models (DDIMs) to model future viewing trajectories in the form of probability distributions, generating diverse and reasonable prediction results. In the denoising network, DiffVP employs Explicit Coordinate-Time Encoding (ECTE) to model the temporal dependencies of trajectories and the spatial relationships among coordinates; moreover, a Coordinate-Aware Saliency Features Fusion (CASF) module is proposed to achieve cross-modal alignment and interactive fusion of saliency and trajectory features. Experimental results on three public datasets demonstrate that DiffVP achieves the best accuracy for 2–5 s viewport prediction without sacrificing the performance of short-term (<1 s) prediction.

Keywords:

viewport prediction; 360∘ videos; diffusion model

1. Introduction

In recent years, 360° video has garnered significant attention due to its ability to provide a more realistic and immersive viewing experience compared to traditional 2D videos [1,2]. However, owing to their typically ultra-high resolution, transmitting 360° videos often consumes substantial bandwidth, while existing transmission rates struggle to meet the demands of high-quality playback. Notably, humans do not view the entire 360° scene simultaneously during actual viewing; their field of vision is usually limited to a local region of approximately 90° × 110°, known as the “Viewport” [3,4]. Based on this, researchers have proposed tile-based rate adaptation algorithms for 360° videos. These algorithms first predict the future viewport and then allocate higher bitrates to tiles within the viewport to reduce bandwidth pressure while ensuring playback quality [5]. Therefore, accurately predicting the future viewport has become a key prerequisite for achieving efficient 360° video transmission. By predicting and pre-caching content that humans are likely to watch in the future, a more persistent cache area can be established to effectively cope with bandwidth fluctuations [6].

In previous studies, most methods have been based on Long Short-term Memory Networks (LSTM) and Transformer models, which directly learn a deterministic mapping function from historical sequences to future sequences [7,8]. These models output a single, averaged future trajectory, which overlooks the inherent randomness in viewing behavior. As illustrated in Figure 1, even under the same video content, viewing trajectories can vary significantly, reflecting diverse exploration strategies and motion patterns. Furthermore, existing methods have certain limitations in feature modeling. They often map coordinate features of viewing sequences directly into high-dimensional representations before feeding them into the model, thereby neglecting the spatial information inherently carried by the coordinates themselves. In addition, visual content plays an important role in influencing viewing trajectories. Prior methods have shown that salient regions in videos tend to attract visual attention, yet relying solely on trajectory features is insufficient to capture this influence [9,10]. Most existing methods typically combine visual and trajectory features through simple concatenation or weighted averaging. Although such methods provide basic cross-modal fusion, they still fall short in exploring deeper associations between the two modalities and struggle to capture the complex spatio-temporal interactions between trajectory and visual features.

Recently, diffusion models have attracted increasing attention for time series modeling, as it can capture long-term dependencies and model data diversity and stochasticity through appropriately designed neural networks in the reverse denoising process [11,12,13]. Viewport prediction in 360° videos can essentially be formulated as a time-series prediction problem, aiming to forecast future trajectories based on historical viewing behaviors and visual content. Motivated by this, we propose DiffVP, a diffusion-based method for viewport prediction in 360° videos. First, we innovatively introduce diffusion models into this task, modeling future viewing trajectories as probability distributions under the guidance of historical trajectories and saliency conditions, and generate diverse yet plausible predictions during the reverse denoising process. Considering that standard Denoising Diffusion Probabilistic Models (DDPMs) [14] cause high inference costs due to iterative sampling, we adopt Denoising Diffusion Implicit Models (DDIMs) [15] to significantly reduce the sampling steps while maintaining generation quality. Second, we propose an Explicit Coordinate-Temporal Encoding (ECTE), which separately models temporal and coordinate features. Two transformer-based subnetworks are employed to capture long-term dependencies in temporal sequences and spatial relationships among coordinates, leading to more effective modeling of viewing behaviors. Finally, to address the deficient cross-modal fusion of saliency and trajectory features, we design a Coordinate-aware Saliency Feature Fusion (CASF) module. This module achieves cross-modal alignment while performing feature interaction and fusion across temporal and channel dimensions, thereby enhancing the guidance of visual content for prediction.

Overall, our main contributions are threefold:

DiffVP introduces diffusion models into the task of viewport prediction in 360° videos, modeling future trajectories as probability distributions conditioned on historical trajectories and saliency information.
The ECTE separately models temporal and spatial features to capture both temporal dependencies of viewing trajectories and spatial relationships among coordinates.
The CASF module is designed to achieve cross-modal alignment between saliency features and trajectory features, while performing feature interaction across temporal and channel dimensions to enhance the guidance of visual content in viewport prediction.

2. Related Work

Viewport prediction for 360° videos streaming is a critical technology to optimize bandwidth usage and enhance user’s Quality of Experience (QoE) in immersive media applications [16]. By accurately forecasting the future viewport, adaptive streaming systems can proactively prioritize high-quality video tiles within the predicted region. Current viewport prediction methods can be categorized into two types: trajectory-based and content-based. Additionally, diffusion models have emerged as a powerful generative paradigm for time series forecasting. By learning to reverse a gradual noising process, they excel at modeling complex data distributions and generating high-quality, diverse samples. This capability offers a promising perspective for viewport prediction. This section first reviews conventional viewport prediction methods, and then discusses the potential of diffusion models as a generative alternative for viewport prediction.

2.1. Viewport Prediction in 360° Videos

2.1.1. Trajectory-Based Viewport Prediction Methods

Trajectory-based methods rely on historical viewing behavior to infer future viewports. Early methods, such as Xu et al. [17], employed a reinforcement learning framework to fuse head motion paths from multiple users to predict future viewports, while Petrangeli et al. [18] utilized extrapolation functions to estimate long-term trajectories based on historical pitch and yaw data. With the advancement of deep learning, researchers have begun to incorporate sequence modeling techniques. For example, Chao et al. [8] proposed an efficient long-term prediction model based on the Transformer, achieving prediction using only short-term historical trajectories.

Beyond individual sequence modeling, further improvements have been realized through cross-user modeling, which leverages similarities and collaborative patterns among viewers. Abid Yaqoob et al. [19] designed a real-time prediction model that identifies users with similar preferences through a client-side system and leverages collaborative behavior to enhance long-term prediction, thereby mitigating challenges arising from user preference diversity. Subsequent methods [20,21] have also effectively improved viewport prediction performance by exploring different perspectives, such as user similarity, cross-user enhancement, and collaborative modeling of user behavior.

However, trajectory-based methods suffer from notable limitations. These methods rely excessively on historical viewing behavior while overlooking the rich semantic and contextual cues inherent in the visual content. Although they achieve high efficiency and accuracy in short-term prediction, long-term forecasting based solely on trajectory information tends to accumulate errors over time and struggles to capture the dynamic shifts of attention driven by scene content.

2.1.2. Content-Based Viewport Prediction Methods

Content-based methods emphasize the influence of visual content on viewing behavior. Researchers have found that video content significantly affects viewing trajectories. Therefore, many methods combine visual features with trajectory features to improve viewport prediction accuracy. For instance, Rondon et al. [22] designed separate LSTM branches to process visual and trajectory features, followed by additive fusion. While some methods [7,9,23,24] adopted combined Convolutional Neural Networks (CNN) and LSTM architectures for joint modeling of both types of information, others [25,26] leveraged a CNN+Transformer fusion structure to better capture the complex relationships between visual and trajectory features through attention mechanisms.

Beyond low-level visual features, another line of content-based research leverages object motion to enhance prediction. These methods analyze the motion characteristics of foreground objects in videos to optimize performance. For instance, Tang et al. [27] proposed a trajectory selector that identifies key objects whose motion trajectories are closest to the current viewport center. These selected object trajectories are then fused with the historical viewing trajectories, and the combined features are fed into an LSTM network to predict the future viewport. Subsequently, Chopra et al. [28] further demonstrated that head movements largely depend on the trajectories of prominent objects in the videos. They convert ERP images into cube-map representations, employ YOLOv3 [29] to detect objects and obtain bounding box coordinates, and perform cross-frame object tracking based on spherical distance matching to extract motion trajectories. The viewport prediction is then achieved by combining the current viewport center coordinates with the predicted object coordinates in the next frame. However, such methods typically require object detection and tracking for each video frame, which often involves complex computational steps.

For 360° videos, special consideration must be given to the geometric distortions introduced by spherical projections during visual feature extraction. To address this, content-based methods have incorporated dedicated visual feature extraction techniques. For example, some methods [30,31,32] employ spherical convolution to process visual content, Li et al. [7] utilize cube projection combined with CNN to extract more stable image representations, and Gao et al. [33] extract visual features based on the distortion modeling method proposed.

Nevertheless, these methods often overlook the spatial information inherent in trajectory coordinates during trajectory feature encoding. Mapping trajectory coordinates directly into a high-dimensional space can easily disrupt their original structural information. In multimodal fusion, the modal differences between visual and trajectory features are rarely considered, with simple concatenation or weighted averaging being commonly used. Such methods struggle to capture the deep spatio-temporal interactions between the two modalities.

2.2. Diffusion Model

Diffusion models primarily consist of two processes: the forward process and the reverse process. The forward process generates latent variables by progressively adding Gaussian noise to the original data:

q (x_{t} | x_{t - 1}) = N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I),

(1)

where

β_{t} \in (0, 1)

denotes the noise variance at step t, and I is the identity matrix. The reverse process reconstructs the original data through a learned neural network:

p_{θ} (x_{t - 1} | x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), σ_{θ} (x_{t}, t) I),

(2)

where

μ_{θ}

and

σ_{θ}

are the mean and variance functions learned by the model.

In recent years, the potential of diffusion models in time series modeling has also attracted increasing attention [34]. By designing appropriate neural networks for the reverse denoising process, such models can effectively capture long-term dependencies while modeling the diversity and stochasticity of data. Rasul et al. [35] first applied diffusion models to time series forecasting tasks, using recurrent neural networks to model temporal dependencies. Subsequently, Tashiro et al. [11] proposed a Transformer-based conditional diffusion model to more effectively capture long-term temporal dependencies. Subsequent studies [13,36,37] have further explored the application of diffusion models to various time series tasks, further validating their potential in time series forecasting.

The task of 360° video viewport prediction is inherently a time series forecasting problem, which involves predicting future trajectories based on historical viewing trajectories. Existing research has applied diffusion models to scanpath prediction in 360° images, achieving promising results [38]. Methods based on diffusion models are capable of generating diverse future trajectories that conform to distributional characteristics, thereby avoiding the issue of over-averaged predictions and better capturing the randomness inherent in viewing behavior. Building on this, this paper introduces diffusion models to the task of 360° video viewport prediction. An explicit coordinate-temporal encoder is designed to model the temporal dependencies of trajectories and the spatial relationships between coordinates. Additionally, a coordinate-aware saliency fusion module is proposed to achieve deep cross-modal alignment and fusion of visual and trajectory features, thereby effectively improving prediction accuracy.

3. Method

The core objective of DiffVP is to train a denoising network that predicts the latent noise of future viewing trajectories, guided by historical viewing trajectories and the saliency maps of upcoming video frames, as illustrated in Figure 2. Next, the details of each component will be explained in detail, and the loss function used for training the network will be presented.

3.1. Construction of Conditional Information

3.1.1. Trajectory Feature Extraction

The trajectory sequence is represented as

X_{0} = {x_{i - m}, \dots, x_{i}, \dots, x_{i + h}} \in R^{1 \times L \times 3}

, where m denotes the length of the historical segment, h is the length of the prediction segment, and

L = m + h

is the total trajectory length. Each trajectory point is expressed in 3D spherical coordinates

(x, y, z)

, corresponding to the viewport center. Training samples are generated through the forward noising process, which is defined by the following Markov chain:

q (X_{1 : T} ∣ X_{0}) = \prod_{t = 1}^{T} q (X_{t} | X_{t - 1}),

(3)

q (X_{t} ∣ X_{t - 1}) = N (\sqrt{1 - β_{t}} X_{t - 1}, β_{t} I),

(4)

X_{t} = \sqrt{α_{t}} X_{0} + (1 - α_{t}) ϵ, ϵ \sim N (0, I),

(5)

where T denotes the total number of diffusion steps,

β_{t}

is obtained via a quadratic noise scheduler, and

α_{t} = \prod_{i = 1}^{t} (1 - β_{t})

. As illustrated in Figure 2a, after the forward noising process, the sequence is reorganized using a binary mask

M \in {0, 1}

, where the historical part is set to 1 and the future part to 0. The historical trajectory ground truth is preserved as the condition

X_{0}^{m} = X_{0} \cdot M

, while the future trajectory is represented by the noised values

X_{t}^{h} = X_{t} \cdot (1 - M)

. The final input is constructed as

\tilde{X_{t}} = X_{0}^{m} + X_{t}^{h} \in R^{1 \times L \times 3}

. The entire training process does not require additional labels and is essentially self-supervised.

3.1.2. Saliency Feature Processing

Saliency features are another key factor in viewport prediction, as they effectively reflect the regions in video frames that attract attention. DiffVP adopts PAVER [39] to extract saliency maps of 360° video frames. This method is based on a ViT architecture tailored for projection distortions in 360° videos and incorporates deformable convolutions to adapt to the geometric distortions caused by spherical projection, thereby enabling more accurate characterization of salient regions in 360° scenes.

First, the saliency map sequence

S_{L} = {s_{i}, s_{i + 1}, \dots, s_{i + L}} \in R^{L \times H \times W}

is fed into a Convolutional Gated Recurrent Unit (ConvGRU) to extract spatio-temporal features, resulting in

S_{L}^{t s} \in R^{L \times H \times W \times C}

, where

H = 96

,

W = 192

and

C = 64

. Then,

S_{L}

is mapped onto the unit sphere via saliency map to spherical coordinates, resulting in the corresponding 3D coordinate representation

S_{L}^{c o o r d} \in R^{L \times H \times W \times 3}

. The computation is defined as follows:

y a w [i, j] = \frac{j + 0.5}{W} \cdot 2 π, p a t c h [i, j] = \frac{i + 0.5}{H} \cdot π,

(6)

\begin{matrix} \begin{matrix} x & = sin (pitch) \cdot cos (yaw), y = sin (pitch) \cdot sin (yaw), \end{matrix} \\ z = cos (pitch), \end{matrix}

(7)

where

i \in [0, H - 1]

and

j \in [0, W - 1]

. Each pixel is converted to a 3D coordinate

(x, y, z)

using normalized yaw and pitch. The sequence

S_{L}

is processed iteratively to form

S_{L}^{c o o r d} \in R^{L \times H \times W \times 3}

. Since the historical trajectory is already used as a condition, the mask mechanism retains only the saliency features of future frames, which serve as visual guidance to supplement trajectory prediction.

3.2. Explicit Coordinate-Temporal Encoding

Unlike the U-Net backbone that uses conventional DDIM, DiffVP adopts ECTE as its core architecture. To capture both temporal dependencies of trajectories and the spatial relationships among coordinates, ECTE employs a serial two-stage Transformer structure, as shown in Figure 2b. Despite employing a serial two-stage architecture, ECTE avoids redundant stacking of multi-layer Transformers. Instead, it adopts a lightweight design characterized by a single layer and low dimensionality. Both the temporal Transformer and the coordinate Transformer subnetworks are configured with only one layer (Layers = 1), and the feature channel dimension is fixed at 64. The core objective of this design is to minimize the model’s parameter count and computational overhead while ensuring the accuracy of spatio-temporal feature modeling.

First, the input sequence

\tilde{X_{t}}

is processed by a one-dimensional convolution followed by ReLU, producing preliminary features

\tilde{X_{t}^{'}} \in R^{L \times 3 \times C}

with channel dimension

C = 64

. Then, learnable temporal embeddings

E_{t i m e} \in R^{1 \times L \times 1}

and diffusion step embeddings

E_{d i f f} \in R^{1 \times 1 \times C}

are added to

\tilde{X_{t}^{'}}

to form

{\tilde{X}}_{t}^{i n} = (\tilde{X_{t}^{'}} + E_{t i m e} + E_{d i f f}) \in R^{3 \times L \times C}

. This representation is then fed into the temporal Transformer to capture temporal dependencies, producing temporally enhanced features

{\tilde{X}}_{t}^{t e} \in R^{3 \times L \times C}

. Finally, a learnable coordinate encoding

E_{f e a t u r e} \in R^{3 \times 1 \times C}

is added to

{\tilde{X}}_{t}^{t e}

, and the result is input to the coordinate Transformer to model spatial relationships among coordinates, producing the output

{\tilde{X}}_{t}^{o u t} \in R^{3 \times L \times C}

.

3.3. Coordinate-Aware Saliency Feature Fusion

Since saliency maps belong to the 2D image modality while the trajectory sequence is represented as 3D coordinate sequences, there exists a modality gap between the two. To address this, a spatially weighted fusion is adopted to aggregate 2D spatial features into a compact representation with 3D coordinates. Specifically, each coordinate component

(x, y, z)

in

S_{L}^{c o o r d}

is weighted-averaged over the spatial dimensions of

S_{L}^{t s}

, with normalization to ensure numerical stability. The formulation is as follows:

S_{L}^{f u s e} = \frac{\sum_{i = 1}^{H} \sum_{j = 1}^{W} S_{L}^{t s} (i, j) \cdot S_{L}^{c o o r d} (i, j) {\cdot w}_{i j}}{\sum_{i = 1}^{H} \sum_{j = 1}^{W} w_{i j} + ε} \in R^{L \times C \times 3},

(8)

where

w_{i j}

denotes the spatial weight, which is uniform by default. And

ε

is a small constant to prevent division by zero. The fused feature tensor

S_{L}^{f u s e}

preserves the spatio-temporal information from the saliency maps while embedding spherical coordinate information.

DiffVP does not simply concatenate or average saliency and trajectory features; instead, it models their dependencies separately along the temporal and channel dimensions, as shown in Figure 2c. Specifically, the trajectory features

{\tilde{X}}_{t}^{o u t}

are reshaped to

R^{L \times C \times 3}

. And a unified Multi-Head Attention (MHA) mechanism is employed to capture the dependencies between the two modalities. The unified MHA computation logic adopted in this module is defined as follows:

MHA (X, Y) = Concat ({head}_{1} (X, Y), {head}_{2} (X, Y), \dots, {head}_{h} (X, Y)) \cdot W .

(9)

The computation for a single attention

{head}_{j}

is:

{head}_{j} = softmax (\frac{{(X \cdot W_{j}^{Q})}^{T} \cdot (Y \cdot W_{j}^{K})}{\sqrt{D}}) \cdot (Y \cdot W_{j}^{V}),

(10)

where h denotes the number of attention heads, and W represents the final linear projection weight for the multi-head output. The trainable parameters

W_{j}^{Q}

,

W_{j}^{K}

, and

W_{j}^{V}

correspond to the query, key, and value projection weights of the

j - t h

attention head, respectively, mapping the input features X and

Y

into a unified attention space. The scaling factor

\sqrt{D}

is used to mitigate the vanishing gradient issue in the softmax function caused by large attention scores.

For the temporal dimension, the feature channels and spatial components are flattened so that features are unfolded along time into

R^{L \times 3 C}

. For the channel dimension, the time and spatial dimensions are flattened, unfolding features along the channel dimension into

R^{C \times 3 L}

. The specific fusion process is defined as follows:

X_{t}^{s f} = M H A (S_{L}^{f u s e}, {\tilde{X}}_{t}^{o u t}) \cdot S_{L}^{f u s e},

(11)

X_{t}^{x f} = M H A ({\tilde{X}}_{t}^{o u t}, S_{L}^{f u s e}) \cdot {\tilde{X}}_{t}^{o u t},

(12)

X_{t}^{f u s e} = N o r m (X_{t}^{s f} + X_{t}^{x f}),

(13)

where

X_{t}^{s f}

and

X_{t}^{x f}

are the attention outputs computed along the temporal and channel dimensions, respectively. After dimension alignment, the outputs are added and normalized to obtain the fused sequence representation

X_{t}^{f u s e} \in R^{C \times 3 L}

. Following a design similar to DiffWave [40],

X_{t}^{f u s e}

is further processed by a series of residual layers and feed-forward convolution operations to model sequential dependencies, yielding the predicted noise

X_{ϵ}^{o u t} \in R^{1 \times L \times 3}

. After training, this process produces a parameterized conditional denoising function

{\hat{ϵ}}_{t} = ϵ_{θ} (X_{t}, t ∣ X_{0}^{m}, S_{L}^{f u s e})

.

3.4. Trajectory Generation

DDIMs can generate high-quality results using significantly fewer steps. Specifically, from a total of

T = 500

diffusion steps,

S ≪ T

steps are uniformly selected for sampling, yielding

S = 50

. At each sampling step, the model predicts based on the current noise sample and the conditional information. The computation is as follows:

\begin{matrix} p_{θ} (X_{1 : T} ∣ X_{0}^{m}, S_{L}^{f u s e}) = p (X_{T}) \prod_{t = 1}^{T} p_{θ} (X_{t - 1} ∣ X_{t}, X_{0}^{m}, S_{L}^{f u s e}), X_{T} \sim N (0, I), \end{matrix}

(14)

\begin{matrix} p_{θ} (X_{t - 1} ∣ X_{t}, X_{0}^{m}, S_{L}^{f u s e}) = N (X_{t - 1}; μ_{θ} (X_{t}, t ∣ X_{0}^{m}, S_{L}^{f u s e}), σ_{θ} (X_{t}, t ∣ X_{0}^{m}, S_{L}^{f u s e}) I), \end{matrix}

(15)

\begin{matrix} μ_{θ} (X_{t}, t ∣ X_{0}^{m}, S_{L}^{f u s e}) = \sqrt{α_{t - 1}} (\frac{X_{t} - \sqrt{α_{t - 1}} \cdot {\hat{ϵ}}_{t}}{\sqrt{α_{t}}}) + \sqrt{1 - α_{t - 1}} \cdot {\hat{ϵ}}_{t} . \end{matrix}

(16)

The conditional mean

μ_{θ}

of the reverse process is derived from the forward noise coefficients

α_{t}

. In DDIMs, to simplify computation and ensure stability, the conditional variance

σ_{θ}

is typically fixed to 0 for deterministic sampling, further reducing the computational overhead caused by stochastic noise. Only

S = 50

steps need to be traversed (1/10 of the total

T = 500

steps), significantly decreasing the number of iterations during inference.

3.5. Loss Function

DiffVP employs a composite loss function, which consists of two components:

L = L_{M S E} + λ L_{s o f t D T W},

(17)

where the weight coefficient is set to

λ = 0.1

based on experimental tuning. The first part is a weighted Mean Squared Error (MSE) loss that measures the accuracy of noise prediction, identical to the DDIM loss:

L = E_{X_{0}, ϵ, t} {∥ ϵ - {\hat{ϵ}}_{t} ∥}_{2}^{2},

(18)

where

ϵ

denotes standard Gaussian noise and the conditional denoising function

{\hat{ϵ}}_{t}

estimates the noise vector.

The second part introduces Dynamic Time Warping (DTW) to measure the similarity between two sequences. Following the approach of ScanDT [38], the differentiable SoftDTW loss [41] is adopted and combined with spherical distance for calculation, as formulated below:

\begin{matrix} δ_{s p h} (ϵ^{i}, {\hat{ϵ}}_{t}^{j}) = 2 a r c s i n (\frac{1}{2} \sqrt{{(ϵ_{x}^{i} - {\hat{ϵ}}_{t, x}^{j})}^{2} + {(ϵ_{y}^{i} - {\hat{ϵ}}_{t, y}^{j})}^{2} + {(ϵ_{z}^{i} - {\hat{ϵ}}_{t, z}^{j})}^{2}}), \end{matrix}

(19)

\begin{matrix} {D T W}_{s p h} (ϵ, {\hat{ϵ}}_{t}) = \sqrt{\sum_{i = 1}^{h} \sum_{i = 1}^{h} w (i, j) δ_{s p h} (ϵ^{i}, {\hat{ϵ}}_{t}^{j})}, \end{matrix}

(20)

L_{s o f t D T W} = {D T W}_{s p h}^{γ} (ϵ, {\hat{ϵ}}_{t}),

(21)

where

δ_{s p h} (ϵ^{i}, {\hat{ϵ}}_{t}^{j})

computes the spherical distance between two 3D coordinate points, h denotes the prediction sequence length,

w (i, j)

is the alignment weight, and

γ

is the smoothing parameter of SoftDTW to ensure differentiability.

4. Experiments

4.1. Datasets and Evaluation Metrics

To comprehensively validate the effectiveness and generalization capability of the DiffVP model in 360° video viewport trajectory prediction, this paper utilizes three widely adopted public datasets in the field, David_MMSys [42], Wu_MMSys [43] and Xu_PAMI [17] for experimental verification, while adhering to unified preprocessing standards to normalize the data format.

(1): David_MMSys [42] comprises nineteen 360° video clips, each with a duration of 20 s. The dataset records both head-tracking and eye-tracking data from 57 participants during free-viewing sessions, and focuses on free-viewing behavior with relatively homogeneous video durations.
(2): Wu_MMSys [43] includes eighteen 360° videos covering five different scene categories (e.g., natural landscapes, urban architecture, sports events, etc.), and collects head-tracking data from 48 participants during video viewing.
(3): Xu_PAMI [17] is the largest dataset in scale, comprising seventy-six 360° video clips of variable duration (ranging from 10 to 80 s, with an average length of 25 s). The video content covers a wide variety of scenes, and the dataset includes both head movement data and eye movement data collected from 58 participants.

As shown in Figure 3, subfigures (a), (b), and (c), respectively, present the Cumulative Distribution Functions (CDFs) of users’ head motion amplitudes under different temporal windows for the three public datasets. It can be observed that under short temporal windows (e.g., 0.2 s, 0.5 s and 1 s) the CDF curves of all three datasets exhibit a steep rise, indicating that head movements are dominated by small-amplitude adjustments at short time scales and that viewing viewpoint changes demonstrate strong short-term inertia.

As the temporal window increases, differences in the distribution of head motion amplitudes across datasets gradually emerge. For the David_MMSys dataset, the CDF curves shift noticeably to the right under longer temporal windows, particularly at 15 s, where the cumulative proportion of large-amplitude head rotations increases significantly. It demonstrates that wide-range viewpoint exploration becomes more prevalent during prolonged viewing. Meanwhile, the CDF curves saturate relatively quickly in the high-angle region, indicating that although head motion amplitudes are large, they remain statistically concentrated within a similar range. In contrast, the Wu_MMSys and Xu_PAMI datasets exhibit more gradual CDF growth under long temporal windows, without a pronounced rapid accumulation in the high-angle region. This implies that head motion amplitudes are not concentrated around a specific angular threshold but are instead distributed over a broader range, reflecting more progressive and diverse long-term viewpoint exploration behaviors. Overall, while the three datasets share similar head motion characteristics at short time scales, they exhibit distinctly different motion amplitude distributions during long-term viewing. These distinct long-term motion patterns indicate that David_MMSys exhibits higher predictability due to its concentrated distribution, while the broader and more uncertain distributions of Wu_MMSys and Xu_PAMI make them inherently more challenging.

Consistent with comparative methods, the evaluation employs two metrics: Orthogonal Distance (OD) and Intersection over Union (IoU). OD is a point-wise distance measure adapted to unit-sphere coordinates, used to quantify the spatial proximity between predicted and ground-truth trajectory points on the spherical surface. Its core advantage lies in its ability to accurately reflect the shortest arc-length distance between two points in spherical space, making it more aligned with the geometric properties of 360° videos compared with Euclidean distance. IoU is a region-based evaluation metric. By dividing each 360° video frame into tiles of a fixed size, IoU measures the overlap between the predicted viewport region and the ground-truth viewport region.

4.2. Implementation Details

We trained the DiffVP model using the Adam optimizer on two RTX 4090 GPUs, with an initial learning rate

10^{- 5}

and batch size of 8 for 400 epochs. In each training iteration, we feed the trajectory sequence and the saliency map into the model, with the saliency map having a size of

H \times W = 96 \times 192

.

To align with the evaluation settings of state-of-the-art methods, this paper adopts the evaluation framework [44] established in prior work, which standardizes the sampling criteria for trajectory points. Specifically, sampling is performed every

0.2

s, and trajectory points are uniformly mapped to three-dimensional coordinates

(x, y, z)

on a unit sphere. The historical viewing trajectory spans 3 s, while the future prediction trajectory covers 5 s. Consequently, the input trajectory sequence length for the model is

L = 40

, and the predicted trajectory sequence length is

h = 25

. During preprocessing, continuous viewing trajectories are cropped using sliding windows of the specified lengths to generate batches of training, validation, and test samples.

4.3. Comparison to State-of-the-Arts

To validate the superiority of the DiffVP model in 360° video viewport prediction tasks, this paper selects representative viewport prediction methods as comparative baselines. Quantitative comparisons are conducted based on the OD and IoU metrics across three datasets, while qualitative comparisons focus on adaptability to different scenarios and the ability to model diverse viewing behavior patterns. To address the lack of publicly available comparative results for the Xu_PAMI [17] dataset, experimental results of mainstream methods were reproduced to supplement the comparative validation.

(1): TRACK [22]: This model utilizes three separate LSTM modules to handle trajectory features, visual features, and their concatenated features, aiming to dynamically balance the contributions of trajectory and visual information across different prediction horizons.
(2): VPT360 [8]: A Transformer-based model that solely employs a Transformer encoder to process trajectory information for temporal prediction.
(3): MFTR [25]: A complex multi-modal fusion Transformer model that adopts three Transformer-encoder-based modules to process trajectory features, visual features, and their concatenated features, respectively.
(4): STAR-VP [33]: A Transformer-based model that converts saliency information into a compact pixel-wise representation aligned with trajectory features. It employs a gating mechanism to achieve dynamic fusion, emphasizing trajectory features for short-term predictions while reinforcing visual information for long-term predictions.

4.3.1. Quantitative Evaluation

Table 1 presents the quantitative experimental results of DiffVP on three public datasets. Compared with the state-of-the-art method STAR-VP [33], DiffVP achieves consistent improvements across all datasets. On the David_MMSys [42] dataset, it reduces OD by 0.5% and improves IoU by 3.9%. On the Wu_MMSys [43] dataset, which contains videos from five different scene categories, the improvement is more pronounced: OD decreases by 5.3% and IoU increases by 5.1%. This is because DiffVP, through the explicit spatio-temporal modeling of the ECTE module and the cross-modal fusion of the CASF module, is better able to adapt to the diversity of viewing behaviors across different scenes. In contrast, the feature fusion strategy of STAR-VP [33] is relatively simplistic, resulting in weaker adaptability to scene variations. Furthermore, on the larger and more diverse Xu_PAMI [17] dataset, DiffVP also demonstrates superior performance, lowering OD by 5.4% and raising IoU by 4.6%, underscoring its strong generalization capability.

The performance gains of the proposed diffusion-based model vary across datasets, which can be attributed to differences in video characteristics and viewing behavior distributions. On David_MMSys, the improvement in OD is relatively limited (0.5%). This dataset mainly consists of short-duration videos with relatively fixed content structure, which strongly constrains viewpoint exploration within a limited temporal span. As a result, although large-amplitude head movements frequently occur over longer temporal windows, the resulting viewing trajectories tend to a similar panoramic scan within the video duration, leading to a relatively concentrated future trajectory distribution with low uncertainty. Under such conditions, deterministic or regression-based models can already provide competitive predictions, leaving limited space for diffusion-based distribution modeling. In contrast, Wu_MMSys and Xu_PAMI demonstrate substantially larger performance gains (5.3% and 5.4%, respectively). Wu_MMSys contains multiple scene categories that induce content-dependent viewing behaviors, while Xu_PAMI further introduces a larger number of videos with varying durations and more complex scene compositions. These factors lead to greater variability across time and content, resulting in more dispersed and diverse future viewport trajectory distributions. Since diffusion models explicitly learn the full conditional probability distribution of future viewing trajectories, they are better suited to capturing such uncertainty and diversity. Consequently, their advantages become more pronounced on datasets with richer content diversity and more complex behavioral distributions.

To further analyze the performance of different methods across varying prediction horizons, Figure 4 illustrates the evolution of the OD and IoU metrics with respect to prediction time steps on three datasets. It can be clearly observed from the figure that DiffVP consistently achieves the best performance across the entire prediction range. Both in short-term prediction (<1 s) and medium-to-long-term prediction (2–5 s), its OD values remain lower and its IoU values remain higher than those of all competing methods.

4.3.2. Qualitative Evaluation

Figure 5 presents a qualitative comparison between DiffVP and two representative baseline methods: the Transformer-based STAR-VP [33] and the LSTM-based TRACK [22] across different viewing scenarios. In the figure, the red curves represent the ground-truth viewing trajectories, while the blue curves denote the predicted trajectories. The visualization results indicate that TRACK [22] tends to concentrate predictions within a confined region in multi-person scenes, failing to accurately capture the viewing trend. Moreover, in exploratory underwater scenes, its predictions deviate significantly from the actual trajectories. In comparison, STAR-VP [33] demonstrates an improved ability to follow the general motion trend of trajectories, yet its long-term predictions still exhibit noticeable offsets. In contrast, DiffVP generates predicted trajectories that closely adhere to the variations of the ground-truth paths, accurately modeling exploratory viewing behavior.

As illustrated in Figure 6, DiffVP demonstrates strong fitting capability across several representative viewport viewing behavior patterns observed in the David_MMSys dataset. The model can generate viewport trajectories that closely align with the ground truth under various motion dynamics, including rapid scanning with concise trajectories, complex exploratory movements, and hybrid behaviors that combine global browsing with local focus. These results qualitatively demonstrate the effectiveness of DiffVP in modeling diverse viewport behavior patterns within a unified probabilistic framework, highlighting the diversity generation capability of diffusion-based methods.

4.4. Computational Efficiency Analysis

To comprehensively evaluate the engineering practicality of the DiffVP, we conduct a computational complexity analysis on the same device by measuring FLOPs, the number of parameters, and per-sample inference time. We evaluate three models: DiffVP with DDPM sampling, DiffVP with DDIM sampling (the optimal configuration in this paper), and the state-of-the-art method STAR-VP [33]. The results are reported in Table 2.

The computational complexity and inference speed comparisons between STAR-VP [33] and DiffVP with different sampling strategies are shown in Table 2. DiffVP has 7.45 MB parameters and 1.96 G FLOPs. These values are on the same order of magnitude as the 6.9 MB parameters and 1.87 G FLOPs of STAR-VP [33]. This indicates that the serial dual-Transformer design does not introduce significant redundancy. It demonstrates that while improving prediction performance, DiffVP does not substantially increase the computational burden. The average per-sample inference time for DiffVP (with DDIM) is 0.34 s, representing an 86.5% speed improvement compared with DiffVP (with DDPM) at 2.51 s. This acceleration is achieved because DDIM requires only 50 sampling steps, just one-tenth of the 500 steps required by DDPM. However, the inference speed of DiffVP remains slower than that of STAR-VP [33]. This difference stems from the core characteristics of their respective model architectures: STAR-VP [33] employs a traditional encoder–decoder architecture, generating predictions through a single forward pass, whereas DiffVP is built upon a probabilistic diffusion-based generation framework. This framework requires multi-step iterative denoising to generate trajectories, a process that inherently increases inference latency.

4.5. Ablation Study

To comprehensively evaluate the contribution of each core module in the DiffVP model and the effectiveness of the DDIM sampling strategy, this paper conducts a series of ablation studies across three public datasets. The baseline model, which solely employs a standard Transformer architecture, serves as the starting point for component-wise analysis. The experimental results are presented in Table 3.

The baseline model based solely on a standard Transformer achieves OD values of 1.132, 0.621, and 0.432 on the three datasets, with corresponding IoU values of 26.02%, 51.25%, and 62.18%. These results serve as a benchmark for evaluating the performance improvements contributed by subsequent components. This model only captures global temporal dependencies in trajectories, fails to adapt to trajectory uncertainty, and lacks guidance from visual saliency information, resulting in the poorest performance. After introducing the diffusion model, OD values decrease by 4.9%, 5.2%, and 4.8% across the three datasets, while IoU values increase by 10.0%, 3.7%, and 1.4%, respectively. This demonstrates that the probabilistic modeling of the diffusion model effectively captures trajectory randomness and significantly improves prediction accuracy.

After replacing the baseline Transformer with the ECTE module, the OD values further decrease by 3.2%, 7.5% and 3.2% on the three datasets, while the IoU values increase by 2.8%, 5.6% and 1.8%, respectively. By separately modeling the temporal dependencies and spatial coordinate semantics of trajectories, this module addresses the spatio-temporal feature aliasing problem inherent in traditional Transformers and fully exploits the expressive power of trajectory features. Building upon the “Diff + ECTE” model, the introduction of the CASF module leads to a further performance improvement: compared with the “Diff + ECTE” configuration, incorporating CASF results in a relative reduction of OD by 7.7%, 7.7% and 2.5% across the three datasets, while IoU shows a relative increase of 10.1%, 6.3% and 2.8%, respectively. Through its coordinate-aware dual-branch attention mechanism, CASF achieves deep alignment and interactive fusion of saliency and trajectory features, demonstrating significant synergistic effects with the ECTE module.

To directly validate the effectiveness of the CASF module, we compare it with a simple feature-fusion baseline method, Cat_Sal. Under the identical “Diff + Baseline” backbone architecture, replacing the CASF module with Cat_Sal causes a noticeable performance drop: specifically, using CASF yields an additional relative reduction in OD of 3.6%, 7.0% and 1.5% across the three datasets, and an additional relative improvement in IoU of 6.9%, 4.3% and 1.9%, compared with using Cat_Sal. This controlled experiment strongly confirms that CASF can utilize visual saliency information more effectively than a simple concatenation strategy to optimize trajectory prediction.

The final DiffVP model achieves optimal performance by employing DDIM sampling, with only a marginal gap observed on the Wu_MMSys [43], demonstrating comparable results to those obtained with DDPM sampling. Nevertheless, DDIM requires only one-tenth of the sampling steps used by DDPM, substantially enhancing inference efficiency, which justifies its adoption as the definitive sampling strategy for the model.

In the DiffVP model, visual saliency features serve as one of the core inputs, and their quality directly influences the effectiveness of the CASF cross-modal fusion module, thereby determining trajectory prediction performance. To quantitatively evaluate the impact of input saliency map quality on the model’s prediction performance and identify the optimal source of saliency features, this paper selects saliency maps generated by several representative 360° video saliency detection methods as visual feature inputs and conducts comparative experiments on the David_MMSys [42] dataset. The experimental results are presented in Table 4.

Among various saliency extraction methods, SalGAN360 [45], originally designed for 360° image tasks, produces insufficient quality saliency maps when directly applied to video scenarios, failing to effectively enhance prediction performance. Its performance is even inferior to the baseline model that does not incorporate saliency. The performance of Spherical U-Net [46] and Offline-DHP [17] is lower than that of PAVER [39]. Therefore, the high-quality saliency maps generated by PAVER [39] contribute more significantly to the model’s prediction results. STAR-VP also utilizes saliency maps obtained from PAVER [39]. When integrated into our DiffVP framework, the prediction performance surpasses that of STAR-VP.

5. Conclusions and Future Work

5.1. Conclusions

This paper proposes DiffVP, a diffusion-based method for 360° video viewport prediction. Based on extensive experimental results, the following conclusions can be drawn: (1) By modeling future viewing trajectories as probability distributions, the diffusion model generates diverse and plausible predictions, thereby better capturing the dynamic variations in viewing behavior. (2) With historical trajectories and visual saliency cues as conditional inputs, DiffVP leverages the probabilistic generation mechanism of diffusion models to characterize trajectory uncertainty while providing reliable prior information for prediction. (3) The ECTE module explicitly models temporal features and spatial coordinate features, enabling more precise capture of temporal dependencies in trajectories and spatial relationships among coordinates. (4) Through coordinate-aware alignment between saliency features and trajectory features, followed by interactive integration via the CASF module, the proposed approach further enhances the guiding role of visual content in viewport prediction. Experimental results on three public datasets demonstrate that DiffVP outperforms existing methods across all evaluation metrics, achieving state-of-the-art viewport prediction performance. Overall, DiffVP not only significantly improves prediction accuracy but also offers a new research perspective and methodological framework for modeling viewing behavior in 360° videos.

5.2. Future Work

Although DiffVP has validated the effectiveness of the diffusion probabilistic framework in viewport prediction without introducing significant parameter redundancy, the inherent iterative denoising mechanism of diffusion models still limits its inference speed, making it difficult to meet the ultra-low latency requirements of real-time streaming applications. This remains a core bottleneck in transitioning the model from academic benchmarks to industrial deployment. To address this challenge and further promote the practical application of the model in real streaming systems, future research will focus on the following three directions: (1) Exploring lightweight diffusion model designs and efficient sampling strategies, incorporating compression techniques such as pruning and knowledge distillation to substantially reduce inference latency while preserving DiffVP’s ability to model probability distributions. (2) Conducting integration studies with existing Adaptive Bitrate (ABR) algorithms, coupling low-complexity viewport predictions with ABR’s rate adaptation and caching logic. (3) Carrying out small-scale A/B testing in real streaming service scenarios, selecting representative user groups and video content to quantitatively evaluate DiffVP’s actual improvement in user QoE in production environments. The test results will also guide scenario-specific fine-tuning of the model, ensuring the algorithmic design aligns more closely with industrial needs.

Author Contributions

Conceptualization, methodology, writing—original draft preparation, H.Z.; validation, writing—review and editing, L.D.; supervision, writing—review and editing, project administration, X.N.; supervision, writing—review and editing, F.D. All authors have read and agreed to the published version of the manuscript.

Funding

Major Basic Research Project of Shandong Provincial Natural Science Foundation (Grant No. ZR2024ZD03).

Data Availability Statement

The original data presented in this study are openly available in at https://gitlab.com/miguelfromeror/head-motion-prediction (accessed on 20 June 2022).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Guo, H.; Wang, F.; Zhang, W.; Zhu, Y.; Cui, L.; Liu, J.; Yu, F.R.; Zhang, L. Joint Adaptation for Mobile 360-Degree Video Streaming and Enhancement. IEEE Trans. Mob. Comput. 2025, 24, 7726–7741. [Google Scholar] [CrossRef]
Delgado, C.Y.; Mayer, R.E. Implementing pretraining to optimise learning in immersive virtual reality. J. Comput. Assist. Learn. 2025, 41, e13099. [Google Scholar] [CrossRef]
Yaqoob, A.; Bi, T.; Muntean, G.M. A survey on adaptive 360 video streaming: Solutions, challenges and opportunities. IEEE Commun. Surv. Tutor. 2020, 22, 2801–2838. [Google Scholar] [CrossRef]
Yaqoob, A.; Muntean, G.M. Advanced predictive tile selection using dynamic tiling for prioritized 360-Degree video vr streaming. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 20, 6. [Google Scholar]
Subhan, F.E.; Yaqoob, A.; Muntean, C.H.; Muntean, G.M. EDGE360: Edge-Enabled Multi-Agent DRL for Region-Aware Rate Adaptation Solution to Enhance Quality of 360-Degree Video Streaming. IEEE Trans. Mob. Comput. 2025, 25, 1918–1935. [Google Scholar] [CrossRef]
Liu, Y.; Wang, D.; Song, B. Viewport Prediction with Unsupervised Multiscale Causal Representation Learning for Virtual Reality Video Streaming. IEEE Trans. Multimed. 2025, 27, 4752–4764. [Google Scholar] [CrossRef]
Li, X.; Wang, S.; Zhu, C.; Song, L.; Xie, R.; Zhang, W. Viewport Prediction for Panoramic Video with Multi-CNN. In Proceedings of the 2019 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB); IEEE: New York, NY, USA, 2019; pp. 1–6. [Google Scholar]
Chao, F.Y.; Ozcinar, C.; Smolic, A. Transformer-based Long-Term Viewport Prediction in 360-Degree Video: Scanpath is All You Need. In Proceedings of the 2021 IEEE 23rd International Workshop on Multimedia Signal Processing (MMSP); IEEE: New York, NY, USA, 2021; pp. 1–6. [Google Scholar]
Chen, X.; Kasgari, A.T.Z.; Saad, W. Deep Learning for Content-Based Personalized Viewport Prediction of 360-Degree VR Videos. IEEE Netw. Lett. 2020, 2, 81–84. [Google Scholar] [CrossRef]
Xu, X.; Tan, X.; Wang, S.; Liu, Z.; Zheng, Q. Multi-features fusion based viewport prediction with gnn for 360-degree video streaming. In Proceedings of the 2023 IEEE International Conference on Metaverse Computing, Networking and Applications (MetaCom); IEEE: New York, NY, USA, 2023; pp. 57–64. [Google Scholar]
Tashiro, Y.; Song, J.; Song, Y.; Ermon, S. CSDI: Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation. In Proceedings of the Advances in Neural Information Processing Systems; NeurIPS Proceedings: San Diego, CA, USA, 2021; pp. 24804–24816. [Google Scholar]
Yang, Y.; Jin, M.; Wen, H.; Zhang, C.; Liang, Y.; Ma, L.; Wang, Y.; Liu, C.M.; Yang, B.; Xu, Z.; et al. A Survey on Diffusion Models for Time Series and Spatio-Temporal Data. Acm Comput. Surv. 2024, 58, 196. [Google Scholar] [CrossRef]
Yuan, X.; Qiao, Y. Diffusion-TS: Interpretable Diffusion for General Time Series Generation. In Proceedings of the Twelfth International Conference on Learning Representations; ICLR: Vienna, Austria, 2024; pp. 1–29. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 6840–6851. [Google Scholar]
Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. In Proceedings of the International Conference on Learning Representations; ICLR: Vienna, Austria, 2021; pp. 1–20. [Google Scholar]
Chen, Y.; Lu, H.; Qin, L.; Wu, C.; Chen, C.W. Streaming 360° VR Video with Statistical QoS Provisioning in mmWave Networks from Delay and Rate Perspectives. IEEE Trans. Wirel. Commun. 2025, 24, 4721–4737. [Google Scholar] [CrossRef]
Xu, M.; Song, Y.; Wang, J.; Qiao, M.; Huo, L.; Wang, Z. Predicting Head Movement in Panoramic Video: A Deep Reinforcement Learning Approach. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 2693–2708. [Google Scholar] [CrossRef] [PubMed]
Petrangeli, S.; Simon, G.; Swaminathan, V. Trajectory-Based Viewport Prediction for 360-Degree Virtual Reality Videos. In Proceedings of the 2018 IEEE International Conference on Artificial Intelligence and Virtual Reality (AIVR); IEEE: New York, NY, USA, 2018; pp. 157–160. [Google Scholar]
Yaqoob, A.; Muntean, G.M. A Collaborative Trajectory-Oriented Viewport Prediction for on-Demand and Live 360-Degree VR Video Streaming. In Proceedings of the 2023 IEEE 20th International Conference on Smart Communities: Improving Quality of Life Using AI, Robotics and IoT (HONET); IEEE: New York, NY, USA, 2023; pp. 1–6. [Google Scholar]
Chen, J.; Luo, X.; Hu, M.; Wu, D.; Zhou, Y. Sparkle: User-Aware Viewport Prediction in 360-Degree Video Streaming. IEEE Trans. Multimed. 2021, 23, 3853–3866. [Google Scholar] [CrossRef]
Li, C.; Zhang, W.; Liu, Y.; Wang, Y. Very Long Term Field of View Prediction for 360-Degree Video Streaming. In Proceedings of the 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR); IEEE: New York, NY, USA, 2019; pp. 297–302. [Google Scholar]
Rondón, M.F.R.; Sassatelli, L.; Aparicio-Pardo, R.; Precioso, F. TRACK: A New Method from a Re-Examination of Deep Architectures for Head Motion Prediction in 360-Degree Videos. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 5681–5699. [Google Scholar] [PubMed]
Xu, Y.; Dong, Y.; Wu, J.; Sun, Z.; Shi, Z.; Yu, J.; Gao, S. Gaze Prediction in Dynamic 360-Degree Immersive Videos. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2018; pp. 5333–5342. [Google Scholar]
Wang, M.; Peng, S.; Chen, X.; Zhao, Y.; Xu, M.; Xu, C. CoLive: An Edge-Assisted Online Learning Framework for Viewport Prediction in 360-Degree Live Streaming. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME); IEEE: New York, NY, USA, 2022; pp. 1–6. [Google Scholar]
Zhang, Z.; Chen, Y.; Zhang, W.; Yan, C.; Zheng, Q.; Wang, Q.; Chen, W. Tile Classification Based Viewport Prediction with Multi-modal Fusion Transformer. In Proceedings of the 31st ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2023; MM ’23; pp. 3560–3568. [Google Scholar]
Zhang, Z.; Du, H.; Huang, S.; Zhang, W.; Zheng, Q. VRFormer: 360-Degree Video Streaming with FoV Combined Prediction and Super resolution. In Proceedings of the 2022 ISPA/BDCloud/SocialCom/SustainCom; IEEE: New York, NY, USA, 2022; pp. 531–538. [Google Scholar]
Tang, J.; Huo, Y.; Yang, S.; Jiang, J. A Viewport Prediction Framework for Panoramic Videos. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN); IEEE: New York, NY, USA, 2020; pp. 1–8. [Google Scholar]
Chopra, L.; Chakraborty, S.; Mondal, A.; Chakraborty, S. PARIMA: Viewport Adaptive 360-Degree Video Streaming. In Proceedings of the Web Conference 2021; Association for Computing Machinery: New York, NY, USA, 2021; WWW ’21; pp. 2379–2391. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Li, J.; Han, L.; Zhang, C.; Li, Q.; Liu, Z. Spherical Convolution Empowered Viewport Prediction in 360 Video Multicast with Limited FoV Feedback. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–23. [Google Scholar] [CrossRef]
Peng, S.; Hu, J.; Li, Z.; Xiao, H.; Yang, S.; Xu, C. Spherical Convolution-based Saliency Detection for FoV Prediction in 360-degree Video Streaming. In Proceedings of the 2023 International Wireless Communications and Mobile Computing (IWCMC); IEEE: New York, NY, USA, 2023; pp. 162–167. [Google Scholar]
Wu, C.; Zhang, R.; Wang, Z.; Sun, L. A Spherical Convolution Approach for Learning Long Term Viewport Prediction in 360 Immersive Video. Proc. AAAI Conf. Artif. Intell. 2020, 34, 14003–14040. [Google Scholar] [CrossRef]
Gao, B.; Sheng, D.; Zhang, L.; Qi, Q.; He, B.; Zhuang, Z.; Wang, J. STAR-VP: Improving Long-term Viewport Prediction in 360-Degree Videos via Space-aligned and Time-varying Fusion. In Proceedings of the 32nd ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2024; MM ’24; pp. 5556–5565. [Google Scholar]
Meijer, C.; Chen, L.Y. The rise of diffusion models in time-series forecasting. arXiv 2024, arXiv:2401.03006. [Google Scholar] [CrossRef]
Rasul, K.; Seward, C.; Schuster, I.; Vollgraf, R. Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting. arXiv 2021, arXiv:2101.12072. [Google Scholar] [CrossRef]
Alcaraz, J.L.; Strodthoff, N. Diffusion-based Time Series Imputation and Forecasting with Structured State Space Models. Trans. Mach. Learn. Res. 2023, 1–36. Available online: https://openreview.net/forum?id=hHiIbk7ApW (accessed on 15 March 2026).
Wang, D.; Cheng, M.; Liu, Z.; Liu, Q. TimeDART: A Diffusion Autoregressive Transformer for Self-Supervised Time Series Representation. In Proceedings of the Forty-Second International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2025; pp. 1–25. [Google Scholar]
Wang, Y.; Zhang, F.L.; Dodgson, N.A. ScanTD: 360-Degree Scanpath Prediction based on Time-Series Diffusion. In Proceedings of the 32nd ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2024; MM ’24; pp. 7764–7773. [Google Scholar]
Yun, H.; Lee, S.; Kim, G. Panoramic Vision Transformer for Saliency Detection in 360-Degree Videos. In Proceedings of the Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Part XXXV. pp. 422–439. [Google Scholar]
Kong, Z.; Ping, W.; Huang, J.; Zhao, K.; Catanzaro, B. DiffWave: A Versatile Diffusion Model for Audio Synthesis. In Proceedings of the International Conference on Learning Representations; ICLR: Appleton, WI, USA, 2021; pp. 1–17. [Google Scholar]
Cuturi, M.; Blondel, M. Soft-DTW: A differentiable loss function for time-series. In Proceedings of the 34th International Conference on Machine Learning; Association for Computing Machinery: New York, NY, USA, 2017; Volume 70, ICML’17; pp. 894–903. [Google Scholar]
David, E.J.; Gutiérrez, J.; Coutrot, A.; Da Silva, M.P.; Callet, P.L. A dataset of head and eye movements for 360-Degree videos. In Proceedings of the 9th ACM Multimedia Systems Conference; Association for Computing Machinery: New York, NY, USA, 2018; MMSys ’18; pp. 432–437. [Google Scholar]
Wu, C.; Tan, Z.; Wang, Z.; Yang, S. A Dataset for Exploring User Behaviors in VR Spherical Video Streaming. In Proceedings of the 8th ACM on Multimedia Systems Conference; Association for Computing Machinery: New York, NY, USA, 2017; MMSys’17; pp. 193–198. [Google Scholar]
Rondón, M.F.R.; Sassatelli, L.; Aparicio-Pardo, R.; Precioso, F. A unified evaluation framework for head motion prediction methods in 360-Degree videos. In Proceedings of the 11th ACM Multimedia Systems Conference; Association for Computing Machinery: New York, NY, USA, 2020; MMSys ’20; pp. 279–284. [Google Scholar]
Chao, F.Y.; Zhang, L.; Hamidouche, W.; Deforges, O. Salgan360: Visual Saliency Prediction on 360-Degree Images with Generative Adversarial Networks. In Proceedings of the 2018 IEEE International Conference on Multimedia & Expo Workshops (ICMEW); IEEE: New York, NY, USA, 2018; pp. 1–4. [Google Scholar]
Zhang, Z.; Xu, Y.; Yu, J.; Gao, S. Saliency Detection in 360-Degree Videos. In Proceedings of the Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018, Part VII; Springer: Berlin/Heidelberg, Germany, 2018; pp. 504–520. [Google Scholar]

Figure 1. Trajectories of different viewing behaviors on the same video.

Figure 2. The overall architecture of DiffVP. (a) Noise Processing, which constructs noise input conditioned on historical trajectories; (b) ECTE, which separately models temporal and coordinate features; (c) CSAF, which spatially weights and fuses

S_{L}^{t s}

and

S_{L}^{c o o r d}

to align trajectory features, while performing spatio-temporal interaction between saliency features and trajectory features.

Figure 2. The overall architecture of DiffVP. (a) Noise Processing, which constructs noise input conditioned on historical trajectories; (b) ECTE, which separately models temporal and coordinate features; (c) CSAF, which spatially weights and fuses

S_{L}^{t s}

and

S_{L}^{c o o r d}

to align trajectory features, while performing spatio-temporal interaction between saliency features and trajectory features.

Figure 3. The CDFs of head movements for three datasets at different time windows. (a) David_MMSys dataset; (b) Wu_MMSys dataset; (c) Xu_PAMI dataset. The horizontal axis represents the maximum head movement amplitude (in degrees) from the reference position t to the T seconds. The vertical axis denotes the cumulative proportion of samples whose displacement does not exceed the corresponding horizontal axis value.

Figure 4. (a,b) illustrate the pointwise evolution of the OD and IoU metrics on the David_MMSys dataset, respectively; (c,d) depict the pointwise trends of the OD and IoU metrics on the Wu_MMSys dataset, respectively; while (e,f) present the pointwise progression of the OD and IoU metrics on the Xu_PAMI dataset, respectively.

Figure 5. Qualitative comparison of different methods on various videos. Red denotes the ground-truth viewing trajectory, and blue denotes the predicted trajectory.

Figure 6. Visualization of prediction results for different viewing behavior trajectories on the same video.

Table 1. Prediction performance (2–5 s) of different methods on three public datasets.

Method	Pub.	David_MMSys		Wu_MMSys		Xu_PAMI
Method	Pub.	OD ↓	IoU ↑	OD ↓	IoU ↑	OD ↓	IoU ↑
Track [22]	TPAMI’21	1.123	25.05%	0.613	51.12%	0.408	63.32%
VPT360 [8]	MMSP’21	1.127	26.00%	0.624	52.04%	0.421	62.47%
MFTR [25]	MM’23	1.064	27.98%	0.599	52.02%	0.418	62.59%
STAR-VP [33]	MM’24	0.967	33.26%	0.531	56.82%	0.410	63.10%
DiffVP	’26	0.962	34.55%	0.503	59.69%	0.388	66.01%

Arrows indicate the direction of better performance: ↓ (lower is better), ↑ (higher is better). The best results are highlighted in bold.

Table 2. Comparison of model FLOPs (G), parameters (MB), and inference time (s).

Methods	Pub.	FLOPs (G)	Parameters (MB)	Inference Time (s) *
STAR-VP [33]	MM’24	6.9	1.87	0.03
DiffVP (DDMP)	’26	7.45	1.96	2.51
DiffVP (DDIM)	’26	7.45	1.96	0.34

* Inference time refers to the average processing time for each sample, measured under the same experimental environment.

Table 3. Ablation study of the proposed components on viewport prediction performance.

Baseline	Diff	ECTE	CASF	Cat_Sal *	DDPM	DDIM	David_MMSys		Wu_MMSys		Xu_PAMI
Baseline	Diff	ECTE	CASF	Cat_Sal *	DDPM	DDIM	OD ↓	IoU ↑	OD ↓	IoU ↑	OD ↓	IoU ↑
√	×	×	×	×	×	×	1.132	26.02%	0.621	51.25%	0.432	62.18%
√	√	×	×	×	×	√	1.076	28.62%	0.589	53.17%	0.411	63.05%
×	√	√	×	×	×	√	1.042	31.37%	0.545	56.17%	0.398	64.21%
√	√	×	×	√	×	√	1.053	30.16%	0.574	54.81%	0.408	63.67%
√	√	×	√	×	×	√	1.015	32.24%	0.534	57.19%	0.402	64.87%
×	√	√	√	×	×	√	0.962	34.55%	0.503	59.69%	0.388	66.01%
×	√	√	√	×	√	×	0.966	34.32%	0.502	59.82%	0.391	65.78%

√ indicates the component is used, × indicates not used. Cat_Sal * denotes the variant where saliency features are integrated solely via concatenation. Arrows indicate the direction of better performance: ↓ (lower is better), ↑ (higher is better). The best results are highlighted in bold.

Table 4. Performance comparison of different saliency feature extraction methods.

Method	Pub.	David_MMSys		Wu_MMSys		Xu_PAMI
Method	Pub.	OD ↓	IoU ↑	OD ↓	IoU ↑	OD ↓	IoU ↑
SalGAN360 [45]	ICMEW’18	1.056	30.27%	0.612	52.34%	0.420	63.15%
Spherical U-Net [46]	ECCV’18	1.002	31.46%	0.552	55.04%	0.405	63.29%
Offline-DHP [39]	TPAMI’18	0.991	33.75%	0.539	56.32%	0.392	64.33%
PAVER [39]	ECCV’22	0.962	34.55%	0.503	59.69%	0.388	66.01%

Arrows indicate the direction of better performance: ↓ (lower is better), ↑ (higher is better). The best results are highlighted in bold.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zheng, H.; Du, L.; Nie, X.; Dong, F. DiffVP: A Diffusion Model with Explicit Coordinate-Temporal Encoding for Viewport Prediction in 360^∘ Videos. Electronics 2026, 15, 1326. https://doi.org/10.3390/electronics15061326

AMA Style

Zheng H, Du L, Nie X, Dong F. DiffVP: A Diffusion Model with Explicit Coordinate-Temporal Encoding for Viewport Prediction in 360^∘ Videos. Electronics. 2026; 15(6):1326. https://doi.org/10.3390/electronics15061326

Chicago/Turabian Style

Zheng, Huimin, Lina Du, Xiushan Nie, and Fei Dong. 2026. "DiffVP: A Diffusion Model with Explicit Coordinate-Temporal Encoding for Viewport Prediction in 360^∘ Videos" Electronics 15, no. 6: 1326. https://doi.org/10.3390/electronics15061326

APA Style

Zheng, H., Du, L., Nie, X., & Dong, F. (2026). DiffVP: A Diffusion Model with Explicit Coordinate-Temporal Encoding for Viewport Prediction in 360^∘ Videos. Electronics, 15(6), 1326. https://doi.org/10.3390/electronics15061326

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DiffVP: A Diffusion Model with Explicit Coordinate-Temporal Encoding for Viewport Prediction in 360^∘ Videos

Abstract

1. Introduction

2. Related Work

2.1. Viewport Prediction in 360° Videos

2.1.1. Trajectory-Based Viewport Prediction Methods

2.1.2. Content-Based Viewport Prediction Methods

2.2. Diffusion Model

3. Method

3.1. Construction of Conditional Information

3.1.1. Trajectory Feature Extraction

3.1.2. Saliency Feature Processing

3.2. Explicit Coordinate-Temporal Encoding

3.3. Coordinate-Aware Saliency Feature Fusion

3.4. Trajectory Generation

3.5. Loss Function

4. Experiments

4.1. Datasets and Evaluation Metrics

4.2. Implementation Details

4.3. Comparison to State-of-the-Arts

4.3.1. Quantitative Evaluation

4.3.2. Qualitative Evaluation

4.4. Computational Efficiency Analysis

4.5. Ablation Study

5. Conclusions and Future Work

5.1. Conclusions

5.2. Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI