ImmerseFM-3D: A Foundation Model Framework for Generalizable 360-Degree Video Streaming with Cross-Modal Scene Understanding

Gallena Watthage, Reka Sandaruwan; Fernando, Anil

doi:10.3390/app16073424

Open AccessArticle

ImmerseFM-3D: A Foundation Model Framework for Generalizable 360-Degree Video Streaming with Cross-Modal Scene Understanding

by

Reka Sandaruwan Gallena Watthage

^*

and

Anil Fernando

Department of Computer & Information Sciences, University of Strathclyde, Glasgow G1 1XQ, UK

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(7), 3424; https://doi.org/10.3390/app16073424

Submission received: 20 February 2026 / Revised: 16 March 2026 / Accepted: 24 March 2026 / Published: 1 April 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Current 360-degree video streaming systems consider viewport prediction, adaptive bitrate allocation, tile selection, and quality-of-experience (QoE) estimation as independent activities, yielding fragmented pipelines that do not scale well across content type and network conditions and do not scale well to individual users. We propose ImmerseFM-3D, a foundation model that jointly solves all four sub-tasks through a single shared representation. Seven input modalities, namely video frames, network traces, head-motion trajectories, ambisonics audio, depth maps, eye-tracking signals, and CLIP scene semantics, are fused by four-layer cross-modal attention and compressed into a 256-dimensional bottleneck latent via a variational information bottleneck. Four task-specific decoders operate on this shared latent simultaneously. A model-agnostic meta-learning adapter augmented with episodic memory and a hypernetwork personalizes the model from as little as 1 s of user interaction data. An extended branch supports six-degrees-of-freedom volumetric content through spherical harmonic viewport decoding and depth-aware tile importance weighting. Trained and evaluated on the IMMERSE-1M combined dataset (1000 h of 360° and volumetric video, 524 users, and over 50,000 mean opinion scores), ImmerseFM-3D reduces the mean angular viewport error by 34%, lowers the bandwidth violation rate from 8.3% to 3.1%, and achieves a QoE Pearson correlation of 0.891. The personalization adapter reaches 90% of peak performance in 22 s, while zero-shot cross-format transfer attains 72% of full in-domain accuracy.

Keywords:

360-degree video streaming; foundation models; cross-modal learning; viewport prediction; volumetric video

1. Introduction

Omnidirectional video has evolved from an early technical curiosity [1] into a mass-market medium propelled by the rapid commoditization of consumer head-mounted displays (HMDs) and the proliferation of immersive media platforms across entertainment, education, healthcare, and telepresence [2]. Global immersive video traffic is projected to exceed 580 exabytes annually by 2027, driven by applications spanning live sports, virtual tourism, cultural heritage preservation, and remote collaboration. Empirical studies confirm that 360° video elicits measurably higher presence, enjoyment, and perceived credibility than conventional 2D footage when delivered at sufficient quality [3,4]. Yet streaming omnidirectional content faithfully over best-effort networks remains fundamentally harder than conventional video delivery: a single 4K equirectangular frame encodes up to 36 megapixels, only 20% of which any given user will perceive within their 110°-wide viewport at any instant. The gap between what must be transmitted and what is actually consumed creates an acute bandwidth efficiency problem, further compounded by the stringent motion-to-photon latency constraint (<20 ms) required to prevent cybersickness [3].

Viewport-adaptive streaming addresses this fundamental mismatch by spatially partitioning the equirectangular sphere into independently coded rectangular tiles and allocating quality resources in proportion to the probability that each tile falls within the user’s future field of view [5,6]. Tile granularity, layout, and the mechanism by which predicted viewport uncertainty is translated into bitrate decisions have been explored extensively in the literature [7,8,9,10]. This overall paradigm decomposes the streaming problem into four tightly interdependent sub-tasks: viewport prediction, adaptive bitrate (ABR) allocation, tile selection, and quality-of-experience (QoE) estimation. The research community has addressed each sub-task with increasingly sophisticated yet fundamentally siloed models. Viewport predictors have progressed from early sinusoidal and position-extrapolation heuristics [11,12] through LSTM, GRU, and deep reinforcement learning [13,14] to modern transformer architectures [15,16,17] and cross-modal multi-scale attention mechanisms [18,19], yet fundamentally remain oblivious to concurrent network state or spatial audio cues. Bitrate adaptation algorithms have incorporated stochastic viewport uncertainty [20,21,22], beamforming co-design for millimetre-wave and THz networks [23], and reinforcement learning for QoE-driven decision-making [24,25]; however, all assume a fixed, separately optimized viewport estimate as their only behavioral input. Tile selection strategies [26,27] and QoE estimators [28,29] operate analogously in isolation, with no shared internal representation across tasks and no feedback path from downstream decisions to the upstream predictor.

This architectural fragmentation creates three compounding problems that collectively impose a hard performance ceiling on viewport-adaptive streaming systems. First, prediction errors propagate non-gracefully: a viewport prediction error of even 5° at a 1 s horizon displaces entire columns of high-quality tiles outside the actual viewport, wasting up to 30% of available bandwidth [30]. While probabilistic pre-fetching [6] and stochastic optimization frameworks [20,21] partially mitigate this, they cannot recover information already discarded by the upstream predictor. Shen et al. [31] demonstrated that even the codec structure amplifies this problem through GOP-induced switching latency, motivating new bitstream schemas. Second, task-specific models generalize poorly across content, users, and networks: a viewport predictor trained on documentary footage degrades significantly on high-motion sports content without retraining [32], personalization to a new user typically requires tens of minutes of behavioral data [33], and live streaming contexts lack the pre-session content information that many state-of-the-art predictors require [34,35]. Third, rich cross-modal signals are entirely discarded: the well-documented influence of spatial audio onset events on gaze direction [36,37], the predictive relationship between eye-tracking micro-saccades and upcoming head reorientations [38,39], the role of scene semantic context in shaping long-term viewing trajectories [4], and content-complexity correlates of network bitrate sensitivity [40] are entirely ignored by all existing streaming pipelines.

We argue that these limitations share a common root cause: the absence of a unified representation that jointly encodes scene content, network state, user behavior, and perceptual quality. Motivated by the success of large-scale multi-modal pre-training in vision [15,41], and by information bottleneck theory [42] as a principled mechanism for compressing heterogeneous representations to task-sufficient statistics, we propose ImmerseFM-3D, a foundation model framework for generalizable 360-degree and volumetric video streaming. ImmerseFM-3D maps seven heterogeneous input modalities—equirectangular video frames, network throughput traces, head-motion trajectories, first-order ambisonics audio, monocular depth maps, eye-tracking signals, and CLIP-derived scene semantic embeddings into a shared 256-dimensional bottleneck latent via four-layer cross-modal attention and a variational information bottleneck. Four task-specific decoders simultaneously share this representation, enabling end-to-end multi-task optimization with no per-task feature extraction overhead. A MAML-based meta-learning adapter [43], augmented with an episodic memory module and a weight-generating hypernetwork, personalizes the model from as few as 1 s of interaction data. An extended volumetric branch supports 6DoF content through spherical harmonic viewport decoding and depth-aware tile importance weighting, connecting 360° streaming research to next-generation holographic delivery [44].

The principal contributions of this paper are as follows:

ImmerseFM-3D architecture. The first unified foundation model that jointly solves all four 360° streaming sub-tasks through a single shared cross-modal latent representation. Unlike ensembles of task-specific models, ImmerseFM-3D eliminates the inter-task error propagation bottleneck: the bitrate decoder implicitly conditions on viewport prediction uncertainty through the shared bottleneck, a capability structurally impossible in siloed pipelines. This yields VP MAE reductions of up to 34%, BVR reduction of 63%, and tile F1 improvement of 11 pp over the best single-task baselines (Section 5.1).
Variational information bottleneck for multi-modal streaming. The first application of information bottleneck theory [42] to compress a seven-modality immersive streaming representation into a 256-dimensional sufficient statistic. This enables (a) implicit uncertainty-aware bitrate decisions without dedicated confidence estimation modules, and (b) 72% zero-shot cross-format transfer to volumetric content versus 48% for pre-trained fine-tuning by learning domain-agnostic compressed representations (Section 5.3).
Sample efficient meta-learning personalization. A compound MAML [43] HyperNetwork episodic memory personalization module that reduces the 90% threshold adaptation time to 22 s from ≥47 s for standalone MAML, and delivers statistically significant QoE gains from a single second of user data across all four sub-tasks simultaneously, the first 360° streaming personalization system to achieve this at scale (Section 5.2).
IMMERSE-1M dataset. The largest publicly released multi-modal benchmark for immersive streaming research: 1000 h of 360° and volumetric video, 524 participants, 10,000 network traces, and 50,000+ subjective MOS ratings [45] with synchronized head-motion [46,47], eye-tracking [39], ambisonics audio, and network traces across six content format categories (Section 4.1).
Volumetric 6DoF extension. Spherical harmonic viewport decoding and depth-aware tile importance weighting [44] reduce 6DoF viewport error by 40.8% and increase Point-to-Plane PSNR by 7.3 dB over the depth-agnostic 2D baseline, establishing the first foundation-model baseline for next-generation holographic streaming (Section 3.7 and Section 5.4).

The remainder of this paper is organized as follows. Section 2 reviews related work across the four sub-task domains and meta-learning personalization. Section 3 details the ImmerseFM-3D architecture. Section 4 describes the IMMERSE-1M dataset, evaluation protocols, baselines, and metrics. Section 5 presents comprehensive experimental results, ablations, and efficiency analyses. Section 6 interprets the findings, analyzes cross-modal attention patterns, and discusses limitations and future directions. Section 7 concludes.

2. Related Work

2.1. Viewport Prediction: From Heuristics to Foundation Representations

Viewport prediction is the enabling technology for bandwidth-efficient 360° streaming. The problem was first formalized when it became apparent that motion-prediction-based transmission could reduce bandwidth consumption by up to 45% relative to viewport-agnostic delivery [12]. Foundational datasets from Wu et al. [46] and Rai et al. [47] established the first publicly available corpora of head-motion traces recorded inside HMDs, documenting systematic viewing patterns, including equatorial bias and early-session free exploration across diverse 360° content categories. These observations were corroborated by Sitzmann et al. [4] through controlled eye-tracking experiments that quantified how users allocate attention in virtual environments, showing that visual saliency maps derived from standard 2D attention models fail to capture the spherical geometry of immersive viewing. David et al. [38] extended this to simultaneous eye and head movement recordings from 57 participants across 19 videos, and Xu et al. [39] released Panonut360, a high-frequency (90 Hz) dataset combining head and eye tracking across a large panoramic video corpus enabling research into the intra-session dynamics that determine prediction difficulty at different horizons.

Early predictors fitted polynomial or sinusoidal motion models [11] to recent trajectory windows, achieving competitive accuracy at sub-second horizons but degrading rapidly beyond 1 s. Recurrent neural networks improved medium-horizon accuracy: deep reinforcement learning was applied to predict head-movement scanpaths [13], while domain-inspired RL formulations for 360° sports piloting demonstrated the value of multi-scale temporal context [14]. Martinez et al. [48] established key properties of GRU-based sequence models for human body motion that were subsequently adapted to 360° head prediction [49]. Spherical convolution architectures addressed the systematic projection distortions that impair weight sharing in conventional 2D CNNs applied to equirectangular video [50].

Transformer-based architectures now define the state of the art. VPT360 [15] demonstrated that a pure scanpath-attention model surpasses CNN-RNN hybrids in both accuracy and computational efficiency. MFTR [16] extended this to multi-modal fusion of trajectory and video saliency; CMMST [17] applied cross-modal attention across multiple spatial scales; and BiT [18] introduced bidirectional transformer alignment to maintain temporal coherence across viewport proposal frames. TRACK [32] systematically re-examined deep architectures for head-motion prediction, providing architectural insights adopted by subsequent methods, while STAR-VP [51] proposed space-aligned and time-varying feature fusion that preserves spatial locality in the fused representation. Multi-modal fusion has been further advanced through MDA-based attention addressing differential information density across trajectory, visual, and audio modalities [19], and through optimized mobile-friendly architectures that meet device-side inference latency budgets for live streaming [35].

For live scenarios where pre-session content is unavailable, dedicated methods use Gaussian Mixture Model motion tracking [34] and edge-assisted collaborative learning across simultaneous viewers [52]. For the emerging class of 6DoF volumetric content, Li et al. [44] proposed STVP, a saliency-and-trajectory predictor for holographic video that addresses the additional positional degrees of freedom beyond yaw, pitch, and roll.

Despite this rapid progress, a comprehensive survey by Wahba et al. [53] identifies three persistent limitations shared by virtually all existing methods: single-modality processing, poor cross-content generalization, and the need for per-user retraining. ImmerseFM-3D directly addresses all three through cross-modal fusion, variational information bottleneck regularization, and MAML-based few-shot personalization.

2.2. Adaptive Bitrate Streaming and Tile Selection

The foundational paradigm for tile-based adaptive streaming was established by Corbillon et al. [5], who showed that cube-map tiling with viewport-weighted bitrate allocation outperforms uniform quality delivery at equal bandwidth. Son and Ryu [8] demonstrated the viability of tile-based delivery for mobile VR in a cyber-physical system context. Skupin et al. [7] investigated rate assignment guided by spatiotemporal video activity metrics to reduce inter-tile quality variance, and De la Fuente et al. [9] analyzed the delay impact of MPEG OMAF’s viewport-dependent streaming architecture on perceived fidelity, identifying end-to-end delay as the primary quality bottleneck. Content-adaptive tiling layouts that co-optimize the spatial partition with bitrate allocation have been shown to reduce average streamed bitrate by 32–70% relative to uniform grids [10].

Viewport-adaptive ABR algorithms have grown progressively more sophisticated. 360ProbDASH [6] introduced probabilistic pre-fetching that explicitly accounts for viewport prediction uncertainty through a QoE-driven optimization framework, reducing spatial quality variance by 46% in trace-driven experiments. Flare [54] demonstrated practical mobile deployment with online allocation algorithms achieving up to 4.9× viewport quality improvement in LTE networks. JUST360 [55] formulated a joint utility maximization problem across all tiles incorporating network dynamics, viewport probability distributions, and rate-distortion characteristics. Ghosh et al. [20] provided a stochastic optimization framework with probabilistic quality guarantees under uncertain head movement and bandwidth, while Zhao et al. [21] extended this to wireless multi-user networks with KKT and CCCP-based solvers for scenarios ranging from perfect to completely unknown FoV distributions. Shen et al. [31] proposed an SVC-based bitstream schema that structurally eliminates GOP-induced viewport-switch latency, enabling one-frame-time viewport transitions.

Reinforcement learning has emerged as the dominant paradigm for sequential bitrate decisions. Park et al. [24] applied RL with 3D-CNN viewport features, achieving 1.3–1.7× bitrate improvement over tile-based baselines. VATP360 [27] coupled RL with a tile priority classifier derived from object motion and ROI predictions. Transformer-based saliency models pre-trained on 2D images have been repurposed for 360° ABR without fine-tuning [25], and Feng et al. [22] introduced multi-window stochastic viewport prediction with model-predictive control, improving overall QoE by 16–19% over prior baselines. Setayesh and Wong [23] jointly optimized viewport prediction, bitrate selection, and beamforming vectors for THz-enabled 6G networks, where beam management and streaming quality are inextricably coupled.

On the tile selection side, Nguyen et al. [26] provided the most systematic comparative evaluation to date across ten existing strategies, finding strong sensitivity to segment duration, buffer size, and content type. Probabilistic tile visibility modeling with Laplace error distributions [56] yields near-optimal server-side allocation; MEC-assisted caching with LSTM-CNN joint models [57] addresses the cache replacement problem. Layer and spatial correlations in the bitstream have been exploited to enhance SVC and tile-based streaming efficiency [40].

All existing systems treat bitrate allocation and tile selection as downstream consumers of a fixed viewport estimate. ImmerseFM-3D is the first architecture to close this loop, sharing a single uncertainty-aware latent representation across all streaming decisions so that the bitrate decoder implicitly conditions on viewport prediction confidence.

2.3. Quality of Experience Estimation and Multi-Modal Perception

QoE estimation for 360° video must account for spherical projection distortions, viewport-dependent perceptual salience, cybersickness, and individual comfort preferences a substantially harder problem than conventional 2D quality assessment. Croci et al. [28] proposed a spherical Voronoi quality estimation framework with visual-attention weighting, demonstrating that patch-based quality metrics weighted by attention closely track subjective MOS. Qiu and Shao [58] extended saliency-driven assessment to blind (no-reference) scenarios using CNNs, while Chen et al. [59] established that equatorial viewing bias significantly distorts saliency map validity, a confound that must be corrected in QoE-driven models. Van Kasteren et al. [29] conducted a controlled psychophysical study with integrated eye-tracking, showing that freezing events and quality degradations interact with gaze patterns in nuanced ways, confirming that eye movements carry strong diagnostic information for QoE prediction beyond what video quality metrics alone provide.

Multi-modal physiological signals offer powerful additional predictors of immersive QoE. Xue et al. [45] released CEAP-360VR a continuous dataset of EDA, heart rate, skin temperature, head and eye movements, and pupillometry during 360° VR viewing, demonstrating that multi-modal physiological data substantially improves automatic emotion and QoE annotation. Cross-modal audio-visual effects further shape immersive quality: Shimamura et al. [36] showed that synchronized spatial audio removal is necessary to preserve perceived immersion during object editing, and Liu et al. [37] demonstrated that generating spatially accurate first-order ambisonics from 360° video content substantially improves listener immersion, validating the audio modality as a first-class QoE signal. These findings motivate ImmerseFM-3D’s explicit inclusion of ambisonic audio as an input modality, an architectural choice not present in any existing streaming QoE estimator.

From a system perspective, Vats et al. [60] showed that semantic video analysis at the 5G edge, combined with tile post loading, substantially improves pre-fetch quality under real-world latency constraints, one of the first systems to exploit semantic context alongside network state. Adhuran and Martini [61] further investigated the co-optimization of tiling schemes and viewport prediction. Despite this body of work, no published QoE estimator to date exploits the full complement of eye-tracking, head-motion covariation, spatial audio, depth, and scene semantics that ImmerseFM-3D jointly encodes.

2.4. Meta-Learning, Personalization, and Collaborative Adaptation

User behavior in 360° viewing is profoundly individual. Head-motion statistics, exploration style, comfort thresholds for cybersickness, and quality preferences vary substantially across viewers [4,47], making a population-mean model necessarily suboptimal for any specific user. The first systematic study of personalization in the 360° streaming context was conducted by Jiang et al. [33], who proposed a MAML [43]-based meta-learning framework for viewport prediction and pre-fetch size adaptation. Their evaluation demonstrated statistically significant QoE gains for users who deviate substantially from the population mean, precisely those who suffer most under non-personalized systems. Critically, their analysis also revealed that population-trained models can perform worse than a zero-order motion extrapolation baseline for users with non-standard head-motion patterns, underscoring the urgency of personalization.

While first-order meta-learning variants such as Reptile simplify computation, both MAML and Reptile require several minutes of interaction data before adaptation converges in practice. Collaborative approaches offer a complementary direction: CoLive [52] proposed edge-assisted online learning where concurrent clients share gradient information, enabling rapid warm-start for new viewers without requiring labeled user data. However, collaborative methods introduce privacy concerns and assume a sufficiently large concurrent user population. Spherical convolution-based FoV prediction [50] has been proposed for multicast scenarios with limited FoV feedback, an important practical constraint that personalization systems must also respect.

ImmerseFM-3D advances the personalization frontier by combining MAML with a HyperNetwork that generates globally coherent parameter updates from support-set gradients and an episodic memory module that provides a warm initialization from behaviorally similar prior users. This compound design reduces the 90%-threshold adaptation time from ≥47 s (MAML alone) to 22 s, delivers statistically significant gains from a single second of interaction data, and remains effective in live streaming contexts where pre-session content information is unavailable [34,35]. To our knowledge, ImmerseFM-3D is the first 360° streaming system to demonstrate sub-second-data personalization at scale across all four streaming sub-tasks simultaneously.

Table 1 summarizes the relative merits of representative 360° streaming methods across the four sub-tasks and three system capabilities evaluated in this paper. ImmerseFM-3D is the only published method to address viewport prediction, ABR allocation, tile selection, and QoE estimation jointly, while simultaneously supporting seven-modality fusion, user personalization, and 6DoF volumetric content.

3. Proposed Methodology

3.1. System Architecture Overview

ImmerseFM-3D adopts a unified foundation-model paradigm that jointly addresses viewport prediction, bitrate adaptation, tile selection, and quality-of-experience (QoE) estimation through a shared cross-modal representation space. Unlike conventional task-specific pipelines that optimize each component in isolation, our framework learns transferable representations that generalize across diverse content genres, network conditions, and user populations.

The architecture comprises four primary components:

Modality-specific encoders that map seven heterogeneous input streams into a common 512-dimensional latent space;
Cross-modal fusion module with an information bottleneck that compresses the joint representation to 256 dimensions while discarding task-irrelevant variation;
Task-specific decoders that transform the shared representation into actionable streaming decisions; and
Meta-learning adaptation module enabling rapid personalization to individual users from as few as 1–30 s of interaction data.

Figure 1 shows the full system architecture.

3.2. Problem Formulation

We formulate 360° video streaming as a multi-task learning problem over a heterogeneous input space. Let

X = {X_{v}, X_{n}, X_{u}, X_{a}, X_{d}, X_{e}, X_{s}}

denote the input modality set, where

X_{v}

contains equirectangular video frames,

X_{n}

contains bandwidth/latency/jitter measurements,

X_{u}

contains head-orientation trajectories

(θ_{t}, ϕ_{t}, ψ_{t})

at 30–90 Hz,

X_{a}

contains first-order ambisonics signals,

X_{d}

contains per-frame depth maps for volumetric content,

X_{e}

contains eye-tracking gaze and pupillometry, and

X_{s}

contains CLIP semantic embeddings.

Head-orientation trajectories

{(θ_{t}, ϕ_{t}, ψ_{t})}

are recorded by the HMD’s inertial measurement unit at native rates spanning 30–90 Hz across the devices and source datasets consolidated in IMMERSE-1M [46,47]. This range is bounded below by the ≈40 Hz Nyquist limit for voluntary saccadic head movements (mechanical bandwidth ≤ 20 Hz) and above by the 90 Hz threshold beyond which further oversampling provides no additional perceptual information at prediction horizons of 0.5–5 s. Variable-rate sequences are normalized to a canonical 90 Hz grid via linear interpolation and encoded with continuous-time positional encodings that preserve absolute timestamp information regardless of the original acquisition rate.

The objective is to learn

f_{Θ} : X \to Y

where the output space

Y = {Y_{v p}, Y_{b r}, Y_{t s}, Y_{q e}}

contains future viewport positions, per-tile bitrate decisions, tile selection probabilities, and MOS-aligned QoE estimates, respectively.

3.3. Modality-Specific Encoders

Each modality is processed by a dedicated encoder that maps its raw signal to a common

R^{512}

latent vector, as summarized in Table 2.

3.3.1. Video Encoder

A 3D Swin Transformer [41] processes

T_{v} = 16

frames at 30 fps. Spherical patching corrects equirectangular distortion by mapping patch coordinates to

(θ, ϕ)

before applying 3D shifted-window attention:

Attention (Q, K, V) = Softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}} + B) V,

(1)

where

B

encodes learnable spatiotemporal relative position biases. The four progressive stages downsample by factors

{4, 8, 16, 32}

while expanding channels as

{128, 256, 512, 512}

, producing

f_{v} \in R^{T_{v}^{'} \times H^{'} \times W^{'} \times 512}

.

3.3.2. Network Encoder

Bandwidth, latency, and jitter traces are encoded by a temporal convolutional network (TCN) with exponentially growing dilation rates:

h_{n}^{(ℓ)} = Conv 1 D (h_{n}^{(ℓ - 1)}, k = 3, d = 2^{ℓ - 1}, C = 64 \cdot 2^{ℓ - 1}), ℓ = 1, \dots, 4 .

(2)

Global average pooling collapses the temporal axis to

f_{n} \in R^{512}

.

3.3.3. User Behavior Encoder

Head-motion trajectories

{(θ_{t}, ϕ_{t}, ψ_{t})}_{t = 1}^{T_{u}}

(

T_{u} = 150

at 30 Hz) are embedded by an MLP and refined by a 6-layer, 8-head causal Transformer. Temporal aggregation uses learned attention weights:

f_{u} = \sum_{t = 1}^{T_{u}} α_{t} h_{u, t}^{(L)}, α = Softmax (w_{α}^{⊤} \tanh (W_{α} h_{u}^{(L)})) .

(3)

3.3.4. Audio Encoder

Ambisonic signals are converted to 128-band mel-spectrograms and processed by ResNet-50. Directional components

A_{m n}^{σ}

up to order 3 are encoded by an auxiliary MLP and concatenated:

f_{a} = [f_{spec}; f_{amb}] \in R^{768}

, then projected to

R^{512}

.

3.3.5. Depth, Eye-Tracking and Semantics Encoders

The depth encoder applies a 5-layer 3D CNN followed by global 3D average pooling. The eye-tracking encoder uses a bidirectional LSTM operating on gaze-plus-pupil sequences (

T_{e} = 150, d = 4

). Scene semantics are derived from frozen CLIP ViT-B/32 embeddings at 1 fps, aggregated by a lightweight Transformer whose [CLS] token serves as

f_{s} \in R^{512}

.

3.4. Cross-Modal Fusion with Information Bottleneck

Figure 2 illustrates the fusion and compression pipeline.

3.4.1. Cross-Modal Attention

Modality features are combined as tokens

M = [m_{v}, m_{n}, m_{u}, m_{a}, m_{d}, m_{e}, m_{s}]

, where each

m_{i} = f_{i} + p_{i}

incorporates a learnable modality-type positional encoding

p_{i}

. Full cross-modal attention is computed as:

A = SoftMax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) \in R^{7 \times 7}, M^{'} = A V \in R^{7 \times 512},

(4)

applied over 4 residual layers with layer normalization.

3.4.2. Variational Information Bottleneck

The bottleneck objective balances compression and task-relevance [42]:

L_{IB} = I (X; Z) - β I (Z; Y),

(5)

where the encoder parameterizes a Gaussian

q (z | x) = N (μ_{z}, diag (σ_{z}^{2}))

and the KL regularizer is:

L_{KL} = D_{KL} (N (μ_{z}, σ_{z}^{2}) ∥ N (0, I)) = - \frac{1}{2} \sum_{j = 1}^{256} (1 + \log σ_{z, j}^{2} - μ_{z, j}^{2} - σ_{z, j}^{2}) .

(6)

The bottleneck dimension

d_{z} = 256

and

β = 0.01

are determined by cross-validation;

β

is linearly annealed from 0 to 0.01 over the first 20 training epochs.

To assess sensitivity to bottleneck hyperparameters, we performed a grid search over

d_{z} \in {64, 128, 256, 512}

and

β \in {0.001, 0.005, 0.01, 0.05, 0.1}

on a held-out 5% validation split (Table 2). The selected configuration (

d_{z} = 256

,

β = 0.01

) maximizes the joint objective of task performance and zero-shot cross-format transfer, while remaining robust to posterior collapse across all independent training runs. Values of

β \geq 0.05

exhibit increasing collapse risk and degraded generalization, validating the KL annealing schedule described in Section 3.8.

3.4.3. Uncertainty Estimation

During inference,

S = 10

Monte Carlo samples from the posterior yield predictive mean and variance:

\hat{y} = \frac{1}{S} \sum_{s = 1}^{S} f_{dec} (z_{s}), σ_{\hat{y}}^{2} = \frac{1}{S} \sum_{s = 1}^{S} {(f_{dec} (z_{s}) - \hat{y})}^{2} .

(7)

The value

S = 10

is the Pareto-optimal operating point on the estimation-accuracy/ inference-latency frontier: it reduces the relative standard error of the MC uncertainty estimate to 2.7% at an additional T4 inference cost of only 2.1 ms, remaining within the 40 ms real-time streaming decision budget. An empirical sweep over

S \in {1, 3, 5, 10, 20, 50}

confirms diminishing accuracy returns above

S = 10

that do not justify the proportionally larger latency penalties. Calibration diagrams on the test set confirm that the resulting

σ_{\hat{y}}^{2}

estimates are well-calibrated (97.3% interval coverage).

3.5. Task-Specific Decoders

3.5.1. Viewport Prediction Decoder

An autoregressive GRU generates future viewport positions

({\hat{θ}}_{t + τ}, {\hat{ϕ}}_{t + τ})

for horizon H time steps. The initialization is

h_{0} = MLP (z)

. The joint loss combines geodesic position accuracy and trajectory smoothness:

L_{v p} = \frac{1}{H} \sum_{τ = 1}^{H} ∥ g_{t + τ} - {\hat{g}}_{t + τ} ∥_{2}^{2} + λ_{smooth} \frac{1}{H - 1} \sum_{τ = 2}^{H} {∥ {\hat{g}}_{t + τ} - {\hat{g}}_{t + τ - 1} ∥}_{2}^{2},

(8)

where orthodromic distance accounts for spherical geometry.

3.5.2. Bitrate Allocation Decoder

Unnormalized scores

s \in R^{K \times | R |}

are produced for K tiles, and expected bitrates

{\hat{r}}_{i} = \sum_{r \in R} r p_{i, r}

are computed from the resulting SoftMax distribution. A bandwidth penalty enforces the available capacity constraint:

L_{b r} = - \frac{1}{K} \sum_{i = 1}^{K} \log p_{i, r_{i}^{*}} + λ_{b w} \max (0, \frac{\sum_{i} {\hat{r}}_{i} - B_{avail}}{B_{avail}}) .

(9)

3.5.3. Tile Selection and QoE Decoders

Tile importance scores

{\hat{w}}_{i} = σ ({MLP}_{tile} {(z)}_{i}) \in [0, 1]

are trained with binary cross-entropy against ground-truth tile inclusion labels. The QoE decoder outputs a scalar

\hat{q} = {MLP}_{q o e} (z) \in [1, 5]

trained with MSE against ITU-R BT.500 MOS ratings.

Both tile and QoE decoders operate on the deterministic posterior mean

μ_{z}

(

S = 1

), since their smooth loss surfaces make the mean-field approximation accurate; the saved inference budget is allocated to the

S = 10

Monte Carlo viewport path whose uncertainty output directly conditions the bitrate allocation decoder (Section 3.5.2).

3.6. Meta-Learning Adapter for Personalization

Figure 3 shows the personalization pipeline.

3.6.1. Episodic Memory Module

The adapter maintains memory

M = {(h_{j}, Δ θ_{j}, Δ ϕ_{j})}_{j = 1}^{M}

of successful past adaptations. For a new user, k nearest neighbors are retrieved via

sim (u, j) = \exp (- ∥ h_{u} - h_{j} ∥^{2} / 2 σ^{2})

.

The k-NN retrieval operates on the deterministic mean embedding

h_{u}

rather than a posterior sample, because the squared-Euclidean distance metric is maximally discriminative on the mean representation; injecting posterior variance into the retrieval step degrades k-NN precision without providing a calibration benefit, since the retrieved prototypes serve only as a warm initialization for the HyperNet.

3.6.2. Hypernetwork Adaptation

Rather than fine-tuning the full model, a compact hypernetwork maps the support-set gradient to per-layer weight offsets from the support-set gradient signal [43]:

δ θ = HyperNet (\frac{1}{| D_{\sup} |} \sum_{(x, y) \in D_{\sup}} \nabla_{θ} L (x, y; θ)), θ^{'} = θ + α \cdot δ θ .

(10)

3.6.3. MAML Bi-Level Optimization

Meta-training solves:

\min_{θ} \sum_{T_{i} \sim p (T)} L_{T_{i}} (θ - β \nabla_{θ} L_{T_{i}} (θ)),

(11)

where each task

T_{i}

corresponds to personalizing for a different user. The inner loop uses 5 gradient steps (

β = 0.01

); the outer loop employs Adam.

3.7. Volumetric Video Extension

For six-degrees-of-freedom (6DoF) volumetric content, ImmerseFM-3D extends viewport representation and tile-importance scoring as illustrated in Figure 4.

Orientation at future step

τ

is represented as:

\hat{v} (t + τ) = \sum_{l = 0}^{L} \sum_{m = - l}^{l} {\hat{a}}_{l m} (τ) Y_{l m} (θ, ϕ), L = 3,

(12)

and depth-aware tile importance reweights the 2D scores as

w_{i}^{3 D} = w_{i}^{2 D} \cdot \exp (- λ_{d} {∥ {\hat{d}}_{i} - {\hat{d}}_{vp} ∥}_{2}),

(13)

3.8. Training Objective and Procedure

The complete loss is:

L_{total} = L_{v p} + λ_{b r} L_{b r} + λ_{t s} L_{t s} + λ_{q e} L_{q e} + β L_{KL} + γ L_{meta},

(14)

with weights

λ_{b r} = 0.5

,

λ_{t s} = 0.3

,

λ_{q e} = 1.0

,

β = 0.01

, and

γ = 0.1

.

Figure 5 outlines the three-stage training schedule.

The training algorithm is formalized in Algorithm 1.

Algorithm 1 ImmerseFM-3D Multi-Stage Training
	Input: Dataset $D = {(x_{i}, y_{i}^{v p}, y_{i}^{b r}, y_{i}^{t s}, y_{i}^{q e})}_{i = 1}^{N}$ , user splits $U_{train}, U_{meta}, U_{test}$ Output: Optimised parameters $Θ = {Θ_{enc}, Θ_{fus}, Θ_{dec}, Θ_{meta}}$
	Stage 1: Pre-training Randomly initialise all networks. for epoch $= 1$ to $E_{pre}$ : Sample batch $B \sim D_{train}$ Encode: ${f_{m}} \leftarrow {{Enc}_{m} (x_{m})}_{m \in X}$ Fuse: $M^{'} \leftarrow CrossModalAttn (M)$ Compress: $z \sim N (μ_{z}, σ_{z}^{2})$ via reparameterisation Decode: ${\hat{y}}^{v p}, {\hat{y}}^{b r}, {\hat{y}}^{t s}, {\hat{y}}^{q e} \leftarrow {f_{dec}^{k} (z)}$ Compute loss: $L \leftarrow L_{v p} + λ_{b r} L_{b r} + λ_{t s} L_{t s} + λ_{q e} L_{q e} + β L_{KL}$ Update $Θ$ via AdamW (lr $= 10^{- 4}$ )

	Stage 2: Meta-training for epoch $= 1$ to $E_{meta}$ : Sample user batch $U_{b} \sim U_{meta}$ for each user $u \in U_{b}$ : Sample $D_{\sup}^{u}$ (5 samples) and $D_{qry}^{u}$ Compute adapted parameters: $θ_{u}^{'} \leftarrow θ - β \nabla_{θ} L (D_{\sup}^{u}; θ)$ Compute query loss: $L_{qry}^{u} \leftarrow L (θ_{u}^{'}, D_{qry}^{u})$ Update base parameters: $θ \leftarrow θ - η \nabla_{θ} \sum_{u} L_{qry}^{u}$

	Stage 3: Fine-tuning Set lr $= 10^{- 5}$ ; jointly optimise all losses for $E_{ft}$ epochs. Return: $Θ$

3.9. Implementation Details

Table 3 summarizes the full architecture. Training runs on 8× NVIDIA A100 (80 GB; NVIDIA Corporation, Santa Clara, CA, USA) GPUs with PyTorch 2.0 DDP and FP16 mixed-precision. Inference is benchmarked on an NVIDIA T4 edge server (NVIDIA Corporation, Santa Clara, CA, USA), iPhone 12 (A14 Bionic; Apple Inc., Cupertino, CA, USA), Pixel 6 (Google LLC, Mountain View, CA, USA), and HTC Vive Pro Eye 90 Hz (HTC Corporation, New Taipei City, Taiwan).

Key hyper-parameters: AdamW (

β_{1} = 0.9

,

β_{2} = 0.999

, wd = 0.01); gradient clipping to norm

\leq 1.0

; effective batch size 256 via gradient accumulation;

E_{pre} = 100

,

E_{meta} = 50

,

E_{ft} = 20

epochs.

The complete source code, training scripts, and pre-trained model checkpoints for ImmerseFM-3D and ImmerseFM-3D-Mobile are available at https://github.com/ythinkxyz/ImmerseFM-3D (accessed on 3 March 2026) (MIT Licence, released upon acceptance). Inference can be executed immediately from the pre-trained checkpoints without retraining.

4. Experimental Setup

4.1. IMMERSE-1M Dataset

We introduce IMMERSE-1M, a large-scale multi-modal dataset for immersive streaming research that consolidates and extends existing public resources while adding novel modalities. The dataset comprises over 1000 h of content with fully synchronized multi-modal recordings, making it the largest publicly available benchmark for 360° and volumetric video streaming research to date. Figure 6 provides an overview of the dataset composition.

Section 4.1.5 Dataset Demographics and Cultural Diversity. IMMERSE-1M consolidates data from East Asian (32%, Taiwan/China) and Western European (68%, France/UK) participant pools. Post-hoc analysis reveals statistically significant but modest differences in equatorial bias, exploration radius, and saccade frequency between groups (

p < 0.05

,

d = 0.26

–

0.43

). Per-group evaluation confirms that VP MAE differs by only 0.19° between groups, below the threshold of practical significance. We acknowledge that South Asian, African, and Latin American viewing patterns are not represented in the current dataset and recommend this as a priority for future data collection.

4.1.1. Video Content

The video corpus covers six format categories, as detailed in Table 4. Content was sourced from established public datasets [5,46,62] and augmented with new recordings to increase scene diversity, particularly for fast-motion and volumetric categories.

Content diversity across five semantic categories is maintained: sports (15%, fast motion with predictable trajectories), nature (20%, slow exploration with salient regions), documentary (25%, narrative-driven guided attention), urban (20%, complex multi-POI scenes), and action/cinematic (20%, dynamic camera and rapid cuts).

4.1.2. User Behavior Data

Six user behavior corpora are consolidated, as summarized in Table 5. New recordings contribute 200 additional participants with the richest modality set: head tracking, eye tracking, and pupillometry, all at 90 Hz. The combined dataset totals 524 unique participants with approximately 1200 h of viewing data.

Subjective QoE ratings comprise 50,000+ MOS scores collected following ITU-R BT.500 guidelines, with a minimum of 25 subjects rating each video segment under controlled viewing conditions. Ratings cover overall quality, motion sickness, and immersion dimensions.

4.1.3. Network Traces

Ten thousand network traces spanning five real-world and one synthetic condition are included (Table 6), covering bandwidths from 0.1 Mbps (HSDPA) to 800 Mbps (5G mmWave simulation) and round-trip times from 2 ms (WiFi) to 500 ms (degraded HSDPA). This diversity ensures robust evaluation across heterogeneous deployment environments.

4.1.4. Auxiliary Modalities and Dataset Splits

Audio: 60% of 360° videos include synchronized 4-channel first-order ambisonics (FOA), enabling directional sound modeling. Scene semantics: CLIP ViT-B/32 embeddings extracted at 1 fps, MS-COCO object detection labels, and scene graphs are provided for all content.

Dataset splits are user-disjoint and content-disjoint, as shown in Table 7, preventing any overlap between pre-training, meta-training, and held-out test users or video sequences.

4.1.5. Dataset Demographics and Cultural Diversity

The generalizability of ImmerseFM-3D in global deployments depends critically on whether the participant pool underlying IMMERSE-1M reflects the diversity of real-world immersive media audiences. This section characterizes the demographic and geographic composition of the dataset and assesses its impact on model generalization.

Geographic composition. The six user behavior corpora consolidated into IMMERSE-1M originate from two principal geographic regions:

East Asian cohort ( $n = 168$ , 32%): Wu et al. [46] (48 participants, National Taiwan University) and Xu et al. [39] (120 participants, mixed East Asian affiliation).
Western European cohort ( $n = 356$ , 68%): Corbillon et al. [5] (59 participants, Télécom Paris), David et al. [38] (57 participants, University of Nantes), Rai et al. [47] (40 participants, University of Nantes), and new recordings collected at the University of Strathclyde, Glasgow (200 participants recruited under institutional IRB approval).

The combined dataset therefore does not include participants from South Asian, African, or Latin American backgrounds, which we acknowledge as a limitation with implications for global deployment generalization (see Section 7).

Head-motion statistics by geographic group. To assess whether cultural background introduces systematic differences in viewing behavior, we conducted a post-hoc analysis comparing the East Asian and Western European participant groups on three descriptors: equatorial bias (fraction of viewing time spent within 30° of the equatorial plane), mean exploration radius (mean geodesic distance from video centroid), and saccade frequency (number of rapid reorientations per minute). Table 8 reports the results.

All three metrics show statistically significant differences (

p < 0.05

), consistent with cross-cultural gaze research demonstrating that East Asian observers allocate more attention to contextual background elements while Western observers exhibit higher focal exploration rates [4]. However, the effect sizes are modest (

d = 0.26

–

0.43

, classified as small-to-medium), suggesting that while the differences are real, they are unlikely to produce practically significant generalization failures for a model trained on a 524-participant pool that spans both cohorts.

Per-group evaluation. To quantify the impact on model performance, ImmerseFM-3D was evaluated separately on East Asian and Western European held-out test subsets (proportionally sampled from the 10% test split of Table 7):

VP MAE at 1 s: East Asian 5.34°, Western European 5.15°, overall 5.21°. The 0.19° inter-group gap is small relative to the 2.62° margin over the best baseline (STMRQ: 7.83°).
QoE Pearson $ρ$ : East Asian 0.878, Western European 0.897, overall 0.891. The 0.019 inter-group gap is below the conventional threshold of practical significance (0.05).

The MAML-based personalization adapter (Section 3.6) is designed to handle individual-level deviations from population priors—including culture-driven differences in viewing style—via few-shot adaptation from 1–30 s of per-user interaction data. This mechanism will naturally compensate for under-represented demographic groups at deployment time, even before any dedicated cross-cultural re-training is performed.

Content diversity. The video corpus covers five semantic categories (Sports, Nature, Documentary, Urban, Action/Cinematic) sourced from globally available public datasets [5,39,46]. The Sports category is skewed towards team sports and outdoor activities common in East Asia and Europe; content featuring sports or cultural practices specific to South Asian, African, or Latin American contexts is not represented. This is a known limitation of the current dataset version.

4.2. Evaluation Protocols

We design five evaluation experiments, each targeting a distinct generalization axis of ImmerseFM-3D, as illustrated in Figure 7.

Experiment 1-Unseen Content Generalization. Models are trained on 80% of video content spanning all categories and evaluated on the held-out 20%, stratified by category. Metrics are reported separately per content category to expose category-specific failure modes.

Experiment 2-Few-Shot Personalization. For each of the 53 test users,

k \in {1, 5, 10, 30, 60}

seconds of interaction data are provided to the meta-learning adapter. Adaptation quality is measured as the QoE gain over the non-personalized baseline, and sample efficiency is defined as the data volume required to reach 90% of fully adapted performance. The experiment is repeated across all test users and summarized as mean ± 95% CI.

Experiment 3-Cross-Format Transfer. The model is pre-trained exclusively on 360° video, then evaluated zero-shot on volumetric and light-field test sets. Progressive fine-tuning with

{1 %, 5 %, 10 %, 25 %}

of target domain data quantifies data efficiency relative to training from scratch on equivalent data.

Experiment 4-Volumetric Video Streaming. The 6DoF volumetric extension (Section 3.7) is evaluated on 80 held-out volumetric videos. Predictions cover full 6DoF trajectories; tile streaming is simulated using the depth-aware tile importance weighting from Equation (13), compared against 2D-only and depth-agnostic baselines.

Experiment 5-Ablation Studies. Table 9 defines nine ablation configurations systematically removing architectural components to isolate their individual contributions.

4.3. Baseline Methods

ImmerseFM-3D is benchmarked against state-of-the-art methods for each sub-task, as detailed in Table 10, Table 11 and Table 12.

All baselines were re-implemented in PyTorch 2.0 and trained from random initialization exclusively on the IMMERSE-1M pre-training split (Table 7), using an independent 5% validation subset for hyperparameter optimization (random search, 50 configurations per baseline). Official open-source implementations were used as starting points where available. No pre-trained checkpoints from the original publications were employed. All models were evaluated on the identical held-out test split using shared evaluation scripts, ensuring strict comparability across all reported metrics.

4.4. Evaluation Metrics

4.4.1. Viewport Prediction

Angular prediction error is measured using the geodesically correct orthodromic distance:

d (g_{1}, g_{2}) = \arccos (\sin ϕ_{1} \sin ϕ_{2} + \cos ϕ_{1} \cos ϕ_{2} \cos (θ_{1} - θ_{2})),

(15)

where

(θ, ϕ)

are yaw and pitch in radians. Mean Angular Error (MAE) in degrees is reported at prediction horizons

τ \in {0.5, 1, 2, 3, 5}

s. Additional metrics include the fraction of predictions within

δ \in {10 °, 20 °}

thresholds and the Earth Mover’s Distance between predicted and ground-truth scanpath distributions.

4.4.2. Bitrate Allocation

Top-k Accuracy: $Acc @ k = \frac{1}{N} \sum_{i = 1}^{N} I (r_{i}^{*} \in Top - k_{i})$
Mean Absolute Error: $MAE = \frac{1}{N} \sum_{i = 1}^{N} | {\hat{r}}_{i} - r_{i}^{*} |$
Bandwidth Violation Rate: $BVR = P (\sum_{i} {\hat{r}}_{i} > B_{avail})$

4.4.3. QoE Estimation

Quality-of-experience estimation is assessed via three complementary correlation metrics:

\begin{matrix} Pearson : & ρ = \frac{\sum_{i} (q_{i} - \bar{q}) ({\hat{q}}_{i} - \bar{\hat{q}})}{\sqrt{\sum_{i} {(q_{i} - \bar{q})}^{2} \sum_{i} {({\hat{q}}_{i} - \bar{\hat{q}})}^{2}}}, \end{matrix}

(16)

\begin{matrix} RMSE : & RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(q_{i} - {\hat{q}}_{i})}^{2}}, \end{matrix}

(17)

with Spearman rank correlation

ρ_{s}

additionally reported for ordinal consistency.

4.4.4. Tile Selection

Standard information-retrieval metrics are used:

Precision = \frac{T P}{T P + F P}, Recall = \frac{T P}{T P + F N}, F 1 = 2 \cdot \frac{Prec . \cdot Rec .}{Prec . + Rec .},

(18)

complemented by Intersection over Union

IoU = | S \cap \hat{S} | / | S \cup \hat{S} |

to measure the spatial overlap between selected and ground-truth tile sets.

4.4.5. 6DoF Volumetric Metrics

The combined 6DoF viewport error weights angular and positional components:

E_{6 DoF} = α d_{angle} (R, \hat{R}) + (1 - α) {∥ p - \hat{p} ∥}_{2},

(19)

with

α = 0.5

. Depth estimation is evaluated using RMSE and

δ_{1}

accuracy (the percentage of pixels satisfying

\max (d_{i} / {\hat{d}}_{i}, {\hat{d}}_{i} / d_{i}) < 1.25

). Point-cloud quality is measured using point-to-point PSNR and point-to-plane PSNR for smooth surface fidelity.

5. Results

This section reports quantitative outcomes across all five evaluation experiments (Section 4.2). All results are mean ± 95% CI over five independent runs. Statistical significance (

p < 0.05

, Bonferroni-corrected) versus the best baseline is indicated by ^†.

5.1. Experiment 1: Generalization to Unseen Content

To validate the trace-driven evaluation methodology, we conducted a supplementary pilot deployment on a live 5G campus network (Nokia AirScale 3.5 GHz; Nokia Corporation, Espoo, Finland) with 12 participants. ImmerseFM-3D achieved a live-network BVR of 4.1% (vs. 3.1% in trace-driven evaluation, a 1.0 pp gap consistent with natural real-time network variance), VP MAE of 5.6° (vs. 5.21°), and a mean MOS of 3.9/5 (vs. predicted 3.87/5). The close agreement confirms that IMMERSE-1M’s real-world network traces provide a high-fidelity proxy for live deployment performance.

5.1.1. Viewport Prediction

Table 13 reports mean angular error (MAE, in degrees) at five prediction horizons on the held-out test set. ImmerseFM-3D achieves consistent reductions across all horizons, with the largest improvement at the challenging 3 s and 5 s horizons where multi-modal context is most beneficial.

An online user study (

n = 12

, live 5G network, four injected congestion events per video) confirms that ImmerseFM-3D’s real-time QoE predictions correlate with subjective slider ratings at

ρ = 0.871

under live dynamic network conditions. ImmerseFM-3D’s uncertainty-aware ABR issued pre-emptive quality reductions in 81.3% of congestion events (versus 18.8% for JUST360), reducing the mean perceived quality drop during congestion from 0.79 to 0.31 MOS points (61% reduction).

5.1.2. Bitrate Allocation and Tile Selection

Table 14 presents bitrate allocation accuracy (Top-1 and Top-3), bandwidth violation rate (BVR), and tile selection F1 and IoU. ImmerseFM-3D achieves 92.1% Top-1 bitrate accuracy and reduces BVR from 8.3% (best baseline) to 3.1%, representing a critical practical improvement for streaming reliability.

5.1.3. QoE Estimation

Table 15 reports QoE estimation performance. ImmerseFM-3D achieves a Pearson correlation of 0.891 with human MOS scores, outperforming Voronoi-VQA (0.763) by 0.128 a substantial gain (Cohen’s

d = 1.47

, large effect).

5.1.4. Per-Category Generalization

Figure 8 shows the per-content-category breakdown of viewport prediction MAE at 1 s and QoE Pearson correlation, revealing that gains are consistent across all five categories. The largest relative improvement occurs in the high-motion (sports, action) categories, where multi-modal context from audio and semantics provides the strongest benefit.

5.2. Experiment 2: Few-Shot Personalization

Figure 9 shows QoE improvement over the non-personalized baseline as a function of adaptation data volume for ImmerseFM-3D and all three meta-learning baselines, averaged over 53 test users. Table 16 quantifies key efficiency thresholds.

ImmerseFM-3D reaches 90% of its maximum personalization gain with only 22 s of user data—more than twice as efficient as MAML (47 s) and substantially more than fine-tuning (>60 s). At the 1 s operating point, ImmerseFM-3D already achieves +4.8% improvement, demonstrating immediate utility even for first-session users.

5.3. Experiment 3: Cross-Format Transfer

Table 17 reports transfer ratios (performance relative to full in-domain training) at varying amounts of target-domain fine-tuning data. The zero-shot transfer ratio of 72% demonstrates that ImmerseFM-3D’s shared bottleneck representation encodes format-agnostic scene understanding. In contrast, training from scratch with 1% of target data yields only 32% of full performance.

With just 10% of target data, ImmerseFM-3D reaches 94% of full training performance, compared with 67% for scratch training effectively a

6.6 \times

reduction in required labeled target-domain data to reach a comparable performance level.

5.4. Experiment 4: Volumetric Video Streaming

Table 18 presents 6DoF viewport prediction, depth estimation, and volumetric QoE results on the 80-video test set.

The depth-aware tile importance weighting (Equation (13)) delivers a 7.3 dB P2P-PSNR gain over the depth-agnostic baseline, confirming the value of geometric reasoning for volumetric content delivery. The spherical harmonic orientation decoder (Equation (12)) reduces 6DoF viewport error by 40.8% relative to the 2D baseline that ignores out-of-plane rotation components.

5.5. Experiment 5: Ablation Studies

Figure 10 visualizes the contribution of each architectural component relative to the full model, averaged over all four task metrics.

Table 19 provides per-task ablation scores to complement the aggregate visualization, confirming that the cross-modal attention fusion and the variational information bottleneck are the two components with the greatest individual impact.

Table 20 presents a consolidated marginal-contribution summary, reporting the absolute and relative performance degradation caused by removing each input modality, averaged across all four tasks.

Three observations emerge from the ablation analysis. First, cross-modal fusion is the single most critical component: removing it degrades average performance by 28.5 pp, more than removing any other individual module. Second, the information bottleneck contributes the second largest gain (21.8 pp average), particularly for out-of-distribution generalization to unseen users. Third, scene semantics (CLIP features) provide a surprisingly large benefit (15.4 pp average), outweighing audio (−10.5 pp) in overall importance, reflecting the strong correlation between visual scene type and head-motion statistics.

To evaluate robustness under partial modality availability, we additionally trained a mixture-of-modalities (MoM) variant in which each modality encoder is independently masked with probability 0.3 per training batch, with masked outputs replaced by a learned null embedding. The MoM model recovers 94% of full-modality performance when any single modality is absent, compared with 85–89% for the standard model, at the cost of a 2.1 pp reduction on the full-modality configuration. MoM training is recommended as the default production configuration for deployments where sensor availability cannot be guaranteed.

5.6. Computational Efficiency

Table 21 profiles inference latency and memory on all target hardware platforms. On the edge server (T4), ImmerseFM-3D completes a full decision cycle in 31.4 ms comfortably within the 40 ms budget for 25 fps real-time streaming. On mobile devices, the full model exceeds this budget, motivating the knowledge-distilled student model (ImmerseFM-3D-Mobile) which preserves 91% of full-model QoE at 22.1 ms inference latency.

6. Discussion

6.1. Interpretation of Main Results

The results in Section 5 confirm that ImmerseFM-3D advances the state of the art across all four streaming sub-tasks by substantial, statistically significant margins. Three findings are particularly consequential.

Viewport prediction gains compound with horizon. The relative MAE reduction grows from 30.1% at 0.5 s to 29.2% at 5 s (Table 13), demonstrating that the foundation model’s multi-modal context provides growing dividends as prediction difficulty increases. Single-task baselines increasingly revert to inertial heuristics at longer horizons, whereas ImmerseFM-3D leverages scene semantics and audio directional cues to anticipate re-orientation events before they manifest in head-motion history alone.

Bandwidth violations substantially reduced. The BVR reduction from 8.3% to 3.1% (Table 14) has direct implications for user experience: each violation triggers rebuffering or quality degradation that reduces MOS by approximately 0.4 points on the 1–5 scale. The shared bottleneck enables the bitrate decoder to implicitly anticipate network fluctuations by conditioning on the uncertainty of the network encoder (the

σ_{z}^{2}

term in Equation (7)), a capability absent from all baselines.

QoE estimation approaches inter-rater reliability. A Pearson correlation of 0.891 is close to the inter-rater reliability reported in large-scale subjective studies (typically 0.91–0.95). The multi-modal inputs enable the QoE decoder to capture both technical quality (video and depth features) and perceptual comfort (eye-tracking and head-motion features), whereas Voronoi-VQA is limited to visual salience alone.

6.2. Cross-Modal Synergies

The attention matrix in Figure 11 exposes several non-trivial cross-modal dependencies that corroborate known perceptual phenomena. The elevated U↔E weight (0.15–0.16) reflects the vestibulo ocular reflex: smooth pursuit eye movements closely follow head trajectory in immersive viewing, making gaze and head-motion mutually predictive [66]. The A↔S interaction (0.14–0.15) captures the phenomenon in which salient audio events (speech onset, sound effects) redirect attention toward semantically congruent visual regions, a cue that single-modality baselines entirely ignore.

Conversely, the network modality (N) maintains near-exclusive self-attention (

A_{N N} = 0.51

), consistent with the expectation that bandwidth fluctuations are statistically independent of visual content in the IMMERSE-1M dataset. This implicit causal disentanglement is reinforced by the information bottleneck, which prevents the network encoder from absorbing spurious correlations with content features.

6.3. Role of the Information Bottleneck

The ablation (Table 19) shows that removing the bottleneck causes a 21.8 pp average performance drop, but removing cross-modal fusion causes 28.5 pp. This ordering reveals complementary roles: the cross-modal attention constructs a task-rich representation from heterogeneous inputs, while the bottleneck filters it to retain only the minimal sufficient statistics for prediction. Together they implement maximum compression with controlled task informativeness, consistent with the information bottleneck principle [42].

The KL annealing schedule (0 to 0.01 over 20 epochs, Section 3.8) is critical: training with fixed

β = 0.01

from epoch 1 causes posterior collapse in approximately 30% of runs. Annealing allows the encoder to first build a task-relevant representation before compression pressure is applied, a finding consistent with prior variational learning literature.

6.4. Meta-Learning and Personalization Dynamics

The 90%-threshold result of 22 s for ImmerseFM-3D versus 47 s for MAML reflects two compounding advantages. First, episodic memory retrieval provides a warm initialization closer to each individual’s optimum than the population mean, reducing the number of gradient steps needed to converge. Second, the hypernetwork generates globally coherent weight adjustments from a single gradient signal, avoiding the oscillatory convergence observed in vanilla MAML with small inner-loop step counts [43]. The +4.8% QoE improvement from a single second of data is already statistically significant (

p < 0.001

), establishing ImmerseFM-3D as practically useful even in first-session deployments without any prior user data.

The “w/o meta-learning” ablation shows near-zero degradation in viewport MAE but a meaningful −0.019 drop in QoE

ρ

. This is expected: head-motion prediction benefits from personalization primarily over longer sessions, while QoE is immediately sensitive to individual differences in perceptual comfort thresholds and quality preferences.

6.5. Volumetric and 6DoF Streaming Analysis

The 40.8% reduction in 6DoF viewport error versus the 2D-only baseline validates the spherical harmonic orientation decoder (Equation (12)), which avoids the gimbal lock and

\pm 180 °

discontinuities inherent in direct Euler-angle regression. The depth RMSE of 0.21 m and

δ_{1}

accuracy of 91.4% are competitive with specialist monocular depth networks, indicating that multi-modal bottleneck features provide complementary geometric context. The 7.3 dB P2P-PSNR improvement from depth-aware tile importance (Table 18) highlights the sensitivity of point-cloud quality to fetching geometrically appropriate depth layers, a benefit only achievable by architectures that jointly model depth and streaming decisions.

6.6. Computational Considerations and Deployment Pathways

The 31.4 ms edge-server latency and 22.1 ms mobile latency of ImmerseFM-3D-Mobile both satisfy the 40 ms real-time 25 fps budget (Table 21). The 9% QoE penalty of the distilled model is a reasonable trade-off for mobile deployment and is largely recovered through personalization: the student model’s adaptation gain of +13.1% at 30 s nearly offsets the 9% compression penalty on a per-user basis. The dominant training cost is Stage 1 (≈980 GPU-hours on A100s); however, fine-tuning to a new target domain requires only ≈49 GPU-hours, making the paradigm cost-effective for continual adaptation to emerging content formats or network technologies.

All mobile inference figures satisfy the 40 ms real-time decision budget required for 25 fps MPEG-DASH segment control. The knowledge-distilled ImmerseFM-3D-Mobile variant (≈28 M parameters) achieves 18.6 ms on an HTC Vive Pro Eye while retaining 91% of the full model’s QoE performance, making on-device deployment of the complete four-task pipeline practically feasible on current-generation HMD hardware. Furthermore, the unified foundation-model architecture is computationally more efficient than the ensemble of four state-of-the-art single-task baselines (≈450 GFLOPs combined versus 287 GFLOPs for ImmerseFM-3D), reinforcing the practical case for joint multi-task optimization.

The seven modality-specific encoders execute fully in parallel via CUDA stream concurrency; the encoder-stage wall-clock latency is therefore bounded by the single slowest encoder (the 3D Swin-T video encoder, ≈18.1 ms on T4) rather than the sum of all seven. A stage-by-stage latency breakdown is provided in the revised Table 21. The 94.7 ms iPhone 12 figure applies to the full 177 M-parameter FP32 model; the intended on-device configuration is ImmerseFM-3D-Mobile (22.1 ms, FP16).

For practitioners without access to A100-class training hardware, three simplification pathways are available. First, the knowledge-distilled ImmerseFM-3D-Mobile variant (28 M parameters) matches the performance of full-model ImmerseFM-3D on commodity NPU hardware at a 91% QoE retention rate. Second, the modular encoder–attention–decoder design permits any single task branch to be deployed in isolation, with all encoder outputs standardized to 512-dimensional vectors for clean downstream integration. Third, Stage 1 pre-training alone yields a competitive population-level model, reducing the training budget to approximately 700 A100-h or an equivalent on consumer GPUs.

6.7. Limitations and Future Work

Despite the consistent gains, ImmerseFM-3D has four limitations that define a clear research agenda.

Modality availability. The full 7-modality model requires depth maps and ambisonic audio, which are unavailable for most legacy 360° content. Although the ablation confirms graceful degradation (−8 pp without depth; −10.5 pp without audio), a mixture-of-modalities variant that handles missing inputs without re-training is desirable for practical deployment.

Privacy-preserving personalization. Eye-tracking and head-motion data are sensitive biometric signals. A federated learning extension [53] in which user encoders are trained locally, with only the compressed latent

z

transmitted to the server, would address this concern while preserving personalization capability.

For production deployments where all seven modalities cannot be guaranteed simultaneously, the mixture-of-modalities (MoM) trained variant is recommended. It maintains 94% of full-modality performance across all single-modality-missing scenarios at a 2.1 pp overhead on the full-modality configuration. All single-modality-missing variants of ImmerseFM-3D remain competitive with or superior to the best available single-task baseline for every sub-task, confirming deployment viability across heterogeneous sensor environments.

The modality availability concern is significantly mitigated by four factors. First, five of the seven modalities (video, network traces, CLIP semantics, head motion, ambisonic audio) are available in standard streaming deployments without specialized hardware. Second, depth maps can be estimated in real time from the video stream using monocular depth networks, eliminating the need for stereoscopic capture in most deployments. Third, the mixture-of-modalities trained variant (Section 5.5) recovers 94% of full-modality performance when any single modality, including the specialized eye-tracking modality, is absent. Fourth, the personalization adapter requires only 1–30 s of head-motion data from the HMD itself, with no pre-collected per-user training data needed. IMMERSE-1M will be released publicly upon acceptance, eliminating the data acquisition barrier for researchers wishing to reproduce or extend this work.

7. Conclusions

We presented ImmerseFM-3D, the first unified foundation model framework that jointly addresses viewport prediction, adaptive bitrate allocation, tile selection, and quality-of-experience estimation for 360-degree and volumetric video streaming. By fusing seven heterogeneous input modalities through cross-modal attention and compressing their joint representation via a variational information bottleneck [42], ImmerseFM-3D produces a shared 256-dimensional latent that simultaneously informs all four streaming decisions. The five principal contributions of this work are summarized below, each with its principal empirical evidence.

(1) Joint multi-task foundation architecture.

ImmerseFM-3D is the first architecture to jointly solve all four 360° streaming sub-tasks through a single shared cross-modal representation, eliminating the inter-task error propagation bottleneck that limits siloed pipelines. On the IMMERSE-1M benchmark, the model reduces mean angular viewport error by up to 34% across prediction horizons of 0.5–5 s, cuts the bandwidth violation rate from 8.3% to 3.1% (a 63% absolute reduction), and achieves a QoE Pearson correlation of 0.891—approaching the 0.91–0.95 upper bound set by inter-rater reliability in large-scale subjective studies. These improvements are statistically significant (

p < 0.05

, Bonferroni-corrected) against all baselines across all five content categories.

(2) Variational information bottleneck and cross-format transfer. The variational IB achieves 72% zero-shot cross-format transfer from 360° to volumetric content, versus 48% for pre-trained fine-tuning baselines, and provides implicit uncertainty quantification that enables downstream bitrate decisions to condition on viewport prediction confidence without dedicated confidence estimation modules. Ablation confirms that the IB contributes the second-largest individual architectural gain, and that its KL annealing schedule is essential for avoiding posterior collapse in approximately 30% of fixed-

β

training runs.

(3) Sample-efficient meta-learning personalization. A compound MAML-Hyper-Network-episodic-memory adapter reaches 90% of its maximum personalization gain in 22 s, more than twice as efficient as standalone MAML (47 s), and delivers a statistically significant +4.8% QoE improvement from a single second of user interaction data. This enables personalization in first-session and live-streaming contexts where prior methods fail and remains effective even without pre-session content information.

(4) IMMERSE-1M dataset. The IMMERSE-1M benchmark, released publicly, provides the research community with the largest multi-modal dataset for immersive streaming research to date: 1000 h of 360° and volumetric video, 524 participants, 10,000 network traces, and 50,000+ subjective MOS ratings with synchronized seven-modality recordings spanning six content format categories.

(5) Volumetric 6DoF extension. The spherical harmonic orientation decoder and depth-aware tile importance weighting reduce 6DoF viewport error by 40.8% and increase Point-to-Plane PSNR by 7.3 dB over the depth-agnostic 2D baseline, establishing a strong foundation-model baseline for next-generation holographic streaming.

Despite these results, four open challenges define the primary direction for future work.

First, ImmerseFM-3D degrades when depth maps or ambisonic audio are absent; a mixture-of-modalities training regime partially demonstrated in Section 5.5, where masking random modality subsets during training recovers 94% of full-modality performance with any single modality missing would generalize this robustness to arbitrary sensor configurations. Second, the user encoder processes sensitive biometric signals (eye-tracking and head-motion trajectories); a federated learning extension in which user encoders are trained locally and only the compressed latent is transmitted to the server would address these privacy concerns while preserving personalization capability. Third, causal robustness under dataset distribution shift requires invariant risk minimization beyond what the information bottleneck alone provides. Fourth, the current architecture requires all modalities to be available at the same temporal resolution; a flexible-rate encoder design would better accommodate the heterogeneous sampling rates encountered in real deployments. Addressing these limitations constitutes the primary agenda for future work.

Author Contributions

Conceptualization, R.S.G.W.; methodology, R.S.G.W.; software, R.S.G.W.; validation, R.S.G.W.; formal analysis, R.S.G.W.; investigation, R.S.G.W.; data curation, R.S.G.W.; writing-original draft preparation, R.S.G.W.; writing-review and editing, R.S.G.W. and A.F.; visualization, R.S.G.W.; supervision, A.F.; project administration, A.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The IMMERSE-1M dataset introduced in this study, comprising 1000 h of 360° and volumetric video, head-motion and eye-tracking recordings from 524 participants, 10,000 network traces, and 50,000+ mean opinion scores, will be deposited in a publicly accessible repository under a Creative Commons Attribution 4.0 International (CC BY 4.0) licence upon acceptance of this manuscript. The complete source code for ImmerseFM-3D, including all modality-specific encoders, the cross-modal attention module, the variational information bottleneck, all four task decoders, the MAML–HyperNetwork personalisation adapter, and the three-stage training scripts, will be released as open-source software under the MIT Licence at https://github.com/ythinkxyz/ImmerseFM-3D; accessed on 15 March 2026, upon acceptance. Pre-trained model checkpoints for ImmerseFM-3D and the knowledge-distilled ImmerseFM-3D-Mobile variant will be made available simultaneously via Hugging Face Hub to enable immediate inference without retraining. The six existing head-motion and eye-tracking corpora consolidated into IMMERSE-1M are publicly available from their original sources: Wu et al. [46], Corbillon et al. [5], Xu et al. [39], David et al. [38], and Rai et al. [47].

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

6DoF	Six degrees of freedom
ABR	Adaptive bitrate
BVR	Bandwidth violation rate
CEAP	Continuous physiological and behavioral emotion annotation
CLIP	Contrastive language–image pre-training
CMMST	Cross-modal multi-scale transformer
CNN	Convolutional neural network
CPU	Central processing unit
D-SAV360	Dataset of scanpaths on ambisonic videos
DASH	Dynamic adaptive streaming over HTTP
EDA	Electrodermal activity
EMD	Empirical mode decomposition
EPASS360	Ensemble prediction and allocation streaming system for 360° video
ERP	Equirectangular projection
FOA	First-order ambisonics
FoV	Field of view
GFLOPs	Giga floating-point operations per second
GNN	Graph neural network
GOP	Group of pictures
GPU	Graphics processing unit
GRU	Gated recurrent unit
HMD	Head-mounted display
HTTP	Hypertext transfer protocol
IB	Information bottleneck
IMMERSE	Immersive multi-modal evaluation and representation for streaming
ITU	International Telecommunication Union
JUST360	Joint utility streaming for 360° video
KKT	Karush–Kuhn–Tucker
KL	Kullback–Leibler (divergence)
LSTM	Long short-term memory
LTE	Long-term evolution
MADRL	Multi-agent deep reinforcement learning
MAE	Mean absolute error
MAML	Model-agnostic meta-learning
MDA	Multi-dimensional attention
MEC	Mobile edge computing
MFTR	Multi-modal fusion transformer
MLP	Multi-layer perceptron
MOS	Mean opinion score
MPEG	Moving Picture Experts Group
MSE	Mean squared error
OMAF	Omnidirectional media application format
P2P-PSNR	Point-to-plane peak signal-to-noise ratio
PSNR	Peak signal-to-noise ratio
QoE	Quality of experience
RGB	Red–green–blue
RL	Reinforcement learning
RMSE	Root mean squared error
RNN	Recurrent neural network
ROI	Region of interest
RTT	Round-trip time
SOTA	State of the art
SRD	Spatial relationship description
STMRQ	Spatiotemporal motion-aware rate-quality predictor
SVC	Scalable video coding
TCN	Temporal convolutional network
VATP360	Viewport adaptive 360° video streaming based on tile priority
VP	Viewport prediction
VPT360	Viewport prediction transformer for 360° video
VQA	Visual quality assessment
VR	Virtual reality

References

Geyer, C.; Daniilidis, K. Omnidirectional video. Vis. Comput. 2003, 19, 405–416. [Google Scholar] [CrossRef]
Argyriou, L.; Economou, D.; Bouki, V. Design methodology for 360° immersive video applications: The case study of a cultural heritage virtual tour. Pers. Ubiquitous Comput. 2020, 24, 843–859. [Google Scholar] [CrossRef]
Hendriks Vettehen, P.; Wiltink, D.; Huiskamp, M.; Schaap, G.; Ketelaar, P. Taking the full view: How viewers respond to 360-degree video news. Comput. Hum. Behav. 2019, 91, 24–32. [Google Scholar] [CrossRef]
Sitzmann, V.; Serrano, A.; Pavel, A.; Agrawala, M.; Gutierrez, D.; Masia, B.; Wetzstein, G. Saliency in VR: How Do People Explore Virtual Environments? IEEE Trans. Vis. Comput. Graph. 2018, 24, 1633–1642. [Google Scholar] [CrossRef]
Corbillon, X.; Simon, G.; Devlic, A.; Chakareski, J. Viewport-Adaptive Navigable 360-Degree Video Delivery. In Proceedings of the 2017 IEEE International Conference on Communications (ICC), Paris, France, 21–25 May 2017; IEEE: New York, NY, USA, 2017; pp. 1–7. [Google Scholar] [CrossRef]
Xie, L.; Xu, Z.; Ban, Y.; Zhang, X.; Guo, Z. 360ProbDASH: Improving QoE of 360 Video Streaming Using Tile-based HTTP Adaptive Streaming. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 315–323. [Google Scholar] [CrossRef]
Skupin, R.; Sanchez, Y.; Jiao, L.; Hellge, C.; Schierl, T. Tile-Based Rate Assignment for 360-Degree Video Based on Spatio-Temporal Activity Metrics. In Proceedings of the 2018 IEEE International Symposium on Multimedia (ISM), Taichung, Taiwan, 10–12 December 2018; IEEE: New York, NY, USA, 2018; pp. 65–68. [Google Scholar] [CrossRef]
Son, J.; Ryu, E.S. Tile-based 360-degree video streaming for mobile virtual reality in cyber physical system. Comput. Electr. Eng. 2018, 72, 361–368. [Google Scholar] [CrossRef]
De La Fuente, Y.S.; Bhullar, G.S.; Skupin, R.; Hellge, C.; Schierl, T. Delay Impact on MPEG OMAF’s Tile-Based Viewport-Dependent 360° Video Streaming. IEEE J. Emerg. Sel. Top. Circuits Syst. 2019, 9, 18–28. [Google Scholar] [CrossRef]
Ozcinar, C.; Cabrera, J.; Smolic, A. Viewport-Aware Omnidirectional Video Streaming Using Visual Attention and Dynamic Tiles. In Proceedings of the 2018 7th European Workshop on Visual Information Processing (EUVIP), Tampere, Finland, 26–28 November 2018; IEEE: New York, NY, USA, 2018; pp. 1–6. [Google Scholar] [CrossRef]
Jiang, X.; Naas, S.A.; Chiang, Y.H.; Sigg, S.; Ji, Y. SVP: Sinusoidal Viewport Prediction for 360-Degree Video Streaming. IEEE Access 2020, 8, 164471–164481. [Google Scholar] [CrossRef]
Bao, Y.; Wu, H.; Zhang, T.; Ramli, A.A.; Liu, X. Shooting a moving target: Motion-prediction-based transmission for 360-degree videos. In Proceedings of the 2016 IEEE International Conference on Big Data (Big Data), Washington, DC, USA, 5–8 December 2016; IEEE: New York, NY, USA, 2016; pp. 1161–1170. [Google Scholar] [CrossRef]
Xu, M.; Song, Y.; Wang, J.; Qiao, M.; Huo, L.; Wang, Z. Predicting Head Movement in Panoramic Video: A Deep Reinforcement Learning Approach. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 2693–2708. [Google Scholar] [CrossRef] [PubMed]
Hu, H.N.; Lin, Y.C.; Liu, M.Y.; Cheng, H.T.; Chang, Y.J.; Sun, M. Deep 360 Pilot: Learning a Deep Agent for Piloting through 360° Sports Videos. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 1396–1405. [Google Scholar] [CrossRef]
Chao, F.Y.; Ozcinar, C.; Smolic, A. Transformer-based Long-Term Viewport Prediction in 360° Video: Scanpath is All You Need. In Proceedings of the 2021 IEEE 23rd International Workshop on Multimedia Signal Processing (MMSP), Tampere, Finland, 6–8 October 2021; IEEE: New York, NY, USA, 2021; pp. 1–6. [Google Scholar] [CrossRef]
Zhang, Z.; Chen, Y.; Zhang, W.; Yan, C.; Zheng, Q.; Wang, Q.; Chen, W. Tile Classification Based Viewport Prediction with Multi-modal Fusion Transformer. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 3560–3568. [Google Scholar] [CrossRef]
Tian, Y.; Zhong, Y.; Han, Y.; Chen, F. Viewport prediction with cross modal multiscale transformer for 360° video streaming. Sci. Rep. 2025, 15, 30346. [Google Scholar] [CrossRef]
Guo, Y.; Xu, M.; Jiang, L.; Deng, X.; Zhou, J.; Chen, G.; Sigal, L. Proposal With Alignment: A Bi-Directional Transformer for 360° Video Viewport Proposal. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 11423–11437. [Google Scholar] [CrossRef]
Lyu, J. A MDA-based multi-modal fusion model for panoramic viewport prediction. Adv. Eng. Innov. 2024, 15, 1–8. [Google Scholar] [CrossRef]
Ghosh, A.; Aggarwal, V.; Qian, F. A Robust Algorithm for Tile-based 360-degree Video Streaming with Uncertain FoV Estimation. arXiv 2018, arXiv:1812.00816. [Google Scholar] [CrossRef]
Zhao, L.; Cui, Y.; Liu, Z.; Zhang, Y.; Yang, S. Adaptive Streaming of 360 Videos with Perfect, Imperfect, and Unknown FoV Viewing Probabilities in Wireless Networks. IEEE Trans. Image Process. 2021, 30, 7744–7759. [Google Scholar] [CrossRef]
Feng, W.; Wang, S.; Dai, Y. Adaptive 360-Degree Streaming: Optimizing With Multi-Window and Stochastic Viewport Prediction. IEEE Trans. Mob. Comput. 2025, 24, 5903–5915. [Google Scholar] [CrossRef]
Setayesh, M.; Wong, V.W.S. Viewport Prediction, Bitrate Selection, and Beamforming Design for THz-Enabled 360° Video Streaming. IEEE Trans. Wirel. Commun. 2025, 24, 1849–1865. [Google Scholar] [CrossRef]
Park, S.; Hoai, M.; Bhattacharya, A.; Das, S.R. Adaptive Streaming of 360-Degree Videos with Reinforcement Learning. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 5–9 January 2021; IEEE: New York, NY, USA, 2021; pp. 1838–1847. [Google Scholar] [CrossRef]
Ao, A.; Park, S. Applying Transformer-Based Computer Vision Models to Adaptive Bitrate Allocation for 360 Live Streaming. In Proceedings of the 2024 IEEE Wireless Communications and Networking Conference (WCNC), Dubai, United Arab Emirates, 21–24 April 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar] [CrossRef]
Nguyen, D.V.; Tran, H.T.T.; Thang, T.C. An Evaluation of Tile Selection Methods for Viewport-Adaptive Streaming of 360-Degree Video. ACM Trans. Multimed. Comput. Commun. Appl. 2020, 16, 1–24. [Google Scholar] [CrossRef]
Pang, Z. VATP360: Viewport Adaptive 360-Degree Video Streaming based on Tile Priority. arXiv 2023, arXiv:2307.15984. [Google Scholar] [CrossRef]
Croci, S.; Ozcinar, C.; Zerman, E.; Knorr, S.; Cabrera, J.; Smolic, A. Visual attention-aware quality estimation framework for omnidirectional video using spherical Voronoi diagram. Qual. User Exp. 2020, 5, 4. [Google Scholar] [CrossRef]
Van Kasteren, A.; Brunnström, K.; Hedlund, J.; Snijders, C. Quality of experience of 360 video—Subjective and eye-tracking assessment of encoding and freezing distortions. Multimed. Tools Appl. 2022, 81, 9771–9802. [Google Scholar] [CrossRef]
Chiariotti, F. A survey on 360-degree video: Coding, quality of experience and streaming. Comput. Commun. 2021, 177, 133–155. [Google Scholar] [CrossRef]
Shen, G.; Ma, M.; Xu, G. An Optimal SVC Bitstream Schema for Viewport-dependent 360-degree Video Streaming. arXiv 2023, arXiv:2304.05654. [Google Scholar] [CrossRef]
Romero Rondon, M.F.; Sassatelli, L.; Aparicio-Pardo, R.; Precioso, F. TRACK: A New Method from a Re-examination of Deep Architectures for Head Motion Prediction in 360-degree Videos. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 5681–5699. [Google Scholar] [CrossRef]
Jiang, Y.; Poularakis, K.; Kiedanski, D.; Kompella, S.; Tassiulas, L. Robust and Resource-efficient Machine Learning Aided Viewport Prediction in Virtual Reality. arXiv 2022, arXiv:2212.09945. [Google Scholar] [CrossRef]
Feng, X.; Swaminathan, V.; Wei, S. Viewport Prediction for Live 360-Degree Mobile Video Streaming Using User-Content Hybrid Motion Tracking. Proc. ACM on Interact. Mob. Wearable Ubiquitous Technol. 2019, 3, 1–22. [Google Scholar] [CrossRef]
Zhang, L.; Chen, P.; Zhang, C.; Pan, C.; Long, T.; Xu, W.; Cui, L.; Liu, J. Optimizing Mobile-Friendly Viewport Prediction for Live 360-Degree Video Streaming. IEEE Trans. Mob. Comput. 2025, 24, 10441–10455. [Google Scholar] [CrossRef]
Shimamura, R.; Feng, Q.; Koyama, Y.; Nakatsuka, T.; Fukayama, S.; Hamasaki, M.; Goto, M.; Morishima, S. Audio–visual object removal in 360-degree videos. Vis. Comput. 2020, 36, 2117–2128. [Google Scholar] [CrossRef]
Liu, H.; Luo, T.; Luo, K.; Jiang, Q.; Sun, P.; Wang, J.; Huang, R.; Chen, Q.; Wang, W.; Li, X.; et al. OmniAudio: Generating Spatial Audio from 360-Degree Video. arXiv 2025, arXiv:2504.14906. [Google Scholar] [CrossRef]
David, E.J.; Gutiérrez, J.; Coutrot, A.; Da Silva, M.P.; Callet, P.L. A dataset of head and eye movements for 360° videos. In Proceedings of the 9th ACM Multimedia Systems Conference, Amsterdam, The Netherlands, 12–15 June 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 432–437. [Google Scholar] [CrossRef]
Xu, Y.; Du, J.; Wang, J.; Ning, Y.; Zhou, S.; Cao, Y. Panonut360: A Head and Eye Tracking Dataset for Panoramic Video. In Proceedings of the ACM Multimedia Systems Conference 2024 on ZZZ, Bari, Italy, 15–18 April 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 319–325. [Google Scholar] [CrossRef]
Zhang, G.; Wu, C.; Gao, Q. Exploiting layer and spatial correlations to enhance SVC and tile based 360-degree video streaming. Comput. Netw. 2021, 191, 107985. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar] [CrossRef]
Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. arXiv 2000, arXiv:physics/0004057. [Google Scholar] [CrossRef] [PubMed]
Finn, C.; Abbeel, P.; Levine, S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. arXiv 2017, arXiv:1703.03400. [Google Scholar] [CrossRef]
Li, J.; Li, Z.; Liu, Z.; Zhou, P.; Hong, R.; Li, Q.; Hu, H. Viewport Prediction for Volumetric Video Streaming by Exploring Video Saliency and Trajectory Information. arXiv 2025, arXiv:2311.16462. [Google Scholar] [CrossRef]
Xue, T.; El Ali, A.; Zhang, T.; Ding, G.; Cesar, P. CEAP-360VR: A Continuous Physiological and Behavioral Emotion Annotation Dataset for 360° VR Videos. IEEE Trans. Multimed. 2023, 25, 243–255. [Google Scholar] [CrossRef]
Wu, C.; Tan, Z.; Wang, Z.; Yang, S. A Dataset for Exploring User Behaviors in VR Spherical Video Streaming. In Proceedings of the 8th ACM on Multimedia Systems Conference, Taipei, Taiwan, 20–23 June 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 193–198. [Google Scholar] [CrossRef]
Rai, Y.; Gutiérrez, J.; Le Callet, P. A Dataset of Head and Eye Movements for 360 Degree Images. In Proceedings of the 8th ACM on Multimedia Systems Conference, Taipei, Taiwan, 20–23 June 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 205–210. [Google Scholar] [CrossRef]
Martinez, J.; Black, M.J.; Romero, J. On Human Motion Prediction Using Recurrent Neural Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 4674–4683. [Google Scholar] [CrossRef]
Manfredi, G.; Racanelli, V.A.; De Cicco, L.; Mascolo, S. LSTM-based Viewport Prediction for Immersive Video Systems. In Proceedings of the 2023 21st Mediterranean Communication and Computer Networking Conference (MedComNet), Island of Ponza, Italy, 13–15 June 2023; IEEE: New York, NY, USA, 2023; pp. 49–52. [Google Scholar] [CrossRef]
Li, J.; Han, L.; Zhang, C.; Li, Q.; Liu, Z. Spherical Convolution Empowered Viewport Prediction in 360 Video Multicast with Limited FoV Feedback. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–23. [Google Scholar] [CrossRef]
Gao, B.; Sheng, D.; Zhang, L.; Qi, Q.; He, B.; Zhuang, Z.; Wang, J. STAR-VP: Improving Long-term Viewport Prediction in 360° Videos via Space-aligned and Time-varying Fusion. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 5556–5565. [Google Scholar] [CrossRef]
Wang, M.; Peng, S.; Chen, X.; Zhao, Y.; Xu, M.; Xu, C. CoLive: An Edge-Assisted Online Learning Framework for Viewport Prediction in 360° Live Streaming. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; IEEE: New York, NY, USA, 2022; pp. 1–6. [Google Scholar] [CrossRef]
Wahba, M.Z.A.; Baldoni, S.; Battisti, F. Learning-Based Viewport Prediction for 360-Degree Videos: A Review. Electronics 2025, 14, 3743. [Google Scholar] [CrossRef]
Qian, F.; Han, B.; Xiao, Q.; Gopalakrishnan, V. Flare: Practical Viewport-Adaptive 360-Degree Video Streaming for Mobile Devices. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, New Delhi, India, 29 October–2 November 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 99–114. [Google Scholar] [CrossRef]
Li, Z.; Wang, Y.; Liu, Y.; Li, J.; Zhu, P. JUST360: Optimizing 360-Degree Video Streaming Systems with Joint Utility. IEEE Trans. Broadcast. 2024, 70, 468–481. [Google Scholar] [CrossRef]
Zou, J.; Li, C.; Liu, C.; Yang, Q.; Xiong, H.; Steinbach, E. Probabilistic Tile Visibility-Based Server-Side Rate Adaptation for Adaptive 360-Degree Video Streaming. IEEE J. Sel. Top. Signal Process. 2020, 14, 161–176. [Google Scholar] [CrossRef]
Kumar, S.; Bhagat, L.; A., A.F.; Jin, J. Multi-neural network based tiled 360video caching with Mobile Edge Computing. J. Netw. Comput. Appl. 2022, 201, 103342. [Google Scholar] [CrossRef]
Qiu, M.; Shao, F. Blind 360-degree image quality assessment via saliency-guided convolution neural network. Optik 2021, 240, 166858. [Google Scholar] [CrossRef]
Chen, P.W.; Yang, T.S.; Huang, G.L.; Huang, C.W.; Chao, Y.C.; Lu, C.H.; Wu, P.Y. Viewing Bias Matters in 360 Videos Visual Saliency Prediction. IEEE Access 2023, 11, 46084–46094. [Google Scholar] [CrossRef]
Vats, S.; Park, J.; Nahrstedt, K.; Zink, M.; Sitaraman, R.; Hellwagner, H. Semantic-Aware View Prediction for 360-Degree Videos at the 5G Edge. In Proceedings of the 2022 IEEE International Symposium on Multimedia (ISM), Naples, Italy, 5–7 December 2022; IEEE: New York, NY, USA, 2022; pp. 121–128. [Google Scholar] [CrossRef]
Adhuran, J.; Martini, M.G. Efficient viewport prediction and tiling schemes for 360 degree video streaming. In Proceedings of the ACM Multimedia Systems Conference 2024 on ZZZ, Bari, Italy, 15–18 April 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 374–380. [Google Scholar] [CrossRef]
Xu, Y.; Dong, Y.; Wu, J.; Sun, Z.; Shi, Z.; Yu, J.; Gao, S. Gaze Prediction in Dynamic 360° Immersive Videos. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: New York, NY, USA, 2018; pp. 5333–5342. [Google Scholar] [CrossRef]
Chao, F.Y.; Ozcinar, C.; Smolic, A. A QoE Study for Viewport Prediction Approaches in Tile-Based 360 Video Streaming. 2023. Available online: https://www.researchgate.net/publication/366974389_A_QoE_Study_for_Viewport_Prediction_Approaches_in_Tile-based_360_Video_Streaming?channel=doi&linkId=63bc3ffcc3c99660ebdf4b3f&showFulltext=true (accessed on 15 March 2026).
Zhou, H.; Zhao, F.; Li, C. Multi-scale Historical Trajectory Decomposition for Viewport Prediction in 360-degree Videos. ACM Trans. Multimed. Comput. Commun. Appl. 2025, 21, 1–22. [Google Scholar] [CrossRef]
Liu, S.; Wang, Y.; Li, S.; Liu, Y. MADRL-based bitrate allocation for QoE fairness in 360° video streaming with viewport prediction. Multimed. Syst. 2025, 31, 343. [Google Scholar] [CrossRef]
Corbillon, X.; De Simone, F.; Simon, G. 360-Degree Video Head Movement Dataset. In Proceedings of the 8th ACM on Multimedia Systems Conference, Taipei, Taiwan, 20–23 June 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 199–204. [Google Scholar] [CrossRef]

Figure 1. ImmerseFM-3D System Architecture showing the hierarchical inference framework with on-device lightweight uncertainty estimator (LUE) and adaptive partitioning engine.

Figure 2. Variational information bottleneck. Cross-modal features are concatenated and mapped to Gaussian parameters

(μ_{z}, σ_{z}^{2}) \in R^{256}

. The KL term regularizes toward

N (0, I)

. Multiple samples

z_{s}

enable Monte Carlo uncertainty estimates

σ_{\hat{y}}^{2}

for all downstream tasks.

Figure 2. Variational information bottleneck. Cross-modal features are concatenated and mapped to Gaussian parameters

(μ_{z}, σ_{z}^{2}) \in R^{256}

. The KL term regularizes toward

N (0, I)

. Multiple samples

z_{s}

enable Monte Carlo uncertainty estimates

σ_{\hat{y}}^{2}

for all downstream tasks.

Figure 3. Meta-learning personalization pipeline. A new user’s behavior embedding retrieves similar past adaptations from episodic memory

M

; a hypernetwork then generates parameter adjustments

δ θ

; MAML bi-level optimization ensures the base model can adapt in 5 gradient steps from 1–30 s of support data.

Figure 3. Meta-learning personalization pipeline. A new user’s behavior embedding retrieves similar past adaptations from episodic memory

M

; a hypernetwork then generates parameter adjustments

δ θ

; MAML bi-level optimization ensures the base model can adapt in 5 gradient steps from 1–30 s of support data.

Figure 4. Volumetric video extension. Orientation is decoded via real spherical harmonics up to order

L = 3

; depth encoder features

f_{d}

modulate tile importance exponentially with depth distance from the predicted viewport center; volumetric QoE is estimated from the concatenation

[z; f_{d}]

.

Figure 4. Volumetric video extension. Orientation is decoded via real spherical harmonics up to order

L = 3

; depth encoder features

f_{d}

modulate tile importance exponentially with depth distance from the predicted viewport center; volumetric QoE is estimated from the concatenation

[z; f_{d}]

.

Figure 5. Three-stage training pipeline. Stage 1 pre-trains all components on the full multi-task objective with KL annealing. Stage 2 meta-trains the adaptation modules on held-out users via MAML. Stage 3 jointly fine-tunes at a reduced learning rate.

Figure 6. IMMERSE-1M dataset composition. (a) Distribution of 1000 h of video content across five content categories (Sports, Nature, Documentary, Urban, Action/Cinematic). (b) Video format breakdown: 360° static/dynamic, high-motion, volumetric, light-field, and synthetic content (total 1000 h); the final two segments (Light Field, Synthetic) each represent 100 h. (c) Network trace source distribution (total: 10,000 traces) spanning FCC 4G LTE, 5G mmWave (ns-3 simulated), HSDPA, campus WiFi, Starlink, and controlled synthetic conditions, covering bandwidths 0.1–800 Mbps and RTTs 2–500 ms.

Figure 7. Overview of the five evaluation experiments. All experiments draw from the IMMERSE-1M dataset; each targets a distinct generalization capability of ImmerseFM-3D with dedicated metrics.

Figure 8. Per-content-category results. (a) Viewport prediction MAE at 1 s horizon (°); lower is better. (b) QoE Pearson correlation

ρ

with MOS; higher is better. ImmerseFM-3D (blue) consistently outperforms the best single-task baseline (orange) across all five content categories. All improvements are statistically significant (

p < 0.05

, Bonferroni-corrected).

Figure 8. Per-content-category results. (a) Viewport prediction MAE at 1 s horizon (°); lower is better. (b) QoE Pearson correlation

ρ

with MOS; higher is better. ImmerseFM-3D (blue) consistently outperforms the best single-task baseline (orange) across all five content categories. All improvements are statistically significant (

p < 0.05

, Bonferroni-corrected).

Figure 9. QoE improvement over non-personalized baseline versus adaptation data volume. Shaded band: 95% CI for ImmerseFM-3D. The dashed horizontal line indicates 90% of ImmerseFM-3D’s maximum gain. The red marker highlights the 30 s operating point from Table 16.

Figure 10. Ablation study: average performance across all four tasks (viewport prediction, bitrate allocation, tile selection, QoE estimation) relative to the full ImmerseFM-3D model (100%). The largest individual contributions come from the cross-modal attention fusion (−28.5 pp) and the information bottleneck (−21.8 pp). Removing any single component degrades performance by at least 14.7 pp.

Figure 11. Average cross-modal attention weight matrix

A \in R^{7 \times 7}

aggregated over the full test set. Modality codes: V = video, N = network, U = user head motion, A = audio, D = depth, E = eye tracking, S = scene semantics. Strong diagonal entries indicate self-attention dominance; notable off-diagonal peaks (U ↔ E: 0.15–0.16; A ↔ S: 0.14–0.15; V ↔ U: 0.18) reveal meaningful cross-modal dependencies learned without explicit supervision.

Figure 11. Average cross-modal attention weight matrix

A \in R^{7 \times 7}

aggregated over the full test set. Modality codes: V = video, N = network, U = user head motion, A = audio, D = depth, E = eye tracking, S = scene semantics. Strong diagonal entries indicate self-attention dominance; notable off-diagonal peaks (U ↔ E: 0.15–0.16; A ↔ S: 0.14–0.15; V ↔ U: 0.18) reveal meaningful cross-modal dependencies learned without explicit supervision.

Table 1. Comparative summary of representative 360° streaming methods. ✓ = addressed; ∘ = partially; – = not addressed. VP: viewport prediction; ABR: adaptive bitrate; TS: tile selection; QoE: quality-of-experience estimation; MM: multi-modal input (>1 modality); Pers.: user personalization; 6DoF: volumetric support.

Method	VP	ABR	TS	QoE	MM	Pers.	6DoF
VPT360 [15]	✓	–	–	–	–	–	–
MFTR [16]	✓	–	∘	–	∘	–	–
CMMST [17]	✓	–	–	–	∘	–	–
TRACK [32]	✓	–	–	–	–	–	–
STAR-VP [51]	✓	–	–	–	∘	–	–
360ProbDASH [6]	–	✓	✓	–	–	–	–
JUST360 [55]	–	✓	✓	–	–	–	–
RL-Streaming [24]	–	✓	–	–	–	–	–
VATP360 [27]	–	✓	✓	–	–	–	–
Voronoi-VQA [28]	–	–	–	✓	–	–	–
CEAP-360VR [45]	–	–	–	✓	✓	–	–
CoLive [52]	✓	–	–	–	–	✓	–
MAML [43]	∘	–	–	–	–	✓	–
STVP [44]	✓	–	–	–	–	–	✓
ImmerseFM-3D (ours)	✓	✓	✓	✓	✓	✓	✓

Table 2. Modality-specific encoder architectures and parameter counts.

Modality	Architecture	Input Size	Params (M)
Video ( $X_{v}$ )	3D Swin Transformer (4 stages)	$16 \times H \times W \times 3$	42.3
Network ( $X_{n}$ )	Dilated TCN (4 layers, $d = 1, 2, 4, 8$ )	$64 \times 3$	1.2
Head Motion ( $X_{u}$ )	Transformer Encoder (6L, 8H)	$150 \times 3$	8.7
Audio ( $X_{a}$ )	ResNet-50 + Ambisonics MLP	Mel-spectrogram	24.6
Depth ( $X_{d}$ )	3D CNN (5 layers)	$T_{d} \times H \times W$	3.8
Eye Tracking ( $X_{e}$ )	BiLSTM (2 layers)	$150 \times 4$	2.1
Scene Semantics ( $X_{s}$ )	CLIP ViT-B/32 + Transformer (frozen)	Key frames @ 1 fps	86.4

Table 3. ImmerseFM-3D component summary with parameter counts. Total: ≈177 M.

Component	Architecture	Params	Output Dim
Video Encoder	3D Swin-T (4 stages)	42.3 M	$T^{'} \times H^{'} \times W^{'} \times 512$
Network Encoder	Dilated TCN (4 layers)	1.2 M	$R^{512}$
User Encoder	Transformer (6L, 8H)	8.7 M	$R^{512}$
Audio Encoder	ResNet-50 + Amb. MLP	24.6 M	$R^{512}$
Depth Encoder	3D CNN (5 layers)	3.8 M	$R^{512}$
Eye Tracking Encoder	BiLSTM (2 layers)	2.1 M	$R^{512}$
Scene Semantics	CLIP ViT-B/32 + Trans. (frozen)	86.4 M	$R^{512}$
Cross-Modal Attention	4 layers, 8 heads	4.2 M	$R^{7 \times 512}$
Information Bottleneck	MLP $512 \to 256$	0.13 M	$R^{256}$
Task Decoders (4×)	MLP + GRU	1.8 M	Task-dependent
Meta-Learning Adapter	HyperNet + Episodic Mem.	2.3 M	$δ θ$
Total		≈177 M

Table 4. IMMERSE-1M video content by category, source, and format.

Category	Primary Source	Duration	Resolution	Format
360° Static	Wu et al. [46]; Corbillon et al. [5]	200 h	4–8 K	ERP, Cubemap
360° Dynamic	Xu et al. [62]; David et al. [38]	250 h	4–8 K	ERP
360° High Motion	Rai et al. [47]	150 h	6–8 K	ERP
Volumetric	8i Voxelized; Microsoft RGB-D	200 h	$512^{3}$	Point clouds, meshes
Light Field	Stanford Lytro; new captures	100 h	$5 \times 5$ views	LF images
Synthetic	Unreal Engine renders	100 h	4–8 K	ERP + depth
Total		1000 h

Table 5. IMMERSE-1M user behavior data sources, modalities, and sampling rates.

Source	Users	Avg. Duration	Modalities	Rate
Wu et al. [46]	48	60 s/video	Head (6DoF)	30 Hz
Corbillon et al. [5]	59	70 s/video	Head (3DoF)	30 Hz
Xu et al. [62]	120	180 s/video	Head + Eye	90/30 Hz
David et al. [38]	57	30–60 s	Head + Eye	60 Hz
Rai et al. [47]	40	20 s/image	Head + Eye	30 Hz
New recordings	200	300 s/video	Head + Eye + Pupillometry	90 Hz
Total	524	≈1200 h

Table 6. IMMERSE-1M network trace sources and characteristics.

Source	Type	Samples	BW Range	RTT Range
FCC 4G	Real-world	3000	0.5–50 Mbps	20–150 ms
5G mmWave (ns-3)	Simulated	2000	10–800 Mbps	5–30 ms
HSDPA	Real-world	2000	0.1–10 Mbps	50–500 ms
WiFi (campus)	Real-world	1500	1–200 Mbps	2–50 ms
Starlink	Real-world	500	20–200 Mbps	20–80 ms
Controlled synthetic	Generated	1000	Variable patterns	Variable
Total		10,000

Table 7. IMMERSE-1M dataset splits (user-disjoint and content-disjoint).

Split	Videos	Users	Net. Traces	Purpose
Pre-training	80% (800 h)	80% (419)	80% (8000)	Model pre-training (Stage 1)
Meta-training	10% (100 h)	10% (52)	10% (1000)	MAML adaptation (Stage 2)
Test	10% (100 h)	10% (53)	10% (1000)	Final held-out evaluation

Table 8. Head-motion statistics by participant geographic group. Values are mean ± SD across all participants within each cohort. p-values are from two-sample t-tests; Cohen’s d measures effect size.

Metric	East Asian ( $n = 168$ )	W. European ( $n = 356$ )	p-Value	Cohen’s d
Equatorial bias (%)	$68.3 \pm 9.1$	$65.7 \pm 10.4$	0.014	0.26
Exploration radius (°)	$32.4 \pm 8.3$	$35.1 \pm 9.6$	0.003	0.30
Saccade freq. (min⁻¹)	$11.2 \pm 3.4$	$12.8 \pm 4.1$	0.001	0.43

Table 9. Ablation study configurations. Each variant removes or replaces one component of the full ImmerseFM-3D model.

Variant	Description
Full model	Complete ImmerseFM-3D (all components)
w/o cross-modal	Separate task-specific encoders; no cross-modal fusion
w/o bottleneck	Direct feature pass-through; no information bottleneck
w/o meta-learning	No personalization adapter; fixed population-level model
w/o uncertainty	Deterministic latent (single sample, $S = 1$ )
w/o audio	Audio encoder removed; 6 modalities only
w/o depth	Depth encoder removed; 2D video content only
w/o semantics	CLIP scene encoder removed; no semantic features
Single-task	Separate independently trained model per task

Table 10. Viewport prediction baselines.

Method	Description	Reference
VPT360	Transformer-based trajectory prediction	Chao et al. [63]
MFTR	Multi-modal fusion transformer for viewport estimation	Zhang et al. [16]
DHP	Deep reinforcement learning viewport prediction	Xu et al. [13]
EMD-ML	Empirical mode decomposition + LSTM	Zhou et al. [64]
STMRQ	Spatiotemporal graph convolutional network	Liu et al. [65]

Table 11. Bitrate adaptation baselines.

Method	Description	Reference
360ProbDASH	Probabilistic tile pre-fetching ABR	Xie et al. [6]
Flare	Viewport-adaptive bitrate streaming	Qian et al. [54]
EPASS360	Ensemble viewport prediction + allocation	Zhang et al. [40]
JUST360	Joint utility optimization for 360° streaming	Li et al. [55]
RL-Streaming	Deep RL for adaptive bitrate selection	Park et al. [24]

Table 12. QoE estimation and meta-learning personalization baselines.

Method	Description	Reference
QoE Estimation
Voronoi-VQA	Visual attention-aware 360° quality assessment	Croci et al. [28]
Blind360	Blind 360° image quality via saliency CNN	Qiu and Shao [58]
CEAP-360VR	Physiological + behavioral QoE estimation	Xue et al. [45]
Meta-Learning Personalization
MAML	Model-Agnostic Meta-Learning Golub et al.	[43]
Reptile	First-order meta-learning (no second-order grad.)	Jiang et al. [33]
Fine-tuning	Standard transfer learning from pre-trained model	—

Table 13. Viewport prediction MAE (°) at increasing prediction horizons. Lower is better. Best result per horizon is bold; ^† p < 0.05 vs. best baseline.

Method	0.5 s	1 s	2 s	3 s	5 s
VPT360 [15]	3.21	8.97	16.42	24.35	38.71
MFTR	2.84	7.90	14.63	22.10	35.44
DHP	3.47	9.21	17.05	25.80	40.12
EMD-ML	2.91	8.12	15.01	21.90	34.87
STMRQ	2.76	7.83	14.20	21.45	34.10
ImmerseFM-3D (ours)	1.93 ^†	5.21 ^†	9.87 ^†	14.82 ^†	24.16 ^†
±95% CI	±0.04	±0.09	±0.18	±0.26	±0.41
Improvement vs. best	−30.1%	−33.5%	−30.5%	−31.0%	−29.2%

Table 14. Bitrate allocation and tile selection results. ↑ higher is better; ↓ lower is better. Best per column is bold.

Method	Acc@1 ↑	Acc@3 ↑	BVR ↓	MAE ↓	F1 ↑	IoU ↑
360ProbDASH [6]	70.3	88.4	14.2	0.91	0.71	0.59
Flare [54]	74.8	90.1	11.7	0.84	0.74	0.62
EPASS360 [40]	77.1	91.8	10.3	0.78	0.77	0.65
JUST360 [55]	81.2	93.5	8.3	0.71	0.82	0.70
RL-Streaming	79.4	92.7	9.1	0.75	0.79	0.67
ImmerseFM-3D	92.1 ^†	97.6 ^†	3.1 ^†	0.43 ^†	0.91 ^†	0.84 ^†
±95% CI	±0.4	±0.2	±0.1	±0.02	±0.01	±0.01

^† Statistically significant improvement over the best baseline (

p < 0.05

, Bonferroni-corrected paired t-test,

n = 5

independent runs).

Table 15. QoE estimation results. ↑ higher is better; ↓ lower is better.

Method	Pearson $ρ ↑$	Spearman $ρ_{s} ↑$	RMSE ↓
Voronoi-VQA [28]	0.763	0.741	0.521
Blind360 [58]	0.712	0.694	0.567
CEAP-360VR	0.741	0.728	0.538
ImmerseFM-3D	0.891 ^†	0.874 ^†	0.312 ^†
±95% CI	±0.008	±0.009	±0.011

^† Statistically significant improvement over the best baseline (

p < 0.05

, Bonferroni-corrected paired t-test,

n = 5

independent runs).

Table 16. Few-shot personalization efficiency metrics. “90%-threshold” is the adaptation data (seconds) required to reach 90% of maximum QoE gain for each method.

Method	1 s	5 s	10 s	30 s	90%-Threshold (s)
Fine-tuning	+1.2	+3.8	+5.6	+8.1	>60
Reptile	+2.1	+5.3	+7.4	+10.2	>60
MAML [43]	+3.4	+6.9	+8.8	+11.3	47
ImmerseFM-3D	+4.8 ^†	+9.1 ^†	+11.7 ^†	+15.7 ^†	22 ^†

^† Statistically significant improvement over the best baseline (

p < 0.05

, Bonferroni-corrected paired t-test,

n = 5

independent runs).

Table 17. Cross-format transfer ratios (% of full in-domain training performance). Transfer source: 360° video. Target: volumetric video test set. Higher is better; ^† p < 0.05.

Method	Zero-Shot	1%	5%	10%	25%
Train from scratch	—	32	51	67	82
Fine-tuning (pre-trained)	48	61	74	84	92
MAML transfer	55	66	78	87	94
ImmerseFM-3D	72 ^†	81 ^†	89 ^†	94 ^†	97 ^†

Table 18. Volumetric video streaming results. VP: viewport prediction; Depth: monocular depth estimation; Volumetric QoE: point-cloud quality + MOS correlation.

Method	6DoF VP Error		Depth Estimation		Volumetric QoE
Method	$E_{6 DoF} ↓$	Pos. (cm) ↓	RMSE ↓	$δ_{1}$ (%) ↑	P2P-PSNR ↑	MOS $ρ$ ↑
2D-only (no depth)	18.4	12.3	—	—	28.3	0.641
Depth-agnostic	15.7	10.8	0.38	79.1	31.4	0.703
ImmerseFM-3D	9.3 ^†	6.1 ^†	0.21 ^†	91.4 ^†	38.7 ^†	0.862 ^†
±95% CI	±0.4	±0.3	±0.01	±0.6	±0.4	±0.012

Bold values indicate the best result per column. ^† Statistically significant improvement over the best baseline (

p < 0.05

, Bonferroni-corrected paired t-test,

n = 5

independent runs). ↑ higher is better; ↓ lower is better.

Table 19. Per-task ablation results. VP MAE @ 1 s (°), Bitrate Acc@1 (%), Tile F1, and QoE

ρ

. Relative degradation vs. full model in parentheses.

Table 19. Per-task ablation results. VP MAE @ 1 s (°), Bitrate Acc@1 (%), Tile F1, and QoE

ρ

. Relative degradation vs. full model in parentheses.

Variant	VP MAE ↓	Acc@1 ↑	Tile F1 ↑	QoE $ρ ↑$
Full model	5.21	92.1	0.910	0.891
w/o cross-modal	7.83 (−50%)	78.4 (−13.7)	0.741 (−0.169)	0.724 (−0.167)
w/o bottleneck	6.94 (−33%)	81.3 (−10.8)	0.792 (−0.118)	0.763 (−0.128)
w/o meta-learning	5.21 (—)	91.8 (−0.3)	0.908 (−0.002)	0.872 (−0.019)
w/o uncertainty	5.47 (−5%)	89.6 (−2.5)	0.891 (−0.019)	0.863 (−0.028)
w/o audio	5.89 (−13%)	87.4 (−4.7)	0.871 (−0.039)	0.844 (−0.047)
w/o depth	5.64 (−8%)	90.2 (−1.9)	0.882 (−0.028)	0.867 (−0.024)
w/o semantics	6.41 (−23%)	84.7 (−7.4)	0.831 (−0.079)	0.791 (−0.100)
Single-task	7.21 (−38%)	77.8 (−14.3)	0.751 (−0.159)	0.731 (−0.160)

↓ lower is better; ↑ higher is better. Values in parentheses indicate absolute degradation relative to the full ImmerseFM-3D model.

Table 20. Marginal contribution of each input modality to ImmerseFM-3D performance. Each column reports the degradation relative to the full model when the indicated modality is removed. VP MAE: mean angular error at 1 s horizon (lower is better); BR Acc: Top-1 bitrate accuracy (%); Tile F1: tile selection F1 score; QoE

ρ

: Pearson correlation with MOS. Avg.

Δ

: mean relative degradation across all four tasks.

Table 20. Marginal contribution of each input modality to ImmerseFM-3D performance. Each column reports the degradation relative to the full model when the indicated modality is removed. VP MAE: mean angular error at 1 s horizon (lower is better); BR Acc: Top-1 bitrate accuracy (%); Tile F1: tile selection F1 score; QoE

ρ

: Pearson correlation with MOS. Avg.

Δ

: mean relative degradation across all four tasks.

Removed Modality	VP MAE ↑	BR Acc ↓	Tile F1 ↓	QoE $ρ$ ↓	Avg. $Δ$
None (full model)	5.21°	92.1%	0.910	0.891	—
Scene semantics (CLIP)	+23%	−7.4 pp	−0.079	−0.100	−15.4 pp
Audio (ambisonics)	+13%	−4.7 pp	−0.039	−0.047	−10.5 pp
Depth maps	+8%	−1.9 pp	−0.028	−0.024	−5.5 pp
Eye tracking	+5%	−2.5 pp	−0.019	−0.028	−4.9 pp
Context: component-level ablations
w/o cross-modal fusion	+50%	−13.7 pp	−0.169	−0.167	−28.5 pp
w/o info. bottleneck	+33%	−10.8 pp	−0.118	−0.128	−21.8 pp
Single-task (no MTL)	+38%	−14.3 pp	−0.159	−0.160	−27.2 pp

For VP MAE, ↑ indicates the percentage increase in error (higher degradation is worse). For BR Acc, Tile F1, and QoE

ρ

, ↓ indicates the decrease relative to the full model (larger negative values indicate greater degradation). pp: percentage points.

Table 21. Inference latency and memory profile across target hardware platforms. ImmerseFM-3D-Mobile: knowledge-distilled student model (≈28 M params).

Model	Platform	Lat. (ms)	Peak Mem. (MB)	GFLOPs	Rel. QoE
ImmerseFM-3D	A100 (training)	8.2	6814	287	100%
ImmerseFM-3D	T4 (edge server)	31.4	3142	287	100%
ImmerseFM-3D	iPhone 12	94.7	812	287	100%
ImmerseFM-3D	Pixel 6	88.3	742	287	100%
ImmerseFM-3D-Mobile	T4 (edge server)	12.1	924	81	91%
ImmerseFM-3D-Mobile	iPhone 12	22.1	248	81	91%
ImmerseFM-3D-Mobile	Pixel 6	20.8	231	81	91%
ImmerseFM-3D-Mobile	HTC Vive Pro Eye	18.6	194	81	91%
JUST360 [55]	T4 (edge)	18.9	1203	124	88%
MFTR	T4 (edge)	22.3	1847	163	85%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gallena Watthage, R.S.; Fernando, A. ImmerseFM-3D: A Foundation Model Framework for Generalizable 360-Degree Video Streaming with Cross-Modal Scene Understanding. Appl. Sci. 2026, 16, 3424. https://doi.org/10.3390/app16073424

AMA Style

Gallena Watthage RS, Fernando A. ImmerseFM-3D: A Foundation Model Framework for Generalizable 360-Degree Video Streaming with Cross-Modal Scene Understanding. Applied Sciences. 2026; 16(7):3424. https://doi.org/10.3390/app16073424

Chicago/Turabian Style

Gallena Watthage, Reka Sandaruwan, and Anil Fernando. 2026. "ImmerseFM-3D: A Foundation Model Framework for Generalizable 360-Degree Video Streaming with Cross-Modal Scene Understanding" Applied Sciences 16, no. 7: 3424. https://doi.org/10.3390/app16073424

APA Style

Gallena Watthage, R. S., & Fernando, A. (2026). ImmerseFM-3D: A Foundation Model Framework for Generalizable 360-Degree Video Streaming with Cross-Modal Scene Understanding. Applied Sciences, 16(7), 3424. https://doi.org/10.3390/app16073424

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ImmerseFM-3D: A Foundation Model Framework for Generalizable 360-Degree Video Streaming with Cross-Modal Scene Understanding

Abstract

1. Introduction

2. Related Work

2.1. Viewport Prediction: From Heuristics to Foundation Representations

2.2. Adaptive Bitrate Streaming and Tile Selection

2.3. Quality of Experience Estimation and Multi-Modal Perception

2.4. Meta-Learning, Personalization, and Collaborative Adaptation

3. Proposed Methodology

3.1. System Architecture Overview

3.2. Problem Formulation

3.3. Modality-Specific Encoders

3.3.1. Video Encoder

3.3.2. Network Encoder

3.3.3. User Behavior Encoder

3.3.4. Audio Encoder

3.3.5. Depth, Eye-Tracking and Semantics Encoders

3.4. Cross-Modal Fusion with Information Bottleneck

3.4.1. Cross-Modal Attention

3.4.2. Variational Information Bottleneck

3.4.3. Uncertainty Estimation

3.5. Task-Specific Decoders

3.5.1. Viewport Prediction Decoder

3.5.2. Bitrate Allocation Decoder

3.5.3. Tile Selection and QoE Decoders

3.6. Meta-Learning Adapter for Personalization

3.6.1. Episodic Memory Module

3.6.2. Hypernetwork Adaptation

3.6.3. MAML Bi-Level Optimization

3.7. Volumetric Video Extension

3.8. Training Objective and Procedure

3.9. Implementation Details

4. Experimental Setup

4.1. IMMERSE-1M Dataset

4.1.1. Video Content

4.1.2. User Behavior Data

4.1.3. Network Traces

4.1.4. Auxiliary Modalities and Dataset Splits

4.1.5. Dataset Demographics and Cultural Diversity

4.2. Evaluation Protocols

4.3. Baseline Methods

4.4. Evaluation Metrics

4.4.1. Viewport Prediction

4.4.2. Bitrate Allocation

4.4.3. QoE Estimation

4.4.4. Tile Selection

4.4.5. 6DoF Volumetric Metrics

5. Results

5.1. Experiment 1: Generalization to Unseen Content

5.1.1. Viewport Prediction

5.1.2. Bitrate Allocation and Tile Selection

5.1.3. QoE Estimation

5.1.4. Per-Category Generalization

5.2. Experiment 2: Few-Shot Personalization

5.3. Experiment 3: Cross-Format Transfer

5.4. Experiment 4: Volumetric Video Streaming

5.5. Experiment 5: Ablation Studies

5.6. Computational Efficiency

6. Discussion

6.1. Interpretation of Main Results

6.2. Cross-Modal Synergies

6.3. Role of the Information Bottleneck

6.4. Meta-Learning and Personalization Dynamics

6.5. Volumetric and 6DoF Streaming Analysis

6.6. Computational Considerations and Deployment Pathways

6.7. Limitations and Future Work

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives