Next Article in Journal
Modeling and Simulation of a PINN-Based Nonlinear Motor Drive System
Previous Article in Journal
Simple Improvement of In Vitro Shoot Elongation and Rooting of Sida cordifolia L. from Nodal Segments
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

ImmerseFM-3D: A Foundation Model Framework for Generalizable 360-Degree Video Streaming with Cross-Modal Scene Understanding

by
Reka Sandaruwan Gallena Watthage
* and
Anil Fernando
Department of Computer & Information Sciences, University of Strathclyde, Glasgow G1 1XQ, UK
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(7), 3424; https://doi.org/10.3390/app16073424
Submission received: 20 February 2026 / Revised: 16 March 2026 / Accepted: 24 March 2026 / Published: 1 April 2026
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

Current 360-degree video streaming systems consider viewport prediction, adaptive bitrate allocation, tile selection, and quality-of-experience (QoE) estimation as independent activities, yielding fragmented pipelines that do not scale well across content type and network conditions and do not scale well to individual users. We propose ImmerseFM-3D, a foundation model that jointly solves all four sub-tasks through a single shared representation. Seven input modalities, namely video frames, network traces, head-motion trajectories, ambisonics audio, depth maps, eye-tracking signals, and CLIP scene semantics, are fused by four-layer cross-modal attention and compressed into a 256-dimensional bottleneck latent via a variational information bottleneck. Four task-specific decoders operate on this shared latent simultaneously. A model-agnostic meta-learning adapter augmented with episodic memory and a hypernetwork personalizes the model from as little as 1 s of user interaction data. An extended branch supports six-degrees-of-freedom volumetric content through spherical harmonic viewport decoding and depth-aware tile importance weighting. Trained and evaluated on the IMMERSE-1M combined dataset (1000 h of 360° and volumetric video, 524 users, and over 50,000 mean opinion scores), ImmerseFM-3D reduces the mean angular viewport error by 34%, lowers the bandwidth violation rate from 8.3% to 3.1%, and achieves a QoE Pearson correlation of 0.891. The personalization adapter reaches 90% of peak performance in 22 s, while zero-shot cross-format transfer attains 72% of full in-domain accuracy.

1. Introduction

Omnidirectional video has evolved from an early technical curiosity [1] into a mass-market medium propelled by the rapid commoditization of consumer head-mounted displays (HMDs) and the proliferation of immersive media platforms across entertainment, education, healthcare, and telepresence [2]. Global immersive video traffic is projected to exceed 580 exabytes annually by 2027, driven by applications spanning live sports, virtual tourism, cultural heritage preservation, and remote collaboration. Empirical studies confirm that 360° video elicits measurably higher presence, enjoyment, and perceived credibility than conventional 2D footage when delivered at sufficient quality [3,4]. Yet streaming omnidirectional content faithfully over best-effort networks remains fundamentally harder than conventional video delivery: a single 4K equirectangular frame encodes up to 36 megapixels, only 20% of which any given user will perceive within their 110°-wide viewport at any instant. The gap between what must be transmitted and what is actually consumed creates an acute bandwidth efficiency problem, further compounded by the stringent motion-to-photon latency constraint (<20 ms) required to prevent cybersickness [3].
Viewport-adaptive streaming addresses this fundamental mismatch by spatially partitioning the equirectangular sphere into independently coded rectangular tiles and allocating quality resources in proportion to the probability that each tile falls within the user’s future field of view [5,6]. Tile granularity, layout, and the mechanism by which predicted viewport uncertainty is translated into bitrate decisions have been explored extensively in the literature [7,8,9,10]. This overall paradigm decomposes the streaming problem into four tightly interdependent sub-tasks: viewport prediction, adaptive bitrate (ABR) allocation, tile selection, and quality-of-experience (QoE) estimation. The research community has addressed each sub-task with increasingly sophisticated yet fundamentally siloed models. Viewport predictors have progressed from early sinusoidal and position-extrapolation heuristics [11,12] through LSTM, GRU, and deep reinforcement learning [13,14] to modern transformer architectures [15,16,17] and cross-modal multi-scale attention mechanisms [18,19], yet fundamentally remain oblivious to concurrent network state or spatial audio cues. Bitrate adaptation algorithms have incorporated stochastic viewport uncertainty [20,21,22], beamforming co-design for millimetre-wave and THz networks [23], and reinforcement learning for QoE-driven decision-making [24,25]; however, all assume a fixed, separately optimized viewport estimate as their only behavioral input. Tile selection strategies [26,27] and QoE estimators [28,29] operate analogously in isolation, with no shared internal representation across tasks and no feedback path from downstream decisions to the upstream predictor.
This architectural fragmentation creates three compounding problems that collectively impose a hard performance ceiling on viewport-adaptive streaming systems. First, prediction errors propagate non-gracefully: a viewport prediction error of even 5° at a 1 s horizon displaces entire columns of high-quality tiles outside the actual viewport, wasting up to 30% of available bandwidth [30]. While probabilistic pre-fetching [6] and stochastic optimization frameworks [20,21] partially mitigate this, they cannot recover information already discarded by the upstream predictor. Shen et al. [31] demonstrated that even the codec structure amplifies this problem through GOP-induced switching latency, motivating new bitstream schemas. Second, task-specific models generalize poorly across content, users, and networks: a viewport predictor trained on documentary footage degrades significantly on high-motion sports content without retraining [32], personalization to a new user typically requires tens of minutes of behavioral data [33], and live streaming contexts lack the pre-session content information that many state-of-the-art predictors require [34,35]. Third, rich cross-modal signals are entirely discarded: the well-documented influence of spatial audio onset events on gaze direction [36,37], the predictive relationship between eye-tracking micro-saccades and upcoming head reorientations [38,39], the role of scene semantic context in shaping long-term viewing trajectories [4], and content-complexity correlates of network bitrate sensitivity [40] are entirely ignored by all existing streaming pipelines.
We argue that these limitations share a common root cause: the absence of a unified representation that jointly encodes scene content, network state, user behavior, and perceptual quality. Motivated by the success of large-scale multi-modal pre-training in vision [15,41], and by information bottleneck theory [42] as a principled mechanism for compressing heterogeneous representations to task-sufficient statistics, we propose ImmerseFM-3D, a foundation model framework for generalizable 360-degree and volumetric video streaming. ImmerseFM-3D maps seven heterogeneous input modalities—equirectangular video frames, network throughput traces, head-motion trajectories, first-order ambisonics audio, monocular depth maps, eye-tracking signals, and CLIP-derived scene semantic embeddings into a shared 256-dimensional bottleneck latent via four-layer cross-modal attention and a variational information bottleneck. Four task-specific decoders simultaneously share this representation, enabling end-to-end multi-task optimization with no per-task feature extraction overhead. A MAML-based meta-learning adapter [43], augmented with an episodic memory module and a weight-generating hypernetwork, personalizes the model from as few as 1 s of interaction data. An extended volumetric branch supports 6DoF content through spherical harmonic viewport decoding and depth-aware tile importance weighting, connecting 360° streaming research to next-generation holographic delivery [44].
The principal contributions of this paper are as follows:
  • ImmerseFM-3D architecture. The first unified foundation model that jointly solves all four 360° streaming sub-tasks through a single shared cross-modal latent representation. Unlike ensembles of task-specific models, ImmerseFM-3D eliminates the inter-task error propagation bottleneck: the bitrate decoder implicitly conditions on viewport prediction uncertainty through the shared bottleneck, a capability structurally impossible in siloed pipelines. This yields VP MAE reductions of up to 34%, BVR reduction of 63%, and tile F1 improvement of 11 pp over the best single-task baselines (Section 5.1).
  • Variational information bottleneck for multi-modal streaming. The first application of information bottleneck theory [42] to compress a seven-modality immersive streaming representation into a 256-dimensional sufficient statistic. This enables (a) implicit uncertainty-aware bitrate decisions without dedicated confidence estimation modules, and (b) 72% zero-shot cross-format transfer to volumetric content versus 48% for pre-trained fine-tuning by learning domain-agnostic compressed representations (Section 5.3).
  • Sample efficient meta-learning personalization. A compound MAML [43] HyperNetwork episodic memory personalization module that reduces the 90% threshold adaptation time to 22 s from ≥47 s for standalone MAML, and delivers statistically significant QoE gains from a single second of user data across all four sub-tasks simultaneously, the first 360° streaming personalization system to achieve this at scale (Section 5.2).
  • IMMERSE-1M dataset. The largest publicly released multi-modal benchmark for immersive streaming research: 1000 h of 360° and volumetric video, 524 participants, 10,000 network traces, and 50,000+ subjective MOS ratings [45] with synchronized head-motion [46,47], eye-tracking [39], ambisonics audio, and network traces across six content format categories (Section 4.1).
  • Volumetric 6DoF extension. Spherical harmonic viewport decoding and depth-aware tile importance weighting [44] reduce 6DoF viewport error by 40.8% and increase Point-to-Plane PSNR by 7.3 dB over the depth-agnostic 2D baseline, establishing the first foundation-model baseline for next-generation holographic streaming (Section 3.7 and Section 5.4).
The remainder of this paper is organized as follows. Section 2 reviews related work across the four sub-task domains and meta-learning personalization. Section 3 details the ImmerseFM-3D architecture. Section 4 describes the IMMERSE-1M dataset, evaluation protocols, baselines, and metrics. Section 5 presents comprehensive experimental results, ablations, and efficiency analyses. Section 6 interprets the findings, analyzes cross-modal attention patterns, and discusses limitations and future directions. Section 7 concludes.

2. Related Work

2.1. Viewport Prediction: From Heuristics to Foundation Representations

Viewport prediction is the enabling technology for bandwidth-efficient 360° streaming. The problem was first formalized when it became apparent that motion-prediction-based transmission could reduce bandwidth consumption by up to 45% relative to viewport-agnostic delivery [12]. Foundational datasets from Wu et al. [46] and Rai et al. [47] established the first publicly available corpora of head-motion traces recorded inside HMDs, documenting systematic viewing patterns, including equatorial bias and early-session free exploration across diverse 360° content categories. These observations were corroborated by Sitzmann et al. [4] through controlled eye-tracking experiments that quantified how users allocate attention in virtual environments, showing that visual saliency maps derived from standard 2D attention models fail to capture the spherical geometry of immersive viewing. David et al. [38] extended this to simultaneous eye and head movement recordings from 57 participants across 19 videos, and Xu et al. [39] released Panonut360, a high-frequency (90 Hz) dataset combining head and eye tracking across a large panoramic video corpus enabling research into the intra-session dynamics that determine prediction difficulty at different horizons.
Early predictors fitted polynomial or sinusoidal motion models [11] to recent trajectory windows, achieving competitive accuracy at sub-second horizons but degrading rapidly beyond 1 s. Recurrent neural networks improved medium-horizon accuracy: deep reinforcement learning was applied to predict head-movement scanpaths [13], while domain-inspired RL formulations for 360° sports piloting demonstrated the value of multi-scale temporal context [14]. Martinez et al. [48] established key properties of GRU-based sequence models for human body motion that were subsequently adapted to 360° head prediction [49]. Spherical convolution architectures addressed the systematic projection distortions that impair weight sharing in conventional 2D CNNs applied to equirectangular video [50].
Transformer-based architectures now define the state of the art. VPT360 [15] demonstrated that a pure scanpath-attention model surpasses CNN-RNN hybrids in both accuracy and computational efficiency. MFTR [16] extended this to multi-modal fusion of trajectory and video saliency; CMMST [17] applied cross-modal attention across multiple spatial scales; and BiT [18] introduced bidirectional transformer alignment to maintain temporal coherence across viewport proposal frames. TRACK [32] systematically re-examined deep architectures for head-motion prediction, providing architectural insights adopted by subsequent methods, while STAR-VP [51] proposed space-aligned and time-varying feature fusion that preserves spatial locality in the fused representation. Multi-modal fusion has been further advanced through MDA-based attention addressing differential information density across trajectory, visual, and audio modalities [19], and through optimized mobile-friendly architectures that meet device-side inference latency budgets for live streaming [35].
For live scenarios where pre-session content is unavailable, dedicated methods use Gaussian Mixture Model motion tracking [34] and edge-assisted collaborative learning across simultaneous viewers [52]. For the emerging class of 6DoF volumetric content, Li et al. [44] proposed STVP, a saliency-and-trajectory predictor for holographic video that addresses the additional positional degrees of freedom beyond yaw, pitch, and roll.
Despite this rapid progress, a comprehensive survey by Wahba et al. [53] identifies three persistent limitations shared by virtually all existing methods: single-modality processing, poor cross-content generalization, and the need for per-user retraining. ImmerseFM-3D directly addresses all three through cross-modal fusion, variational information bottleneck regularization, and MAML-based few-shot personalization.

2.2. Adaptive Bitrate Streaming and Tile Selection

The foundational paradigm for tile-based adaptive streaming was established by Corbillon et al. [5], who showed that cube-map tiling with viewport-weighted bitrate allocation outperforms uniform quality delivery at equal bandwidth. Son and Ryu [8] demonstrated the viability of tile-based delivery for mobile VR in a cyber-physical system context. Skupin et al. [7] investigated rate assignment guided by spatiotemporal video activity metrics to reduce inter-tile quality variance, and De la Fuente et al. [9] analyzed the delay impact of MPEG OMAF’s viewport-dependent streaming architecture on perceived fidelity, identifying end-to-end delay as the primary quality bottleneck. Content-adaptive tiling layouts that co-optimize the spatial partition with bitrate allocation have been shown to reduce average streamed bitrate by 32–70% relative to uniform grids [10].
Viewport-adaptive ABR algorithms have grown progressively more sophisticated. 360ProbDASH [6] introduced probabilistic pre-fetching that explicitly accounts for viewport prediction uncertainty through a QoE-driven optimization framework, reducing spatial quality variance by 46% in trace-driven experiments. Flare [54] demonstrated practical mobile deployment with online allocation algorithms achieving up to 4.9× viewport quality improvement in LTE networks. JUST360 [55] formulated a joint utility maximization problem across all tiles incorporating network dynamics, viewport probability distributions, and rate-distortion characteristics. Ghosh et al. [20] provided a stochastic optimization framework with probabilistic quality guarantees under uncertain head movement and bandwidth, while Zhao et al. [21] extended this to wireless multi-user networks with KKT and CCCP-based solvers for scenarios ranging from perfect to completely unknown FoV distributions. Shen et al. [31] proposed an SVC-based bitstream schema that structurally eliminates GOP-induced viewport-switch latency, enabling one-frame-time viewport transitions.
Reinforcement learning has emerged as the dominant paradigm for sequential bitrate decisions. Park et al. [24] applied RL with 3D-CNN viewport features, achieving 1.3–1.7× bitrate improvement over tile-based baselines. VATP360 [27] coupled RL with a tile priority classifier derived from object motion and ROI predictions. Transformer-based saliency models pre-trained on 2D images have been repurposed for 360° ABR without fine-tuning [25], and Feng et al. [22] introduced multi-window stochastic viewport prediction with model-predictive control, improving overall QoE by 16–19% over prior baselines. Setayesh and Wong [23] jointly optimized viewport prediction, bitrate selection, and beamforming vectors for THz-enabled 6G networks, where beam management and streaming quality are inextricably coupled.
On the tile selection side, Nguyen et al. [26] provided the most systematic comparative evaluation to date across ten existing strategies, finding strong sensitivity to segment duration, buffer size, and content type. Probabilistic tile visibility modeling with Laplace error distributions [56] yields near-optimal server-side allocation; MEC-assisted caching with LSTM-CNN joint models [57] addresses the cache replacement problem. Layer and spatial correlations in the bitstream have been exploited to enhance SVC and tile-based streaming efficiency [40].
All existing systems treat bitrate allocation and tile selection as downstream consumers of a fixed viewport estimate. ImmerseFM-3D is the first architecture to close this loop, sharing a single uncertainty-aware latent representation across all streaming decisions so that the bitrate decoder implicitly conditions on viewport prediction confidence.

2.3. Quality of Experience Estimation and Multi-Modal Perception

QoE estimation for 360° video must account for spherical projection distortions, viewport-dependent perceptual salience, cybersickness, and individual comfort preferences a substantially harder problem than conventional 2D quality assessment. Croci et al. [28] proposed a spherical Voronoi quality estimation framework with visual-attention weighting, demonstrating that patch-based quality metrics weighted by attention closely track subjective MOS. Qiu and Shao [58] extended saliency-driven assessment to blind (no-reference) scenarios using CNNs, while Chen et al. [59] established that equatorial viewing bias significantly distorts saliency map validity, a confound that must be corrected in QoE-driven models. Van Kasteren et al. [29] conducted a controlled psychophysical study with integrated eye-tracking, showing that freezing events and quality degradations interact with gaze patterns in nuanced ways, confirming that eye movements carry strong diagnostic information for QoE prediction beyond what video quality metrics alone provide.
Multi-modal physiological signals offer powerful additional predictors of immersive QoE. Xue et al. [45] released CEAP-360VR a continuous dataset of EDA, heart rate, skin temperature, head and eye movements, and pupillometry during 360° VR viewing, demonstrating that multi-modal physiological data substantially improves automatic emotion and QoE annotation. Cross-modal audio-visual effects further shape immersive quality: Shimamura et al. [36] showed that synchronized spatial audio removal is necessary to preserve perceived immersion during object editing, and Liu et al. [37] demonstrated that generating spatially accurate first-order ambisonics from 360° video content substantially improves listener immersion, validating the audio modality as a first-class QoE signal. These findings motivate ImmerseFM-3D’s explicit inclusion of ambisonic audio as an input modality, an architectural choice not present in any existing streaming QoE estimator.
From a system perspective, Vats et al. [60] showed that semantic video analysis at the 5G edge, combined with tile post loading, substantially improves pre-fetch quality under real-world latency constraints, one of the first systems to exploit semantic context alongside network state. Adhuran and Martini [61] further investigated the co-optimization of tiling schemes and viewport prediction. Despite this body of work, no published QoE estimator to date exploits the full complement of eye-tracking, head-motion covariation, spatial audio, depth, and scene semantics that ImmerseFM-3D jointly encodes.

2.4. Meta-Learning, Personalization, and Collaborative Adaptation

User behavior in 360° viewing is profoundly individual. Head-motion statistics, exploration style, comfort thresholds for cybersickness, and quality preferences vary substantially across viewers [4,47], making a population-mean model necessarily suboptimal for any specific user. The first systematic study of personalization in the 360° streaming context was conducted by Jiang et al. [33], who proposed a MAML [43]-based meta-learning framework for viewport prediction and pre-fetch size adaptation. Their evaluation demonstrated statistically significant QoE gains for users who deviate substantially from the population mean, precisely those who suffer most under non-personalized systems. Critically, their analysis also revealed that population-trained models can perform worse than a zero-order motion extrapolation baseline for users with non-standard head-motion patterns, underscoring the urgency of personalization.
While first-order meta-learning variants such as Reptile simplify computation, both MAML and Reptile require several minutes of interaction data before adaptation converges in practice. Collaborative approaches offer a complementary direction: CoLive [52] proposed edge-assisted online learning where concurrent clients share gradient information, enabling rapid warm-start for new viewers without requiring labeled user data. However, collaborative methods introduce privacy concerns and assume a sufficiently large concurrent user population. Spherical convolution-based FoV prediction [50] has been proposed for multicast scenarios with limited FoV feedback, an important practical constraint that personalization systems must also respect.
ImmerseFM-3D advances the personalization frontier by combining MAML with a HyperNetwork that generates globally coherent parameter updates from support-set gradients and an episodic memory module that provides a warm initialization from behaviorally similar prior users. This compound design reduces the 90%-threshold adaptation time from ≥47 s (MAML alone) to 22 s, delivers statistically significant gains from a single second of interaction data, and remains effective in live streaming contexts where pre-session content information is unavailable [34,35]. To our knowledge, ImmerseFM-3D is the first 360° streaming system to demonstrate sub-second-data personalization at scale across all four streaming sub-tasks simultaneously.
Table 1 summarizes the relative merits of representative 360° streaming methods across the four sub-tasks and three system capabilities evaluated in this paper. ImmerseFM-3D is the only published method to address viewport prediction, ABR allocation, tile selection, and QoE estimation jointly, while simultaneously supporting seven-modality fusion, user personalization, and 6DoF volumetric content.

3. Proposed Methodology

3.1. System Architecture Overview

ImmerseFM-3D adopts a unified foundation-model paradigm that jointly addresses viewport prediction, bitrate adaptation, tile selection, and quality-of-experience (QoE) estimation through a shared cross-modal representation space. Unlike conventional task-specific pipelines that optimize each component in isolation, our framework learns transferable representations that generalize across diverse content genres, network conditions, and user populations.
The architecture comprises four primary components:
  • Modality-specific encoders that map seven heterogeneous input streams into a common 512-dimensional latent space;
  • Cross-modal fusion module with an information bottleneck that compresses the joint representation to 256 dimensions while discarding task-irrelevant variation;
  • Task-specific decoders that transform the shared representation into actionable streaming decisions; and
  • Meta-learning adaptation module enabling rapid personalization to individual users from as few as 1–30 s of interaction data.
Figure 1 shows the full system architecture.

3.2. Problem Formulation

We formulate 360° video streaming as a multi-task learning problem over a heterogeneous input space. Let X = { X v , X n , X u , X a , X d , X e , X s } denote the input modality set, where X v contains equirectangular video frames, X n contains bandwidth/latency/jitter measurements, X u contains head-orientation trajectories ( θ t , ϕ t , ψ t ) at 30–90 Hz, X a contains first-order ambisonics signals, X d contains per-frame depth maps for volumetric content, X e contains eye-tracking gaze and pupillometry, and X s contains CLIP semantic embeddings.
Head-orientation trajectories { ( θ t , ϕ t , ψ t ) } are recorded by the HMD’s inertial measurement unit at native rates spanning 30–90 Hz across the devices and source datasets consolidated in IMMERSE-1M [46,47]. This range is bounded below by the ≈40 Hz Nyquist limit for voluntary saccadic head movements (mechanical bandwidth ≤ 20 Hz) and above by the 90 Hz threshold beyond which further oversampling provides no additional perceptual information at prediction horizons of 0.5–5 s. Variable-rate sequences are normalized to a canonical 90 Hz grid via linear interpolation and encoded with continuous-time positional encodings that preserve absolute timestamp information regardless of the original acquisition rate.
The objective is to learn f Θ : X Y where the output space Y = { Y v p , Y b r , Y t s , Y q e } contains future viewport positions, per-tile bitrate decisions, tile selection probabilities, and MOS-aligned QoE estimates, respectively.

3.3. Modality-Specific Encoders

Each modality is processed by a dedicated encoder that maps its raw signal to a common R 512 latent vector, as summarized in Table 2.

3.3.1. Video Encoder

A 3D Swin Transformer [41] processes T v = 16 frames at 30 fps. Spherical patching corrects equirectangular distortion by mapping patch coordinates to ( θ , ϕ ) before applying 3D shifted-window attention:
Attention ( Q , K , V ) = Softmax Q K d k + B V ,
where B encodes learnable spatiotemporal relative position biases. The four progressive stages downsample by factors { 4 , 8 , 16 , 32 } while expanding channels as { 128 , 256 , 512 , 512 } , producing f v R T v × H × W × 512 .

3.3.2. Network Encoder

Bandwidth, latency, and jitter traces are encoded by a temporal convolutional network (TCN) with exponentially growing dilation rates:
h n ( ) = Conv 1 D h n ( 1 ) , k = 3 , d = 2 1 , C = 64 · 2 1 , = 1 , , 4 .
Global average pooling collapses the temporal axis to f n R 512 .

3.3.3. User Behavior Encoder

Head-motion trajectories { ( θ t , ϕ t , ψ t ) } t = 1 T u ( T u = 150 at 30 Hz) are embedded by an MLP and refined by a 6-layer, 8-head causal Transformer. Temporal aggregation uses learned attention weights:
f u = t = 1 T u α t h u , t ( L ) , α = Softmax w α tanh W α h u ( L ) .

3.3.4. Audio Encoder

Ambisonic signals are converted to 128-band mel-spectrograms and processed by ResNet-50. Directional components A m n σ up to order 3 are encoded by an auxiliary MLP and concatenated: f a = [ f spec ; f amb ] R 768 , then projected to R 512 .

3.3.5. Depth, Eye-Tracking and Semantics Encoders

The depth encoder applies a 5-layer 3D CNN followed by global 3D average pooling. The eye-tracking encoder uses a bidirectional LSTM operating on gaze-plus-pupil sequences ( T e = 150 , d = 4 ). Scene semantics are derived from frozen CLIP ViT-B/32 embeddings at 1 fps, aggregated by a lightweight Transformer whose [CLS] token serves as f s R 512 .

3.4. Cross-Modal Fusion with Information Bottleneck

Figure 2 illustrates the fusion and compression pipeline.

3.4.1. Cross-Modal Attention

Modality features are combined as tokens M = [ m v , m n , m u , m a , m d , m e , m s ] , where each m i = f i + p i incorporates a learnable modality-type positional encoding p i . Full cross-modal attention is computed as:
A = SoftMax Q K d k R 7 × 7 , M = A V R 7 × 512 ,
applied over 4 residual layers with layer normalization.

3.4.2. Variational Information Bottleneck

The bottleneck objective balances compression and task-relevance [42]:
L IB = I ( X ; Z ) β I ( Z ; Y ) ,
where the encoder parameterizes a Gaussian q ( z | x ) = N ( μ z , diag ( σ z 2 ) ) and the KL regularizer is:
L KL = D KL N ( μ z , σ z 2 ) N ( 0 , I ) = 1 2 j = 1 256 1 + log σ z , j 2 μ z , j 2 σ z , j 2 .
The bottleneck dimension d z = 256 and β = 0.01 are determined by cross-validation; β is linearly annealed from 0 to 0.01 over the first 20 training epochs.
To assess sensitivity to bottleneck hyperparameters, we performed a grid search over d z { 64 , 128 , 256 , 512 } and β { 0.001 , 0.005 , 0.01 , 0.05 , 0.1 } on a held-out 5% validation split (Table 2). The selected configuration ( d z = 256 , β = 0.01 ) maximizes the joint objective of task performance and zero-shot cross-format transfer, while remaining robust to posterior collapse across all independent training runs. Values of β 0.05 exhibit increasing collapse risk and degraded generalization, validating the KL annealing schedule described in Section 3.8.

3.4.3. Uncertainty Estimation

During inference, S = 10 Monte Carlo samples from the posterior yield predictive mean and variance:
y ^ = 1 S s = 1 S f dec ( z s ) , σ y ^ 2 = 1 S s = 1 S f dec ( z s ) y ^ 2 .
The value S = 10 is the Pareto-optimal operating point on the estimation-accuracy/ inference-latency frontier: it reduces the relative standard error of the MC uncertainty estimate to 2.7% at an additional T4 inference cost of only 2.1 ms, remaining within the 40 ms real-time streaming decision budget. An empirical sweep over S { 1 , 3 , 5 , 10 , 20 , 50 } confirms diminishing accuracy returns above S = 10 that do not justify the proportionally larger latency penalties. Calibration diagrams on the test set confirm that the resulting σ y ^ 2 estimates are well-calibrated (97.3% interval coverage).

3.5. Task-Specific Decoders

3.5.1. Viewport Prediction Decoder

An autoregressive GRU generates future viewport positions ( θ ^ t + τ , ϕ ^ t + τ ) for horizon H time steps. The initialization is h 0 = MLP ( z ) . The joint loss combines geodesic position accuracy and trajectory smoothness:   
L v p = 1 H τ = 1 H g t + τ g ^ t + τ 2 2 + λ smooth 1 H 1 τ = 2 H g ^ t + τ g ^ t + τ 1 2 2 ,
where orthodromic distance accounts for spherical geometry.

3.5.2. Bitrate Allocation Decoder

Unnormalized scores s R K × | R | are produced for K tiles, and expected bitrates r ^ i = r R r p i , r are computed from the resulting SoftMax distribution. A bandwidth penalty enforces the available capacity constraint:
L b r = 1 K i = 1 K log p i , r i * + λ b w max 0 , i r ^ i B avail B avail .

3.5.3. Tile Selection and QoE Decoders

Tile importance scores w ^ i = σ ( MLP tile ( z ) i ) [ 0 , 1 ] are trained with binary cross-entropy against ground-truth tile inclusion labels. The QoE decoder outputs a scalar q ^ = MLP q o e ( z ) [ 1 , 5 ] trained with MSE against ITU-R BT.500 MOS ratings.
Both tile and QoE decoders operate on the deterministic posterior mean μ z ( S = 1 ), since their smooth loss surfaces make the mean-field approximation accurate; the saved inference budget is allocated to the S = 10 Monte Carlo viewport path whose uncertainty output directly conditions the bitrate allocation decoder (Section 3.5.2).

3.6. Meta-Learning Adapter for Personalization

Figure 3 shows the personalization pipeline.

3.6.1. Episodic Memory Module

The adapter maintains memory M = { ( h j , Δ θ j , Δ ϕ j ) } j = 1 M of successful past adaptations. For a new user, k nearest neighbors are retrieved via sim ( u , j ) = exp h u h j 2 / 2 σ 2 .
The k-NN retrieval operates on the deterministic mean embedding h u rather than a posterior sample, because the squared-Euclidean distance metric is maximally discriminative on the mean representation; injecting posterior variance into the retrieval step degrades k-NN precision without providing a calibration benefit, since the retrieved prototypes serve only as a warm initialization for the HyperNet.

3.6.2. Hypernetwork Adaptation

Rather than fine-tuning the full model, a compact hypernetwork maps the support-set gradient to per-layer weight offsets from the support-set gradient signal [43]:   
δ θ = HyperNet 1 | D sup | ( x , y ) D sup θ L ( x , y ; θ ) , θ = θ + α · δ θ .

3.6.3. MAML Bi-Level Optimization

Meta-training solves:
min θ T i p ( T ) L T i θ β θ L T i ( θ ) ,
where each task T i corresponds to personalizing for a different user. The inner loop uses 5 gradient steps ( β = 0.01 ); the outer loop employs Adam.

3.7. Volumetric Video Extension

For six-degrees-of-freedom (6DoF) volumetric content, ImmerseFM-3D extends viewport representation and tile-importance scoring as illustrated in Figure 4.
Orientation at future step τ is represented as:
v ^ ( t + τ ) = l = 0 L m = l l a ^ l m ( τ ) Y l m ( θ , ϕ ) , L = 3 ,
and depth-aware tile importance reweights the 2D scores as
w i 3 D = w i 2 D · exp λ d d ^ i d ^ vp 2 ,

3.8. Training Objective and Procedure

The complete loss is:
L total = L v p + λ b r L b r + λ t s L t s + λ q e L q e + β L KL + γ L meta ,
with weights λ b r = 0.5 , λ t s = 0.3 , λ q e = 1.0 , β = 0.01 , and γ = 0.1 .
Figure 5 outlines the three-stage training schedule.
The training algorithm is formalized in Algorithm 1.
Algorithm 1 ImmerseFM-3D Multi-Stage Training
    Input: Dataset D = { ( x i , y i v p , y i b r , y i t s , y i q e ) } i = 1 N , user splits U train , U meta , U test             
Output: Optimised parameters Θ = { Θ enc , Θ fus , Θ dec , Θ meta }
    Stage 1: Pre-training
  Randomly initialise all networks.
  for epoch = 1 to E pre :
     Sample batch B D train
     Encode: { f m } { Enc m ( x m ) } m X
     Fuse: M CrossModalAttn ( M )
     Compress: z N ( μ z , σ z 2 ) via reparameterisation
     Decode: y ^ v p , y ^ b r , y ^ t s , y ^ q e { f dec k ( z ) }
     Compute loss: L L v p + λ b r L b r + λ t s L t s + λ q e L q e + β L KL
     Update Θ via AdamW (lr = 10 4 )
    Stage 2: Meta-training
  for epoch = 1 to E meta :
     Sample user batch U b U meta
     for each user u U b :
        Sample D sup u (5 samples) and D qry u
        Compute adapted parameters: θ u θ β θ L ( D sup u ; θ )
        Compute query loss: L qry u L ( θ u , D qry u )
     Update base parameters: θ θ η θ u L qry u
    Stage 3: Fine-tuning
  Set lr = 10 5 ; jointly optimise all losses for E ft epochs.
Return: Θ

3.9. Implementation Details

Table 3 summarizes the full architecture. Training runs on 8× NVIDIA A100 (80 GB; NVIDIA Corporation, Santa Clara, CA, USA) GPUs with PyTorch 2.0 DDP and FP16 mixed-precision. Inference is benchmarked on an NVIDIA T4 edge server (NVIDIA Corporation, Santa Clara, CA, USA), iPhone 12 (A14 Bionic; Apple Inc., Cupertino, CA, USA), Pixel 6 (Google LLC, Mountain View, CA, USA), and HTC Vive Pro Eye 90 Hz (HTC Corporation, New Taipei City, Taiwan).
Key hyper-parameters: AdamW ( β 1 = 0.9 , β 2 = 0.999 , wd = 0.01); gradient clipping to norm 1.0 ; effective batch size 256 via gradient accumulation; E pre = 100 , E meta = 50 , E ft = 20 epochs.
The complete source code, training scripts, and pre-trained model checkpoints for ImmerseFM-3D and ImmerseFM-3D-Mobile are available at https://github.com/ythinkxyz/ImmerseFM-3D (accessed on 3 March 2026) (MIT Licence, released upon acceptance). Inference can be executed immediately from the pre-trained checkpoints without retraining.

4. Experimental Setup

4.1. IMMERSE-1M Dataset

We introduce IMMERSE-1M, a large-scale multi-modal dataset for immersive streaming research that consolidates and extends existing public resources while adding novel modalities. The dataset comprises over 1000 h of content with fully synchronized multi-modal recordings, making it the largest publicly available benchmark for 360° and volumetric video streaming research to date. Figure 6 provides an overview of the dataset composition.
Section 4.1.5 Dataset Demographics and Cultural Diversity. IMMERSE-1M consolidates data from East Asian (32%, Taiwan/China) and Western European (68%, France/UK) participant pools. Post-hoc analysis reveals statistically significant but modest differences in equatorial bias, exploration radius, and saccade frequency between groups ( p < 0.05 , d = 0.26 0.43 ). Per-group evaluation confirms that VP MAE differs by only 0.19° between groups, below the threshold of practical significance. We acknowledge that South Asian, African, and Latin American viewing patterns are not represented in the current dataset and recommend this as a priority for future data collection.

4.1.1. Video Content

The video corpus covers six format categories, as detailed in Table 4. Content was sourced from established public datasets [5,46,62] and augmented with new recordings to increase scene diversity, particularly for fast-motion and volumetric categories.
Content diversity across five semantic categories is maintained: sports (15%, fast motion with predictable trajectories), nature (20%, slow exploration with salient regions), documentary (25%, narrative-driven guided attention), urban (20%, complex multi-POI scenes), and action/cinematic (20%, dynamic camera and rapid cuts).

4.1.2. User Behavior Data

Six user behavior corpora are consolidated, as summarized in Table 5. New recordings contribute 200 additional participants with the richest modality set: head tracking, eye tracking, and pupillometry, all at 90 Hz. The combined dataset totals 524 unique participants with approximately 1200 h of viewing data.
Subjective QoE ratings comprise 50,000+ MOS scores collected following ITU-R BT.500 guidelines, with a minimum of 25 subjects rating each video segment under controlled viewing conditions. Ratings cover overall quality, motion sickness, and immersion dimensions.

4.1.3. Network Traces

Ten thousand network traces spanning five real-world and one synthetic condition are included (Table 6), covering bandwidths from 0.1 Mbps (HSDPA) to 800 Mbps (5G mmWave simulation) and round-trip times from 2 ms (WiFi) to 500 ms (degraded HSDPA). This diversity ensures robust evaluation across heterogeneous deployment environments.

4.1.4. Auxiliary Modalities and Dataset Splits

Audio: 60% of 360° videos include synchronized 4-channel first-order ambisonics (FOA), enabling directional sound modeling. Scene semantics: CLIP ViT-B/32 embeddings extracted at 1 fps, MS-COCO object detection labels, and scene graphs are provided for all content.
Dataset splits are user-disjoint and content-disjoint, as shown in Table 7, preventing any overlap between pre-training, meta-training, and held-out test users or video sequences.

4.1.5. Dataset Demographics and Cultural Diversity

The generalizability of ImmerseFM-3D in global deployments depends critically on whether the participant pool underlying IMMERSE-1M reflects the diversity of real-world immersive media audiences. This section characterizes the demographic and geographic composition of the dataset and assesses its impact on model generalization.
Geographic composition. The six user behavior corpora consolidated into IMMERSE-1M originate from two principal geographic regions:
  • East Asian cohort ( n = 168 , 32%): Wu et al. [46] (48 participants, National Taiwan University) and Xu et al. [39] (120 participants, mixed East Asian affiliation).
  • Western European cohort ( n = 356 , 68%): Corbillon et al. [5] (59 participants, Télécom Paris), David et al. [38] (57 participants, University of Nantes), Rai et al. [47] (40 participants, University of Nantes), and new recordings collected at the University of Strathclyde, Glasgow (200 participants recruited under institutional IRB approval).
The combined dataset therefore does not include participants from South Asian, African, or Latin American backgrounds, which we acknowledge as a limitation with implications for global deployment generalization (see Section 7).
Head-motion statistics by geographic group. To assess whether cultural background introduces systematic differences in viewing behavior, we conducted a post-hoc analysis comparing the East Asian and Western European participant groups on three descriptors: equatorial bias (fraction of viewing time spent within 30° of the equatorial plane), mean exploration radius (mean geodesic distance from video centroid), and saccade frequency (number of rapid reorientations per minute). Table 8 reports the results.
All three metrics show statistically significant differences ( p < 0.05 ), consistent with cross-cultural gaze research demonstrating that East Asian observers allocate more attention to contextual background elements while Western observers exhibit higher focal exploration rates [4]. However, the effect sizes are modest ( d = 0.26 0.43 , classified as small-to-medium), suggesting that while the differences are real, they are unlikely to produce practically significant generalization failures for a model trained on a 524-participant pool that spans both cohorts.
Per-group evaluation. To quantify the impact on model performance, ImmerseFM-3D was evaluated separately on East Asian and Western European held-out test subsets (proportionally sampled from the 10% test split of Table 7):
  • VP MAE at 1 s: East Asian 5.34°, Western European 5.15°, overall 5.21°. The 0.19° inter-group gap is small relative to the 2.62° margin over the best baseline (STMRQ: 7.83°).
  • QoE Pearson ρ : East Asian 0.878, Western European 0.897, overall 0.891. The 0.019 inter-group gap is below the conventional threshold of practical significance (0.05).
The MAML-based personalization adapter (Section 3.6) is designed to handle individual-level deviations from population priors—including culture-driven differences in viewing style—via few-shot adaptation from 1–30 s of per-user interaction data. This mechanism will naturally compensate for under-represented demographic groups at deployment time, even before any dedicated cross-cultural re-training is performed.
Content diversity. The video corpus covers five semantic categories (Sports, Nature, Documentary, Urban, Action/Cinematic) sourced from globally available public datasets [5,39,46]. The Sports category is skewed towards team sports and outdoor activities common in East Asia and Europe; content featuring sports or cultural practices specific to South Asian, African, or Latin American contexts is not represented. This is a known limitation of the current dataset version.

4.2. Evaluation Protocols

We design five evaluation experiments, each targeting a distinct generalization axis of ImmerseFM-3D, as illustrated in Figure 7.
Experiment 1-Unseen Content Generalization. Models are trained on 80% of video content spanning all categories and evaluated on the held-out 20%, stratified by category. Metrics are reported separately per content category to expose category-specific failure modes.
Experiment 2-Few-Shot Personalization. For each of the 53 test users, k { 1 , 5 , 10 , 30 , 60 } seconds of interaction data are provided to the meta-learning adapter. Adaptation quality is measured as the QoE gain over the non-personalized baseline, and sample efficiency is defined as the data volume required to reach 90% of fully adapted performance. The experiment is repeated across all test users and summarized as mean ± 95% CI.
Experiment 3-Cross-Format Transfer. The model is pre-trained exclusively on 360° video, then evaluated zero-shot on volumetric and light-field test sets. Progressive fine-tuning with { 1 % , 5 % , 10 % , 25 % } of target domain data quantifies data efficiency relative to training from scratch on equivalent data.
Experiment 4-Volumetric Video Streaming. The 6DoF volumetric extension (Section 3.7) is evaluated on 80 held-out volumetric videos. Predictions cover full 6DoF trajectories; tile streaming is simulated using the depth-aware tile importance weighting from Equation (13), compared against 2D-only and depth-agnostic baselines.
Experiment 5-Ablation Studies. Table 9 defines nine ablation configurations systematically removing architectural components to isolate their individual contributions.

4.3. Baseline Methods

ImmerseFM-3D is benchmarked against state-of-the-art methods for each sub-task, as detailed in Table 10, Table 11 and Table 12.
All baselines were re-implemented in PyTorch 2.0 and trained from random initialization exclusively on the IMMERSE-1M pre-training split (Table 7), using an independent 5% validation subset for hyperparameter optimization (random search, 50 configurations per baseline). Official open-source implementations were used as starting points where available. No pre-trained checkpoints from the original publications were employed. All models were evaluated on the identical held-out test split using shared evaluation scripts, ensuring strict comparability across all reported metrics.

4.4. Evaluation Metrics

4.4.1. Viewport Prediction

Angular prediction error is measured using the geodesically correct orthodromic distance:
d ( g 1 , g 2 ) = arccos sin ϕ 1 sin ϕ 2 + cos ϕ 1 cos ϕ 2 cos ( θ 1 θ 2 ) ,
where ( θ , ϕ ) are yaw and pitch in radians. Mean Angular Error (MAE) in degrees is reported at prediction horizons τ { 0.5 , 1 , 2 , 3 , 5 } s. Additional metrics include the fraction of predictions within δ { 10 ° , 20 ° } thresholds and the Earth Mover’s Distance between predicted and ground-truth scanpath distributions.

4.4.2. Bitrate Allocation

  • Top-k Accuracy: Acc @ k = 1 N i = 1 N I r i * Top - k i
  • Mean Absolute Error: MAE = 1 N i = 1 N | r ^ i r i * |
  • Bandwidth Violation Rate: BVR = P i r ^ i > B avail

4.4.3. QoE Estimation

Quality-of-experience estimation is assessed via three complementary correlation metrics:
Pearson : ρ = i ( q i q ¯ ) ( q ^ i q ^ ¯ ) i ( q i q ¯ ) 2 i ( q ^ i q ^ ¯ ) 2 ,
RMSE : RMSE = 1 N i = 1 N ( q i q ^ i ) 2 ,
with Spearman rank correlation ρ s additionally reported for ordinal consistency.

4.4.4. Tile Selection

Standard information-retrieval metrics are used:
Precision = T P T P + F P , Recall = T P T P + F N , F 1 = 2 · Prec . · Rec . Prec . + Rec . ,
complemented by Intersection over Union IoU = | S S ^ | / | S S ^ | to measure the spatial overlap between selected and ground-truth tile sets.

4.4.5. 6DoF Volumetric Metrics

The combined 6DoF viewport error weights angular and positional components:
E 6 DoF = α d angle ( R , R ^ ) + ( 1 α ) p p ^ 2 ,
with α = 0.5 . Depth estimation is evaluated using RMSE and δ 1 accuracy (the percentage of pixels satisfying max ( d i / d ^ i , d ^ i / d i ) < 1.25 ). Point-cloud quality is measured using point-to-point PSNR and point-to-plane PSNR for smooth surface fidelity.

5. Results

This section reports quantitative outcomes across all five evaluation experiments (Section 4.2). All results are mean ± 95% CI over five independent runs. Statistical significance ( p < 0.05 , Bonferroni-corrected) versus the best baseline is indicated by .

5.1. Experiment 1: Generalization to Unseen Content

To validate the trace-driven evaluation methodology, we conducted a supplementary pilot deployment on a live 5G campus network (Nokia AirScale 3.5 GHz; Nokia Corporation, Espoo, Finland) with 12 participants. ImmerseFM-3D achieved a live-network BVR of 4.1% (vs. 3.1% in trace-driven evaluation, a 1.0 pp gap consistent with natural real-time network variance), VP MAE of 5.6° (vs. 5.21°), and a mean MOS of 3.9/5 (vs. predicted 3.87/5). The close agreement confirms that IMMERSE-1M’s real-world network traces provide a high-fidelity proxy for live deployment performance.

5.1.1. Viewport Prediction

Table 13 reports mean angular error (MAE, in degrees) at five prediction horizons on the held-out test set. ImmerseFM-3D achieves consistent reductions across all horizons, with the largest improvement at the challenging 3 s and 5 s horizons where multi-modal context is most beneficial.
An online user study ( n = 12 , live 5G network, four injected congestion events per video) confirms that ImmerseFM-3D’s real-time QoE predictions correlate with subjective slider ratings at ρ = 0.871 under live dynamic network conditions. ImmerseFM-3D’s uncertainty-aware ABR issued pre-emptive quality reductions in 81.3% of congestion events (versus 18.8% for JUST360), reducing the mean perceived quality drop during congestion from 0.79 to 0.31 MOS points (61% reduction).

5.1.2. Bitrate Allocation and Tile Selection

Table 14 presents bitrate allocation accuracy (Top-1 and Top-3), bandwidth violation rate (BVR), and tile selection F1 and IoU. ImmerseFM-3D achieves 92.1% Top-1 bitrate accuracy and reduces BVR from 8.3% (best baseline) to 3.1%, representing a critical practical improvement for streaming reliability.

5.1.3. QoE Estimation

Table 15 reports QoE estimation performance. ImmerseFM-3D achieves a Pearson correlation of 0.891 with human MOS scores, outperforming Voronoi-VQA (0.763) by 0.128 a substantial gain (Cohen’s d = 1.47 , large effect).

5.1.4. Per-Category Generalization

Figure 8 shows the per-content-category breakdown of viewport prediction MAE at 1 s and QoE Pearson correlation, revealing that gains are consistent across all five categories. The largest relative improvement occurs in the high-motion (sports, action) categories, where multi-modal context from audio and semantics provides the strongest benefit.

5.2. Experiment 2: Few-Shot Personalization

Figure 9 shows QoE improvement over the non-personalized baseline as a function of adaptation data volume for ImmerseFM-3D and all three meta-learning baselines, averaged over 53 test users. Table 16 quantifies key efficiency thresholds.
ImmerseFM-3D reaches 90% of its maximum personalization gain with only 22 s of user data—more than twice as efficient as MAML (47 s) and substantially more than fine-tuning (>60 s). At the 1 s operating point, ImmerseFM-3D already achieves +4.8% improvement, demonstrating immediate utility even for first-session users.

5.3. Experiment 3: Cross-Format Transfer

Table 17 reports transfer ratios (performance relative to full in-domain training) at varying amounts of target-domain fine-tuning data. The zero-shot transfer ratio of 72% demonstrates that ImmerseFM-3D’s shared bottleneck representation encodes format-agnostic scene understanding. In contrast, training from scratch with 1% of target data yields only 32% of full performance.
With just 10% of target data, ImmerseFM-3D reaches 94% of full training performance, compared with 67% for scratch training effectively a 6.6 × reduction in required labeled target-domain data to reach a comparable performance level.

5.4. Experiment 4: Volumetric Video Streaming

Table 18 presents 6DoF viewport prediction, depth estimation, and volumetric QoE results on the 80-video test set.
The depth-aware tile importance weighting (Equation (13)) delivers a 7.3 dB P2P-PSNR gain over the depth-agnostic baseline, confirming the value of geometric reasoning for volumetric content delivery. The spherical harmonic orientation decoder (Equation (12)) reduces 6DoF viewport error by 40.8% relative to the 2D baseline that ignores out-of-plane rotation components.

5.5. Experiment 5: Ablation Studies

Figure 10 visualizes the contribution of each architectural component relative to the full model, averaged over all four task metrics.
Table 19 provides per-task ablation scores to complement the aggregate visualization, confirming that the cross-modal attention fusion and the variational information bottleneck are the two components with the greatest individual impact.
Table 20 presents a consolidated marginal-contribution summary, reporting the absolute and relative performance degradation caused by removing each input modality, averaged across all four tasks.
Three observations emerge from the ablation analysis. First, cross-modal fusion is the single most critical component: removing it degrades average performance by 28.5 pp, more than removing any other individual module. Second, the information bottleneck contributes the second largest gain (21.8 pp average), particularly for out-of-distribution generalization to unseen users. Third, scene semantics (CLIP features) provide a surprisingly large benefit (15.4 pp average), outweighing audio (−10.5 pp) in overall importance, reflecting the strong correlation between visual scene type and head-motion statistics.
To evaluate robustness under partial modality availability, we additionally trained a mixture-of-modalities (MoM) variant in which each modality encoder is independently masked with probability 0.3 per training batch, with masked outputs replaced by a learned null embedding. The MoM model recovers 94% of full-modality performance when any single modality is absent, compared with 85–89% for the standard model, at the cost of a 2.1 pp reduction on the full-modality configuration. MoM training is recommended as the default production configuration for deployments where sensor availability cannot be guaranteed.

5.6. Computational Efficiency

Table 21 profiles inference latency and memory on all target hardware platforms. On the edge server (T4), ImmerseFM-3D completes a full decision cycle in 31.4 ms comfortably within the 40 ms budget for 25 fps real-time streaming. On mobile devices, the full model exceeds this budget, motivating the knowledge-distilled student model (ImmerseFM-3D-Mobile) which preserves 91% of full-model QoE at 22.1 ms inference latency.

6. Discussion

6.1. Interpretation of Main Results

The results in Section 5 confirm that ImmerseFM-3D advances the state of the art across all four streaming sub-tasks by substantial, statistically significant margins. Three findings are particularly consequential.
Viewport prediction gains compound with horizon. The relative MAE reduction grows from 30.1% at 0.5 s to 29.2% at 5 s (Table 13), demonstrating that the foundation model’s multi-modal context provides growing dividends as prediction difficulty increases. Single-task baselines increasingly revert to inertial heuristics at longer horizons, whereas ImmerseFM-3D leverages scene semantics and audio directional cues to anticipate re-orientation events before they manifest in head-motion history alone.
Bandwidth violations substantially reduced. The BVR reduction from 8.3% to 3.1% (Table 14) has direct implications for user experience: each violation triggers rebuffering or quality degradation that reduces MOS by approximately 0.4 points on the 1–5 scale. The shared bottleneck enables the bitrate decoder to implicitly anticipate network fluctuations by conditioning on the uncertainty of the network encoder (the σ z 2 term in Equation (7)), a capability absent from all baselines.
QoE estimation approaches inter-rater reliability. A Pearson correlation of 0.891 is close to the inter-rater reliability reported in large-scale subjective studies (typically 0.91–0.95). The multi-modal inputs enable the QoE decoder to capture both technical quality (video and depth features) and perceptual comfort (eye-tracking and head-motion features), whereas Voronoi-VQA is limited to visual salience alone.

6.2. Cross-Modal Synergies

The attention matrix in Figure 11 exposes several non-trivial cross-modal dependencies that corroborate known perceptual phenomena. The elevated U↔E weight (0.15–0.16) reflects the vestibulo ocular reflex: smooth pursuit eye movements closely follow head trajectory in immersive viewing, making gaze and head-motion mutually predictive [66]. The A↔S interaction (0.14–0.15) captures the phenomenon in which salient audio events (speech onset, sound effects) redirect attention toward semantically congruent visual regions, a cue that single-modality baselines entirely ignore.
Conversely, the network modality (N) maintains near-exclusive self-attention ( A N N = 0.51 ), consistent with the expectation that bandwidth fluctuations are statistically independent of visual content in the IMMERSE-1M dataset. This implicit causal disentanglement is reinforced by the information bottleneck, which prevents the network encoder from absorbing spurious correlations with content features.

6.3. Role of the Information Bottleneck

The ablation (Table 19) shows that removing the bottleneck causes a 21.8 pp average performance drop, but removing cross-modal fusion causes 28.5 pp. This ordering reveals complementary roles: the cross-modal attention constructs a task-rich representation from heterogeneous inputs, while the bottleneck filters it to retain only the minimal sufficient statistics for prediction. Together they implement maximum compression with controlled task informativeness, consistent with the information bottleneck principle [42].
The KL annealing schedule (0 to 0.01 over 20 epochs, Section 3.8) is critical: training with fixed β = 0.01 from epoch 1 causes posterior collapse in approximately 30% of runs. Annealing allows the encoder to first build a task-relevant representation before compression pressure is applied, a finding consistent with prior variational learning literature.

6.4. Meta-Learning and Personalization Dynamics

The 90%-threshold result of 22 s for ImmerseFM-3D versus 47 s for MAML reflects two compounding advantages. First, episodic memory retrieval provides a warm initialization closer to each individual’s optimum than the population mean, reducing the number of gradient steps needed to converge. Second, the hypernetwork generates globally coherent weight adjustments from a single gradient signal, avoiding the oscillatory convergence observed in vanilla MAML with small inner-loop step counts [43]. The +4.8% QoE improvement from a single second of data is already statistically significant ( p < 0.001 ), establishing ImmerseFM-3D as practically useful even in first-session deployments without any prior user data.
The “w/o meta-learning” ablation shows near-zero degradation in viewport MAE but a meaningful −0.019 drop in QoE ρ . This is expected: head-motion prediction benefits from personalization primarily over longer sessions, while QoE is immediately sensitive to individual differences in perceptual comfort thresholds and quality preferences.

6.5. Volumetric and 6DoF Streaming Analysis

The 40.8% reduction in 6DoF viewport error versus the 2D-only baseline validates the spherical harmonic orientation decoder (Equation (12)), which avoids the gimbal lock and ± 180 ° discontinuities inherent in direct Euler-angle regression. The depth RMSE of 0.21 m and δ 1 accuracy of 91.4% are competitive with specialist monocular depth networks, indicating that multi-modal bottleneck features provide complementary geometric context. The 7.3 dB P2P-PSNR improvement from depth-aware tile importance (Table 18) highlights the sensitivity of point-cloud quality to fetching geometrically appropriate depth layers, a benefit only achievable by architectures that jointly model depth and streaming decisions.

6.6. Computational Considerations and Deployment Pathways

The 31.4 ms edge-server latency and 22.1 ms mobile latency of ImmerseFM-3D-Mobile both satisfy the 40 ms real-time 25 fps budget (Table 21). The 9% QoE penalty of the distilled model is a reasonable trade-off for mobile deployment and is largely recovered through personalization: the student model’s adaptation gain of +13.1% at 30 s nearly offsets the 9% compression penalty on a per-user basis. The dominant training cost is Stage 1 (≈980 GPU-hours on A100s); however, fine-tuning to a new target domain requires only ≈49 GPU-hours, making the paradigm cost-effective for continual adaptation to emerging content formats or network technologies.
All mobile inference figures satisfy the 40 ms real-time decision budget required for 25 fps MPEG-DASH segment control. The knowledge-distilled ImmerseFM-3D-Mobile variant (≈28 M parameters) achieves 18.6 ms on an HTC Vive Pro Eye while retaining 91% of the full model’s QoE performance, making on-device deployment of the complete four-task pipeline practically feasible on current-generation HMD hardware. Furthermore, the unified foundation-model architecture is computationally more efficient than the ensemble of four state-of-the-art single-task baselines (≈450 GFLOPs combined versus 287 GFLOPs for ImmerseFM-3D), reinforcing the practical case for joint multi-task optimization.
The seven modality-specific encoders execute fully in parallel via CUDA stream concurrency; the encoder-stage wall-clock latency is therefore bounded by the single slowest encoder (the 3D Swin-T video encoder, ≈18.1 ms on T4) rather than the sum of all seven. A stage-by-stage latency breakdown is provided in the revised Table 21. The 94.7 ms iPhone 12 figure applies to the full 177 M-parameter FP32 model; the intended on-device configuration is ImmerseFM-3D-Mobile (22.1 ms, FP16).
For practitioners without access to A100-class training hardware, three simplification pathways are available. First, the knowledge-distilled ImmerseFM-3D-Mobile variant (28 M parameters) matches the performance of full-model ImmerseFM-3D on commodity NPU hardware at a 91% QoE retention rate. Second, the modular encoder–attention–decoder design permits any single task branch to be deployed in isolation, with all encoder outputs standardized to 512-dimensional vectors for clean downstream integration. Third, Stage 1 pre-training alone yields a competitive population-level model, reducing the training budget to approximately 700 A100-h or an equivalent on consumer GPUs.

6.7. Limitations and Future Work

Despite the consistent gains, ImmerseFM-3D has four limitations that define a clear research agenda.
Modality availability. The full 7-modality model requires depth maps and ambisonic audio, which are unavailable for most legacy 360° content. Although the ablation confirms graceful degradation (−8 pp without depth; −10.5 pp without audio), a mixture-of-modalities variant that handles missing inputs without re-training is desirable for practical deployment.
Privacy-preserving personalization. Eye-tracking and head-motion data are sensitive biometric signals. A federated learning extension [53] in which user encoders are trained locally, with only the compressed latent z transmitted to the server, would address this concern while preserving personalization capability.
For production deployments where all seven modalities cannot be guaranteed simultaneously, the mixture-of-modalities (MoM) trained variant is recommended. It maintains 94% of full-modality performance across all single-modality-missing scenarios at a 2.1 pp overhead on the full-modality configuration. All single-modality-missing variants of ImmerseFM-3D remain competitive with or superior to the best available single-task baseline for every sub-task, confirming deployment viability across heterogeneous sensor environments.
The modality availability concern is significantly mitigated by four factors. First, five of the seven modalities (video, network traces, CLIP semantics, head motion, ambisonic audio) are available in standard streaming deployments without specialized hardware. Second, depth maps can be estimated in real time from the video stream using monocular depth networks, eliminating the need for stereoscopic capture in most deployments. Third, the mixture-of-modalities trained variant (Section 5.5) recovers 94% of full-modality performance when any single modality, including the specialized eye-tracking modality, is absent. Fourth, the personalization adapter requires only 1–30 s of head-motion data from the HMD itself, with no pre-collected per-user training data needed. IMMERSE-1M will be released publicly upon acceptance, eliminating the data acquisition barrier for researchers wishing to reproduce or extend this work.

7. Conclusions

We presented ImmerseFM-3D, the first unified foundation model framework that jointly addresses viewport prediction, adaptive bitrate allocation, tile selection, and quality-of-experience estimation for 360-degree and volumetric video streaming. By fusing seven heterogeneous input modalities through cross-modal attention and compressing their joint representation via a variational information bottleneck [42], ImmerseFM-3D produces a shared 256-dimensional latent that simultaneously informs all four streaming decisions. The five principal contributions of this work are summarized below, each with its principal empirical evidence.
(1) Joint multi-task foundation architecture.
ImmerseFM-3D is the first architecture to jointly solve all four 360° streaming sub-tasks through a single shared cross-modal representation, eliminating the inter-task error propagation bottleneck that limits siloed pipelines. On the IMMERSE-1M benchmark, the model reduces mean angular viewport error by up to 34% across prediction horizons of 0.5–5 s, cuts the bandwidth violation rate from 8.3% to 3.1% (a 63% absolute reduction), and achieves a QoE Pearson correlation of 0.891—approaching the 0.91–0.95 upper bound set by inter-rater reliability in large-scale subjective studies. These improvements are statistically significant ( p < 0.05 , Bonferroni-corrected) against all baselines across all five content categories.
(2) Variational information bottleneck and cross-format transfer. The variational IB achieves 72% zero-shot cross-format transfer from 360° to volumetric content, versus 48% for pre-trained fine-tuning baselines, and provides implicit uncertainty quantification that enables downstream bitrate decisions to condition on viewport prediction confidence without dedicated confidence estimation modules. Ablation confirms that the IB contributes the second-largest individual architectural gain, and that its KL annealing schedule is essential for avoiding posterior collapse in approximately 30% of fixed- β training runs.
(3) Sample-efficient meta-learning personalization. A compound MAML-Hyper-Network-episodic-memory adapter reaches 90% of its maximum personalization gain in 22 s, more than twice as efficient as standalone MAML (47 s), and delivers a statistically significant +4.8% QoE improvement from a single second of user interaction data. This enables personalization in first-session and live-streaming contexts where prior methods fail and remains effective even without pre-session content information.
(4) IMMERSE-1M dataset. The IMMERSE-1M benchmark, released publicly, provides the research community with the largest multi-modal dataset for immersive streaming research to date: 1000 h of 360° and volumetric video, 524 participants, 10,000 network traces, and 50,000+ subjective MOS ratings with synchronized seven-modality recordings spanning six content format categories.
(5) Volumetric 6DoF extension. The spherical harmonic orientation decoder and depth-aware tile importance weighting reduce 6DoF viewport error by 40.8% and increase Point-to-Plane PSNR by 7.3 dB over the depth-agnostic 2D baseline, establishing a strong foundation-model baseline for next-generation holographic streaming.
Despite these results, four open challenges define the primary direction for future work.
First, ImmerseFM-3D degrades when depth maps or ambisonic audio are absent; a mixture-of-modalities training regime partially demonstrated in Section 5.5, where masking random modality subsets during training recovers 94% of full-modality performance with any single modality missing would generalize this robustness to arbitrary sensor configurations. Second, the user encoder processes sensitive biometric signals (eye-tracking and head-motion trajectories); a federated learning extension in which user encoders are trained locally and only the compressed latent is transmitted to the server would address these privacy concerns while preserving personalization capability. Third, causal robustness under dataset distribution shift requires invariant risk minimization beyond what the information bottleneck alone provides. Fourth, the current architecture requires all modalities to be available at the same temporal resolution; a flexible-rate encoder design would better accommodate the heterogeneous sampling rates encountered in real deployments. Addressing these limitations constitutes the primary agenda for future work.

Author Contributions

Conceptualization, R.S.G.W.; methodology, R.S.G.W.; software, R.S.G.W.; validation, R.S.G.W.; formal analysis, R.S.G.W.; investigation, R.S.G.W.; data curation, R.S.G.W.; writing-original draft preparation, R.S.G.W.; writing-review and editing, R.S.G.W. and A.F.; visualization, R.S.G.W.; supervision, A.F.; project administration, A.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The IMMERSE-1M dataset introduced in this study, comprising 1000 h of 360° and volumetric video, head-motion and eye-tracking recordings from 524 participants, 10,000 network traces, and 50,000+ mean opinion scores, will be deposited in a publicly accessible repository under a Creative Commons Attribution 4.0 International (CC BY 4.0) licence upon acceptance of this manuscript. The complete source code for ImmerseFM-3D, including all modality-specific encoders, the cross-modal attention module, the variational information bottleneck, all four task decoders, the MAML–HyperNetwork personalisation adapter, and the three-stage training scripts, will be released as open-source software under the MIT Licence at https://github.com/ythinkxyz/ImmerseFM-3D; accessed on 15 March 2026, upon acceptance. Pre-trained model checkpoints for ImmerseFM-3D and the knowledge-distilled ImmerseFM-3D-Mobile variant will be made available simultaneously via Hugging Face Hub to enable immediate inference without retraining. The six existing head-motion and eye-tracking corpora consolidated into IMMERSE-1M are publicly available from their original sources: Wu et al. [46], Corbillon et al. [5], Xu et al. [39], David et al. [38], and Rai et al. [47].

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
6DoFSix degrees of freedom
ABRAdaptive bitrate
BVRBandwidth violation rate
CEAPContinuous physiological and behavioral emotion annotation
CLIPContrastive language–image pre-training
CMMSTCross-modal multi-scale transformer
CNNConvolutional neural network
CPUCentral processing unit
D-SAV360Dataset of scanpaths on ambisonic videos
DASHDynamic adaptive streaming over HTTP
EDAElectrodermal activity
EMDEmpirical mode decomposition
EPASS360Ensemble prediction and allocation streaming system for 360° video
ERPEquirectangular projection
FOAFirst-order ambisonics
FoVField of view
GFLOPsGiga floating-point operations per second
GNNGraph neural network
GOPGroup of pictures
GPUGraphics processing unit
GRUGated recurrent unit
HMDHead-mounted display
HTTPHypertext transfer protocol
IBInformation bottleneck
IMMERSEImmersive multi-modal evaluation and representation for streaming
ITUInternational Telecommunication Union
JUST360Joint utility streaming for 360° video
KKTKarush–Kuhn–Tucker
KLKullback–Leibler (divergence)
LSTMLong short-term memory
LTELong-term evolution
MADRLMulti-agent deep reinforcement learning
MAEMean absolute error
MAMLModel-agnostic meta-learning
MDAMulti-dimensional attention
MECMobile edge computing
MFTRMulti-modal fusion transformer
MLPMulti-layer perceptron
MOSMean opinion score
MPEGMoving Picture Experts Group
MSEMean squared error
OMAFOmnidirectional media application format
P2P-PSNRPoint-to-plane peak signal-to-noise ratio
PSNRPeak signal-to-noise ratio
QoEQuality of experience
RGBRed–green–blue
RLReinforcement learning
RMSERoot mean squared error
RNNRecurrent neural network
ROIRegion of interest
RTTRound-trip time
SOTAState of the art
SRDSpatial relationship description
STMRQSpatiotemporal motion-aware rate-quality predictor
SVCScalable video coding
TCNTemporal convolutional network
VATP360Viewport adaptive 360° video streaming based on tile priority
VPViewport prediction
VPT360Viewport prediction transformer for 360° video
VQAVisual quality assessment
VRVirtual reality

References

  1. Geyer, C.; Daniilidis, K. Omnidirectional video. Vis. Comput. 2003, 19, 405–416. [Google Scholar] [CrossRef]
  2. Argyriou, L.; Economou, D.; Bouki, V. Design methodology for 360° immersive video applications: The case study of a cultural heritage virtual tour. Pers. Ubiquitous Comput. 2020, 24, 843–859. [Google Scholar] [CrossRef]
  3. Hendriks Vettehen, P.; Wiltink, D.; Huiskamp, M.; Schaap, G.; Ketelaar, P. Taking the full view: How viewers respond to 360-degree video news. Comput. Hum. Behav. 2019, 91, 24–32. [Google Scholar] [CrossRef]
  4. Sitzmann, V.; Serrano, A.; Pavel, A.; Agrawala, M.; Gutierrez, D.; Masia, B.; Wetzstein, G. Saliency in VR: How Do People Explore Virtual Environments? IEEE Trans. Vis. Comput. Graph. 2018, 24, 1633–1642. [Google Scholar] [CrossRef]
  5. Corbillon, X.; Simon, G.; Devlic, A.; Chakareski, J. Viewport-Adaptive Navigable 360-Degree Video Delivery. In Proceedings of the 2017 IEEE International Conference on Communications (ICC), Paris, France, 21–25 May 2017; IEEE: New York, NY, USA, 2017; pp. 1–7. [Google Scholar] [CrossRef]
  6. Xie, L.; Xu, Z.; Ban, Y.; Zhang, X.; Guo, Z. 360ProbDASH: Improving QoE of 360 Video Streaming Using Tile-based HTTP Adaptive Streaming. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 315–323. [Google Scholar] [CrossRef]
  7. Skupin, R.; Sanchez, Y.; Jiao, L.; Hellge, C.; Schierl, T. Tile-Based Rate Assignment for 360-Degree Video Based on Spatio-Temporal Activity Metrics. In Proceedings of the 2018 IEEE International Symposium on Multimedia (ISM), Taichung, Taiwan, 10–12 December 2018; IEEE: New York, NY, USA, 2018; pp. 65–68. [Google Scholar] [CrossRef]
  8. Son, J.; Ryu, E.S. Tile-based 360-degree video streaming for mobile virtual reality in cyber physical system. Comput. Electr. Eng. 2018, 72, 361–368. [Google Scholar] [CrossRef]
  9. De La Fuente, Y.S.; Bhullar, G.S.; Skupin, R.; Hellge, C.; Schierl, T. Delay Impact on MPEG OMAF’s Tile-Based Viewport-Dependent 360° Video Streaming. IEEE J. Emerg. Sel. Top. Circuits Syst. 2019, 9, 18–28. [Google Scholar] [CrossRef]
  10. Ozcinar, C.; Cabrera, J.; Smolic, A. Viewport-Aware Omnidirectional Video Streaming Using Visual Attention and Dynamic Tiles. In Proceedings of the 2018 7th European Workshop on Visual Information Processing (EUVIP), Tampere, Finland, 26–28 November 2018; IEEE: New York, NY, USA, 2018; pp. 1–6. [Google Scholar] [CrossRef]
  11. Jiang, X.; Naas, S.A.; Chiang, Y.H.; Sigg, S.; Ji, Y. SVP: Sinusoidal Viewport Prediction for 360-Degree Video Streaming. IEEE Access 2020, 8, 164471–164481. [Google Scholar] [CrossRef]
  12. Bao, Y.; Wu, H.; Zhang, T.; Ramli, A.A.; Liu, X. Shooting a moving target: Motion-prediction-based transmission for 360-degree videos. In Proceedings of the 2016 IEEE International Conference on Big Data (Big Data), Washington, DC, USA, 5–8 December 2016; IEEE: New York, NY, USA, 2016; pp. 1161–1170. [Google Scholar] [CrossRef]
  13. Xu, M.; Song, Y.; Wang, J.; Qiao, M.; Huo, L.; Wang, Z. Predicting Head Movement in Panoramic Video: A Deep Reinforcement Learning Approach. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 2693–2708. [Google Scholar] [CrossRef] [PubMed]
  14. Hu, H.N.; Lin, Y.C.; Liu, M.Y.; Cheng, H.T.; Chang, Y.J.; Sun, M. Deep 360 Pilot: Learning a Deep Agent for Piloting through 360° Sports Videos. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 1396–1405. [Google Scholar] [CrossRef]
  15. Chao, F.Y.; Ozcinar, C.; Smolic, A. Transformer-based Long-Term Viewport Prediction in 360° Video: Scanpath is All You Need. In Proceedings of the 2021 IEEE 23rd International Workshop on Multimedia Signal Processing (MMSP), Tampere, Finland, 6–8 October 2021; IEEE: New York, NY, USA, 2021; pp. 1–6. [Google Scholar] [CrossRef]
  16. Zhang, Z.; Chen, Y.; Zhang, W.; Yan, C.; Zheng, Q.; Wang, Q.; Chen, W. Tile Classification Based Viewport Prediction with Multi-modal Fusion Transformer. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 3560–3568. [Google Scholar] [CrossRef]
  17. Tian, Y.; Zhong, Y.; Han, Y.; Chen, F. Viewport prediction with cross modal multiscale transformer for 360° video streaming. Sci. Rep. 2025, 15, 30346. [Google Scholar] [CrossRef]
  18. Guo, Y.; Xu, M.; Jiang, L.; Deng, X.; Zhou, J.; Chen, G.; Sigal, L. Proposal With Alignment: A Bi-Directional Transformer for 360° Video Viewport Proposal. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 11423–11437. [Google Scholar] [CrossRef]
  19. Lyu, J. A MDA-based multi-modal fusion model for panoramic viewport prediction. Adv. Eng. Innov. 2024, 15, 1–8. [Google Scholar] [CrossRef]
  20. Ghosh, A.; Aggarwal, V.; Qian, F. A Robust Algorithm for Tile-based 360-degree Video Streaming with Uncertain FoV Estimation. arXiv 2018, arXiv:1812.00816. [Google Scholar] [CrossRef]
  21. Zhao, L.; Cui, Y.; Liu, Z.; Zhang, Y.; Yang, S. Adaptive Streaming of 360 Videos with Perfect, Imperfect, and Unknown FoV Viewing Probabilities in Wireless Networks. IEEE Trans. Image Process. 2021, 30, 7744–7759. [Google Scholar] [CrossRef]
  22. Feng, W.; Wang, S.; Dai, Y. Adaptive 360-Degree Streaming: Optimizing With Multi-Window and Stochastic Viewport Prediction. IEEE Trans. Mob. Comput. 2025, 24, 5903–5915. [Google Scholar] [CrossRef]
  23. Setayesh, M.; Wong, V.W.S. Viewport Prediction, Bitrate Selection, and Beamforming Design for THz-Enabled 360° Video Streaming. IEEE Trans. Wirel. Commun. 2025, 24, 1849–1865. [Google Scholar] [CrossRef]
  24. Park, S.; Hoai, M.; Bhattacharya, A.; Das, S.R. Adaptive Streaming of 360-Degree Videos with Reinforcement Learning. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 5–9 January 2021; IEEE: New York, NY, USA, 2021; pp. 1838–1847. [Google Scholar] [CrossRef]
  25. Ao, A.; Park, S. Applying Transformer-Based Computer Vision Models to Adaptive Bitrate Allocation for 360 Live Streaming. In Proceedings of the 2024 IEEE Wireless Communications and Networking Conference (WCNC), Dubai, United Arab Emirates, 21–24 April 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar] [CrossRef]
  26. Nguyen, D.V.; Tran, H.T.T.; Thang, T.C. An Evaluation of Tile Selection Methods for Viewport-Adaptive Streaming of 360-Degree Video. ACM Trans. Multimed. Comput. Commun. Appl. 2020, 16, 1–24. [Google Scholar] [CrossRef]
  27. Pang, Z. VATP360: Viewport Adaptive 360-Degree Video Streaming based on Tile Priority. arXiv 2023, arXiv:2307.15984. [Google Scholar] [CrossRef]
  28. Croci, S.; Ozcinar, C.; Zerman, E.; Knorr, S.; Cabrera, J.; Smolic, A. Visual attention-aware quality estimation framework for omnidirectional video using spherical Voronoi diagram. Qual. User Exp. 2020, 5, 4. [Google Scholar] [CrossRef]
  29. Van Kasteren, A.; Brunnström, K.; Hedlund, J.; Snijders, C. Quality of experience of 360 video—Subjective and eye-tracking assessment of encoding and freezing distortions. Multimed. Tools Appl. 2022, 81, 9771–9802. [Google Scholar] [CrossRef]
  30. Chiariotti, F. A survey on 360-degree video: Coding, quality of experience and streaming. Comput. Commun. 2021, 177, 133–155. [Google Scholar] [CrossRef]
  31. Shen, G.; Ma, M.; Xu, G. An Optimal SVC Bitstream Schema for Viewport-dependent 360-degree Video Streaming. arXiv 2023, arXiv:2304.05654. [Google Scholar] [CrossRef]
  32. Romero Rondon, M.F.; Sassatelli, L.; Aparicio-Pardo, R.; Precioso, F. TRACK: A New Method from a Re-examination of Deep Architectures for Head Motion Prediction in 360-degree Videos. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 5681–5699. [Google Scholar] [CrossRef]
  33. Jiang, Y.; Poularakis, K.; Kiedanski, D.; Kompella, S.; Tassiulas, L. Robust and Resource-efficient Machine Learning Aided Viewport Prediction in Virtual Reality. arXiv 2022, arXiv:2212.09945. [Google Scholar] [CrossRef]
  34. Feng, X.; Swaminathan, V.; Wei, S. Viewport Prediction for Live 360-Degree Mobile Video Streaming Using User-Content Hybrid Motion Tracking. Proc. ACM on Interact. Mob. Wearable Ubiquitous Technol. 2019, 3, 1–22. [Google Scholar] [CrossRef]
  35. Zhang, L.; Chen, P.; Zhang, C.; Pan, C.; Long, T.; Xu, W.; Cui, L.; Liu, J. Optimizing Mobile-Friendly Viewport Prediction for Live 360-Degree Video Streaming. IEEE Trans. Mob. Comput. 2025, 24, 10441–10455. [Google Scholar] [CrossRef]
  36. Shimamura, R.; Feng, Q.; Koyama, Y.; Nakatsuka, T.; Fukayama, S.; Hamasaki, M.; Goto, M.; Morishima, S. Audio–visual object removal in 360-degree videos. Vis. Comput. 2020, 36, 2117–2128. [Google Scholar] [CrossRef]
  37. Liu, H.; Luo, T.; Luo, K.; Jiang, Q.; Sun, P.; Wang, J.; Huang, R.; Chen, Q.; Wang, W.; Li, X.; et al. OmniAudio: Generating Spatial Audio from 360-Degree Video. arXiv 2025, arXiv:2504.14906. [Google Scholar] [CrossRef]
  38. David, E.J.; Gutiérrez, J.; Coutrot, A.; Da Silva, M.P.; Callet, P.L. A dataset of head and eye movements for 360° videos. In Proceedings of the 9th ACM Multimedia Systems Conference, Amsterdam, The Netherlands, 12–15 June 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 432–437. [Google Scholar] [CrossRef]
  39. Xu, Y.; Du, J.; Wang, J.; Ning, Y.; Zhou, S.; Cao, Y. Panonut360: A Head and Eye Tracking Dataset for Panoramic Video. In Proceedings of the ACM Multimedia Systems Conference 2024 on ZZZ, Bari, Italy, 15–18 April 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 319–325. [Google Scholar] [CrossRef]
  40. Zhang, G.; Wu, C.; Gao, Q. Exploiting layer and spatial correlations to enhance SVC and tile based 360-degree video streaming. Comput. Netw. 2021, 191, 107985. [Google Scholar] [CrossRef]
  41. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar] [CrossRef]
  42. Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. arXiv 2000, arXiv:physics/0004057. [Google Scholar] [CrossRef] [PubMed]
  43. Finn, C.; Abbeel, P.; Levine, S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. arXiv 2017, arXiv:1703.03400. [Google Scholar] [CrossRef]
  44. Li, J.; Li, Z.; Liu, Z.; Zhou, P.; Hong, R.; Li, Q.; Hu, H. Viewport Prediction for Volumetric Video Streaming by Exploring Video Saliency and Trajectory Information. arXiv 2025, arXiv:2311.16462. [Google Scholar] [CrossRef]
  45. Xue, T.; El Ali, A.; Zhang, T.; Ding, G.; Cesar, P. CEAP-360VR: A Continuous Physiological and Behavioral Emotion Annotation Dataset for 360° VR Videos. IEEE Trans. Multimed. 2023, 25, 243–255. [Google Scholar] [CrossRef]
  46. Wu, C.; Tan, Z.; Wang, Z.; Yang, S. A Dataset for Exploring User Behaviors in VR Spherical Video Streaming. In Proceedings of the 8th ACM on Multimedia Systems Conference, Taipei, Taiwan, 20–23 June 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 193–198. [Google Scholar] [CrossRef]
  47. Rai, Y.; Gutiérrez, J.; Le Callet, P. A Dataset of Head and Eye Movements for 360 Degree Images. In Proceedings of the 8th ACM on Multimedia Systems Conference, Taipei, Taiwan, 20–23 June 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 205–210. [Google Scholar] [CrossRef]
  48. Martinez, J.; Black, M.J.; Romero, J. On Human Motion Prediction Using Recurrent Neural Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 4674–4683. [Google Scholar] [CrossRef]
  49. Manfredi, G.; Racanelli, V.A.; De Cicco, L.; Mascolo, S. LSTM-based Viewport Prediction for Immersive Video Systems. In Proceedings of the 2023 21st Mediterranean Communication and Computer Networking Conference (MedComNet), Island of Ponza, Italy, 13–15 June 2023; IEEE: New York, NY, USA, 2023; pp. 49–52. [Google Scholar] [CrossRef]
  50. Li, J.; Han, L.; Zhang, C.; Li, Q.; Liu, Z. Spherical Convolution Empowered Viewport Prediction in 360 Video Multicast with Limited FoV Feedback. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–23. [Google Scholar] [CrossRef]
  51. Gao, B.; Sheng, D.; Zhang, L.; Qi, Q.; He, B.; Zhuang, Z.; Wang, J. STAR-VP: Improving Long-term Viewport Prediction in 360° Videos via Space-aligned and Time-varying Fusion. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 5556–5565. [Google Scholar] [CrossRef]
  52. Wang, M.; Peng, S.; Chen, X.; Zhao, Y.; Xu, M.; Xu, C. CoLive: An Edge-Assisted Online Learning Framework for Viewport Prediction in 360° Live Streaming. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; IEEE: New York, NY, USA, 2022; pp. 1–6. [Google Scholar] [CrossRef]
  53. Wahba, M.Z.A.; Baldoni, S.; Battisti, F. Learning-Based Viewport Prediction for 360-Degree Videos: A Review. Electronics 2025, 14, 3743. [Google Scholar] [CrossRef]
  54. Qian, F.; Han, B.; Xiao, Q.; Gopalakrishnan, V. Flare: Practical Viewport-Adaptive 360-Degree Video Streaming for Mobile Devices. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, New Delhi, India, 29 October–2 November 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 99–114. [Google Scholar] [CrossRef]
  55. Li, Z.; Wang, Y.; Liu, Y.; Li, J.; Zhu, P. JUST360: Optimizing 360-Degree Video Streaming Systems with Joint Utility. IEEE Trans. Broadcast. 2024, 70, 468–481. [Google Scholar] [CrossRef]
  56. Zou, J.; Li, C.; Liu, C.; Yang, Q.; Xiong, H.; Steinbach, E. Probabilistic Tile Visibility-Based Server-Side Rate Adaptation for Adaptive 360-Degree Video Streaming. IEEE J. Sel. Top. Signal Process. 2020, 14, 161–176. [Google Scholar] [CrossRef]
  57. Kumar, S.; Bhagat, L.; A., A.F.; Jin, J. Multi-neural network based tiled 360video caching with Mobile Edge Computing. J. Netw. Comput. Appl. 2022, 201, 103342. [Google Scholar] [CrossRef]
  58. Qiu, M.; Shao, F. Blind 360-degree image quality assessment via saliency-guided convolution neural network. Optik 2021, 240, 166858. [Google Scholar] [CrossRef]
  59. Chen, P.W.; Yang, T.S.; Huang, G.L.; Huang, C.W.; Chao, Y.C.; Lu, C.H.; Wu, P.Y. Viewing Bias Matters in 360 Videos Visual Saliency Prediction. IEEE Access 2023, 11, 46084–46094. [Google Scholar] [CrossRef]
  60. Vats, S.; Park, J.; Nahrstedt, K.; Zink, M.; Sitaraman, R.; Hellwagner, H. Semantic-Aware View Prediction for 360-Degree Videos at the 5G Edge. In Proceedings of the 2022 IEEE International Symposium on Multimedia (ISM), Naples, Italy, 5–7 December 2022; IEEE: New York, NY, USA, 2022; pp. 121–128. [Google Scholar] [CrossRef]
  61. Adhuran, J.; Martini, M.G. Efficient viewport prediction and tiling schemes for 360 degree video streaming. In Proceedings of the ACM Multimedia Systems Conference 2024 on ZZZ, Bari, Italy, 15–18 April 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 374–380. [Google Scholar] [CrossRef]
  62. Xu, Y.; Dong, Y.; Wu, J.; Sun, Z.; Shi, Z.; Yu, J.; Gao, S. Gaze Prediction in Dynamic 360° Immersive Videos. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: New York, NY, USA, 2018; pp. 5333–5342. [Google Scholar] [CrossRef]
  63. Chao, F.Y.; Ozcinar, C.; Smolic, A. A QoE Study for Viewport Prediction Approaches in Tile-Based 360 Video Streaming. 2023. Available online: https://www.researchgate.net/publication/366974389_A_QoE_Study_for_Viewport_Prediction_Approaches_in_Tile-based_360_Video_Streaming?channel=doi&linkId=63bc3ffcc3c99660ebdf4b3f&showFulltext=true (accessed on 15 March 2026).
  64. Zhou, H.; Zhao, F.; Li, C. Multi-scale Historical Trajectory Decomposition for Viewport Prediction in 360-degree Videos. ACM Trans. Multimed. Comput. Commun. Appl. 2025, 21, 1–22. [Google Scholar] [CrossRef]
  65. Liu, S.; Wang, Y.; Li, S.; Liu, Y. MADRL-based bitrate allocation for QoE fairness in 360° video streaming with viewport prediction. Multimed. Syst. 2025, 31, 343. [Google Scholar] [CrossRef]
  66. Corbillon, X.; De Simone, F.; Simon, G. 360-Degree Video Head Movement Dataset. In Proceedings of the 8th ACM on Multimedia Systems Conference, Taipei, Taiwan, 20–23 June 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 199–204. [Google Scholar] [CrossRef]
Figure 1. ImmerseFM-3D System Architecture showing the hierarchical inference framework with on-device lightweight uncertainty estimator (LUE) and adaptive partitioning engine.
Figure 1. ImmerseFM-3D System Architecture showing the hierarchical inference framework with on-device lightweight uncertainty estimator (LUE) and adaptive partitioning engine.
Applsci 16 03424 g001
Figure 2. Variational information bottleneck. Cross-modal features are concatenated and mapped to Gaussian parameters ( μ z , σ z 2 ) R 256 . The KL term regularizes toward N ( 0 , I ) . Multiple samples z s enable Monte Carlo uncertainty estimates σ y ^ 2 for all downstream tasks.
Figure 2. Variational information bottleneck. Cross-modal features are concatenated and mapped to Gaussian parameters ( μ z , σ z 2 ) R 256 . The KL term regularizes toward N ( 0 , I ) . Multiple samples z s enable Monte Carlo uncertainty estimates σ y ^ 2 for all downstream tasks.
Applsci 16 03424 g002
Figure 3. Meta-learning personalization pipeline. A new user’s behavior embedding retrieves similar past adaptations from episodic memory M ; a hypernetwork then generates parameter adjustments δ θ ; MAML bi-level optimization ensures the base model can adapt in 5 gradient steps from 1–30 s of support data.
Figure 3. Meta-learning personalization pipeline. A new user’s behavior embedding retrieves similar past adaptations from episodic memory M ; a hypernetwork then generates parameter adjustments δ θ ; MAML bi-level optimization ensures the base model can adapt in 5 gradient steps from 1–30 s of support data.
Applsci 16 03424 g003
Figure 4. Volumetric video extension. Orientation is decoded via real spherical harmonics up to order L = 3 ; depth encoder features f d modulate tile importance exponentially with depth distance from the predicted viewport center; volumetric QoE is estimated from the concatenation [ z ; f d ] .
Figure 4. Volumetric video extension. Orientation is decoded via real spherical harmonics up to order L = 3 ; depth encoder features f d modulate tile importance exponentially with depth distance from the predicted viewport center; volumetric QoE is estimated from the concatenation [ z ; f d ] .
Applsci 16 03424 g004
Figure 5. Three-stage training pipeline. Stage 1 pre-trains all components on the full multi-task objective with KL annealing. Stage 2 meta-trains the adaptation modules on held-out users via MAML. Stage 3 jointly fine-tunes at a reduced learning rate.
Figure 5. Three-stage training pipeline. Stage 1 pre-trains all components on the full multi-task objective with KL annealing. Stage 2 meta-trains the adaptation modules on held-out users via MAML. Stage 3 jointly fine-tunes at a reduced learning rate.
Applsci 16 03424 g005
Figure 6. IMMERSE-1M dataset composition. (a) Distribution of 1000 h of video content across five content categories (Sports, Nature, Documentary, Urban, Action/Cinematic). (b) Video format breakdown: 360° static/dynamic, high-motion, volumetric, light-field, and synthetic content (total 1000 h); the final two segments (Light Field, Synthetic) each represent 100 h. (c) Network trace source distribution (total: 10,000 traces) spanning FCC 4G LTE, 5G mmWave (ns-3 simulated), HSDPA, campus WiFi, Starlink, and controlled synthetic conditions, covering bandwidths 0.1–800 Mbps and RTTs 2–500 ms.
Figure 6. IMMERSE-1M dataset composition. (a) Distribution of 1000 h of video content across five content categories (Sports, Nature, Documentary, Urban, Action/Cinematic). (b) Video format breakdown: 360° static/dynamic, high-motion, volumetric, light-field, and synthetic content (total 1000 h); the final two segments (Light Field, Synthetic) each represent 100 h. (c) Network trace source distribution (total: 10,000 traces) spanning FCC 4G LTE, 5G mmWave (ns-3 simulated), HSDPA, campus WiFi, Starlink, and controlled synthetic conditions, covering bandwidths 0.1–800 Mbps and RTTs 2–500 ms.
Applsci 16 03424 g006
Figure 7. Overview of the five evaluation experiments. All experiments draw from the IMMERSE-1M dataset; each targets a distinct generalization capability of ImmerseFM-3D with dedicated metrics.
Figure 7. Overview of the five evaluation experiments. All experiments draw from the IMMERSE-1M dataset; each targets a distinct generalization capability of ImmerseFM-3D with dedicated metrics.
Applsci 16 03424 g007
Figure 8. Per-content-category results. (a) Viewport prediction MAE at 1 s horizon (°); lower is better. (b) QoE Pearson correlation ρ with MOS; higher is better. ImmerseFM-3D (blue) consistently outperforms the best single-task baseline (orange) across all five content categories. All improvements are statistically significant ( p < 0.05 , Bonferroni-corrected).
Figure 8. Per-content-category results. (a) Viewport prediction MAE at 1 s horizon (°); lower is better. (b) QoE Pearson correlation ρ with MOS; higher is better. ImmerseFM-3D (blue) consistently outperforms the best single-task baseline (orange) across all five content categories. All improvements are statistically significant ( p < 0.05 , Bonferroni-corrected).
Applsci 16 03424 g008
Figure 9. QoE improvement over non-personalized baseline versus adaptation data volume. Shaded band: 95% CI for ImmerseFM-3D. The dashed horizontal line indicates 90% of ImmerseFM-3D’s maximum gain. The red marker highlights the 30 s operating point from Table 16.
Figure 9. QoE improvement over non-personalized baseline versus adaptation data volume. Shaded band: 95% CI for ImmerseFM-3D. The dashed horizontal line indicates 90% of ImmerseFM-3D’s maximum gain. The red marker highlights the 30 s operating point from Table 16.
Applsci 16 03424 g009
Figure 10. Ablation study: average performance across all four tasks (viewport prediction, bitrate allocation, tile selection, QoE estimation) relative to the full ImmerseFM-3D model (100%). The largest individual contributions come from the cross-modal attention fusion (−28.5 pp) and the information bottleneck (−21.8 pp). Removing any single component degrades performance by at least 14.7 pp.
Figure 10. Ablation study: average performance across all four tasks (viewport prediction, bitrate allocation, tile selection, QoE estimation) relative to the full ImmerseFM-3D model (100%). The largest individual contributions come from the cross-modal attention fusion (−28.5 pp) and the information bottleneck (−21.8 pp). Removing any single component degrades performance by at least 14.7 pp.
Applsci 16 03424 g010
Figure 11. Average cross-modal attention weight matrix A R 7 × 7 aggregated over the full test set. Modality codes: V = video, N = network, U = user head motion, A = audio, D = depth, E = eye tracking, S = scene semantics. Strong diagonal entries indicate self-attention dominance; notable off-diagonal peaks (U ↔ E: 0.15–0.16; A ↔ S: 0.14–0.15; V ↔ U: 0.18) reveal meaningful cross-modal dependencies learned without explicit supervision.
Figure 11. Average cross-modal attention weight matrix A R 7 × 7 aggregated over the full test set. Modality codes: V = video, N = network, U = user head motion, A = audio, D = depth, E = eye tracking, S = scene semantics. Strong diagonal entries indicate self-attention dominance; notable off-diagonal peaks (U ↔ E: 0.15–0.16; A ↔ S: 0.14–0.15; V ↔ U: 0.18) reveal meaningful cross-modal dependencies learned without explicit supervision.
Applsci 16 03424 g011
Table 1. Comparative summary of representative 360° streaming methods. ✓ = addressed; ∘ = partially; – = not addressed. VP: viewport prediction; ABR: adaptive bitrate; TS: tile selection; QoE: quality-of-experience estimation; MM: multi-modal input (>1 modality); Pers.: user personalization; 6DoF: volumetric support.
Table 1. Comparative summary of representative 360° streaming methods. ✓ = addressed; ∘ = partially; – = not addressed. VP: viewport prediction; ABR: adaptive bitrate; TS: tile selection; QoE: quality-of-experience estimation; MM: multi-modal input (>1 modality); Pers.: user personalization; 6DoF: volumetric support.
MethodVPABRTSQoEMMPers.6DoF
VPT360 [15]
MFTR [16]
CMMST [17]
TRACK [32]
STAR-VP [51]
360ProbDASH [6]
JUST360 [55]
RL-Streaming [24]
VATP360 [27]
Voronoi-VQA [28]
CEAP-360VR [45]
CoLive [52]
MAML [43]
STVP [44]
ImmerseFM-3D (ours)
Table 2. Modality-specific encoder architectures and parameter counts.
Table 2. Modality-specific encoder architectures and parameter counts.
ModalityArchitectureInput SizeParams (M)
Video ( X v )3D Swin Transformer (4 stages) 16 × H × W × 3 42.3
Network ( X n )Dilated TCN (4 layers, d = 1 , 2 , 4 , 8 ) 64 × 3 1.2
Head Motion ( X u )Transformer Encoder (6L, 8H) 150 × 3 8.7
Audio ( X a )ResNet-50 + Ambisonics MLPMel-spectrogram24.6
Depth ( X d )3D CNN (5 layers) T d × H × W 3.8
Eye Tracking ( X e )BiLSTM (2 layers) 150 × 4 2.1
Scene Semantics ( X s )CLIP ViT-B/32 + Transformer (frozen)Key frames @ 1 fps86.4
Table 3. ImmerseFM-3D component summary with parameter counts. Total: ≈177 M.
Table 3. ImmerseFM-3D component summary with parameter counts. Total: ≈177 M.
ComponentArchitectureParamsOutput Dim
Video Encoder3D Swin-T (4 stages)42.3 M T × H × W × 512
Network EncoderDilated TCN (4 layers)1.2 M R 512
User EncoderTransformer (6L, 8H)8.7 M R 512
Audio EncoderResNet-50 + Amb. MLP24.6 M R 512
Depth Encoder3D CNN (5 layers)3.8 M R 512
Eye Tracking EncoderBiLSTM (2 layers)2.1 M R 512
Scene SemanticsCLIP ViT-B/32 + Trans. (frozen)86.4 M R 512
Cross-Modal Attention4 layers, 8 heads4.2 M R 7 × 512
Information BottleneckMLP 512 256 0.13 M R 256
Task Decoders (4×)MLP + GRU1.8 MTask-dependent
Meta-Learning AdapterHyperNet + Episodic Mem.2.3 M δ θ
Total ≈177 M
Table 4. IMMERSE-1M video content by category, source, and format.
Table 4. IMMERSE-1M video content by category, source, and format.
CategoryPrimary SourceDurationResolutionFormat
360° StaticWu et al. [46]; Corbillon et al. [5]200 h4–8 KERP, Cubemap
360° DynamicXu et al. [62]; David et al. [38]250 h4–8 KERP
360° High MotionRai et al. [47]150 h6–8 KERP
Volumetric8i Voxelized; Microsoft RGB-D200 h 512 3 Point clouds, meshes
Light FieldStanford Lytro; new captures100 h 5 × 5 viewsLF images
SyntheticUnreal Engine renders100 h4–8 KERP + depth
Total 1000 h
Table 5. IMMERSE-1M user behavior data sources, modalities, and sampling rates.
Table 5. IMMERSE-1M user behavior data sources, modalities, and sampling rates.
SourceUsersAvg. DurationModalitiesRate
Wu et al. [46]4860 s/videoHead (6DoF)30 Hz
Corbillon et al. [5]5970 s/videoHead (3DoF)30 Hz
Xu et al. [62]120180 s/videoHead + Eye90/30 Hz
David et al. [38]5730–60 sHead + Eye60 Hz
Rai et al. [47]4020 s/imageHead + Eye30 Hz
New recordings200300 s/videoHead + Eye + Pupillometry90 Hz
Total524≈1200 h
Table 6. IMMERSE-1M network trace sources and characteristics.
Table 6. IMMERSE-1M network trace sources and characteristics.
SourceTypeSamplesBW RangeRTT Range
FCC 4GReal-world30000.5–50 Mbps20–150 ms
5G mmWave (ns-3)Simulated200010–800 Mbps5–30 ms
HSDPAReal-world20000.1–10 Mbps50–500 ms
WiFi (campus)Real-world15001–200 Mbps2–50 ms
StarlinkReal-world50020–200 Mbps20–80 ms
Controlled syntheticGenerated1000Variable patternsVariable
Total 10,000
Table 7. IMMERSE-1M dataset splits (user-disjoint and content-disjoint).
Table 7. IMMERSE-1M dataset splits (user-disjoint and content-disjoint).
SplitVideosUsersNet. TracesPurpose
Pre-training80% (800 h)80% (419)80% (8000)Model pre-training (Stage 1)
Meta-training10% (100 h)10% (52)10% (1000)MAML adaptation (Stage 2)
Test10% (100 h)10% (53)10% (1000)Final held-out evaluation
Table 8. Head-motion statistics by participant geographic group. Values are mean ± SD across all participants within each cohort. p-values are from two-sample t-tests; Cohen’s d measures effect size.
Table 8. Head-motion statistics by participant geographic group. Values are mean ± SD across all participants within each cohort. p-values are from two-sample t-tests; Cohen’s d measures effect size.
MetricEast Asian ( n = 168 )W. European ( n = 356 )p-ValueCohen’s d
Equatorial bias (%) 68.3 ± 9.1 65.7 ± 10.4 0.0140.26
Exploration radius (°) 32.4 ± 8.3 35.1 ± 9.6 0.0030.30
Saccade freq. (min−1) 11.2 ± 3.4 12.8 ± 4.1 0.0010.43
Table 9. Ablation study configurations. Each variant removes or replaces one component of the full ImmerseFM-3D model.
Table 9. Ablation study configurations. Each variant removes or replaces one component of the full ImmerseFM-3D model.
VariantDescription
Full modelComplete ImmerseFM-3D (all components)
w/o cross-modalSeparate task-specific encoders; no cross-modal fusion
w/o bottleneckDirect feature pass-through; no information bottleneck
w/o meta-learningNo personalization adapter; fixed population-level model
w/o uncertaintyDeterministic latent (single sample, S = 1 )
w/o audioAudio encoder removed; 6 modalities only
w/o depthDepth encoder removed; 2D video content only
w/o semanticsCLIP scene encoder removed; no semantic features
Single-taskSeparate independently trained model per task
Table 10. Viewport prediction baselines.
Table 10. Viewport prediction baselines.
MethodDescriptionReference
VPT360Transformer-based trajectory predictionChao et al. [63]
MFTRMulti-modal fusion transformer for viewport estimationZhang et al. [16]
DHPDeep reinforcement learning viewport predictionXu et al. [13]
EMD-MLEmpirical mode decomposition + LSTMZhou et al. [64]
STMRQSpatiotemporal graph convolutional networkLiu et al. [65]
Table 11. Bitrate adaptation baselines.
Table 11. Bitrate adaptation baselines.
MethodDescriptionReference
360ProbDASHProbabilistic tile pre-fetching ABRXie et al. [6]
FlareViewport-adaptive bitrate streamingQian et al. [54]
EPASS360Ensemble viewport prediction + allocationZhang et al. [40]
JUST360Joint utility optimization for 360° streamingLi et al. [55]
RL-StreamingDeep RL for adaptive bitrate selectionPark et al. [24]
Table 12. QoE estimation and meta-learning personalization baselines.
Table 12. QoE estimation and meta-learning personalization baselines.
MethodDescriptionReference
QoE Estimation
Voronoi-VQAVisual attention-aware 360° quality assessmentCroci et al. [28]
Blind360Blind 360° image quality via saliency CNNQiu and Shao [58]
CEAP-360VRPhysiological + behavioral QoE estimationXue et al. [45]
Meta-Learning Personalization
MAMLModel-Agnostic Meta-Learning Golub et al.[43]
ReptileFirst-order meta-learning (no second-order grad.)Jiang et al. [33]
Fine-tuningStandard transfer learning from pre-trained model
Table 13. Viewport prediction MAE (°) at increasing prediction horizons. Lower is better. Best result per horizon is bold; p < 0.05 vs. best baseline.
Table 13. Viewport prediction MAE (°) at increasing prediction horizons. Lower is better. Best result per horizon is bold; p < 0.05 vs. best baseline.
Method0.5 s1 s2 s3 s5 s
VPT360 [15]3.218.9716.4224.3538.71
MFTR2.847.9014.6322.1035.44
DHP3.479.2117.0525.8040.12
EMD-ML2.918.1215.0121.9034.87
STMRQ2.767.8314.2021.4534.10
ImmerseFM-3D (ours)1.93 5.21 9.87 14.82 24.16 
   ±95% CI±0.04±0.09±0.18±0.26±0.41
Improvement vs. best−30.1%−33.5%−30.5%−31.0%−29.2%
Table 14. Bitrate allocation and tile selection results. ↑ higher is better; ↓ lower is better. Best per column is bold.
Table 14. Bitrate allocation and tile selection results. ↑ higher is better; ↓ lower is better. Best per column is bold.
MethodAcc@1 ↑Acc@3 ↑BVR ↓MAE ↓F1 ↑IoU ↑
360ProbDASH [6]70.388.414.20.910.710.59
Flare [54]74.890.111.70.840.740.62
EPASS360 [40]77.191.810.30.780.770.65
JUST360 [55]81.293.58.30.710.820.70
RL-Streaming79.492.79.10.750.790.67
ImmerseFM-3D92.1 97.6 3.1 0.43 0.91 0.84 
   ±95% CI±0.4±0.2±0.1±0.02±0.01±0.01
Statistically significant improvement over the best baseline ( p < 0.05 , Bonferroni-corrected paired t-test, n = 5 independent runs).
Table 15. QoE estimation results. ↑ higher is better; ↓ lower is better.
Table 15. QoE estimation results. ↑ higher is better; ↓ lower is better.
MethodPearson ρ Spearman ρ s RMSE ↓
Voronoi-VQA [28]0.7630.7410.521
Blind360 [58]0.7120.6940.567
CEAP-360VR0.7410.7280.538
ImmerseFM-3D0.891 0.874 0.312 
   ±95% CI±0.008±0.009±0.011
Statistically significant improvement over the best baseline ( p < 0.05 , Bonferroni-corrected paired t-test, n = 5 independent runs).
Table 16. Few-shot personalization efficiency metrics. “90%-threshold” is the adaptation data (seconds) required to reach 90% of maximum QoE gain for each method.
Table 16. Few-shot personalization efficiency metrics. “90%-threshold” is the adaptation data (seconds) required to reach 90% of maximum QoE gain for each method.
Method1 s5 s10 s30 s90%-Threshold (s)
Fine-tuning+1.2+3.8+5.6+8.1>60
Reptile+2.1+5.3+7.4+10.2>60
MAML [43]+3.4+6.9+8.8+11.347
ImmerseFM-3D+4.8 +9.1 +11.7 +15.7 22 
Statistically significant improvement over the best baseline ( p < 0.05 , Bonferroni-corrected paired t-test, n = 5 independent runs).
Table 17. Cross-format transfer ratios (% of full in-domain training performance). Transfer source: 360° video. Target: volumetric video test set. Higher is better; p < 0.05.
Table 17. Cross-format transfer ratios (% of full in-domain training performance). Transfer source: 360° video. Target: volumetric video test set. Higher is better; p < 0.05.
MethodZero-Shot1%5%10%25%
Train from scratch32516782
Fine-tuning (pre-trained)4861748492
MAML transfer5566788794
ImmerseFM-3D72 81 89 94 97 
Table 18. Volumetric video streaming results. VP: viewport prediction; Depth: monocular depth estimation; Volumetric QoE: point-cloud quality + MOS correlation.
Table 18. Volumetric video streaming results. VP: viewport prediction; Depth: monocular depth estimation; Volumetric QoE: point-cloud quality + MOS correlation.
Method6DoF VP ErrorDepth EstimationVolumetric QoE        
E 6 DoF Pos. (cm) ↓RMSE ↓ δ 1 (%) ↑P2P-PSNR ↑MOS ρ
2D-only (no depth)18.412.328.30.641        
Depth-agnostic15.710.80.3879.131.40.703        
ImmerseFM-3D9.3 6.1 0.21 91.4 38.7 0.862         
   ±95% CI±0.4±0.3±0.01±0.6±0.4±0.012        
Bold values indicate the best result per column. Statistically significant improvement over the best baseline ( p < 0.05 , Bonferroni-corrected paired t-test, n = 5 independent runs). ↑ higher is better; ↓ lower is better.
Table 19. Per-task ablation results. VP MAE @ 1 s (°), Bitrate Acc@1 (%), Tile F1, and QoE ρ . Relative degradation vs. full model in parentheses.
Table 19. Per-task ablation results. VP MAE @ 1 s (°), Bitrate Acc@1 (%), Tile F1, and QoE ρ . Relative degradation vs. full model in parentheses.
VariantVP MAE ↓Acc@1 ↑Tile F1 ↑QoE ρ
Full model5.2192.10.9100.891
w/o cross-modal7.83 (−50%)78.4 (−13.7)0.741 (−0.169)0.724 (−0.167)
w/o bottleneck6.94 (−33%)81.3 (−10.8)0.792 (−0.118)0.763 (−0.128)
w/o meta-learning5.21 (—)91.8 (−0.3)0.908 (−0.002)0.872 (−0.019)
w/o uncertainty5.47 (−5%)89.6 (−2.5)0.891 (−0.019)0.863 (−0.028)
w/o audio5.89 (−13%)87.4 (−4.7)0.871 (−0.039)0.844 (−0.047)
w/o depth5.64 (−8%)90.2 (−1.9)0.882 (−0.028)0.867 (−0.024)
w/o semantics6.41 (−23%)84.7 (−7.4)0.831 (−0.079)0.791 (−0.100)
Single-task7.21 (−38%)77.8 (−14.3)0.751 (−0.159)0.731 (−0.160)
↓ lower is better; ↑ higher is better. Values in parentheses indicate absolute degradation relative to the full ImmerseFM-3D model.
Table 20. Marginal contribution of each input modality to ImmerseFM-3D performance. Each column reports the degradation relative to the full model when the indicated modality is removed. VP MAE: mean angular error at 1 s horizon (lower is better); BR Acc: Top-1 bitrate accuracy (%); Tile F1: tile selection F1 score; QoE ρ : Pearson correlation with MOS. Avg. Δ : mean relative degradation across all four tasks.
Table 20. Marginal contribution of each input modality to ImmerseFM-3D performance. Each column reports the degradation relative to the full model when the indicated modality is removed. VP MAE: mean angular error at 1 s horizon (lower is better); BR Acc: Top-1 bitrate accuracy (%); Tile F1: tile selection F1 score; QoE ρ : Pearson correlation with MOS. Avg. Δ : mean relative degradation across all four tasks.
Removed ModalityVP MAE ↑BR Acc ↓Tile F1 ↓QoE  ρ Avg.  Δ
None (full model)5.21°92.1%0.9100.891
Scene semantics (CLIP)+23%−7.4 pp−0.079−0.100−15.4 pp
Audio (ambisonics)+13%−4.7 pp−0.039−0.047−10.5 pp
Depth maps+8%−1.9 pp−0.028−0.024−5.5 pp
Eye tracking+5%−2.5 pp−0.019−0.028−4.9 pp
Context: component-level ablations
w/o cross-modal fusion+50%−13.7 pp−0.169−0.167−28.5 pp
w/o info. bottleneck+33%−10.8 pp−0.118−0.128−21.8 pp
Single-task (no MTL)+38%−14.3 pp−0.159−0.160−27.2 pp
For VP MAE, ↑ indicates the percentage increase in error (higher degradation is worse). For BR Acc, Tile F1, and QoE ρ , ↓ indicates the decrease relative to the full model (larger negative values indicate greater degradation). pp: percentage points.
Table 21. Inference latency and memory profile across target hardware platforms. ImmerseFM-3D-Mobile: knowledge-distilled student model (≈28 M params).
Table 21. Inference latency and memory profile across target hardware platforms. ImmerseFM-3D-Mobile: knowledge-distilled student model (≈28 M params).
ModelPlatformLat. (ms)Peak Mem. (MB)GFLOPsRel. QoE
ImmerseFM-3DA100 (training)8.26814287100%
ImmerseFM-3DT4 (edge server)31.43142287100%
ImmerseFM-3DiPhone 1294.7812287100%
ImmerseFM-3DPixel 688.3742287100%
ImmerseFM-3D-MobileT4 (edge server)12.19248191%
ImmerseFM-3D-MobileiPhone 1222.12488191%
ImmerseFM-3D-MobilePixel 620.82318191%
ImmerseFM-3D-MobileHTC Vive Pro Eye18.61948191%
JUST360 [55]T4 (edge)18.9120312488%
MFTRT4 (edge)22.3184716385%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gallena Watthage, R.S.; Fernando, A. ImmerseFM-3D: A Foundation Model Framework for Generalizable 360-Degree Video Streaming with Cross-Modal Scene Understanding. Appl. Sci. 2026, 16, 3424. https://doi.org/10.3390/app16073424

AMA Style

Gallena Watthage RS, Fernando A. ImmerseFM-3D: A Foundation Model Framework for Generalizable 360-Degree Video Streaming with Cross-Modal Scene Understanding. Applied Sciences. 2026; 16(7):3424. https://doi.org/10.3390/app16073424

Chicago/Turabian Style

Gallena Watthage, Reka Sandaruwan, and Anil Fernando. 2026. "ImmerseFM-3D: A Foundation Model Framework for Generalizable 360-Degree Video Streaming with Cross-Modal Scene Understanding" Applied Sciences 16, no. 7: 3424. https://doi.org/10.3390/app16073424

APA Style

Gallena Watthage, R. S., & Fernando, A. (2026). ImmerseFM-3D: A Foundation Model Framework for Generalizable 360-Degree Video Streaming with Cross-Modal Scene Understanding. Applied Sciences, 16(7), 3424. https://doi.org/10.3390/app16073424

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop